Amplicon_sorter: A tool for reference‐free amplicon sorting based on sequence similarity and for building consensus sequences

Abstract Oxford Nanopore Technologies (ONT) is a third‐generation sequencing technology that is gaining popularity in ecological research for its portable and low‐cost sequencing possibilities. Although the technology excels at long‐read sequencing, it can also be applied to sequence amplicons. The downside of ONT is the low quality of the raw reads. Hence, generating a high‐quality consensus sequence is still a challenge. We present Amplicon_sorter, a tool for reference‐free sorting of ONT sequenced amplicons based on their similarity in sequence and length and for building solid consensus sequences.


| INTRODUC TI ON
Long-read sequencing methods from Oxford Nanopore Technologies (ONT) (Eisenstein, 2012) can also be used to mass sequence amplicons. In comparison with short-read sequencers such as Illumina (2 × 300 bp) and IonTorrent (600 bp) (Slatko et al., 2018), there is virtually no limit to the amplicon length for ONT. However, to this date, the main disadvantage of ONT is the relatively low read quality, which most recently reached a modal of 99.3% with the new Q20+ technology and an R10.4 flow cell (https://nanop orete ch.com/accuracy).
The reads are clustered and a consensus is made with Canu (Koren et al., 2017), MAFFT (Katoh & Standley, 2013), vsearch (Rognes et al., 2016b), IsONclust (Sahlin & Medvedev, 2020), or Consension (https:// micro biolo gy.se/softw are/conse nsion). 4. In most cases, a last consensus polishing step is performed with Medaka (https://github.com/ nanop orete ch/medaka), Racon (Vaser et al., 2017), Nanopolish (Loman et al., 2015), or a reading frame correction for coding genes (Menegon et al., 2017;Srivathsan et al., 2018Srivathsan et al., , 2021aSrivathsan et al., , 2021b. Most current pipelines (Maestri et al., 2019;Menegon et al., 2017;Srivathsan et al., 2018) need these consecutive programs to demultiplex, sort amplicons based on length/species identity with references to finally create a consensus sequence. IsoCon and ToFu are referencefree long-read consensus algorithms for transcriptome data that have been described but aim for a different application (Gordon et al., 2015;Sahlin et al., 2018). The recent programs ONTrack (Maestri et al., 2019), NGSpeciesID (Sahlin et al., 2021), and ONTbarcoder (Srivathsan et al., 2021a) perform reference-free clustering of amplicons and create a high-quality consensus sequence and are designed for specific amplicon sequencing applications. ONTrack needs demultiplexed files, processes only the reads in the most abundant cluster, and needs the large fast5 files to polish the consensus sequence. NGSpeciesID processes demultiplexed files with one or a few divergent amplicons. It only needs a fastq file as input and clusters the sequences based on similarity. A preferred amplicon length and deviation thereof can be entered in the script. ONTbarcoder is specifically made to process COI amplicons that are uniquely tagged with a barcode. It needs a demultiplexing file which contains the unique barcode-primer sequences, a fastq file with the sequences, and the expected fragment length. Although it expects one amplicon per unique barcode, it can find divergent amplicons (even other genes) with the same length if more are present. Here we present Amplicon_sorter which is developed to sort sequences based on similarity and length, and to build a robust consensus sequence for each group of sequences in one simple run. Amplicon_sorter can process all sorts of amplicons, with or without barcode unlike ONTbarcoder that processes coding genes and needs a barcode for each sample. Amplicon_ sorter and NGSpeciesID can process a range of amplicon lengths in one go unlike ONTbarcoder that need one expected fragment length.
Amplicon_sorter does not limit the search to the most abundant clusters like ONTrack and ONTbarcoder but searches for everything. Unlike ONTrack and NGSpeciesID which are pipelines that are dependent on other programs to do the job, Amplicon_sorter is a python script that only needs python3 and a few python plugins. Amplicon_sorter might perform even better in some cases in conjunction with Medaka, but this is in most cases not needed. It has been written for metagenetics samples that contain amplicons of several genes with the same or different lengths from all the species in the samples. Nevertheless, it can also be used for demultiplexed samples that only contain one amplicon.

| Gene group creation
The Amplicon_sorter script reads the input file in fasta or fastq format ( Figure 1). Prior to analysis, minimum and maximum read lengths can be delimited and the maximum number of sampled reads for analysis can be set. In absence of a user limit, Amplicon_sorter will analyze 10,000 reads by default. If the number of reads in the input file is lower than 1000, all reads are used. All reads get a unique serial number. An option (a --all) is available to compare all reads with each other, but this is discouraged for sequence sets of over 100,000 reads because it is computation intensive. For example: on a 3. Without this option, the script subsamples the selected number of reads in batches of 1,000 in the same order as the reads in the inputfile (an option (ra --random) is available to randomly sample from the inputfile) and compares the reads pairwise within each batch for read length differences smaller than 5%. If the similarity is lower than 50%, the reverse complement of one of the sequences is also compared. If the similarity is greater than or equal to 80%, the serial numbers of the two compared sequences and their similarity is added to a list. This list is saved to disk for later use (step 2.2.2). Next, for each read, only the read to which it has the highest similarity is kept resulting in a high-similarity pair. Gene groups are created by merging high-similarity pairs with overlapping reads. It may occur that, eventually, several gene groups remain that actually represent the same gene. To combine those, Amplicon_sorter samples 50 random reads | 3 of 17 VIERSTRAETE And BRAECKMAn F I G U R E 1 A step-wise schematic diagram of the workflow of Amplicon_sorter from each group, creates a consensus, and compares the consensus of each group with each other. A length difference of 8% is allowed, and if the similarity is greater than or equal to 60%, the gene groups are merged. The script saves the result in gene group files that contain reads of the same gene based on length and similarity (e.g., group_1 contains 18S reads, group_2 contains COI reads, etc.).

| Gene-to-species sorting
Unlike Tofu and IsoCon which use a nearest neighbor graph method to cluster the reads, Amplicon_sorter uses a more straightforward approach. For each read, only the read to which it has the highest similarity is retained. Gene-to-species sorting is an iterative process within each gene group, starting with grouping reads with high sequence similarity (greater than or equal to 93%) into species groups.
Species groups that contain common sequences are merged. A consensus sequence for each species group is built to which all remaining sequences in the gene group are compared with a maximum length difference of 5%. A sequence is added to the species group to which it has the highest similarity (of at least 95%). A new consensus is built after each iteration and therefore becomes increasingly more accurate. When no more reads can be added (or a limit of 3 cycles for the same similarity), the similarity threshold is dropped by 1% and a new cycle starts. Every other cycle, the consensuses from all species groups that have a maximum length difference of 8% are compared. If the similarity between two consensuses is greater than or equal to 96%, the two species groups are merged. When the similarity threshold dropped to 85%, the loop ends and the sequences from each species group are saved in a file. This iterative process converges to a stable point in each iteration but is limited to 3 cycles because adding more cycles increases the processing time and is only marginally improving the consensus in that cycle. As a result, each output file contains all sequences with high similarity and similar length as well as a final consensus sequence based on 200 random reads (e.g., file_1_1.fasta is 18S from species1, file1_2.fasta is 18S from species2, file_2_1 is COI from species 3 …). Amplicon_sorter generates extra files containing all consensus sequences per species group and a list of all consensus sequences in the project. Reads that could not be grouped are saved in "unique sequences" files. The script allows for parallel processing to speed up the analysis. Output files can be saved in fasta and fastq format. Amplicon_sorter writes and reads temporary files to keep the RAM memory consumption low.

| Parameter optimization
An online available amplicon dataset sequenced on an R9.4 MinION flow cell (Maestri et al., 2019) was used to optimize the parameters of Amplicon_sorter. The dataset contains barcoded amplicons from two snails and five beetles with similarities ranging from 69% to 89% (Table A1). To test the maximal consensus accuracy of Amplicon_sorter in a single species, the script was run on the separate barcode files. Consensus accuracies were reached ranging from 98.41% to 99.54% and all errors were homopolymer underestimations ( Figure A1). After pooling all seven barcode samples and creating input files with quality score between Q7 and Q12 using NanoFilt, we were able to retrieve all original barcodes from all input files (Table 1). The low abundant barcodes (BC04: 7.6%, BC07: 3.2%) were detected in the Q7 input file and even the lowest abundance of 1.5% from BC07 was detected in the Q12 input file. If very low abundance barcodes should be found, we recommend using the --all option (to compare all reads with each other) at the cost of processing time. An alternative option to find very low abundance barcodes is to use the random function --random and increase the number of comparisons --maxreads to a number higher than the available reads. This way the program samples reads randomly several times to increase the chances to find its best match (example command for Q12 33% reads: python3 amplicon_sorter.py -i poolded_q12.fastq -o q12_30 -min 600 -max 800 -np 10 -maxr 13062) (-i = input file, -o = output folder, -min = minimum read length, -max = maximum read length, -np = number of cores, -maxr = maximum number of reads to use). Lower quality reads are less likely to be assigned to a group (Table A2, Figure A2).
The percentage of all reads within a barcode that were used to create the consensus is shown. For this concatenated dataset of 7 barcodes, the "random" setting of the program was used. Because by default Amplicon_sorter samples several times 1000 reads from the input file, some reads are selected multiple times from the pool while others are never selected. This results in an average of 60% of reads that are used for consensus creation per barcode when sampling 100% of the number of reads from the high-quality pool.
When sampling the low-quality pool, only 43% of the reads are recovered. When choosing the "compare all" option, there are no duplicate reads in the comparison. This results in 68% on average for the low-quality dataset to 96% for the high-quality reads. We can conclude that Amplicon_sorter has a high sorting and recovery capability for the reads in the sample.

| Separation limits
To further test the potential and boundaries of Amplicon_sorter, we generated a new ONT sequence data set using a specific set of amplicons and species that allowed us to cover several questions. The first goal was to combine amplicons of up to three genes per barcode to test if Amplicon_sorter could distinguish them and how accurate the resulting consensus would be compared to the Sanger reference sequence. The second goal was to detect the separation limit of Amplicon_sorter for a given gene of closely related species. In our third goal, we wanted to test whether long amplicons can be sequenced with only a part of that amplicon being available as reference to check the consensus accuracy. Our ONT sequence data set was comprised of several barcoded amplicons (spacer and COI) from two mollusks and several insect species with similarities ranging from 85 to 100% (Tables A3 and A4).

| Amplicon_sorter, ONTBarcoder, and NGSpeciesID output for separate barcodes
In a first approach, we tested the separation limit of Amplicon_ sorter, ONTbarcoder, and NGSpeciesID using our demultiplexed ONT sequence data set. Reads were selected with NanoFilt for a quality score of minimum 12 and demultiplexing was done with Minibar. Each barcode sample contained up to three genes (COI 700 bp, spacer 750 bp, and some a tandem repeat part of 2800 bp).
Amplicon_sorter was able to sort the reads and build the consensus for each gene of which we had the complete Sanger sequence with an accuracy between 98.2% and 100% (  When considering raw reads in a group, the similarity to the Sanger reference varies between 86% and 98% (Table A6, Figure   A3), which may explain the species separation limit of around 95%-96% similarity. When species with over 95% similarity occur in the same pool, the consensus sequence will have a lower similarity to the Sanger reference (if available) because of the averaging effects.
If the accuracy of the basecaller will further improve, especially for homopolymer calling, this limit will likely increase.
While the other tools only search for the most abundant cluster(s), Amplicon_sorter searches for everything. As a result, for amplicons from low-quality PCR, it may produce more/false/redundant consensus sequences than the other tools, sometimes even multiple consensus sequences for the same species. If the initial PCR is of high quality, the result should be one consensus per species. If the PCR is less successful (smear, multiple bands), Amplicon_sorter produces multiple consensus sequences ( Figure   A4). These can have the same similarity with the Sanger reference, but still contain an adapter or primer at one side that was missed by trimming. In case the PCR produced incomplete amplicons, also shorter consensus sequences from the same species are

| DISCUSS ION
Amplicon_sorter creates high-quality consensus sequences for barcoded or non-barcoded amplicons sequenced with ONT. When compared to programs with similar purpose, the consensus sequences have a similar or higher quality which is mostly between 99% and 100%. It is remarkable that Amplicon_sorter by default is

TA B L E 3 Species separated by Amplicon_sorter and NGSspeciesID from a pooled dataset
Amplicon_sorter has been tested on two datasets (Srivathsan et al., 2021b) containing 511 and 9929 species with numbers of reads used ranging from 100,000 to 568,000 ( Figure A5, Table A7). The memory usage peaked to 80 GB when creating the species groups using the highest number of reads. Using a higher number of reads is necessary to have sufficient read coverage for each species. Analyzing datasets with a large number of species is limited by the amount of available memory on the computer.
In mixed samples, Amplicon_sorter can find low abundance samples (1.5%) with the default settings, and if the option to compare all sequences with each other (all) is used, even lower abundance species can be recovered. This option is computation intensive and is discouraged for samples with more than 100,000 reads. By default, Amplicon_sorter compares the reads in batches of 1000 sequences with each other to speed up the process.

Amplicon_sorter outperforms NGSpeciesID and ONTbarcoder
when processing metagenetic samples which contain several amplicons of the same or different length from distant or closely related species. The separating limit is around 95 or 96% depending on the type of flow cell and basecaller version used. There is no need to specifically indicate an expected amplicon length, instead a range with minimum and maximum length can be entered to search for all possible amplicons within that range.

| CON CLUS ION
Amplicon_sorter is an easy-to-use tool to group sequences to species or genus level without the need for reference sequences. It automatically creates a consensus sequence for each group of reads. It can be used for samples where only one species is present or samples with several species and genes with different lengths. The limit for separating closely related species within a sample is currently around 95%-96%.

ACK N OWLED G M ENTS
The authors thank the editor and reviewers for the helpful comments to improve this manuscript.

CO N FLI C T O F I NTE R E S T
The authors declare no conflicts of interest. Writing -review & editing (supporting).

AVA I L A B I LIT Y A N D I M PLE M E NTATI O N
Amplicon_sorter is written in Python3 and released under the GNU GPL 3.0 License. The source code and documentation are available at https://github.com/avier str/ampli con_sorter. The script is written for Linux/Unix/MacOSx and is a command line tool.

DATA AVA I L A B I L I T Y S TAT E M E N T
The sequencing data generated for this study is available at Dryad

F I G U R E A 3
Comparison of a few raw ONT spacer reads from Cordulegaster insignis and C. mzymtae with the Sanger references. There are no consistent differences between insignis and mzymtae reads because of the high error rate (red rectangles in the alignment indicate the differences in Sanger sequence between both species). This is likely the reason why the script cannot separate these reads into specific groups F I G U R E A 4 Example of false/redundant consensus sequences produced by Amplicon_sorter. (a) four consensus sequences of the same species. The first two with similar identity to the Sanger reference and more than 300 reads. The third and fourth have a much lower similarity. (b) alignment of those reads. The first consensus has 40 extra bases at the 5' end, the second read has 40 extra bases more at the 3' end. The middle parts of both reads are almost identical. The third sequence differs in many positions and is built from 152 reads with similar errors. The last sequence differs even more and is a consensus built from only two reads

TA B L E A 2
The percentage of all reads within a barcode that are used to create the Amplicon_sorter consensus sequence depends on quality score, the fraction of total reads sampled, and sampling strategy (random sampling (R) vs all reads (A)). For quality scores Q7, Q9, and Q12, the abundance (%) of reads in each barcode pool is added. The average for all barcodes is shown in the last two columns