Abstract
Sequence alignment is an essential method in bioinformatics and the basis of many analyses, including phylogenetic inference, ancestral sequence reconstruction, and gene annotation. Sequencing artifacts and errors made during genome assembly, such as abiological frameshifts and incorrect early stop codons, can impact downstream analyses leading to erroneous conclusions in comparative and functional genomic studies. More significantly, while indels can occur both within and between codons in natural sequences, most amino-acid- and codon-based aligners assume that indels only occur between codons. This mismatch between biology and alignment algorithms produces suboptimal alignments and errors in downstream analyses. To address these issues, we present COATi, a statistical, codon-aware pairwise aligner that supports complex insertion–deletion models and can handle artifacts present in genomic data. COATi allows users to reduce the amount of discarded data while generating more accurate sequence alignments. COATi can infer indels both within and between codons, leading to improved sequence alignments. We applied COATi to a dataset containing orthologous protein-coding sequences from humans and gorillas and conclude that 41% of indels occurred between codons, agreeing with previous work in other species. We also applied COATi to semiempirical benchmark alignments and find that it outperforms several popular alignment programs on several measures of alignment quality and accuracy.
Keywords: pairwise alignment, coding sequences, indel phases, statistical alignment, codon models
Introduction
Sequence alignment is a fundamental task in bioinformatics and a cornerstone step in comparative and functional genomic analysis (Rosenberg 2009). While sophisticated advancements have been made, the challenge of alignment inference has not been fully solved (Morrison 2015). The alignment of protein-coding DNA sequences is one such challenge, and a common approach to this problem is to perform alignment inference in amino-acid space (e.g. Bininda-Emonds 2005; Abascal et al. 2010). While this approach is an improvement over DNA models, it discards information, underperforms compared to alignment at the codon level, and fails in the presence of artifacts, such as frameshifts and early stop codons. While some aligners can utilize codon substitution models, they are often not robust against coding-sequence artifacts.
Within protein-coding sequences, indels may occur between any pair of adjacent nucleotides, and therefore, gaps in alignments of natural sequences may occur both between and within codons (Fig. 1). In our nomenclature, gaps that start after the first, second, and third positions of a codon are known as phase-1, phase-2, and phase-3 gaps, respectively. Here, gaps that occur between codons are phase-3 gaps, whereas they may be known as phase-0 gaps in earlier studies (e.g. Taylor et al. 2004). Indels that occur within codons can produce an amino-acid change in addition to adding or removing nucleotides (Fig. 1), and due to the structure of the genetic code, phase-1 gaps are less common than phase-2, which are less common than phase-3 (Taylor et al. 2004; Zhu 2022). While all three phases of gaps occur in natural sequences, alignments performed in amino-acid or codon space force all gaps to be phase-3 gaps. Because only about 42% of indels are phase-3 (Taylor et al. 2004; Zhu 2022), this mismatch between aligner assumptions and biology can produce suboptimal alignments and inflated estimates of sequence divergence (Fig. 1; Redelings and Suchard 2007).
Fig. 1.
Standard algorithms produce suboptimal alignments. a) Shows the true alignment of an ancestral sequence (A) and a descendant sequence (D). b) to d) Are the results of different aligners. Nucleotide mismatches are highlighted in red. Notably, COATi is the only aligner able to retrieve the biological alignment in this example. Indels in protein-coding sequences can be classified as having one of three different phases and being one of two different types. Phases refer to the location of the gap with respect to the reading frame, while types refer to the consequence of the indel. Phase-1, phase-2, and phase-3 indels are shown in blue, orange, and green, respectively. Additionally, the orange indel is type-II (an amino-acid indel plus an amino-acid change) while the blue indel is type-I (an amino-acid indel only). The difference between an in-frame and a frameshift indel is not displayed.
Bioinformatic pipelines need to be robust to variation in quality across genomic datasets because uncorrected errors in the alignment stage can lead to erroneous results in comparative and functional genomic studies (Schneider et al. 2009; Fletcher and Yang 2010; Hubisz et al. 2011). While genomes for model organisms are often refined over many iterations and contain meticulously curated protein-coding sequences, genomes for nonmodel organisms might only receive partial curation and typically have lower-quality sequences and annotations. These genomes often lack the amount of sequencing data needed to fix artifacts, including missing exons, erroneous mutations, and erroneous indels (Jackman et al. 2018). When comparative and functional genomics studies include data from nonmodel organisms, care must be taken to identify and manage such artifacts; however, current alignment methods are ill-equipped to handle common artifacts in genomic data, requiring costly curation practices that discard significant amounts of information.
To address current limitations of alignment software to accurately align protein-coding sequences, we present COATi, short for COdon-aware Alignment Transducer, a pairwise statistical aligner that incorporates evolutionary models for protein-coding sequences and is robust to artifacts present in modern genomic datasets. We find that COATi generates biologically reasonable alignments and downstream inferences when applied to empirical data. We also find that COATi generates more accurate alignments when applied to a semiempirical benchmark dataset.
Methods
Statistical Alignment via Finite State Transducers
In statistical alignment, sequence alignments are scored based on a stochastic model, typically derived from molecular evolutionary processes (Hein 2001; Holmes and Bruno 2001; Lunter et al. 2005; Holmes 2020; De Maio 2021). An advantage of statistical alignment is that its parameters are derived from biological processes, allowing them to be estimated directly from data or extracted from previous studies. While approaches vary, a statistical aligner for a pair of sequences, X and Y, typically finds an alignment, Aln, that maximizes the joint probability or samples alignments from the posterior . This is typically performed using pairwise hidden Markov models (pair-HMMs; Holmes and Bruno 2001; Redelings and Suchard 2007; Cartwright 2009). Pair-HMMs are computational machines with two output tapes. Each tape represents one sequence, and a path through the pair-HMM represents an alignment of the two sequences. Conceptually, pair-HMMs generate two sequences from an unknown ancestor and can calculate the joint probability (Yoon 2009).
While the use of pair-HMMs is ubiquitous in bioinformatics, they are limited to modeling the evolution of two related sequences from an unknown ancestor. As an alternative, finite-state transducers (FSTs, Fig. 2) allow researchers to model the evolution of a descendant sequence from an ancestral sequence. FSTs are computational machines with one input tape and one output tape and provide similar benefits to pair-HMMs, while being more suitable for evolutionary models (Bradley and Holmes 2007). FSTs consume symbols from an input tape and emit symbols to an output tape based on the symbols consumed and the structure of the FST. Conceptually, FSTs generate a descendant sequence, D, from a known ancestor, A, and can calculate the conditional probability .
Fig. 2.
FSTs model the generation of an output sequence based on an input sequence. a) A graph of a probabilistic FST (Cotterell et al. 2014) for base-calling errors using a Mealy-machine architecture, where parameter u is the error rate. This graph contains two states (S and M) connected by arcs, with labels “input symbols : output symbols/weight.” Arcs consume symbols from the input sequence and emit symbols to the output sequence. Weights describe the probability that an arc is taken given the input symbols. Epsilon ( ) is a special symbol denoting that no symbols were either consumed or emitted. b) An FST for matching sequences against ambiguous nucleotides (N). c) An FST that results from the composition (° operation) of the Error FST with the Ambiguity FST.
There are well-established algorithms for combining FSTs in different ways allowing the design of complex models by combining simpler FSTs, including concatenation, composition, intersection, union, and reversal (Bradley and Holmes 2007; Silvestre-Ryan et al. 2021). Specifically, composition is an algorithm to combine two FSTs by sending the output of one FST into the input of another, creating a new, more complex transducer (Mohri et al. 1996). Figure 2 illustrates how FSTs modeling sequencing errors (Fig. 2a) and ambiguity (Fig. 2b) can be combined via composition to produce an FST that does both (Fig. 2c). Conceptually, composition creates an FST that generates a descendant sequence from a known ancestor via an unknown intermediate, J, and can calculate the conditional probability .
The COATi FST
COATi aligns pairs of sequences using a statistical alignment model, which is implemented as an FST derived from the composition of multiple FSTs (Figs. 2 and 3), each representing a specific biological or technical process: (i) the codon substitution FST, (ii) the indel FST, (iii) the error FST, and (iv) the ambiguity FST. We call this transducer the COATi FST. Codon substitution models are uncommon in sequence aligners, despite their extensive use in phylogenetics. COATi implements both the (Muse and Gaut 1994) codon model (codon-triplet-mg) and the Empirical Codon Model (codon-triplet-ecm; Kosiol et al. 2007). It also lets the user provide a codon substitution matrix. A key innovation of COATi is that it combines a codon substitution model with a nucleotide-based indel model, allowing gaps to occur both between and within codons (see Hein 1994; Arvestad 1997; Pedersen et al. 1998; Ranwez et al. 2011, 2018 for earlier approaches). This also allows the aligner to be robust against sequencing artifacts that produce sequences with disrupted reading frames.
Fig. 3.
The COATi FST is built from simpler FSTs via composition. a) The substitution FST encodes a codon substitution model with 3721 arcs from S to M. These arcs consume three nucleotides from the input tape and emit three nucleotides to the output tape. The weight of each arc is a conditional probability derived from a codon substitution model. See Fig. 2 for more details about reading this graph. b) The indel FST allows for insertions (H to I) and deletions (C to D). Here g is the gap-opening parameter and e is the gap-extension parameter. Insertion arcs are weighted according to the codon model’s stationary distribution of nucleotides, and deletion arcs have a weight of 1. This FST is structured such that if insertions and deletions are contiguous, insertions will precede deletions (cf. Holmes and Bruno 2001; De Maio 2021). c) The COATi FST is derived via composition from the codon substitution, indel, error, and ambiguity FSTs.
COATi FST is not a true probabilistic FST (Cotterell et al. 2014) and cannot be used as-is to simulate output sequences based on an input sequence. This is because it is missing a parameter to control how often Ns are added to the output sequence. This is a design feature to allow COATi FST to properly weight ambiguous nucleotides as representing any other symbol. In addition, COATi’s indel FST (Fig. 3b) implicitly assumes that match states occur both immediately before and immediately after the alignment. This allows COATi FST to assign the same weight to gaps at the beginning and the end of an alignment but also introduces a normalization constant (not shown) to reflect that mass is lost by not allowing the indel FST to terminate from node C. While this normalization constant is not needed to find the most likely alignment or to sample alignments, it can be calculated using a Markov model that considers only the transitions between states M and D in the indel FST (Fig. 3b) and eliding any states in between:
where g is the gap-opening parameter, e is the gap-extension parameter, and n is the length of the input sequence. The exponent has a due to the presence of an “immortal match” immediately after the input sequence.
Since codon substitution is the first process in COATi’s model, the input sequence to the COATi FST must be compatible with a codon substitution model, i.e. a multiple of three nucleotides and not contain any ambiguous symbols or stop codons. The phases, reading frames, and amino-acid contexts of alignment columns is determined by the input sequence, and better alignments will be generated if the input is of high quality and free of artifacts. Depending on context, we may refer to the input sequence as the “ancestral” or “reference” sequence. In contrast, the output sequence must be compatible with the ambiguity FST, can be of any length, and can contain any nucleotides or “N.” This allows COATi to align lower-quality sequences that may contain artifacts against a high-quality reference sequence. We may refer to the output sequence as the “descendant” or “nonreference” sequence. The choice of which sequence is the input sequence and which sequence is the output sequence is left up to the user.
In order to use the COATi FST to align an output sequence against an input sequence, we first convert each sequence into an acceptor, represented as a linear transducer where the input and output symbols of each transition are identical and each transition represents one nucleotide of a sequence (Allauzen et al. 2007). By composing the input and output acceptors with the COATi FST, we generate a transducer of all possible alignments of the two sequences. Any path through this FST represents a pairwise alignment, while the shortest path (by weight) corresponds to the best alignment. If more than one optimal alignment exists, ties are broken according to the implementation of the shortest-path algorithm. All FST operations in COATi, including model development, composition, search for the shortest path, and other optimization algorithms, are performed using the C++ openFST library (Allauzen et al. 2007). An example of an FST-based alignment can be found in supplementary fig. S1, Supplementary Material online.
The Marginal Model
The COATi FST has a large state space to keep track of codon substitution rates when codons can be interspersed with indel events. This additional state space increases the computational complexity of the alignment algorithm. To reduce the runtime complexity of COATi, we have also developed an approximation of the COATi FST that can be implemented with standard dynamic programming techniques. This approximation uses a marginal substitution model where the output nucleotides are independent of one another and only depend on the input codon and position. This produces a substitution model and eliminates the need to track dependencies between output nucleotides.
A marginal substitution model is calculated from a standard substitution model by calculating the marginal probabilities that each ancestral codon produces specific descendant nucleotides at each reading frame position. Specifically, let
represent transition probabilities from a codon model, and
represent the marginal transition probabilities, where is the position of the descendant nucleotide relative to the ancestral reading frame and is an indicator function. COATi contains marginal models for both Muse and Gaut (1994) and the Empirical Codon Model (Kosiol et al. 2007), resulting in the marginal models codon-marginal-mg and codon-marginal-ecm. These models emphasize the position in a codon where the substitution occurs, help restrict the effects of low-quality data in the descendant sequence, and allow more than one substitution per codon. In combination with the indel model, alignment using the marginal model is implemented using dynamic programming.
Empirical Dataset and Alignments
Humans and gorillas are two closely related species with very different levels of genome curation. The human reference genome has been revised dozens of times and is currently on version GRCh38.p14, while the gorilla reference genome has only been revised a handful of times and is currently on version gorGor4 (cf. ENSEMBL database v110; Hubbard et al. 2002). Additionally significant levels of investment have been made to correctly identify and annotate human genes, while gorilla annotations have received limited support in comparison. Together, these reference genomes provide a good opportunity to compare COATi against other aligners as they offer one genome that is high-quality (human) and sister genome that is lower-quality (gorilla).
We used ENSEMBL database v110 (Hubbard et al. 2002) to create an empirical dataset of protein-coding sequences for both human genes and their gorilla orthologs. We first selected human protein-coding genes that belonged to the Consensus Coding Sequence Set, that were located on an autosomal chromosome, and that had a one-to-one gorilla ortholog. We selected the canonical isoform for both species and removed any pair in which the total nucleotide length was larger than 6,000 nucleotides. This resulted in 14,127 sequence pairs and corresponding FASTA files containing CDS sequences. Due to the way that canonical isoforms are identified, there is no guarantee that the isoforms are orthologous even though the genes are. Therefore, a subset of the sequence pairs in this dataset contain human and gorilla sequences with different exon compositions. We have made no attempt to correct these sequence pairs because in our experience genome-wide studies rarely control for such artifacts.
Alignment Methods
In order to compare COATi against other aligners, we evaluated five different alignment models: (i) COATi’s FST model (i.e. codon-triplet-mg), (ii) the amino-acid model of ClustalΩ (Sievers et al. 2011) via translation and reverse translation, (iii) the amino-acid aware nucleotide model of MACSE (Ranwez et al. 2018), (iv) the DNA model of MAFFT (Katoh and Standley 2013), and (v) the codon model of PRANK (Löytynoja 2014). COATi is not a symmetric pairwise aligner, as reference sequences are more constrained than nonreference sequences. In order to evaluate the importance of the choice for reference sequence, we also aligned sequence pairs using (vi) COATi with gorilla sequences as the reference (i.e. COATi-rev). Together, these six methods allow us to evaluate both different alignment strategies and different software implementations. See supplemental materials for additional details, including results for COATi’s marginal and ECM models.
We aligned our dataset of human–gorilla orthologs using all six alignment methods and calculated multiple biological and technical statistics on each alignment in order to compare how different alignment methods influence biological conclusions. Additionally, we checked alignments to ensure that our pipeline did not introduce any artifacts into the results, including unexpected characters, empty columns, and aligned sequences of different lengths. We also generated checksums of our sequences with gaps removed to ensure that they were not modified during alignment.
Evolutionary Distances and Gaps
Alignment inference impacts the estimation of evolutionary distances (Redelings and Suchard 2007). To quantify the impact of aligners on the estimation of evolutionary distances, we calculated Kimura’s 2-parameter (K2P; Kimura 1980) distance for each sequence pair and aligner combination. K2P distances correct for multiple mutations at a site and take into account differences in transition and transversion rates. K2P also assumes equal nucleotide frequencies and no variation in the rate of substitution across sites. While K2P distance are more suitable for noncoding sequences, they are straightforward to calculate and provides a quantitative measure of the evolutionary divergence between sequences. We also calculated p-distances (Saitou and Nei 1987) for each sequence pair and aligner combination, which are simply the proportions of sites that differ between two sequences. We used the R (R Core Team 2024) software package ape (Paradis and Schliep 2019) to calculate the distances. Since aligners influence evolutionary distances based on their tendencies to insert gaps into sequences, we also quantified the phases and lengths of gaps introduced by each method as well as the fraction of nucleotides that are aligned against a match, mismatch, or gap.
Selection
Alignment inference also impacts the identification of genes that have experienced positive and negative selection. To evaluate the impact of aligners on selection identification, we estimated and statistics (also known as and ) for each estimated alignment. Briefly, is the number of substitutions per synonymous site and is the number of substitutions per nonsynonymous site between two protein-coding sequences. We used the method developed by Li (1993) and independently derived by Pamilo and Bianchi (1993) to estimate these statistics as implemented in the R package seqinr (Charif and Lobry 2007).
Briefly, this method takes two aligned sequences and classifies the sites in each sequence as nondegenerate, 2-fold degenerate, or 4-fold degenerate based on the standard genetic code. A site is nondegenerate if 0/3 possible nucleotide changes to that site are synonymous, 2-fold degenerate if 1/3 of the possible changes are synonymous, and 4-fold degenerate if 3/3 possible changes are synonymous. (The rare 3-fold degenerate sites are treated as 2-fold degenerate in this method.) First the method calculates the average numbers of sites of each type in the sequence pair (, , and ). Next, following Li et al. (1985), it uses the K2P model (Kimura 1980) and the alignment to estimate the numbers of transitions () and transversions () that occurred per each ith type. Using these statistics, Li (1993) estimated and as
We considered any alignment to be showing evidence of positive selection if and negative selection if . We estimated F1 scores for both positive and negative selection by comparing estimated alignments to benchmark alignments. F1 is the harmonic mean of precision and recall:
where TP is the number of alignments that correctly predicted positive (or negative) selection, FP is the number of alignments that incorrectly predicted positive (or negative) selection, and FN is the number of alignments that incorrectly predicted the absence of positive (or negative) selection. F1 allows us to measure how well aligners produce alignments that correctly identify the presence and absence of positive selection or negative selection.
Semiempirical Benchmark Alignments
In order to compare COATi against other aligners on realistic datasets with known alignments, we developed a procedure to introduce realistic gap patterns into human–gorilla orthologous gene pairs that did not previously contain indels. We separated our 14,127 empirical sequence pairs into two sets: (i) 8,261 sequence pairs that did not contain any gaps when aligned by COATi FST, ClustalΩ, MACSE, MAFFT, and PRANK, and (ii) 5,821 sequence pairs for which at least one of these five aligners added gaps. (45 pairs that could not be aligned by at least one aligner were dropped.) We identified gap patterns from the sequence pairs that were aligned with gaps and randomly introduced them into the ungapped sequence pairs. We used an equal number of randomly sampled gap patterns from each aligner to produce a benchmark dataset of known alignments. We preserved the phases of gaps and the spacings of any clusters of gaps. Segments of matches that were 96 nucleotides or longer were allowed to change length to accommodate the length of the ungapped sequence pair. (This criteria was recursively lowered if the gap pattern did not fit the ungapped sequence pair.)
For example, consider the gapped alignment represented by the CIGAR string “170M 3D 10M 6I 102M” applied to an ungapped alignment of human and gorilla sequences that are both 300 nucleotides long. First, we modified the CIGAR string to insert flexible lengths for any match segment that is 96 or more nucleotides long while preserving phase. This results in a new CIGAR string of “98M *M 3D 10M 6I 96M *M,” where * represents the locations that have flexible length. Considering only matches and deletions, this CIGAR string has 207 fixed nucleotides, leaving 93 nucleotides in the human sequence to be allocated to the flexible locations. Since there are two locations of flexible length, we drew one random break point uniformly while maintaining phase, producing the final CIGAR string of “98M 12M 3D 10M 6I 96M 81M”, which was used to add gaps into the target, ungapped sequence pair. To apply deletions, we removed the corresponding nucleotides from the gorilla sequence, and to apply insertions, we added nucleotides to the gorilla sequence by sampling codons from their stationary frequency while respecting gap phases. The human sequence was left unchanged. We generated a benchmark dataset of 8,261 sequence pairs with known alignments using this strategy.
Measuring Aligner Accuracy
We aligned our benchmark alignments using our suite of aligners and quantified the similarity of each estimated alignment to its respective benchmark. As above, we measured the K2P distance (Kimura 1980) and (Li 1993) for the benchmark alignments and estimated alignments. Additionally, we quantified the error of estimated alignments using the metric (Blackburne and Whelan 2011).
Intuitively, ranges between zero and one and can be interpreted as the fraction of nucleotides in the sequence pair that are aligned differently between estimated and benchmark alignments. This metric summarizes each alignment by building homology sets for each nucleotide in the sequence pair. Briefly, if an alignment column contains position i from the first sequence and position j from the second sequence, then is the homology set for position i in the first sequence and is the homology set for position j in the second sequence. If a position is aligned against a gap, then its homology set is simply . This treats all gaps equally as the location of the gap is not recorded. Finally, is calculated between an estimated and a benchmark alignment as the average, normalized Hamming distance between the homology sets for each nucleotide in the sequence pair (Blackburne and Whelan 2011).
Using we quantified the number of perfect, imperfect, and best alignments each aligner produced. We define perfect alignments as alignments with a distance of zero to the benchmark alignment () or any alignment that is evolutionary equivalent to an alignment with a distance of zero. We included equivalency in our definition of perfect alignments to prevent the manner by which aligners break ties from influencing our results. Evolutionary equivalence was determined by scoring alignments using COATi’s marginal model. Any alignment that had a score which matched the score of its benchmark alignment was considered perfect even if its distance to the benchmark alignment was greater than zero. Imperfect alignments are defined as alignments that are not perfect when another method successfully produces a perfect alignment for the same pair of sequences. Best alignments are the alignments that the lowest distance to the true alignment, including ties and equivalent alignments. Taken together, these three statistics not only allow a direct comparison of aligners but also expose instances where all aligners fall short of achieving a perfect result.
Results
Empirical Data and Alignments
We generated 14,127 sequence pairs containing the coding sequences of a human gene and the orthologous gorilla gene. Twenty-two human sequences contained early stop codons. These were not artifacts, but rather UGA codons which encoded for selenocysteine. One human sequence had a length that was not a multiple of 3, and no human sequences contained ambiguous nucleotides. Conversely, no gorilla sequences contained early stop codons (i.e. proteins containing selenocysteines were improperly annotated). Twenty gorilla sequences had lengths that were not multiples of 3, and 173 sequences contained ambiguous nucleotides.
As expected, COATi was not able to align 23 sequence pairs and COATi-rev was not able to align 193 sequence pairs. PRANK was not able to align 24 sequence pairs. MAFFT, MACSE, and ClustalΩ were able to align all sequence pairs, although the latter received some help from its wrapper script. Additionally, MACSE produced alignments with columns that contained only gaps, which were removed by its wrapper script. A total of 10,296 sequence pairs were aligned equivalently by all five methods, including 8,261 sequence pairs that did not contain any gaps, 664 that had identical alignments, and another 1,371 that had identical scores.
Compared to other aligners, COATi inferred more indels and aligned more nucleotides against a gap while producing fewer mismatches (Table 1). Additionally, COATi inferred that 59% indels occurred in phase-1 or phase-2 in our coding sequences. MACSE’s amino-acid aware nucleotide model inferred 7%, while MAFFT’s noncoding DNA model inferred 48%. PRANK’s codon and ClustalΩ’s amino-acid models were constrained by their assumptions and inferred no phase-1 or phase-2 gaps. Note that ClustalΩ’s gaps with phases of 1 and 2 were created by our wrapper script handling DNA sequences with lengths that are not multiples of 3.
Table 1.
COATi produced more indels and fewer mismatches compared to other aligners
| Indel Phasesa | Homology Patternsb | |||||
|---|---|---|---|---|---|---|
| Method | 1 | 2 | 3 | Matches (%) | Mismatches (%) | Gaps (%) |
| COATi | 4,493 | 3,962 | 5,822 | 95.93 | 0.79 | 3.28 |
| ClustalΩ | 8 | 6 | 10,230 | 95.93 | 1.54 | 2.52 |
| MACSE | 455 | 497 | 13,017 | 96.13 | 1.33 | 2.54 |
| MAFFT | 2,317 | 2,631 | 5,458 | 96.12 | 1.36 | 2.52 |
| PRANK | 0 | 0 | 10,862 | 95.64 | 0.84 | 3.52 |
| COATi-rev | 3,992 | 3,775 | 6,176 | 95.95 | 0.79 | 3.25 |
aTotal number of indels inferred by each aligner separated by phase.
bThe percent of nucleotides that were aligned against a match, mismatch, and gap, respectively. This is different than the percent of columns that contain a match, mismatch, or gap because match and mismatch columns are counted twice.
Evolutionary Distances
The distribution of evolutionary distances for COATi, COATi-rev, and PRANK were biologically reasonable for human–gorilla gene pairs, with means of 0.8% to 0.9% (Fig. 4a). Conversely, the means for ClustalΩ, MACSE, and MAFFT were much higher, 1.5% to 1.8%. These larger means are due to the fact that the right tails of the distributions for ClustalΩ, MACSE, and MAFFT were larger than the other methods. As expected, the alignment methods that “overaligned” sequences (produced more mismatches and fewer gaps) also produced higher evolutionary distances on average (Table 1 and Fig. 4a). The differences between the distances inferred by COATi and other methods also show that ClustalΩ, MACSE, and MAFFT produce higher distance estimates for a subset of sequences (Fig. 4b). COATi’s alignments produced significantly lower K2P distances than ClustalΩ, MACSE, MAFFT, and PRANK. Single-tailed Wilcoxon signed-rank tests on matched data produced P-values . Additionally, COATi’s K2P distances did not depend on whether human or gorilla was used as the reference sequence (P-value of for a two-sided test). Additional analyses of the empirical alignments can be found in supplementary methods, Supplementary Material online including results using p-distances, which did not differ from the results using K2P distances.
Fig. 4.
COATi’s alignments produce biologically reasonable evolutionary distances. a) The distribution of K2P distances inferred from alignments generated by each method. The averages of each distribution are indicated by vertical lines. The averages are , , , , , and . b) The distribution of the differences between distances inferred by COATi and other methods. The x-axes of both plots have been pseudo-log transformed using the inverse hyperbolic sine.
Selection
The vast majority of sequence pairs (12,344) were identified as showing evidence of negative selection when they were aligned using COATi, ClustalΩ, MACSE, MAFFT, PRANK, and COATi-rev (Fig. 5). Another 1,087 sequence pairs were identified as showing evidence of positive selection when they were aligned using all six strategies. The most common pattern that showed disagreement between aligners were 91 sequence pairs that were identified as positively selected by COATi, PRANK, and COATi-rev. The most common singleton pattern was 64 sequence pairs identified as positively selected only by MAFFT. Notably, 36 sequence pairs were identified as positively selected, and 9 sequence pairs were identified as negatively selected by COATi and COATi-rev only. In total, COATi identified 1,367 (∼10%) sequence pairs as showing evidence of positive selection. ClustalΩ identified 1,228. MACSE identified 1,183. MAFFT identified 1,340. And PRANK identified 1,349. Note that we did not explore whether any of these inferences were biologically or statistically significant.
Fig. 5.
Aligners varied in which sequence pairs they identified as undergoing positive selection. In this UpSet plot, the bottom panel displays the 16 most frequent intersection patterns among aligners. A black circle represents positive selection. The most frequent pattern was that no aligner found positive selection while the second most frequent pattern was that all aligners found positive selection. Other patterns involved a disagreement between aligners about whether a sequence pair showed evidence of positive or negative selection. The top panel displays the number of sequence pairs in each grouping.
Semiempirical Benchmark
COATi produced better alignments than ClustalΩ, MACSE, MAFFT, and PRANK when evaluated by our benchmark dataset. It produced more best alignments, more perfect alignments, and less imperfect alignments (Table 2). COATi was significantly more accurate (lower ) at inferring the benchmark alignments compared to the other methods (Table 2). All P-values were significant according to the one-tailed, paired Wilcoxon signed-rank tests (P-value of COATi vs PRANK was , and P-values of COATi vs the other aligners were ). Notably, the average alignment error of the second best protocol was four times larger than COATi’s. We also calculated the average distance of alignments between aligners and used principal coordinate analysis (PCoA) to project the resulting distance matrix into two dimensions (Fig. 6). COATi was the closest aligner to the benchmark dataset, and other aligners tended to diverge from the benchmark dataset in different directions. Because benchmark alignments were derived from gap patterns estimated by different aligners, we also performed separate PCoA analyses for every source of gap patterns (supplementary fig. S6, Supplementary Material online). While all aligners, except ClustalΩ, do exceptionally well on their own patterns, COATi does exceptionally well on every set of patterns.
Table 2.
COATi generates better alignments than other alignment algorithms on a semiempirical benchmark dataseta
| COATi | ClustalΩ | MACSE | MAFFT | PRANK | |
|---|---|---|---|---|---|
| Average alignment error () | 0.23% | 2.65% | 1.37% | 1.67% | 1.02% |
| Number of bestb alignments | 7,050 | 3,988 | 5,226 | 5,918 | 6,128 |
| Number of perfectb alignments | 6,795 | 3,954 | 5,138 | 5,840 | 5,860 |
| Number of imperfectb alignments | 850 | 3,691 | 2,507 | 1,805 | 1,785 |
| RMSE for K2P distances | 0.000694 | 0.0920 | 0.0528 | 0.0561 | 0.0305 |
| Overestimated K2P distances | 10.3% | 52.7% | 38.0% | 26.2% | 25.6% |
| F1 score for negative selection | 99.8% | 97.6% | 98.4% | 98.3% | 99.1% |
| F1 score for positive selection | 97.8% | 77.2% | 84.9% | 85.5% | 91.8% |
aTotal number of sequence pairs in the benchmark dataset was 8,261. bBest alignments have the lowest (including equivalent alignments). Perfect alignments have
(including equivalent alignments). Imperfect alignments are alignments that are not perfect when at least one method found a perfect alignment.
Fig. 6.
COATi’s alignments were closer to the semiempirical benchmark dataset than other methods according to a PCoA of the average alignment distance () between alignments generated by different methods.
For evolutionary distances, COATi had the lowest root-mean-square error (0.000694) relative to the benchmark alignments (Table 2). The second lowest aligner, PRANK, had a root-mean-square error that was over 400 times larger than COATi’s. COATi only overestimated 10% of its evolutionary distances, while other aligners overestimated 26% to 53% of their evolutionary distances. Additionally, COATi more accurately inferred events of positive and negative selection (Table 2).
Discussion
We have developed a statistical alignment program using FSTs called COATi. On empirical data, COATi produced alignments that provided reasonable biological inferences. On a semiempirical benchmark dataset, COATi was more accurate than the other aligners/alignment strategies that we tested: ClustalΩ’s amino-acid model, MACSE’s amino-acid aware nucleotide model, MAFFT’s DNA model, and PRANK’s codon model. Unlike standard codon or amino-acid alignment strategies, COATi supports aligning protein-coding sequences using all three phases of gaps. Consistent with Taylor et al. (2004) and Zhu (2022), our COATi results indicate that only 41% indels in protein-coding sequences are phase-3 indels between humans and gorillas. Therefore, alignment strategies that only support phase-3 gaps in protein-coding sequences are suboptimally aligning the 59% of indels that are phase-1 and phase-2.
In addition to supporting all three phases of gaps, COATi’s advantage over the other aligners was at least partially due to its default parameters fitting the human–gorilla coding sequences better than the other aligners’ default settings. Different datasets may show different results. However, because COATi uses a statistical model with biologically meaningful parameters, users can easily optimize COATi by using parameters estimated by previous studies or estimating parameters themselves (e.g. Zhu 2022). This contrasts with other aligners that have limited ability to customize parameters and/or use parameters without biological interpretations.
COATi’s primary disadvantage is that its FST model was slower and consumed more resources than other methods. However, COATi’s marginal model produces results similar to the FST model, was the fastest method tested, and consumed reasonable amounts of memory (supplementary materials and García Mesa 2023). COATi is not a symmetric pairwise aligner, as reference sequences are more constrained than nonreference sequences. In order to test whether the choice of a reference sequence impact’s COATi’s results, we aligned our empirical sequence pairs with COATi using gorilla as the reference. We found no biologically or statistically significant differences between the results. The most noticeable difference was that, with gorilla as the reference, COATi rejected more sequence pairs due to the presence of ambiguous nucleotides in gorilla sequences.
After COATi, PRANK’s codon model was the second most accurate alignment method. Like COATi, PRANK also uses statistical models and supports aligning protein-coding sequences using codon models developed for phylogenetics. Both COATi and PRANK had reasonable distributions of evolutionary distances that lacked large right tails. This is likely because COATi and PRANK produced better alignments when tasked with aligning sequence pairs that contained nonorthologous exons. However, in contrast with COATi, PRANK’s codon model only permits phase-3 gaps and is very sensitive to frameshift artifacts. In fact, PRANK refuses to align any sequence using a codon model if its length is not a multiple of three.
ClustalΩ’s amino-acid model also only permits phase-3 gaps and is very sensitive to frameshift artifacts. Because this approach depends on amino-acid translations, if a coding sequence contains an abiological frameshift, then translation will produce an abiological amino-acid sequence resulting in a poor alignment. ClustalΩ was the worst performing protocol that we tested. It had the highest average alignment error and also had difficulties correctly identifying positive selection and estimating evolutionary distances. It even performed poorly on its own gap patterns. ClustalΩ’s relatively poor performance likely derives from its default assumptions being a poor fit for our dataset and also from the loss of information that occurred when we translated protein-coding DNA sequences into protein sequences.
We obtained better results when aligning in nucleotide space using both amino-acid aware MACSE and amino-acid agnostic MAFFT. These results indicate that aligning protein-coding sequences in nucleotide space works reasonably well if the sequences come from closely related species, such as humans and gorillas, as it supports all three gap phases and is robust to frameshift artifacts. If the sequences were further apart, we predict that amino-acid agnostic approaches would begin to produce unreasonable alignments.
We can look at distances between alignments generated by different aligners to understand the diversity of approaches used by our alignment protocols. Aside from the cluster of different COATi methods, we detected no other clusters in our analysis (supplementary fig. S7, Supplementary Material online). Different aligners tended to spread out from one another and in different directions from the benchmark dataset. This indicates that each protocol that we utilized approached alignment inference from a different section of “aligner space.” While each aligner, except ClustalΩ, performed well on its own gap patterns, MACSE also did well on ClustalΩ and MAFFT gap patterns, and PRANK did well on ClustalΩ patterns. COATi did well on all patterns.
This study has focused on COATi’s module for estimating pairwise alignments; however, COATi has additional tools, including a module to sample pairwise alignments from alignment space and a module to generate multiple-sequence alignments. We have used the sampling module in a large study (Zhu 2022) and believe that it is ready to be used in further studies. On the other hand, COATi’s multiple-sequence alignment algorithm is still in early development and needs additional work to ensure that it produces high-quality alignments. Currently, COATi generates multiple-sequence alignments by aligning nonreference sequences against a reference sequence and collapsing insertions following a user-provided guide tree.
The present study has shown that COATi offers a biologically significant improvement over other methods. COATi produces more accurate alignments of protein-coding sequences and better downstream inferences of sequence divergence, selection, and indel processes. While here we have focused on the challenge of aligning pairs of sequences from sister species that may contain genomic artifacts, in other work, we have begun to explore application of COATi to more divergent sequences and evaluate in detail the accuracy of the marginal approximation of the COATi FST (García Mesa 2023).
COATi is under active development, and future work will enhance the multiple-sequence aligner to refine the initial alignment according to a scoring function consistent with COATi’s FST model. We also plan on extending COATi to support more complex gap models, e.g. mixtures of single-nucleotide and triple-nucleotide indel models and weighting gap openings to reflect known selection on indel phases (Zhu 2022). We also plan on improving its alignment sampling capabilities, as well as implementing new models for aligning long-read sequences of genes against reference genomes. Our goal is to develop COATi into a user-friendly suite of tools that will allow researchers to analyze more data with higher accuracy and facilitate the study of important biological processes that shape genomic data.
Supplementary Material
Acknowledgments
The authors would like to thank Profs. Marco Mangone, Ted Pavlic, Banu Ozkan, Jay Taylor, and Jeremy Wideman for their helpful support on two separate PhD dissertation projects. The authors would also like to thank the associate editor and two reviewers for their helpful comments on earlier versions of this manuscript.
Contributor Information
Juan José García Mesa, The Biodesign Institute, Arizona State University, Tempe, AZ, USA; Ira A. Fulton Schools of Engineering, Arizona State University, Tempe, AZ, USA.
Ziqi Zhu, The Biodesign Institute, Arizona State University, Tempe, AZ, USA; School of Life Sciences, Arizona State University, Tempe, AZ, USA.
Reed A Cartwright, The Biodesign Institute, Arizona State University, Tempe, AZ, USA; School of Life Sciences, Arizona State University, Tempe, AZ, USA.
Supplementary Material
Supplementary material is available at Molecular Biology and Evolution online.
Funding
This research was funded by NSF award DBI-1929850.
Conflict of Interest
None declared.
Data Availability
The source code for COATi, along with documentation, is freely available on GitHub: https://github.com/CartwrightLab/coati (doi: 10.5281/zenodo.11499800) and is implemented in C++. COATi is released under an MIT license. Additional information, code, and workflows to replicate our analyses can be found on GitHub: https://github.com/jgarciamesa/coati-testing (doi: 10.5281/zenodo.11515933) and https://github.com/jgarciamesa/alignpair_letter (doi: 10.5281/zenodo.11512411).
References
- Abascal F, Zardoya R, Telford MJ. TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res. 2010:38(suppl_2):W7–W13. 10.1093/nar/gkq291. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Allauzen C, Riley M, Schalkwyk J, Skut W, Mohri M. OpenFst: a general and efficient weighted finite-state transducer library. In: Holub J, Žďárek J, editors. Implementation and application of automata. Berlin, Heidelberg: Springer; 2007. p. 11–23. 10.1007/978-3-540-76336-9_3. [DOI] [Google Scholar]
- Arvestad L. Aligning coding DNA in the presence of frame-shift errors. In: Apostolico A, Hein J, editors. Combinatorial pattern matching. Berlin, Heidelberg: Springer; 1997. p. 180–190. 10.1007/3-540-63220-4_59. [DOI] [Google Scholar]
- Bininda-Emonds O. transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences. BMC Bioinformatics. 2005:6(1):1–6. 10.1186/1471-2105-6-156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blackburne BP, Whelan S. Measuring the distance between multiple sequence alignments. Bioinformatics. 2011:28(4):495–502. 10.1093/bioinformatics/btr701. [DOI] [PubMed] [Google Scholar]
- Bradley RK, Holmes I. Transducers: an emerging probabilistic framework for modeling indels on trees. Bioinformatics. 2007:23(23):3258–3262. 10.1093/bioinformatics/btm402. [DOI] [PubMed] [Google Scholar]
- Cartwright RA. Problems and solutions for estimating indel rates and length distributions. Mol Biol Evol. 2009:26(2):473–480. 10.1093/molbev/msn275. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Charif D, Lobry J. SeqinR 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis. In: Bastolla U, Porto M, Roman H, Vendruscolo M, editors. Structural approaches to sequence evolution. Berlin, Heidelberg: Springer; 2007. p. 207–232. 10.1007/978-3-540-35306-5_10. [DOI] [Google Scholar]
- Cotterell R, Peng N, Eisner J. Stochastic contextual edit distance and probabilistic FSTs. In: Toutanova K, Wu H, editors. Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Baltimore, Maryland: Association for Computational Linguistics; 2014. p. 625–630. 10.3115/v1/P14-2102. [DOI] [Google Scholar]
- De Maio N. The cumulative indel model: fast and accurate statistical evolutionary alignment. Syst Biol. 2021:70(2):236–257. 10.1093/sysbio/syaa050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fletcher W, Yang Z. The effect of insertions, deletions, and alignment errors on the branch-site test of positive selection. Mol Biol Evol. 2010:27(10):2257–2267. 10.1093/molbev/msq115. [DOI] [PubMed] [Google Scholar]
- García Mesa JJ. Statistical sequence alignment of protein coding regions [PhD thesis]. Arizona State University; 2023.
- Hein J. An algorithm combining DNA and protein alignment. J Theor Biol. 1994:167(2):169–174. 10.1006/jtbi.1994.1062. [DOI] [PubMed] [Google Scholar]
- Hein J. An algorithm for statistical alignment of sequences related by a binary tree. Pac Symp Biocomput. 2001:6:179–190. [DOI] [PubMed] [Google Scholar]
- Holmes I. A model of indel evolution by finite-state, continuous-time machines. Genetics. 2020:216(4):1187–1204. 10.1534/genetics.120.303630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holmes I, Bruno WJ. Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics. 2001:17(9):803–820. 10.1093/bioinformatics/17.9.803. [DOI] [PubMed] [Google Scholar]
- Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T. The Ensembl genome database project. Nucleic Acids Res. 2002:30(1):38–41. 10.1093/nar/30.1.38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hubisz MJ, Lin MF, Kellis M, Siepel A. Error and error mitigation in low-coverage genome assemblies. PLoS One. 2011:6(2):e17034. 10.1371/journal.pone.0017034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jackman SD, Coombe L, Chu J, Warren RL, Vandervalk BP, Yeo S, Xue Z, Mohamadi H, Bohlmann J, Jones SJ, et al. Tigmint: correcting assembly errors using linked reads from large molecules. BMC Bioinformatics. 2018:19(1):393. 10.1186/s12859-018-2425-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013:30(4):772–780. 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980:16(2):111–120. 10.1007/BF01731581. [DOI] [PubMed] [Google Scholar]
- Kosiol C, Holmes I, Goldman N. An empirical codon model for protein sequence evolution. Mol Biol Evol. 2007:24(7):1464–1479. 10.1093/molbev/msm064. [DOI] [PubMed] [Google Scholar]
- Li WH. Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J Mol Evol. 1993:36(1):96–99. 10.1007/BF02407308. [DOI] [PubMed] [Google Scholar]
- Li WH, Wu CI, Luo CC. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol Biol Evol. 1985:2(2):150–174. 10.1093/oxfordjournals.molbev.a040343. [DOI] [PubMed] [Google Scholar]
- Löytynoja A. Phylogeny-aware alignment with PRANK. In: Russell D, editor. Multiple sequence alignment methods. Totowa (NJ): Humana Press; 2014. p. 155–170. 10.1007/978-1-62703-646-7_10. [DOI] [PubMed] [Google Scholar]
- Lunter G, Drummond AJ, Miklós I, Hein J. Statistical alignment: recent progress, new applications, and challenges. In: Nielsen R, editor. Statistical methods in molecular evolution. New York (NY): Springer; 2005. p. 375–405. 10.1007/0-387-27733-1_14. [DOI] [Google Scholar]
- Mohri M, Pereira F, Riley M. Weighted automata in text and speech processing. In: Wahlster W, editor. Proceedings of the 12th biennial European Conference on Artificial Intelligence (ECAI-96), Workshop on extended finite state models of language. Chichester: John Wiley and Sons; 1996. p. 1–5. 10.48550/arXiv.cs/0503077. [DOI] [Google Scholar]
- Morrison DA. Is sequence alignment an art or a science? Syst Bot. 2015:40(1):14–26. 10.1600/036364415X686305. [DOI] [Google Scholar]
- Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol. 1994:11(5):715–724. 10.1093/oxfordjournals.molbev.a040152. [DOI] [PubMed] [Google Scholar]
- Pamilo P, Bianchi NO. Evolution of the Zfx and Zfy genes: rates and interdependence between the genes. Mol Biol Evol. 1993:10(2):271–281. 10.1093/oxfordjournals.molbev.a040003. [DOI] [PubMed] [Google Scholar]
- Paradis E, Schliep K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 2019:35(3):526–528. 10.1093/bioinformatics/bty633. [DOI] [PubMed] [Google Scholar]
- Pedersen CNS, Lyngsø R, Hein J. Comparison of coding DNA. In: Farach-Colton M, editor. Combinatorial pattern matching. Berlin, Heidelberg: Springer; 1998. p. 153–173. 10.1007/BFb0030788. [DOI] [Google Scholar]
- Ranwez V, Douzery EJP, Cambon C, Chantret N, Delsuc F. MACSE v2: toolkit for the alignment of coding sequences accounting for frameshifts and stop codons. Mol Biol Evol. 2018:35(10):2582–2584. 10.1093/molbev/msy159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ranwez V, Harispe S, Delsuc F, Douzery EJ. MACSE: multiple alignment of coding sequences accounting for frameshifts and stop codons. PLoS One. 2011:6(9):e22594. 10.1371/journal.pone.0022594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team . R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2024.
- Redelings BD, Suchard MA. Incorporating indel information into phylogeny estimation for rapidly emerging pathogens. BMC Evol Biol. 2007:7(1):40. 10.1186/1471-2148-7-40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosenberg MS. Sequence alignment: methods, models, concepts, and strategies. Berkeley (CA): University of California Press; 2009. [Google Scholar]
- Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987:4(4):406–425. 10.1093/oxfordjournals.molbev.a040454. [DOI] [PubMed] [Google Scholar]
- Schneider A, Souvorov A, Sabath N, Landan G, Gonnet GH, Graur D. Estimates of positive Darwinian selection are inflated by errors in sequencing, annotation, and alignment. Genome Biol Evol. 2009:1:114–118. 10.1093/gbe/evp012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011:7(1):539. 10.1038/msb.2011.75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Silvestre-Ryan J, Wang Y, Sharma M, Lin S, Shen Y, Dider S, Holmes I. Machine Boss: rapid prototyping of bioinformatic automata. Bioinformatics. 2021:37(1):29–35. 10.1093/bioinformatics/btaa633. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taylor MS, Ponting CP, Copley RR. Occurrence and consequences of coding sequence insertions and deletions in mammalian genomes. Genome Res. 2004:14(4):555–566. 10.1101/gr.1977804. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yoon BJ. Hidden Markov models and their applications in biological sequence analysis. Curr Genomics. 2009:10(6):402–415. 10.2174/138920209789177575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu Z. Profiling of indel phases in coding regions [PhD thesis]. Arizona State University; 2022.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The source code for COATi, along with documentation, is freely available on GitHub: https://github.com/CartwrightLab/coati (doi: 10.5281/zenodo.11499800) and is implemented in C++. COATi is released under an MIT license. Additional information, code, and workflows to replicate our analyses can be found on GitHub: https://github.com/jgarciamesa/coati-testing (doi: 10.5281/zenodo.11515933) and https://github.com/jgarciamesa/alignpair_letter (doi: 10.5281/zenodo.11512411).






