WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences

George Glidden-Handgis; Travis J Wheeler

doi:10.1093/bioadv/vbae052

. 2024 Apr 8;4(1):vbae052. doi: 10.1093/bioadv/vbae052

WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences

George Glidden-Handgis ^1,^✉, Travis J Wheeler ²

Editor: Aida Ouangraoua

PMCID: PMC11099658 PMID: 38764475

Abstract

Background

Software for labeling biological sequences typically produces a theory-based statistic for each match (the E-value) that indicates the likelihood of seeing that match’s score by chance. E-values accurately predict false match rate for comparisons of random (shuffled) sequences, and thus provide a reasoned mechanism for setting score thresholds that enable high sensitivity with low expected false match rate. This threshold-setting strategy is challenged by real biological sequences, which contain regions of local repetition and low sequence complexity that cause excess matches between non-homologous sequences. Knowing this, tool developers often develop benchmarks that use realistic-seeming decoy sequences to explore empirical tradeoffs between sensitivity and false match rate. A recent trend has been to employ reversed biological sequences as realistic decoys, because these preserve the distribution of letters and the existence of local repeats, while disrupting the original sequence’s functional properties. However, we and others have observed that sequences appear to produce high scoring alignments to their reversals with surprising frequency, leading to overstatement of false match risk that may negatively affect downstream analysis.

Results

We demonstrate that an alignment between a sequence S and its (possibly mutated) reversal tends to produce higher scores than alignment between truly unrelated sequences, even when S is a shuffled string with no notable repetitive or low-complexity regions. This phenomenon is due to the unintuitive fact that (even randomly shuffled) sequences contain palindromes that are on average longer than the longest common substrings (LCS) shared between permuted variants of the same sequence. Though the expected palindrome length is only slightly larger than the expected LCS, the distribution of alignment scores involving reversed sequences is strongly right-shifted, leading to greatly increased frequency of high-scoring alignments to reversed sequences.

Impact

Overestimates of false match risk can motivate unnecessarily high score thresholds, leading to potentially reduced true match sensitivity. Also, when tool sensitivity is only reported up to the score of the first matched decoy sequence, a large decoy set consisting of reversed sequences can obscure sensitivity differences between tools. As a result of these observations, we advise that reversed biological sequences be used as decoys only when care is taken to remove positive matches in the original (un-reversed) sequences, or when overstatement of false labeling is not a concern. Though the primary focus of the analysis is on sequence annotation, we also demonstrate that the prevalence of internal palindromes may lead to an overstatement of the rate of false labels in protein identification with mass spectrometry.

1 Introduction

The utility of tools that label biological sequences is dependent on the ability to reasonably predict the risk that a proposed label is due to some random or biased process, rather than true signal. An example is the case of sequence homology detection, in which an unlabeled sequence q is compared to a collection of known sequences, and similarity to one or more of those sequences serves as the basis of annotation of q.

Homology detection tools typically represent the risk of false annotation by providing the supporting alignment’s E-value along with its score S. For search of a particular query sequence against a given target database T, the E-value gives the number of alignments expected to meet or exceed the score S if T consisted exclusively of unrelated (random) sequences. Theoretically-grounded calculation of the E-value has been established for gap-free alignments (Karlin and Altschul 1990, Altschul and Gish 1996), and empirical studies demonstrate that these E-value predictions are accurate for search against randomly generated sequences of nucleotides or amino acids (Mott 1992, Eddy 2008).

Even so, genomics practitioners are keenly aware that these E-values are often overestimates of the confidence with which we should label one sequence as being related to another, so that it is common for analysis pipelines to require seemingly-extreme E-value cutoffs for search matches (Pearson 2013). This is because the models underlying the scoring schemes used in alignment (Henikoff and Henikoff 1992, Sjölander et al. 1996) assume that the database sequences are generated under a simple random distribution (sample with replacement), whereas true biological sequences are not always random in this way.

One notable deviation comes in the form of locally repetitive sequence. So-called tandem repeats are typically the result of some form of replication error (Pumpernik et al. 2008), and can present with a repeat period ranging from 1 to several hundred (Thakur et al. 2021). They produce substantial risk of spurious matches when both query and target sequences contain (possibly degenerate) repeats with similar repeated units. One strategy for compensating for tandem repeats is to employ a tandem repeat masking tool (Benson 1999, Frith 2011, Olson and Wheeler 2018) to hide regions of repetitive sequence away from homology detection. Such strategies are necessarily limited by the selection of some score threshold, and may thus either over-mask (hiding true homologs) or under-mask (allowing decayed repetitive regions to be identified as true homologs (Wheeler et al. 2012)).

Continuing concerns about real-world accuracy of E-values, and of the compensatory score adjustment/masking procedures, have lead researchers to evaluate tools with benchmarks that empirically measure false positive annotations in addition to true positive recall, rather than rely solely on theoretical E-values. A benchmark will generally contain a set of query sequences and a distinct set of target sequences. In the target set, some sequences are related to a query (these are the positives that the tool will ideally annotate as such), while other sequences are not related to a query (these are negatives or decoys that the tool should not annotate). The benchmarked algorithm is used to compute alignments without knowledge of true labels, and a predicted classification is produced. Wherever the computed classification agrees with ground truth, the algorithm is said to have produced a true positive or true negative, and elsewhere it produces false negatives and false positives, respectively.

A variety of techniques have been developed to automate the production of positive and negative sequence pairs. In this paper, we are particularly interested in methods of creating negative sequence pairs, because these decoys inform efforts to establish score cutoffs that limit false discovery rate. One option for generating decoy sequences is to randomly sample real biological sequences from a large dataset. This strategy is likely to produce an overestimate of the false positive rate because each randomly chosen ‘decoy’ has some possibility of being homologous to the query. Alternative methods for producing decoys include (i) shuffling the letters of a real biological sequence, or (ii) generating a sequence by sampling letters under a random model. In both of these cases, failure to include repetition or other realistic features of biological sequences may lead to an underestimate of the rate of false positives that will be observed when the tool is applied to real biological sequence.

Another mechanism for producing non-homologous bio-realistic sequence is to use reversed biological sequences, so that decoys are generated by selecting true biological sequences and reversing the order of their characters. Reversal has several immediate benefits over stochastic methods. Reversal preserves the distribution of letters in a sequence, and also preserves the existence of short (possibly degenerate) tandemly repetitive regions and regions of low compositional complexity that are often found in biological sequences. Meanwhile, the reversed sequence almost certainly does not fold or behave the same as the original.

Thus, a decoy set consisting of reversed sequences is expected to possess the low complexity and repetitiveness of real biological sequence, but with no true homology to an actual sequence. Such sequences demonstrate much higher rates of apparent “false” matching than do shuffled sequences, and it is tempting to view these as a better (more realistic) benchmark for false labeling, in the context of both nucleotide (Frith 2011) and protein (Steinegger and Söding 2017) annotation. We have employed reversed sequences in this way for the same reasons (Wheeler et al. 2012); however, during post-publication analysis of our benchmark results, we realized that “false” matches of transposable element queries to a reversed (not complemented) genome were surprisingly likely to be to “self” – a match of a query sequence to the reversed genome was highly likely to be to a reversal of an instance its own family. In an earlier paper evaluating software for protein search, Eddy (2011) suggested the source of this observation, while explaining why that manuscript’s analysis did not include reversed decoy: (“reversed sequences are surprisingly significantly more likely to show a significant match to the original sequence […] because of a counterintuitive statistical effect of the frequency of approximate palindromes”). Here, and throughout this manuscript, the term “palindrome” follows the canonical definition and refers to a sequence that reads the same backwards and forwards (e.g. ATGG $\bar{GGTA}$ or LIWM $\bar{MWIL}$ ). This is in contrast to an alternate usage of the term “palindrome” that sometimes appears in genomics literature: a sequence that consists of two spaced or adjacent inverted repeats, in which matching sequences occur as reverse complement (e.g. GTGAxx $\bar{TCAC}$ ).

Here, we explore the phenomenon of surprisingly high palindrome lengths, and the extent to which they may cause the use of reversed decoys to overstate the risk of false discovery in sequence database search tools. Through empirical analysis, we demonstrate that this palindromic phenomenon explains much of the increase in false discovery relative to shuffled sequence.

1.1 Why are unexpectedly high scores for reversed sequences a problem?

It has recently become common to present the accuracy of homology search tools by reporting the percent of all true instances that are assigned an E-value better than the highest-scoring decoy (which we refer to as recall-0 and is equivalent to the measure primarily reported in (Steinegger and Söding 2017) and (Buchfink et al. 2021)).

When the top-scoring decoy produces E-value D, recall-0 will ignore differences in sensitivity that occur for matches with E-value $> D$ . If the method of generating decoys is prone to producing false matches with inappropriately high score (low E-value), then the recall-0-imposed cutoff for differentiating between tools will be extreme, such that sensitivity differences in interesting score ranges are ignored. For example, a decoy with D = 1e-10 will ensure that recall-0 does not distinguish between one tool that reports 20 matches with E-value between 1e-5 and 1e-10, and a separate tool that only reports a single match in that range. Failure to distinguish tools producing different rates of recall in this ‘marginal’ score range may be concerning, particularly when there is a risk that the marginal cutoff might be unintentionally set higher than is appropriate.

As we demonstrate in the Results section, reversed decoy sequences are prone to producing surprisingly high scores (low E-values). This is true both between a query sequence q and its reversal and also between q and reversals containing many mutations. As a result, we suggest that care be taken in using reversed sequences as decoy sequences when evaluating the occurrence of false discovery (Steinegger and Söding 2017) or selecting score thresholds for reporting (Storer et al. 2021). If reversed sequences are to be used as decoys, the input sequences should first be filtered to remove matches to the query set. Alternatively it may be preferable to use simulated sequences in which care is taken to produce realistic sequence (for DNA, see Caballero et al. 2014). We hope that this article will induce exploration into alternative mechanisms for improving the generation of bio-realistic decoy sequences, particularly for proteins, for approximating false similarity rates.

1.2 Intuition and background for palindrome frequencies

We find it instructive to consider a simple calculation related to the expected frequency of a length-n palindrome in a string A, and of a length-n common substring in two strings A and B, both over a uniformly distributed alphabet of size k. The next two paragraphs are a sketch, intended to provide intuition about expected palindrome length; they do not represent a complete analysis.

Consider some position i in A. Ignoring edge cases, the chance that this is the center position of a length-n palindrome (where n is odd) is ${(1 / k)}^{(n - 1) / 2}$ —a length-n palindrome is formed if the $(n - 1) / 2$ letters to the right if i are matched (in reverse) to the $(n - 1) / 2$ letters to the left of i, each with $1 / k$ probability. This corresponds directly to the probability that position i in A and position $(m - i + 1)$ in the reversal of A (which are the same letter) will be the center of an identical common substring of length n.

Contrast this with the chance of observing a length-n identical common substring in two length-m strings, A and B. For some position i in A with value $a_{i}$ , with reasonably large string length m and modest value k, there will almost certainly be some position j in B with value $b_{j}$ = $a_{i}$ ; this is a length-1 common substring between A and B. Ignoring edge effects, the chance that this is the center position of a length-n common substring is ${(1 / k)}^{(n - 1)}$ , because each of the neighboring $n - 1$ characters in A has a $1 / k$ chance of being matched by the corresponding position in B. For each position i, there are expected to be $m / k$ candidate centers for common substrings, but this number is dwarfed by the much lower probability of producing a common substring of even modest length.

The above sketch ignores edge effects and cases of even-length or approximate palindromes, yet suggests that with a randomly sampled sequence A, a length-n palindrome is less surprising than is a length-n substring shared by A and some other randomly sampled sequence B (or a shuffled variant of A). Generally, the expected length of the longest common sequence (LCS) of two strings of length-k is $2 {log}_{1 / λ_{2}} (k)$ (Arratia and Waterman 1985), and the expected longest palindrome length in a length-k string is $2 {log}_{1 / λ_{2}} (k) + 1$ . (The probability amplitude $λ_{2}$ is determined from frequencies of each letter; for an alphabet of size s, on uniformly-distributed sequences, $λ_{2} = 1 / s$ ).

Finally, we consider approximate palindromes. An $α$ -gapped palindrome is a kind of approximate palindrome defined as $w = u v \bar{u}$ where u and v can be any strings and $\bar{u}$ is the reversal of u. We say w is $α$ -gapped whenever $α \geq \frac{| u | + | v |}{| u |}$ . Duchon et al. (2017) give an expectation for the length of $α$ -gapped palindromes in a length-k string ( $L_{k}$ ) as $E [L_{k}] \sim (α + 1) {log}_{1 / λ_{2}} (k)$ or equivalently $E [L_{k}] \sim (β + 2) {log}_{1 / λ_{2}} (k)$ where $β = α + 1 \geq \frac{| v |}{| u |}$ . By considering odd-length palindromes as gapped palindromes with $| v | = 1$ , it follows that $β {log}_{1 / λ_{2}} (k) ≳ 1$ , and it can be demonstrated that the expected length of odd maximal palindromes becomes $E [L_{k}] ≳ 2 {log}_{1 / λ_{2}} (k) + 1$ [manuscript in prep]. To our knowledge, there is no further published analysis of approximate palindromes involving substitutions or a more general pattern of gaps distributed throughout the palindrome.

Though the expected (mean) palindrome length of a sequence is only slightly higher than the expected length of the longest common substring (LCS) of that sequence and its shuffled partner, we hasten to note that palindromes as described above (exact, or simple approximate) are not exclusively responsible for high scoring alignments. High-scoring self-reversal alignments usually involve a gapped alignment with some mismatches and a reasonably-good approximate palindrome at the center. The scoring dynamics of these mismatched and gapped approximate palindromes appear to be similar to those of pure palindromes, but analysis is sufficiently complicated to exceed the scope of this manuscript. We also highlight that the average match length does not reflect the expected frequency of outliers, which will have almost no effect on the mean due to their rarity; it is these rare high-scoring matches that drive most of the alignment score results seen below.

1.3 Paper outline

In the following sections, we explore the connection between approximate palindrome length and the increased scores produced when aligning sequences against their reversal, relative to aligning against shuffled sequences. We begin by describing the sequence data used in our analyses, then examine the extent to which sequences achieve higher than random score when aligned to their reversal. We follow this with a study of the frequency of long palindromes and common substrings, and an investigation of the roles of tandem repeats and mutations in the context of using reversed sequences as decoy sequences. Finally, we explore the impact of palindromes on assessing the false discovery rate in sequence annotation and in mass spectrometry analysis.

2 Methods

2.1 Preparation of datasets

2.1.1 Real protein sequences and their paired relatives

To gain insight into the influence of decoy type on the distribution of high-scoring matches between sequences, we developed related sets of paired protein sequences. For each pair, we first identified one sequence q from Swiss-Prot (Bateman et al. 2023), then either (a) selected another Swiss-Prot sequence t, (b) shuffled q, or (c) reversed q.

It is possible that two sequences that are randomly selected from Swiss-Prot will in fact be homologous. To reduce this risk, we ensured that paired sequences did not share folds as defined by the SCOP database (Andreeva et al. 2014, 2020). Specifically, sequences from release 2022_04 of Swiss-Prot were length-filtered to identify 565,854 sequences with length between 50 and 2000; this set is called sprot_all. These were aligned to all SCOP2 superfamily representatives using blastp from the BLAST suite (version 2.12.0+) (Kent 2002). Swiss-Prot sequences were retained for this dataset only if they yielded an alignment that covered at least 95% of the length of both the Swiss-Prot and SCOP sequence—this ensured that we could reasonably annotate all folds on retained proteins. The result is a set of 106,740 sequences, called (sprot). Length distributions are shown in Fig. 1.

Figure 1. — Distribution of sequence lengths for sequences sampled from Swiss-Prot (**sprot_all)** and for those that match a SCOP family (**sprot**).

Sequences in (sprot) were assigned to clusters according to their SCOP superfamily. For each sequence q in sprot, its cluster v was identified, and a distinct cluster $w \neq v$ was chosen via sampling weighted by cluster size. A sequence t was then sampled from w and assigned to random. This second set (random) contains presumably non-homologous partners for the set sprot, and can be used to produce an alignment score distribution by aligning the $i th$ sequence of sprot to the ith sequence of random.

To explore the distribution of scores when aligning a biological sequence q to a shuffled copy of q, each sequence in sprot was shuffled to produce the set shuf; the ith sequence in shuf will share length and amino acid composition with the ith sequence in sprot, but no homology. Similarly, each sequence in sprot was reversed to produce the set rev (note: this is true reversal, not reverse complementation). All shuffling and reversals were computed using the esl_shuffle tool from the Easel package as released in HMMER 3.3.2 (Finn et al. 2011).

It is plausible that the high rate of palindromes in biological sequences is due to repetition in those sequences. To demonstrate that the palindrome phenomenon is general to all sequences, and not just repetitive ones, we reversed the sequences of shuf to produce shufrev; because the shuffled sequences do not retain signal of repetition found in true proteins, an alignment of a sequence in shuf to its mate in shufrev will highlight pure features of reversal alignment. Analogous sets of unrestricted relatives of sprot_all are called rev_all, shuf_all, and shufrev_all respectively. Figure 2 provides a visual guide to the various alignment pairings used throughout the article.

Figure 2. — The core experiment design compares five related sequence sets, **random**, **sprot**, **rev**, **shuf**, and **shufrev**, represented here as nodes, by means of pairwise alignment. Each colored edge represents a comparison of two sequence sets. For each connected pairing, the edge color corresponds to the color used for that pairing throughout the manuscript in both figures and text.

2.1.2 Reversals in mutated sequences

In the reversal analyses described above, each sprot sequence is compared to a reversal of itself. In a carefully constructed benchmark, query sequences themselves may not be present among the sequences that are reversed to produce the decoy set, calling the applicability of this analysis into question. But the benchmark may contain their homologs, which will show some level of sequence similarity to the queries. To understand how the level of divergence impacts distribution of scores induced by approximate palindromes, we generated six mutated variants of shuf_all, and reversed each to form a new diverged reversed benchmark. For each variant, a conditional distribution was used as the basis of mutation, so that $s = s_{1} \dots s_{n}$ in shuf_all was transformed to a related sequence $t = t_{1} \dots t_{n}$ , in which $t_{i}$ is an amino acid sampled from the distribution $p (x | s_{i})$ . For the first transformation, substitution probabilities were derived (Yu and Altschul 2005) from the BLOSUM90 substitution matrix. This matrix has a mean self-emission frequency ( $p (a | a)$ ) of 45.6%. Additional transformations were produced based on modified distributions in which the likelihood of mutation was reduced for each amino acid a, adjusting the likelihoods of mutation to the other 19 amino acids $p (b | a)$ proportionately. These distributions have average self-emission frequencies ( $p (a | a)$ ) of 50%, 60%, 70%, 80%, and 90% respectively. These enable consideration of the likelihood of finding matches between a query and reversals of homologs that show varying levels of percent identity to the query. These augmented distributions preserve the relative frequency of mutations derived from BLOSUM90 ( $p (b | a)$ for $b \neq a$ ) while increasing the expected percent identity between the original sequence s and its relative t. We refer to these distributions as shufrev_all-46pct, shufrev_all-50pct, shufrev_all-60pct, shufrev_all-70pct, shufrev_all-80pct, and shufrev_all-90pct, respectively.

2.1.3 Repeat masking

To further investigate the role of repetition in false matching for biological sequences, we hard-masked repetitive sequences in sprot_all using tantan version 40 (Frith 2011) with default parameters, producing masked sprot_all.

2.1.4 Simulated fragmentation with trypsin

In mass spectrometry-based proteomics, proteins are broken into fragments, often employing trypsin digestion. The identity of the resulting fragments is determined based in part on their mass. Typically, the fragments evaluated by mass spectrometry have length 5–100 amino acids because the signal induced by short sequences ( $< 5$ residues) is indistinguishable from noise, while long sequences ( $> 100$ residues) may exceed the physical limits of the machine. To simulate this process, we applied a simplified approximation of trypsin digestion as follows. Consider a protein $P = (p_{1}, p_{2}, \dots, p_{n})$ and its set $S = (s_{1}, s_{2}, \dots, s_{k})$ of potential trypsin digestion positions at which there is an arginine or lysine, ignoring any position that is not followed by a proline. Taking a pair of positions $(s_{i}, s_{j})$ from S, P may be split into three fragments, $(p_{1} .. p_{s_{i}}), (p_{s_{i} + 1} .. p_{s_{j}}), (p_{s_{j} + 1} .. p_{n})$ . For all protein sequences in sprot_all, we enumerated all pairs $(i, j)$ of potential digestion sites and produced all resulting three-way splits, then filtered to remove fragments with length below 5 or greater than 100. The remaining fragments were accumulated into the set sprot_tr. Each member of sprot_tr was shuffled to produce a shuffled control denoted shuf_tr. This approach does not capture the true complexity of trypsin digestion patterns (Rodriguez et al. 2008), but is sufficient for our purposes of analyzing palindromes in relatively short fragments.

2.1.5 Annotation of chromosome 22

The preceding datasets only comprise protein and peptide sequences. To demonstrate that the elevated presence of approximate palindromes in sequence benchmarks is not merely a feature of amino acid sequences, we consider DNA sequences from the human genome.

The sequence of human chromosome 22 in hg38.p14 along with the NCBI RefSeq annotation was downloaded from UCSC’s Genome Browser (Kent et al. 2002). The command-line tool Gffread (Pertea and Pertea 2020) was used to extract and concatenate exon regions into sequences corresponding to the mRNA transcripts of chromosome 22, denoted mRNA. The intron and intergenic regions of chromosome 22 were also sampled for sets of sequences with similar lengths as mRNA, denoted intron and intergenic.

As above, shuffling and masking were performed, and the resulting variant datasets are denoted shuf_mRNA, shuf_intron, shuf_intergenic, and masked mRNA, masked intron, and masked intergenic.

2.2 Software and algorithms

Exploration of score and count distributions was achieved using established tools and algorithms. For the large Swiss-Prot-based dataset, sequence database search scores were computed using blastp (Altschul et al. 1997). For our purposes, pairwise alignment was performed by creating a blast database for each singular target, and this target was “searched” with its paired query sequence. The parameter for (minimum) word length was set to 4. The default substitution matrix (BLOSUM62, half-bit) and gap penalties (gap open $= - 11$ , gap extend $= - 1$ ) were left unchanged. One blastp alignment may contain several results, each with its own score—for our experiments, only the highest-scoring match was retained and reported. We report score, rather than the associated E-value, because E-value is dependent on the size of the query and target datasets, while score is stable for an aligned pair.

One problem with the BLAST-based analysis is that for many comparisons, the similarity between sequences is sufficiently poor that not even a low-scoring blastp result is produced. To gain insight to the distribution of lower-scoring results, we used the BioJulia implementation of Smith-Waterman (SW) algorithm BioAlignments.jl (Christensen et al. 2023) for finding the optimal local alignment between two sequences. The algorithm is parameterized by a substitution matrix (whose values at row i and column j determine the cost of substituting letter i with letter j), as well as gap-open and gap-extend penalties, which determine the cost of skipping one or more characters in either sequence. When gap penalties and off-diagonal values in a substitution matrix are very large negative numbers (e.g. $- 1 e 12$ ), Smith-Waterman will never produce gaps or mismatches and effectively performs exact alignment; we use this mode for computing LCS between sequences. In all other cases, the gap penalties and substitution matrix are set to BLAST’s defaults.

Using either BLAST or Smith-Waterman (with typical parameters) for reversed alignment readily produces approximate palindromes, but rarely identifies true palindromes. To measure the frequency of exact palindromes in Swiss-Prot, we implemented Manacher’s algorithm (Gusfield 1997) in Julia.

Code used to produce all results and figures may be found at https://github.com/TravisWheelerLab/WASITAMATchISAW.

3 Results

3.1 Alignment scores for Swiss-Prot variations

To understand the influence of reversals on sequence alignment score distributions, we aligned sequences taken from Swiss-Prot (length between 50 and 2000) against a variety of decoy datasets. Each pairing involves either 106 740 (sprot and relatives) or 565 855 (sprot_all and relatives) pairwise alignments, with the ith sequence of one set aligned to the ith sequence of the other set using blastp (see Section 2 for definitions of sequence sets). The first pairing (sprot↔random) replicates the case in which a real protein sequence is aligned to some randomly chosen (presumed non-homologous) protein. The second pairing (sprot↔shuf) corresponds to a simple decoy strategy: the true protein is aligned to a sequence that is a shuffled copy of the original—it shares length and amino acid composition, but is not a true protein. As seen in Fig. 3A, the distribution of scores in these two pairings are similar.

Figure 3. — Score distributions from BLAST alignments. (A) All four pairings **sprot↔random**, **sprot↔shuf**, **sprot↔rev**, and **shuf↔shufrev** are restricted to sequences with 95% identity to SCOP2 superfamily representatives. (B) BLAST alignment scores are presented for the unfiltered sets **sprot_all↔rev_all** and **shuf_all↔shufrev_all.**

The next two pairings involve reversed sequences. The sprot↔rev pairing aligns each of the 106 740 true Swiss-Prot proteins to the sequence’s reversal. Figure 3A shows that the distribution of these scores is strongly right-shifted relative to alignments involving random or shuf. According to common wisdom this is expected, as the argument for using reversed sequences as decoys is that real biological sequences contain patterns of repetition and regions of low complexity, and these patterns are responsible for a higher observed rate of matches against reversed decoys. This result is contrasted with the pairing shuf↔shufrev. In this pairing, each query sequence is a shuffled sequence from Swiss-Prot, and the target for each shuffled query is its reversal. The shuffled sequences retain only random levels of repeats and local low complexity, so by conventional wisdom are expected to score similarly to sprot↔shuf. However, the score distribution of shuf↔shufrev is nearly the same as that of the biological sequence and reversal, sprot↔rev. This indicates that the observed right-shift of sprot↔rev scores is primarily due to the reversal itself, and not the biological or repetitive nature of the sequences. Figure 3B presents score distributions for the reversal alignments of the unfiltered set of proteins (sprot_all), and shows a very small right shift for sprot_all↔rev_all relative to shuf↔shufrev. This shift represents the impact of repetition and low complexity, and is much smaller than the shift between shuf↔shufrev and sprot↔shuf.

Figure 4 presents scores for these various pairings as a function of length. Specifically, every pair was binned according to sequence length, then the mean blastp score was computed for each bin. Sequences aligned to other sprot sequences (sprot↔random) may differ in length, so were binned by the geometric mean length of the query and target sequences; they show a score distribution that is essentially identical to that of sequences aligned to their shuffled match (sprot↔shuf). When sprot sequences are aligned to their reversals (sprot↔rev), they produce consistently higher scores than they do when aligned to their shuffled partners (sprot↔shuf). The majority of the difference is explained by the simple fact of reversal, as evidenced by the gap between sprot↔shuf and shuf↔shufrev and the overlap between sprot↔rev and shuf↔shufrev.

3.2 Role of tandem repeats in BLAST score gap

In Fig. 4A, Swiss-Prot sequences aligned to their reversals show a modest increase in BLAST score relative to shuffled sequences to their reversals. To explore the role of tandem repeats in this shift, we created the dataset masked sprot_all by hard-masking repetitive regions using tantan (Frith 2011) (see Section 2). Figure 4B presents these analyses, focused on shuf_all↔shufrev_all and sprot_all↔rev_all distributions, while Fig. 4C replaces sprot_all↔rev_all with another distribution showing the result of first masking with tantan (Frith 2011) (default settings), then searching against its reversal (masked sprot_all↔rev_all). These results demonstrate that the small gap between shuf_all↔shufrev_all and sprot_all↔rev_all is almost entirely due to repetitive sequence that can be effectively masked. This phenomenon, and its response to masking, reoccurs in the analyses replicated on DNA sequences in the corresponding section below.

3.3 Sequences match their own reversals much more than other reversals

The previous results show that alignment of a shuffled sequence to its reversal produces a score that is typically 10–20 bits higher than alignment of that shuffled sequence to the original protein. When evaluating false discovery risk, this average score is less important than the frequency of very high scoring outliers. To explore the right tail of this score distribution, we captured a slice of all sequences in sprot with length between 280 and 320, and computed the frequency with which scores are observed in (i) alignments of each sequence in this set with a shuffled copy of the sequence, (ii) alignments of shuffled copies of each sequence in this set with reversed copies of those shuffled copies, and (iii) alignments of shuffled copies of each sequence in this set with reversed copies that we mutated according to the 80pct substitution matrix described in Methods (and in the next section).

Figure 5 shows that alignment to reversals produces a highly right-skewed distribution, and that this right skew is evident even when reversals have been subjected to mutation to 80% identity. All three score distributions appear to be well modeled by a Gumbel distribution, so we fit Gumbel parameters (Chafaï 2018) to estimate rare event frequency. As a concrete example of how these distributions relate to rare high scores: a score of 165 is expected to occur by chance with a probability of $\sim$ 3.0e-12 in masked sequence-to-shuffled analysis. Meanwhile the same score is expected to occur $\sim$ 150 000 more frequently ( $\sim$ 4.5e-7) for shuffled sequence aligned to reversed masked sequences.

To further explore the overabundance of high scores in self-reversal alignments, we aligned a set of 10 000 sequences against reversals of that set, and found that sequences were much more likely to find a high-scoring match to their own reversals than to the reversals of any other sequences. Specifically, we selected 10 000 random sequences from UniProtKB/TrEMBL (Boeckmann et al. 2003) and masked tandemly repetitive regions in each sequence using tantan (Frith 2011) with default settings, resulting in a set that we call trembl. Each sequence in this set was reversed, resulting a set that we call revtrembl. Blast was used to search trembl against the revtrembl database, so that each query sequence in trembl was aligned to the 10 000 reversed sequences in revtrembl. Under conventional reasoning, the reversed sequences in revtrembl are unrelated to the original sequences in trembl, so each query should have a similar chance of matching any of the sequences in revtrembl; thus, a query sequence should produce a high-scoring alignment to the reversal of some other sequence roughly 10 000× as often as it will produce a high-scoring alignment to its own reversal. In conflict with this expectation, among all blastp matches with E-value <0.001, there were 12 matches between any query and its self-reversal, and only three matches between a query and any of the 9999 other reversals. Other E-value thresholds produce similar relative abundances.

These self-reversal matches are not due to repetitive sequence, since repeat sequences were masked before alignment. Because low complexity (biased use of the available protein alphabet) could cause elevated rate of self-reversal matches, we also assessed the potential role of sequence complexity using phmmer (Finn et al. 2011) (which reports a measure of sequence complexity): matches reported negligible composition bias.

3.4 The role of mutations (approximate reversals)

In practice, a database of reversed sequences may contain the reversal of one or more homologs to the query sequence. It seems reasonable to expect the effect of reversal on alignment score to be diminished, but perhaps not entirely removed. To simulate this situation, we produced a collection of reversal variants by sampling amino acid substitution mutations from conditional distributions with self-emission probabilities ranging from 46% to 90% (see Section 2). These transformations are of shufrev_all, named shufrev_all-46pct, shufrev_all-50pct, shufrev_all-60pct, shufrev_all-70pct, shufrev_all-80pct, and shufrev_all-90pct. Pairwise alignments are performed for each transform to shuf_all using the Smith-Waterman (SW) algorithm implemented in BioAlignments.jl (Christensen et al. 2023). SW was used instead of blastp in order to maximize the number of alignment score datapoints captured, because blast often produces no alignment (or score) for a comparison between a query and decoy.

The SW alignments between shuf_all and shufrev_all-Xpct for X $\in$ {46, 50, 60, 70, 80, 90} were computed using the BLOSUM62 scoring matrix and gap penalties set to $- 11$ for opening and $- 1$ for extension, to mimic the default configuration of blastp. Figure 6 displays the spread of these scores, as well as those of sprot_all↔shuf_all and shuf_all↔shufrev_all. The self-matching effect of reversal is lost completely in shufrev_all-46pct and almost undetectable in shufrev_all-50pct, which present score distributions similar to sprot_all↔shuf_all. But when simulated homologs are created with higher average percent identity between query and partner, and the partner is then reversed (yielding shufrev_all-Xpct for X $\in$ {60,70,80,90}), the associated pairs exhibit higher-than-random affinity between sequences and their (approximate) reversal. Thus, there is a risk of overstating false match rates even when the reversed decoy set contains homologs sharing 60% or greater similarity to query sequences.

Figure 6. — Box-and-whisker plot of SW alignment score distributions for alignment pairs **sprot_all↔shuf_all**, **shuf_all↔shufrev_all**, and **shufrev_all-Xpct** for X $\in$ {46, 50, 60, 70, 80, 90}. Above each box plot, the symbol *** indicates distributions that are significantly higher than **sprot_all↔shuf_all** based on the Wilcoxon rank test (P = 0, due to computational underflow). The values [Y] indicates the fraction of shuffled instances for which the associated box plot produces a higher score than **sprot_all↔shuf_all.**

These results demonstrate that elevated false match risk is not limited to self-reversal. In a benchmark of practical size, most of the elevated risk is likely to result from the presence of homologs in the decoy set. Very large decoy sets (Steinegger and Söding 2017, Buchfink et al. 2021) include sufficiently numerous proteins that for some queries there may be hundreds or thousands of high-identity homologs, each of which has a greatly elevated chance of producing a high-scoring match upon reversal (Figs 6 and 5). Collectively, these analyses highlight that reversals can potentially overstate false discovery risk, thus skewing analyses in which a single high-scoring decoy serves as the basis for selecting a score threshold (as when reporting recall-0 metrics).

3.5 Exact palindromes in Swiss-Prot

To gain insight into the high frequency of high-scoring alignments between sequences and their reversals, we computed all maximal-length palindromes in each of the 565 855 sequences in sprot_all as well as shuf_all for a control, using our implementation of Manacher’s algorithm (Gusfield 1997). An additional control was observed by computing the LCS between sequence pairs in sprot_all and shuf_all.

The palindromic substrings in sprot_all, shuf_all, and the LCS between them, were binned by the length of the sequence containing the substring; averages computed per each length are presented in Fig. 7. The length distribution of maximal palindromes is closely related to that of LCS; a result of Duchon et al. (2017) implies that the expected length of (ungapped) maximal palindromes is asymptotically equivalent to the expected length of LCS as derived by Arratia and Waterman (1985). In order to contextualize our analyses within this theory, we plotted the expected lengths of (i) LCS between two random sequences ( $2 {log}_{1 / λ_{2}} (n)$ ), and (ii) palindrome within a sequence ( $2 {log}_{1 / λ_{2}} (n) + 1$ ).

Figure 7 shows that sprot_all contains, on average, longer palindromic substrings than shuf_all. While palindromes found in the shuffled dataset are densely concentrated around their expected value, the average palindrome length in sprot_all diverges from the expectation, particularly as sequence length increases. This asymptotic behavior only yields a large divergence from expected score in the small fraction of sequences in sprot_all that are longer than $\sim$ 500 amino acids.

To explore the role of repetitive sequence in this shift, we computed maximal palindromes for each masked sequence. The resulting distribution of palindrome lengths (Fig. 7B) mirrors that of the shuffled sequences in Fig. 4C, clustering densely around the expected value. Based on manual review, the divergence of biological palindromes from their stochastic counterparts appears to be almost entirely explained entirely by long tandem repeats. In particular, these palindromes are composed of smaller, palindrome-supporting subunits like “PNAN” or “SD,” leading respectively to palindromes like PNANPNANPNANP and SDSDSDSDS.

3.6 Palindromes in the context of mass spectrometry

When using mass spectrometry to identify proteins in a sample, a library of reversed sequences is usually used to set thresholds based on risk of false labeling (Wright and Choudhary 2016). Here, we consider the potential that including reversals in mass spectrometry benchmarks may lead to overestimated false match rates and a resulting decrease in search sensitivity.

With all sequences in sprot_all, we produced protein fragments similar to those used in mass spectrometry, splitting each sequence at each instance of a lysine or arginine, except where followed by a proline (see Section 2). After splitting, substrings with length between 5 and 100 were accumulated into sets named sprot_tr. Fragments in sprot_tr were shuffled to produce a related set shuf_tr.

Figure 8 displays the average lengths of longest palindromic subsequence (LPS) in each set of strings, binned by the length of the string. To place these in context, the LCS was computed between paired sequences in sprot_tr and shuf_tr.

Trends established by the preceding results are still apparent in Fig. 8. LPS are longer than LCS, and because tryptic peptides are so short, the constant component separating LPS and LCS lengths is large relative to their shared logarithmic component. However, the relationship between LPS and LCS appears to invert for very short sequences. The data indicate that $LCS (s pro t_{t r}, s h u f_{t r}) > LPS (s pro t_{t r})$ for sequences shorter than 10 amino acids, and $LPS (s pro t_{t r}) \sim LPS (s h u f_{t r})$ for sequences shorter than approximately 20 peptides. Composition bias in short tryptic peptides may contribute to quasi-invariance under shuffling.

3.7 Palindromes in DNA

The analyses presented above have been focused on protein sequences, but the surprising prevalence of palindromes is a general feature of strings, regardless of alphabet. As with the protein results shown in this manuscript, a random DNA sequence (i.e. one with no repetitive or low-complexity regions) will on average have a longer exact match to its reversal than it will to a shuffled version of itself; it will also typically produce a higher score when aligned (under a reasonable scoring scheme) to its reversal than when aligned to a shuffled version of itself.

To explore this, we analyzed DNA occurring in human chromosome 22. Chromosome 22 was downloaded and annotated using the UCSC Genome Browser. Gene sequences were extracted from the chromosome using GffRead; introns and intergenic regions were also sampled from chromosome 22 to match the length distribution of genes. These datasets are respectively denoted mRNA, intron, and intergenic. Each set was filtered to contain sequences less than 50 000 bp. As above, variant sets were created using shuffling and masking, respectively denoted shuf_mRNA, shuf_intron, shuf_intergenic, and masked mRNA, masked intron, and masked intergenic. For more detail, see the preceding section in Methods.

The LPS of each DNA sequence was computed using Manacher’s algorithm. LCS were constructed between each sequence and a shuffled counterpart using a substring automata. Figure 9 plots the lengths of these common substrings, binned by sequence length, and Figure 10 plots these lengths after masking short tandem repeats.

Figure 9. — The average lengths of palindromes occurring in human chromosome 22 are plotted by sequence length against shuffled controls.

Figure 10. — Repetitive regions in human chromosome 22 are masked via tantan; the remaining palindromes in non-repetitive DNA, as well as those in corresponding shuffled sequences, are plotted. Both data are distributed closely around their expectation. As above, when repetitive subsequences are suppressed, the biological contribution to reverse affinity appears negligible.

The analysis of DNA reproduces the results observed in peptides and proteins. On average, the longest palindromic substring of a sequence is longer than the LCS between that sequence and a random other. When biologially-induced repetitions are present, the difference between LPS and LCS grows superlogarithmically. In the absence of repetition (either by shuffling or tantan masks), this difference appears constant.

4 Summary of results

In summary, alignment to reversal produces a substantial increase in score compared to alignment to unrelated sequences, even when there is no inherent repetitiveness or low complexity in the initial sequence. This effect is diminished when the reversed sequence contains mutations relative to the query, but remains significant for related sequences exceeding $60 %$ identity to the non-reversed sequence. As a result, when reversed biological sequences are used as benchmarks they may overstate the risk of false matches.

5 Discussion

Up to this point, we have discussed the prevalence of palindromes in terms of their impact on bioinformatic analysis (specifically, overstatement of false positive rates in sequence search). It is also interesting to consider how this existence of palindromes in sequences may play a role in biological processes. The role of palindromes in proteins is currently unknown and research on their function is ongoing (Giel-Pietraszuk et al. 2003, Sheari et al. 2008, Sridhar et al. 2015, Kefala et al. 2021). Sheari et al. (2008) argue that the prevalence of palindromic regions in proteins is a side-effect of repetitive and low-complexity regions. Sridhar et al. (2015) found that 60% of palindromes in proteins are “not associated with any regular secondary structure”, consistent with the notion that palindromes in proteins are not dependent on repetitiveness such as that observed in alpha helical structures. Furthermore, 80% of palindromes in proteins do not interact with active sites, catalytic residues, ligands, or metal ions. In a more direct study, Kefala et al. (2021) synthesized a 4-alpha-helical bundle and its reversal, and demonstrated that while the reversed sequence does fold into a helix—likely due to the aforementioned effects of repetition in helical sequences—the resulting structure was not a true alpha helix. In general, these results are consistent with our results above, in that the frequency of palindromes in true proteins appears to be primarily a function of the frequency of palindromes in random sequences, and repetitive aspects of some sequences may lead to a moderate increase in the observed rate of palindromes.

5.1 A better benchmark?

Can we compensate for the overabundance of palindromic regions in biosequence benchmarks? A repository containing all experiments found in this manuscript (https://github.com/TravisWheelerLab/WASITAMATchISAW) includes a script for generating improved reversed decoy benchmarks, given a query set Q and a target set T. The target set T is restricted to $T^{'}$ , containing sequences with less than $50 %$ percent identity to any sequence in Q. The reversal of $T^{'}$ , denoted R, retains the original biological signal present in T and omits reversal-induced false matches. As above, it may be necessary to mask repetitive regions before performing the initial alignment.

Acknowledgements

We wish to thank Dave Rich for exploratory analysis of the role of reversals on benchmarking issues during his thesis work, and for discussions that motivated early work on this analysis. We thank Robert Hubley and Daniel Olson for illuminating conversations regarding both palindrome theory and application to annotation benchmarks. We also thank Martin Frith for providing valuable feedback and insights in response to our first preprint draft. Analyses were made possible thanks to high performance computing (HPC) resources supported by the University of Arizona TRIF, UITS, and Research, Innovation, and Impact (RII) and maintained by the UArizona Research Technologies department.

Contributor Information

George Glidden-Handgis, R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ 85721, United States.

Travis J Wheeler, R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ 85721, United States.

Author contributions

George Glidden-Handgis (Formal analysis [lead], Investigation [lead], Methodology [lead], Software [lead], Visualization [lead], Writing—original draft [equal], Writing—review & editing [equal]) Travis Wheeler (Conceptualization [lead], Funding acquisition [lead], Methodology [equal], Project administration [lead], Writing—original draft [equal], Writing—review & editing [equal])

Conflict of interest

None declared.

Funding

GGH and TJW were supported by NSF DBI 1933305 and NIH NIGMS R01GM132600.

Data availability

Code and data used to produce all results and figures may be found at https://github.com/TravisWheelerLab/WASITAMATchISAW.

References

Altschul SF, Gish W.. Local alignment statistics. In: Methods in Enzymology, vol. 266. Elsevier, 1996, 460–480. [DOI] [PubMed] [Google Scholar]
Altschul SF, Madden TL, Schäffer AA. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997;25:3389–402. [DOI] [PMC free article] [PubMed] [Google Scholar]
Andreeva A, Howorth D, Chothia C. et al. SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res 2014;42:D310–D314. [DOI] [PMC free article] [PubMed] [Google Scholar]
Andreeva A, Kulesha E, Gough J. et al. The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res 2020;48:D376–D382. [DOI] [PMC free article] [PubMed] [Google Scholar]
Arratia R, Waterman MS.. An erdös-rényi law with shifts. Advances in Mathematics 1985;55:13–23. 10.1016/0001-8708(85)90003-9 [DOI] [Google Scholar]
Bateman A, , MartinM-J, , Orchard S. et al. UniProt: The universal protein knowledgebase in 2023. Nucleic Acids Res 2023;51:D523–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 1999;27:573–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
Boeckmann B, Bairoch A, Apweiler R. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003;31:365–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
Buchfink B, Reuter K, Drost H-G.. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods 2021;18:366–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Caballero J, Smit AFA, Hood L. et al. Realistic artificial DNA sequences as negative controls for computational genomics. Nucleic Acids Res 2014;42:e99. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chafaï D. Gumbel fit with Julia. https://djalil.chafai.net/blog/2018/02/18/gumbel-fit-with-julia/., 2018. (27 12 2023, date last accessed).
Christensen TA II, O’Mara C, Stukalov A. et al. Tim Holy, Kapara newbie, Morten Piibeleht, and tanhevg. Biojulia/bioalignments.jl. 2023. 10.5281/zenodo.7502559 (27 December 2023, date last accessed). [DOI] [Google Scholar]
Duchon P, Nicaud C, Pivoteau C. Gapped pattern statistics. In: Annual Symposium on Combinatorial Pattern Matching, Warsaw, Poland, 2017.
Eddy SR. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol 2008;4:e1000069. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol 2011;7:e1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
Finn RD, Clements J, Eddy SR.. HMMER web server: interactive sequence similarity searching. Nucleic Acids Research 2011;39:W29–W37. [DOI] [PMC free article] [PubMed] [Google Scholar]
Frith MC. A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res 2011;39:e23. [DOI] [PMC free article] [PubMed] [Google Scholar]
Giel-Pietraszuk M, Hoffmann M, Dolecka S. et al. Palindromes in proteins. J Protein Chem 2003;22:109–13. [DOI] [PubMed] [Google Scholar]
Gusfield D. Algorithms on strings, trees, and sequences. Cambridge, UK: Press Syndacate of the Univestity of Cambridge, 1997.
Henikoff S, Henikoff JG.. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992;89:10915–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Karlin S, Altschul SF.. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A 1990;87:2264–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kefala A, Amprazi M, Mylonas E. et al. Probing protein folding with sequence-reversed $α$ -helical bundles. International Journal of Molecular Sciences 2021;22:1955. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kent WJ. BLAT—the BLAST-like alignment tool. Genome Research 2002;12:656–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kent WJ, Sugnet CW, Furey TS. et al. The human genome browser at UCSC. Genome Res 2002;12:996–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mott R. Maximum-likelihood estimation of the statistical distribution of smith-waterman local sequence similarity scores. Bull Math Biol 1992;54:59–75. [DOI] [PubMed] [Google Scholar]
Olson D, Wheeler T. ULTRA: a model based tool to detect tandem repeats. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2018, 37–46. [DOI] [PMC free article] [PubMed]
Pearson WR. An introduction to sequence similarity (“homology”) searching. Curr Protoc Bioinformatics 2013;Chapter 3:3.1.1–.1.8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pertea G, Pertea M.. GFF utilities: gffread and gffcompare. F1000Res 2020;9:ISCB Comm J-304. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pumpernik D, Oblak B, Borštnik B.. Replication slippage versus point mutation rates in short tandem repeats of the human genome. Mol Genet Genomics 2008;279:53–61. [DOI] [PubMed] [Google Scholar]
Rodriguez J, Gupta N, Smith RD. et al. Does trypsin cut before proline? J Proteome Res 2008;7:300–5. [DOI] [PubMed] [Google Scholar]
Sheari A, Kargar M, Katanforoush A. et al. A tale of two symmetrical tails: structural and functional characteristics of palindromes in proteins. BMC Bioinformatics 2008;9:274–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sjölander K, Karplus K, Brown M. et al. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci 1996;12:327–45. [DOI] [PubMed] [Google Scholar]
Sridhar S, Nagamruta M, Guruprasad K.. Analyses of the sequence and structural properties corresponding to pentapeptide and large palindromes in proteins. PLoS One 2015;10:e0139568. [DOI] [PMC free article] [PubMed] [Google Scholar]
Steinegger M, Söding J.. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 2017;35:1026–8. [DOI] [PubMed] [Google Scholar]
Storer J, Hubley R, Rosen J. et al. The Dfam community resource of transposable element families, sequence models, and genome annotations. Mob DNA 2021;12:2–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
Thakur J, Packiaraj J, Henikoff S.. Sequence, chromatin and evolution of satellite DNA. Int J Mol Sci 2021;22:4309. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wheeler TJ, Clements J, Eddy SR. et al. Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Research 2012;41:D70–D82. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wright JC, Choudhary JS.. DecoyPyrat: fast non-redundant hybrid decoy sequence generation for large scale proteomics. J Proteomics Bioinform 2016;9:176–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yu Y-K, Altschul SF.. The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics 2005;21:902–11. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Code and data used to produce all results and figures may be found at https://github.com/TravisWheelerLab/WASITAMATchISAW.

[vbae052-B2] Altschul SF, Gish W.. Local alignment statistics. In: Methods in Enzymology, vol. 266. Elsevier, 1996, 460–480. [DOI] [PubMed] [Google Scholar]

[vbae052-B3] Altschul SF, Madden TL, Schäffer AA. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997;25:3389–402. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B4] Andreeva A, Howorth D, Chothia C. et al. SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res 2014;42:D310–D314. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B5] Andreeva A, Kulesha E, Gough J. et al. The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res 2020;48:D376–D382. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B6] Arratia R, Waterman MS.. An erdös-rényi law with shifts. Advances in Mathematics 1985;55:13–23. 10.1016/0001-8708(85)90003-9 [DOI] [Google Scholar]

[vbae052-B4918033] Bateman A, , MartinM-J, , Orchard S. et al. UniProt: The universal protein knowledgebase in 2023. Nucleic Acids Res 2023;51:D523–31. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B7] Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 1999;27:573–80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B8] Boeckmann B, Bairoch A, Apweiler R. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003;31:365–70. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B9] Buchfink B, Reuter K, Drost H-G.. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods 2021;18:366–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B10] Caballero J, Smit AFA, Hood L. et al. Realistic artificial DNA sequences as negative controls for computational genomics. Nucleic Acids Res 2014;42:e99. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B11] Chafaï D. Gumbel fit with Julia. https://djalil.chafai.net/blog/2018/02/18/gumbel-fit-with-julia/., 2018. (27 12 2023, date last accessed).

[vbae052-B12] Christensen TA II, O’Mara C, Stukalov A. et al. Tim Holy, Kapara newbie, Morten Piibeleht, and tanhevg. Biojulia/bioalignments.jl. 2023. 10.5281/zenodo.7502559 (27 December 2023, date last accessed). [DOI] [Google Scholar]

[vbae052-B13] Duchon P, Nicaud C, Pivoteau C. Gapped pattern statistics. In: Annual Symposium on Combinatorial Pattern Matching, Warsaw, Poland, 2017.

[vbae052-B14] Eddy SR. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol 2008;4:e1000069. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B15] Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol 2011;7:e1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B16] Finn RD, Clements J, Eddy SR.. HMMER web server: interactive sequence similarity searching. Nucleic Acids Research 2011;39:W29–W37. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B17] Frith MC. A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res 2011;39:e23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B18] Giel-Pietraszuk M, Hoffmann M, Dolecka S. et al. Palindromes in proteins. J Protein Chem 2003;22:109–13. [DOI] [PubMed] [Google Scholar]

[vbae052-B19] Gusfield D. Algorithms on strings, trees, and sequences. Cambridge, UK: Press Syndacate of the Univestity of Cambridge, 1997.

[vbae052-B20] Henikoff S, Henikoff JG.. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992;89:10915–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B21] Karlin S, Altschul SF.. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A 1990;87:2264–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B22] Kefala A, Amprazi M, Mylonas E. et al. Probing protein folding with sequence-reversed $α$ -helical bundles. International Journal of Molecular Sciences 2021;22:1955. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B23] Kent WJ. BLAT—the BLAST-like alignment tool. Genome Research 2002;12:656–64. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B24] Kent WJ, Sugnet CW, Furey TS. et al. The human genome browser at UCSC. Genome Res 2002;12:996–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B25] Mott R. Maximum-likelihood estimation of the statistical distribution of smith-waterman local sequence similarity scores. Bull Math Biol 1992;54:59–75. [DOI] [PubMed] [Google Scholar]

[vbae052-B26] Olson D, Wheeler T. ULTRA: a model based tool to detect tandem repeats. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2018, 37–46. [DOI] [PMC free article] [PubMed]

[vbae052-B27] Pearson WR. An introduction to sequence similarity (“homology”) searching. Curr Protoc Bioinformatics 2013;Chapter 3:3.1.1–.1.8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B28] Pertea G, Pertea M.. GFF utilities: gffread and gffcompare. F1000Res 2020;9:ISCB Comm J-304. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B29] Pumpernik D, Oblak B, Borštnik B.. Replication slippage versus point mutation rates in short tandem repeats of the human genome. Mol Genet Genomics 2008;279:53–61. [DOI] [PubMed] [Google Scholar]

[vbae052-B30] Rodriguez J, Gupta N, Smith RD. et al. Does trypsin cut before proline? J Proteome Res 2008;7:300–5. [DOI] [PubMed] [Google Scholar]

[vbae052-B31] Sheari A, Kargar M, Katanforoush A. et al. A tale of two symmetrical tails: structural and functional characteristics of palindromes in proteins. BMC Bioinformatics 2008;9:274–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B32] Sjölander K, Karplus K, Brown M. et al. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci 1996;12:327–45. [DOI] [PubMed] [Google Scholar]

[vbae052-B33] Sridhar S, Nagamruta M, Guruprasad K.. Analyses of the sequence and structural properties corresponding to pentapeptide and large palindromes in proteins. PLoS One 2015;10:e0139568. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B34] Steinegger M, Söding J.. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 2017;35:1026–8. [DOI] [PubMed] [Google Scholar]

[vbae052-B35] Storer J, Hubley R, Rosen J. et al. The Dfam community resource of transposable element families, sequence models, and genome annotations. Mob DNA 2021;12:2–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B36] Thakur J, Packiaraj J, Henikoff S.. Sequence, chromatin and evolution of satellite DNA. Int J Mol Sci 2021;22:4309. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B37] Wheeler TJ, Clements J, Eddy SR. et al. Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Research 2012;41:D70–D82. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B38] Wright JC, Choudhary JS.. DecoyPyrat: fast non-redundant hybrid decoy sequence generation for large scale proteomics. J Proteomics Bioinform 2016;9:176–80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbae052-B39] Yu Y-K, Altschul SF.. The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics 2005;21:902–11. [DOI] [PubMed] [Google Scholar]

PERMALINK

WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences

George Glidden-Handgis

Travis J Wheeler

Roles

Abstract

Background

Results

Impact

1 Introduction

1.1 Why are unexpectedly high scores for reversed sequences a problem?

1.2 Intuition and background for palindrome frequencies

1.3 Paper outline

2 Methods

2.1 Preparation of datasets

2.1.1 Real protein sequences and their paired relatives

Figure 1.

Figure 2.

2.1.2 Reversals in mutated sequences

2.1.3 Repeat masking

2.1.4 Simulated fragmentation with trypsin

2.1.5 Annotation of chromosome 22

2.2 Software and algorithms

3 Results

3.1 Alignment scores for Swiss-Prot variations

Figure 3.

Figure 4.

3.2 Role of tandem repeats in BLAST score gap

3.3 Sequences match their own reversals much more than other reversals

Figure 5.

3.4 The role of mutations (approximate reversals)

Figure 6.

3.5 Exact palindromes in Swiss-Prot

Figure 7.

3.6 Palindromes in the context of mass spectrometry

Figure 8.

3.7 Palindromes in DNA

Figure 9.

Figure 10.

4 Summary of results

5 Discussion

5.1 A better benchmark?

Acknowledgements

Contributor Information

Author contributions

Conflict of interest

Funding

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases