How sequence alignment scores correspond to probability models

Martin C Frith

doi:10.1093/bioinformatics/btz576

. 2019 Jul 22;36(2):408–415. doi: 10.1093/bioinformatics/btz576

How sequence alignment scores correspond to probability models

Martin C Frith ^1,^2,^3,^✉

Editor: Alfonso Valencia

PMCID: PMC9883716 PMID: 31329241

Abstract

Motivation

Sequence alignment remains fundamental in bioinformatics. Pair-wise alignment is traditionally based on ad hoc scores for substitutions, insertions and deletions, but can also be based on probability models (pair hidden Markov models: PHMMs). PHMMs enable us to: fit the parameters to each kind of data, calculate the reliability of alignment parts and measure sequence similarity integrated over possible alignments.

Results

This study shows how multiple models correspond to one set of scores. Scores can be converted to probabilities by partition functions with a ‘temperature’ parameter: for any temperature, this corresponds to some PHMM. There is a special class of models with balanced length probability, i.e. no bias toward either longer or shorter alignments. The best way to score alignments and assess their significance depends on the aim: judging whether whole sequences are related versus finding related parts. This clarifies the statistical basis of sequence alignment.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

The main way of analyzing nucleotide and protein sequences is by comparing them to related sequences. This is usually done by defining scores for aligned monomers, insertions and deletions, then finding alignments with maximal total score. An alternative approach is to use a pair hidden Markov model (PHMM): a probability model with probabilities for aligned monomers, insertions and deletions (Durbin et al., 1998). The probability approach has three major advantages:

The probabilities can be fitted to each kind of comparison [e.g. error-prone deoxyribonucleic acid (DNA) reads versus a genome]. This is expected to make the comparisons more accurate.
We can estimate the reliability of any alignment part, for example, each column (Fig. 1). This is useful because alignments often have parts that are uncertain, due to high divergence or repetitive sequence.
We can measure similarity between two sequences integrated over possible alignments between them. This is expected to detect subtle relationships more powerfully than single optimal alignments (Allison et al., 1992; Eddy, 2009).

Fig. 1. — An alignment color-coded by column reliability. The upper sequence is from a human mitochondrial genome, and the lower sequence is from a sea urchin mitochondrial genome. The figure was made using LAST (last.cbrc.jp)

We should bear in mind that ‘all models are wrong, but some are useful’. Alignment models typically omit rapid evolution of tandem repeats, neighbor-dependence of substitutions, etc., but have proven useful.

Another approach is to define alignment probabilities as exponentiated scores (Miyazawa, 1995; Zhang and Marr, 1995):

prob (alignment) \propto exp (alignment score / t),

(1)

where t is a parameter whose value must be chosen somehow. It follows that:

prob (alignment) = \frac{exp (alignment score / t)}{\sum_{alignments} exp (alignment score / t)} .

(2)

The denominator in Equation (2) is called a ‘partition function’, and t is often called ‘temperature’ due to an analogy with physics. It has been argued that this approach should be distinguished from generative models such as PHMMs (Eddy, 2008).

This study describes the equivalence between the partition function approach and PHMMs. The aim is to further unify our understanding of alignment methods, and thereby encourage the use of probability approaches with their aforementioned advantages. This study also clarifies the notion of alignment models with balanced length probability, i.e. no bias toward either longer or shorter alignments. It concludes with a discussion of the best way to score alignments and assess their significance, depending on our precise aim. One previous study also describes a one-to-many relationship between alignment scores and probabilities (Allison, 1993), but lacks most of the results presented here.

1.1 Review of score-based alignment

We wish to align two sequences:

$R_{1}, \dots, R_{m}$ : 1st sequence (e.g. ‘reference’), of length m.

$Q_{1}, \dots, Q_{n}$ : 2nd sequence (e.g. ‘query’), of length n.

The classic approach is to specify a scoring scheme, which assigns a numeric score for aligning any pair of letters or inserting a gap:

$S_{xy}$ ?>: score for aligning reference letter x to query letter y.

$a_{D}$ : first deletion score.

$b_{D}$ :?>deletion extension score. A deletion of length k scores $a_{D} + b_{D} \times (k - 1)$ .

$a_{I}$ ?>:?>first insertion score.

$b_{I}$ ?>:?>insertion extension score. An insertion of length k scores $a_{I} + b_{I} \times (k - 1)$ .

This describes affine gap scores, proportional to the gap length plus a constant. (There are two common parameterizations of affine score for a gap of length k: $a + b \times k$ and $a + b \times (k - 1)$ . It turns out the latter is a better fit to the mathematics of this study.) This is not the only way to score gaps, but it is the most common. Usually, $a_{D} \leq b_{D} < 0$ and $a_{I} \leq b_{I} < 0$ . Having defined a scoring scheme, we seek alignments with maximal total score. (In the unusual case that a_D > b_D or a_I > b_I, we must decide whether a gap of length k may be scored as k separate gaps. Algorithm I does not allow this; Algorithm II does.) Sometimes we seek an optimal global alignment (which aligns the sequences end-to-end), sometimes a local alignment (which aligns a pair of substrings).

1.1.1 Degrees of freedom

Some transformations of the score parameters have no effect on alignment. For local alignment, these transformations have one degree of freedom: if we multiply all the scores by any positive constant there is no effect on alignment. For global alignment, there are two further degrees of freedom. First, we can add any constant (say h) to a_D, b_D and S_xy: this also has no effect on alignment, because it simply adds h × m to the total score of any alignment. Second, we can add any constant to a_I, b_I and S_xy. These freedoms can be used to set $b_{D} = b_{I} = 0$ , which might allow global alignment software to run a bit faster (Katoh et al., 2002).

1.1.2 Algorithms for local alignment

Maximal-scoring local alignments can be found by a classic algorithm, which builds up the solution using prefixes of the sequences (Gotoh, 1982; Smith and Waterman, 1981). This algorithm has several variants.

One variant calculates the optimal score for aligning the length-i prefix of R to the length-j prefix of Q, ending with R_i aligned to Q_j (X_ij), R_i aligned to a gap (Y_ij), or Q_j aligned to a gap (Z_ij):

Algorithm I.

\begin{matrix} X_{i 0} = 0 Y_{i 0} = - \infty Z_{i 0} = - \infty \forall_{i = 0}^{m} \\ X_{0 j} = 0 Y_{0 j} = - \infty Z_{0 j} = - \infty \forall_{j = 0}^{n} \end{matrix}

\begin{matrix} X_{ij} = \max (X_{i - 1 j - 1}, Y_{i - 1 j - 1}, Z_{i - 1 j - 1}, 0) + S_{R_{i} Q_{j}} \forall_{i = 1}^{m} \forall_{j = 1}^{n} \\ Y_{ij} = \max (X_{i - 1 j} + a_{D}, Y_{i - 1 j} + b_{D}) \forall_{i = 1}^{m} \forall_{j = 1}^{n} \\ Z_{ij} = \max (X_{i j - 1} + a_{I}, Y_{i j - 1} + a_{I}, Z_{i j - 1} + b_{I}) \forall_{i = 1}^{m} \forall_{j = 1}^{n} \end{matrix}

\begin{matrix} Optimal alignment score = \max_{i = 0}^{m} \max_{j = 0}^{n} X_{ij} \end{matrix}

This variant is asymmetric in that it considers insertions after deletions but not deletions after insertions. This is because those two configurations always have the same score, so we may as well only consider the former. The algorithm only returns the alignment score, but an actual alignment can be found by tracing back from a maximal scoring (i, j) endpoint.

Another variant uses $W_{ij} = \max (X_{ij}, Y_{ij}, Z_{ij}, 0)$ instead of X_ij:

Algorithm II.

\begin{matrix} W_{i 0} = 0 Z_{i 0} = - \infty \forall_{i = 0}^{m} \\ W_{0 j} = 0 Y_{0 j} = - \infty \forall_{j = 0}^{n} \end{matrix}

\begin{matrix} Y_{ij} = \max (W_{i - 1 j} + a_{D}, Y_{i - 1 j} + b_{D}) \forall_{i = 1}^{m} \forall_{j = 0}^{n} \\ Z_{ij} = \max (W_{i j - 1} + a_{I}, Z_{i j - 1} + b_{I}) \forall_{i = 0}^{m} \forall_{j = 1}^{n} \\ W_{ij} = \max (W_{i - 1 j - 1} + S_{R_{i} Q_{j}}, Y_{ij}, Z_{ij}, 0) \forall_{i = 1}^{m} \forall_{j = 1}^{n} \end{matrix}

\begin{matrix} Optimal alignment score = \max_{i = 0}^{m} \max_{j = 0}^{n} W_{ij} \end{matrix}

This variant gives identical results to Algorithm I (assuming that $a_{D} \leq b_{D}$ and $a_{I} \leq b_{I}$ ), but is more efficient (fewer operations). There are even more efficient variants (Cameron et al., 2004; Farrar, 2007; Rognes, 2011; Suzuki and Kasahara, 2018; Zhang et al., 1997), discussed in the Supplementary Material.

1.2 Review of alignment probability models

A PHMM is a scheme for randomly generating a pair of sequences. Let us first consider a gapless model, which is simpler but illustrates the main issues (Fig. 2). Starting at ‘begin’, we randomly traverse the arrows until we hit the ‘end’. Whenever we have a choice of arrows, we randomly choose one, with the labeled probabilities: for example, at the first choice point we choose one arrow with probability ω_D and the other with probability $1 - ω_{D}$ . Whenever we reach a circle labeled ‘D’, we randomly generate a reference sequence letter, with probabilities $ϕ$ (e.g. $ϕ_{a} = 0.3, ϕ_{c} = 0.2, ϕ_{g} = 0.2, ϕ_{t} = 0.3$ ). Likewise, at ‘I’ we generate a query letter, and at ‘M’ we generate an aligned pair of letters. Thus, the model generates a stretch of aligned letters flanked by unaligned letters: a gapless local alignment.

Fig. 2. — A PHMM for gapless local alignment. The state labeled M emits aligned letters $x : y$ with probability *π_xy*. States labeled D emit reference letters x with probability $ϕ_{x}$ . States labeled I emit query letters y with probability *ψ_y*

Typically, a PHMM is used not to generate but to analyze a pre-existing pair of sequences. The simplest analysis is to find a path through the model (i.e. an alignment) most likely to have generated those sequences. Consider an alignment where the first c letters of R and d letters of Q are unaligned, the next e letters of R and Q are aligned and the final f letters of R and g letters of Q are unaligned. The probability of this alignment under the model is:

\begin{matrix} prob (path, R, Q) & = & (\prod_{k = 1}^{c} ω_{D} ϕ_{R_{k}}) (1 - ω_{D}) \\ (\prod_{k = 1}^{d} ω_{I} ψ_{Q_{k}}) (1 - ω_{I}) \\ (\prod_{k = 1}^{e} γ π_{R_{c + k} Q_{d + k}}) (1 - γ) \\ (\prod_{k = 1}^{f} ω_{D} ϕ_{R_{c + e + k}}) (1 - ω_{D}) \\ (\prod_{k = 1}^{g} ω_{I} ψ_{Q_{d + e + k}}) (1 - ω_{I}) \end{matrix}

This can be simplified by factoring out a constant μ, defined as:

μ = (\prod_{k = 1}^{m} ω_{D} ϕ_{R_{k}}) (\prod_{k = 1}^{n} ω_{I} ψ_{Q_{k}}) {(1 - ω_{D})}^{2} {(1 - ω_{I})}^{2} (1 - γ) .

(3)

Because μ does not depend on the path, we can find a most-probable path by maximizing:

\frac{prob (path, R, Q)}{μ} = \prod_{k = 1}^{e} \frac{γ}{ω_{D} ω_{I}} \frac{π_{R_{c + k} Q_{d + k}}}{ϕ_{R_{c + k}} ψ_{Q_{d + k}}} .

Finally, because maximizing a value is equivalent to maximizing its logarithm, we can maximize:

\ln (\frac{prob (path, R, Q)}{μ}) = \sum_{k = 1}^{e} \ln (\frac{γ}{ω_{D} ω_{I}} \frac{π_{R_{c + k} Q_{d + k}}}{ϕ_{R_{c + k}} ψ_{Q_{d + k}}}) .

This reveals the connection between model-based and score-based alignment. If we define a substitution score matrix as follows:

S_{xy} = t \ln (\frac{π_{xy}}{ϕ_{x} ψ_{y}} \cdot \frac{γ}{ω_{D} ω_{I}})

(4)

then a most-probable path is one which maximizes the sum of scores. Note that μ is the probability of a null alignment (with zero aligned letters): so an alignment score is a log probability ratio relative to a null alignment.

2 Degrees of freedom in the gapless model

The model in Figure 2 has several degrees of freedom, meaning that we can vary the parameters with no effect on S_xy and thus no effect on alignment. First, we can freely vary ω_D, ω_I and γ, provided that $\frac{γ}{ω_{D} ω_{I}}$ remains fixed. This is rather interesting: it means, for example, that a PHMM with $ω_{D} = 0.8, ω_{I} = 0.2$ is equivalent to one with $ω_{D} = ω_{I} = 0.4$ , when we analyze pre-existing sequences. Second, we can freely vary $ϕ$ , ψ and t, provided that we co-vary π and $\frac{γ}{ω_{D} ω_{I}}$ in a suitable way. If we fix values for S, $ϕ$ , ψ and t; then π and $\frac{γ}{ω_{D} ω_{I}}$ are given by this equation, where the right-hand side is fully specified:

\frac{γ}{ω_{D} ω_{I}} π_{xy} = ϕ_{x} ψ_{y} exp (S_{xy} / t) .

(5)

Since π_xy must sum to 1, we can infer that:

\frac{γ}{ω_{D} ω_{I}} = \sum_{x, y} ϕ_{x} ψ_{y} exp (S_{xy} / t) .

(6)

Then we can calculate π using Equation (5).

2.1 Homogeneous letter probabilities

The letter probabilities in aligned and unaligned regions need not be the same, but we may wish to make them so (Yu et al., 2003):

\begin{matrix} ϕ_{x} = \sum_{y} π_{xy} ψ_{y} = \sum_{x} π_{xy} . \end{matrix}

In this case, we can freely vary t, provided we co-vary π and $\frac{γ}{ω_{D} ω_{I}}$ in a suitable way. If we fix values for S and t, and define $S'_{xy} = exp (S_{xy} / t)$ and $ψ'_{y} = \frac{ω_{D} ω_{I}}{γ} ψ_{y}$ , we get:

\sum_{y} S'_{xy} ψ'_{y} = 1.

This is a set of simultaneous linear equations, which can be solved by standard methods (provided that $S'$ is not a singular matrix). We can then calculate:

\begin{matrix} \frac{γ}{ω_{D} ω_{I}} = 1 / \sum_{y} ψ'_{y} ψ_{y} = ψ'_{y} \cdot \frac{γ}{ω_{D} ω_{I}} \end{matrix} .

$ϕ$ can be calculated in the same way. These calculations may yield negative $ϕ_{x}$ or ψ_y, meaning that these values of S and t are not consistent with homogeneous letter probabilities.

2.2 Uniform length probability

We may wish to focus on the case where $\frac{γ}{ω_{D} ω_{I}} = 1$ , which means that alignments of different lengths have uniform prior probability. (Despite using a hidden Markov model, we do not have a geometrically decaying length distribution here.) If we require both uniform length probability and homogeneous letter probabilities, π has no degrees of freedom (usually). Starting from a given substitution score matrix S, t must have a value such that $\sum_{y} ψ'_{y} = 1$ . Yu et al. showed that t has at most one such value that also yields $ϕ$ and $ψ \geq 0$ , and they described a numerical procedure for finding it (Yu and Altschul, 2005; Yu et al., 2003). It is possible (though unlikely) to have a solution where $S'$ is singular, in which case π does have degrees of freedom, i.e. a linear space of values that yield the same S (Yu and Altschul, 2005).

3 Examples

A substitution score matrix named ‘HoxD70’ is often used for inter-species genome alignment (Chiaromonte et al., 2002):

S_{HoxD 70} = (\begin{matrix} 91 & - 114 & - 31 & - 123 \\ - 114 & 100 & - 125 & - 31 \\ - 31 & - 125 & 100 & - 114 \\ - 123 & - 31 & - 114 & 91 \end{matrix})

Assuming a PHMM of the form shown in Figure 2, with homogeneous letter probabilities, any of the parameter combinations in Table 1 yield HoxD70. This is rather striking: apparently different PHMM parameter settings lead to the exact same score parameters, and therefore always produce the same optimal alignments.

Table 1.

Sets of PHMM parameters that correspond to the HoxD70 score matrix

t	$\frac{γ}{ω_{D} ω_{I}}$	$ϕ = ψ$	π
30	6.06	$(\begin{matrix} 0.288 \\ 0.212 \\ 0.212 \\ 0.288 \end{matrix})$	$(\begin{matrix} 0.284 & 0.000225 & 0.00359 & 0.000226 \\ 0.000225 & 0.208 & 0.000115 & 0.00359 \\ 0.00359 & 0.000115 & 0.208 & 0.000225 \\ 0.000226 & 0.00359 & 0.000225 & 0.284 \end{matrix})$
96.1735	1	$(\begin{matrix} 0.266 \\ 0.234 \\ 0.234 \\ 0.266 \end{matrix})$	$(\begin{matrix} 0.182 & 0.019 & 0.0451 & 0.0197 \\ 0.019 & 0.155 & 0.0149 & 0.0451 \\ 0.0451 & 0.0149 & 0.155 & 0.019 \\ 0.0197 & 0.0451 & 0.019 & 0.182 \end{matrix})$
10 000	0.996	$(\begin{matrix} 0.258 \\ 0.242 \\ 0.242 \\ 0.258 \end{matrix})$	$(\begin{matrix} 0.0673 & 0.062 & 0.0625 & 0.0658 \\ 0.062 & 0.0596 & 0.0583 & 0.0625 \\ 0.0625 & 0.0583 & 0.0596 & 0.062 \\ 0.0658 & 0.0625 & 0.062 & 0.0673 \end{matrix})$

Open in a new tab

The ‘Simple’ substitution score matrix provides another example:

S_{Simple} = (\begin{matrix} 1 & - 1 & - 1 & - 1 \\ - 1 & 1 & - 1 & - 1 \\ - 1 & - 1 & 1 & - 1 \\ - 1 & - 1 & - 1 & 1 \end{matrix})

Any of the PHMM parameter sets in Table 2 yield this matrix (again using homogeneous letter probabilities).

Table 2.

Sets of PHMM parameters that correspond to the Simple score matrix

t	$\frac{γ}{ω_{D} ω_{I}}$	$ϕ = ψ$	π
0.3	7.03	$(\begin{matrix} 0.25 \\ 0.25 \\ 0.25 \\ 0.25 \end{matrix})$	$(\begin{matrix} 0.249 & 0.000317 & 0.000317 & 0.000317 \\ 0.000317 & 0.249 & 0.000317 & 0.000317 \\ 0.000317 & 0.000317 & 0.249 & 0.000317 \\ 0.000317 & 0.000317 & 0.000317 & 0.249 \end{matrix})$
0.910239	1	$(\begin{matrix} 0.25 \\ 0.25 \\ 0.25 \\ 0.25 \end{matrix})$	$(\begin{matrix} 0.188 & 0.0208 & 0.0208 & 0.0208 \\ 0.0208 & 0.188 & 0.0208 & 0.0208 \\ 0.0208 & 0.0208 & 0.188 & 0.0208 \\ 0.0208 & 0.0208 & 0.0208 & 0.188 \end{matrix})$
10	0.955	$(\begin{matrix} 0.25 \\ 0.25 \\ 0.25 \\ 0.25 \end{matrix})$	$(\begin{matrix} 0.0723 & 0.0592 & 0.0592 & 0.0592 \\ 0.0592 & 0.0723 & 0.0592 & 0.0592 \\ 0.0592 & 0.0592 & 0.0723 & 0.0592 \\ 0.0592 & 0.0592 & 0.0592 & 0.0723 \end{matrix})$

Open in a new tab

These examples show a common pattern. When t is low, $\frac{γ}{ω_{D} ω_{I}}$ is high, and π has low mismatch probabilities. High $\frac{γ}{ω_{D} ω_{I}}$ means that longer alignments are preferred, which counteracts the aversion to mismatches. As t increases, $\frac{γ}{ω_{D} ω_{I}}$ decreases, and the mismatch probabilities increase: the net effect is that optimal alignments remain the same.

As t increases further, however, $\frac{γ}{ω_{D} ω_{I}}$ reaches a minimum (whose value differs slightly in the two examples), and then rises asymptotically to 1 (Fig. 3). This means that not any value of $\frac{γ}{ω_{D} ω_{I}}$ can be consistent with a given substitution score matrix: the value must be $\geq$ this minimum.

3.1 Effect of mismatch score

If we strengthen the Simple matrix’s mismatch score (i.e. make it more negative), the value of $\frac{γ}{ω_{D} ω_{I}}$ decreases for any given t (Fig. 3, Equation 6). One consequence is that uniform length probability occurs at lower values of t.

4 Linear gap costs

Let us now consider a simple gapped model (Fig. 4). Just as for the gapless model, we can align two sequences by finding a path through the model most likely to have generated the sequences. This is equivalent to maximizing an alignment score defined as $t \ln (prob (path, R, Q) / μ_{G})$ , where μ_G is the probability of a null alignment (that never traverses the γ, α_D or α_I arrows):

μ_{G} = (\prod_{k = 1}^{m} ω_{D} ϕ_{R_{k}}) (\prod_{k = 1}^{n} ω_{I} ψ_{Q_{k}}) {(1 - ω_{D})}^{2} {(1 - ω_{I})}^{2} (1 - γ - α_{D} - α_{I}) .

(7)

This alignment score is a sum of scores for substitutions, deletions and insertions:

\begin{matrix} S_{xy} = t \ln (\frac{π_{xy}}{ϕ_{x} ψ_{y}} \cdot \frac{γ}{ω_{D} ω_{I}}) \\ a_{D} = b_{D} = t \ln (α_{D} / ω_{D}) \\ a_{I} = b_{I} = t \ln (α_{I} / ω_{I}) \end{matrix} .

Fig. 4. — A simple PHMM for gapped local alignment. The state labeled M emits aligned letters $x : y$ with probability *π_xy*. States labeled D emit reference letters x with probability $ϕ_{x}$ . States labeled I emit query letters y with probability *ψ_y*

4.1 Degrees of freedom

This model has several degrees of freedom, meaning that we can vary the model probabilities with no effect on alignment scores. First, we can freely vary ω_D and ω_I, provided we co-vary γ, α_D and α_I so as to keep these values fixed:

\begin{matrix} c' = γ / (ω_{D} ω_{I}) = \sum_{xy} ϕ_{x} ψ_{y} exp (S_{xy} / t) \\ a'_{D} = α_{D} / ω_{D} = exp (a_{D} / t) \\ a'_{I} = α_{I} / ω_{I} = exp (a_{I} / t) \end{matrix}

Second, we can freely vary $ϕ$ , ψ and t, provided we co-vary π, $c', α_{D} / ω_{D}$ and $α_{I} / ω_{I}$ in a suitable way. Overall, the degrees of freedom are similar to those of the gapless model.

4.2 Balanced length probability

There is a useful gapped generalization of gapless alignment with uniform length probability. If ω_D and ω_I are relatively low, but γ, α_D and α_I are high, the model has a bias toward longer alignments. Conversely, if ω_D and ω_I are high but γ, α_D and α_I are low, the model is biased toward shorter alignments. A natural balance occurs when:

\frac{γ}{ω_{D} ω_{I}} + \frac{α_{D}}{ω_{D}} + \frac{α_{I}}{ω_{I}} = 1,

(8)

or equivalently:

c' + a'_{D} + a'_{I} = 1.

(9)

If $c' + a'_{D} + a'_{I} < 1$ , the degrees of freedom allow us to vary ω_D and ω_I arbitrarily close to 1, in which case $γ + α_{D} + α_{I}$ (the probability of continuing the alignment) will be <1. This means there is a bias toward shorter alignments. Conversely, if $c' + a'_{D} + a'_{I} > 1$ , we cannot vary both ω_D and ω_I arbitrarily close to 1: they have an upper bound that occurs when $γ + α_{D} + α_{I} = 1$ . This means there is a bias toward longer alignments.

Balanced length probability can also be understood as conservation of probability ratio relative to a null alignment (Supplementary Material 2.2; Yu and Hwa, 2001). Its similarity to gapless alignment with uniform length probability can be understood from their implications for the unaligned flanks of a local alignment (Supplementary Material 2.3).

5 Affine gap costs

There is more than one PHMM topology corresponding to affine-gap alignment: Figure 5 shows two options. Model A is more ‘ambiguous’, meaning that different paths through the PHMM yield indistinguishable alignments. For example, if we have an alignment with two consecutive deleted bases, this could be one deletion of length two (traversing the arrow labeled β_D), or two deletions of length one (traversing α_D twice). Using model B, on the other hand, it would have to be one deletion of length two. I believe model A is the most elegant possible model for local affine-gap alignment: it makes insertions and deletions symmetric, and produces simple equations.

Fig. 5. — Two PHMMs for gapped local alignment. States labeled M emit aligned letters $x : y$ with probability *π_xy*. States labeled D emit reference letters x with probability $ϕ_{x}$ . States labeled I emit query letters y with probability *ψ_y*

Let us now relate model A probabilities to alignment scores. We must first decide whether ‘alignment’ means ‘path’ or ‘set of indistinguishable paths’: let us use the former definition here, and explore the latter in the Supplementary Material. An alignment score is then $t \ln (prob (path, R, Q) / μ_{G})$ , which is a sum of scores for substitutions, deletions and insertions:

\begin{matrix} S_{xy} = t \ln (\frac{π_{xy}}{ϕ_{x} ψ_{y}} \cdot \frac{γ}{ω_{D} ω_{I}}) \\ a_{D} = t \ln (α_{D} (1 - β_{D}) / ω_{D}) \\ b_{D} = t \ln (β_{D} / ω_{D}) \\ a_{I} = t \ln (α_{I} (1 - β_{I}) / ω_{I}) \\ b_{I} = t \ln (β_{I} / ω_{I}) \end{matrix}

5.1 Degrees of freedom

The affine-gap models have several degrees of freedom, meaning that we can vary the model probabilities with no effect on alignment scores. In model A we can freely vary ω_D and ω_I, provided we co-vary β_D, β_I, α_D, α_I and γ so as to keep these values fixed:

\begin{matrix} c' = γ / (ω_{D} ω_{I}) \\ a'_{D} = exp (a_{D} / t) = α_{D} (1 - β_{D}) / ω_{D} \\ b'_{D} = exp (b_{D} / t) = β_{D} / ω_{D} \\ a'_{I} = exp (a_{I} / t) = α_{I} (1 - β_{I}) / ω_{I} \\ b'_{I} = exp (b_{I} / t) = β_{I} / ω_{I} \end{matrix}

There are also degrees of freedom involving $ϕ$ , ψ and t, similar to the gapless and linear-gap models.

5.2 Limits to degrees of freedom

Our freedom to vary ω_D and ω_I (while keeping $c'$ , etc. fixed) may have an upper limit. In model A, we must have $γ + α_{D} + α_{I} < 1$ , and:

γ + α_{D} + α_{I} = c' ω_{D} ω_{I} + \frac{a'_{D} ω_{D}}{1 - b'_{D} ω_{D}} + \frac{a'_{I} ω_{I}}{1 - b'_{I} ω_{I}} .

(10)

Thus, $γ + α_{D} + α_{I}$ increases with increasing ω_D and ω_I. [Let us assume that b_D and b_I are negative, thus $b'_{D}$ and $b'_{I}$ are <1, thus the denominators in Equation (10) are always positive.] This may prevent high values of ω_D and ω_I. Conversely, $γ + α_{D} + α_{I}$ may have an upper limit: its maximum possible value occurs when $ω_{D} \approx 1 \approx ω_{I}$ .

Let us see some examples (Fig. 6), with symmetric insertions and deletions, so we can drop the D and I subscripts (e.g. $ω_{D} = ω_{I} = ω$ ). The upper limit on ω can be found by solving this cubic equation:

b' c' ω^{3} - c' ω^{2} - (2 a' + b') ω + 1 = 0.

(11)

These examples assume homogeneous letter probabilities, so that $c' = γ / ω^{2}$ is a unique function of t. In all cases, as t increases, the maximum possible value of ω rises to a peak, and then decreases toward 0.311. For the Simple:6:1 scoring scheme, there is a range of t for which ω can be arbitrarily close to 1. In precisely this range, $γ + 2 α$ (i.e. $γ + α_{D} + α_{I}$ ) cannot be arbitrarily close to 1. Both can be arbitrarily close to 1 at the two endpoints of this range.

Fig. 6. — How model A’s parameter limits vary with t, for four scoring schemes. Each scoring scheme is written as three colon-separated components, matrix:first gap cost:gap extension cost

For the other scoring schemes, ω can never be arbitrarily close to 1. Its maximum possible value is 0.948 for Simple:3:1 (when $c' \approx 0.980$ ), 0.962 for HoxD70:430:30 (when $c' \approx 0.999$ ) and 0.900 for BLOSUM50:12:2 (when $c' \approx 1.08$ ).

The other parameters have limits too. When t is low γ can be arbitrarily close to 1, but as t increases the maximum possible γ steadily decreases toward ${0.311}^{2}$ . As t increases, the maximum possible β rises to a peak, before decreasing toward 0.311. The maximum possible α is extremely low when t is low, but rises toward 0.452.

Previously, Durbin et al. (1998, Section 4.5) reverse-engineered a PHMM for BLOSUM50:12:2. Their PHMM has $ω = 0.8$ and $γ = 0.64$ (though its topology is slightly different from model A), which they argue is unrealistic since it implies very short sequences. The reverse-engineering problem is, however, underdetermined: e.g. when $c' = 1$ , the maximum possible ω is 0.88 and the maximum possible γ is 0.77. This probably does not alter their conclusion.

5.3 Balanced length probability

For model A, balanced length probability occurs when:

\frac{γ}{ω_{D} ω_{I}} + \frac{α_{D} (1 - β_{D})}{ω_{D} - β_{D}} + \frac{α_{I} (1 - β_{I})}{ω_{I} - β_{I}} = 1,

(12)

or equivalently:

C' = c' + \frac{a'_{D}}{1 - b'_{D}} + \frac{a'_{I}}{1 - b'_{I}} = 1.

(13)

When $C' < 1$ , there is an upper limit on $γ + α_{D} + α_{I}$ (Equation 10). When $C' > 1$ , there is an upper limit on ω_D and ω_I. Only when $C' = 1$ are there no such limits.

5.4 Non-uniqueness of t

For gapped (unlike gapless) alignment, balanced length probability does not imply a unique value for t. Recall that, if we assume homogeneous letter probabilities, there is one degree of freedom involving π, $ϕ$ , ψ, $c'$ and t. If we wish $C' = 1$ , then $c'$ must have some value <1 (Equation 13, assuming $b'_{D}$ and $b'_{I}$ are <1). Figure 3 shows there are two values of t for $c'$ slightly <1, and no t for $c'$ much <1. Accordingly, Figure 6 shows balanced length probability at two values of t for the Simple:6:1 scoring scheme, and none for the other three schemes. If the gap and mismatch scores are strong (highly negative), balanced length probability implies $c'$ slightly <1, thus two ts. If they are weak (near zero), balanced length probability implies $c'$ much <1, which never occurs for any t.

6 Sum of alignment probabilities

Probability models enable various useful calculations (e.g. Fig. 1), which are based on the total probability of all possible alignments (Durbin et al., 1998):

\sum_{paths} prob (path, R, Q) .

This can be calculated by the Forward algorithm (Durbin et al., 1998). The Forward algorithm for model A uses these recurrence relations, which calculate the sum of probabilities for aligning the length-i prefix of R with the length-j prefix of Q $(W'_{ij})$ , ending with R_i aligned to Q_j $(X'_{ij})$ , R_i aligned to a gap $(Y'_{ij})$ or Q_j aligned to a gap $(Z'_{ij})$ :

Algorithm III.

\begin{matrix} X'_{i 0} = Z'_{i 0} = 0 \forall_{i = 0}^{m} \\ X'_{0 j} = Y'_{0 j} = 0 \forall_{j = 0}^{n} \end{matrix}

\begin{matrix} W'_{ij} = X'_{ij} + Y'_{ij} + Z'_{ij} + 1 & \forall_{i = 0}^{m} \forall_{j = 0}^{n} \\ X'_{ij} = W'_{i - 1 j - 1} \cdot S'_{R_{i} Q_{j}} & \forall_{i = 1}^{m} \forall_{j = 1}^{n} \\ Y'_{ij} = W'_{i - 1 j} \cdot a'_{D} + Y'_{i - 1 j} \cdot b'_{D} & \forall_{i = 1}^{m} \forall_{j = 0}^{n} \\ Z'_{ij} = W'_{i j - 1} \cdot a'_{I} + Z'_{i j - 1} \cdot b'_{I} & \forall_{i = 0}^{m} \forall_{j = 1}^{n} \end{matrix}

\begin{matrix} \sum_{paths} prob (path, R, Q) / μ_{G} = \sum_{i = 0}^{m} \sum_{j = 0}^{n} W'_{ij} \end{matrix}

This algorithm uses four dynamic programming matrices, but $X'$ could easily be eliminated (at the expense of making the initializations a bit less simple).

Algorithm III is almost a mechanical transformation of Algorithm II. First, since it uses probabilities without log transformation, addition is replaced by multiplication, and the primitive terms are replaced by $exp (• / t)$ . Second, since it calculates the sum of probabilities rather than the optimal probability, maximization is replaced by summation. The only nuance is that the initialization of $W'$ has to be specified more carefully.

The Forward algorithm for model B, on the other hand, is very similar to Algorithm I (Supplementary Material 4.6). The important point is that, although Algorithms I and II give identical results, the two Forward algorithms do not. The precise form of the algorithm reflects the model topology, i.e. which possible paths there are. To find optimal alignments, this need not be fully specified, because there are paths that cannot be optimal (e.g. two length-1 deletions in a row), and so we need not care whether the model allows them. To calculate the total probability, on the other hand, we do need to specify the topology (possible paths).

7 Discussion

This study describes the many-to-one relationship between probability models and score parameters for sequence alignment. In order to perform alignment probability calculations with the Forward algorithm, we need not fully specify the probability model. (To get useful results, we should of course use a model that fits the data.) Starting from an alignment scoring scheme, we need only specify two further things: (i) a value for t and (ii) a model topology (or equivalently, the form of the dynamic programming algorithm). Typically, any value of t is valid and corresponds to some PHMM.

In practice, alignment is often done with ad hoc gap costs, but a probability-based score matrix (like HoxD70 or BLOSUM62) with homogeneous letter probabilities and uniform (gapless) length probability. For such a score matrix, it seems reasonable to use the unique t that recovers the original letter probabilities (π_xy). For gapped alignment, however, this corresponds to a model without balanced length probability: weaker gap costs increasingly favor longer alignments. This suggests that such scoring schemes are ‘wrong’, and should be fixed by subtracting a value (e.g. $t \ln [ω_{D} ω_{I} / γ]$ ) from the matrix scores.

On the other hand, if we get an alignment scoring scheme from a model with homogeneous letter probabilities and balanced length probability, it is generally impossible to recover the original t, because it is not unique.

7.1 Useful probability calculations

This study highlights the importance of Formula (1), which is fairly simple and should be more widely known. Although this formula is true by fiat in the partition function approach, it is true by derivation in the PHMM approach, provided we use the t corresponding to our PHMM. Even if we use the ‘wrong’ t, it will correspond to some other PHMM.

An example of the formula’s utility arises in alignment of short DNA reads to a reference genome. Suppose one read aligns strongly to three loci, A, B and C, with scores s_A, s_B and s_C. The probability that A is correct is:

prob (A) = \frac{exp (s_{A} / t)}{exp (s_{A} / t) + exp (s_{B} / t) + exp (s_{C} / t)} .

This assumes that exactly one locus is correct, which would be a good assumption for an ancestral reference, without reference-specific deletions or duplications (Frith and Khan, 2018). Ideally the denominator would sum over all possible alignments, and perhaps the numerator would sum over alternative alignments to the same locus, but in practice this simple calculation often works well (Frith et al., 2010). It is worth emphasizing the generality of this calculation, e.g. it remains valid if we use specialized alignment parameters to model AT-rich genomes or bisulfite-converted DNA (Frith et al., 2012), or incorporate sequence quality data into the model (Frith et al., 2010).

A more sophisticated use of the formula is to estimate the probability that an alignment part is correct:

prob (alignment part) = \frac{\sum_{alignments with the part} exp (alignment score / t)}{\sum_{alignments} exp (alignment score / t)}

In particular, we can estimate the probability that each column of an alignment is correct (Fig. 1) (Durbin et al., 1998). This is important when studying sequence divergence at fine resolution.

7.2 Model ambiguity

To infer an alignment part as reliably as possible, we should sum the probabilities of all alignments that include the part. Likewise, to infer an alignment as reliably as possible, we should sum the probabilities of all PHMM paths corresponding to that alignment. Thus it is convenient if just one path corresponds to one alignment: model A (Fig. 5) unfortunately lacks this property, but we can easily sum over (at least some) paths for each alignment (Supplementary Material 3.1).

7.3 Sequences with multiple similar segments

The models in this study (Figs 2, 4 and 5) model sequences with one similar segment, but long sequences often have multiple similar segments. This particular mismatch between model and reality can cause standard sequence comparison methods to work poorly (Zhang et al., 1999). This can be addressed by modeling a set of alignments (Frith and Kawaguchi, 2015) [which is more tractable when comparing a derived sequence to an ancestral sequence (Frith and Khan, 2018)]: but this method uses a partition function approach without clear correspondence to a generative probability model.

7.4 Alignment significance

A fundamental task is detection of significant sequence similarities, i.e. similarities stronger than is likely to occur by chance. ‘By chance’ is usually defined as: between random sequences of independent monomers. ‘Strength’ is usually defined as optimal alignment score, though a more powerful definition can be made by integrating over alternative alignments (Allison et al., 1992; Eddy, 2009).

Optimal local alignment scores of random sequences typically follow a Gumbel distribution, enabling us to know the significance of any score. For gapped alignment, it is not known how to determine this distribution from first principles, but typically it can be determined rapidly by importance sampling (Park et al., 2009). There is evidence that significant similarity reliably indicates homology of biological sequences, only if simple repeats are excluded in a particular way (Frith, 2011).

7.4.1 ‘Hybrid’ alignment

There is a special kind of alignment for which scores of random sequences appear to follow a simple distribution (Yu and Hwa, 2001). Here, alignment scores are computed by a modified Forward algorithm, where the final summation is replaced by maximization (e.g. $\max_{i = 0}^{m} \max_{j = 0}^{n} t \ln W'_{ij}$ in Algorithm III). Thus, these scores integrate over some, but not all, alternative alignments. (Note that integrating over all alignments is not necessarily desirable. For example, if we compare the human and mouse X chromosomes, our aim is probably not to judge whether the whole chromosomes are related.)

Suppose we define ‘by chance’ as: between two random sequences with monomer probabilities $Φ$ and $Ψ$ , and redefine $c' = \sum_{x, y} Φ_{x} Ψ_{y} S'_{xy}$ . Typically, $Φ = ϕ$ and $Ψ = ψ$ , so the redefinition does not change $c'$ . Anyway, if we use parameters that satisfy the balanced length condition (Equation 9 or 13) using this redefined $c'$ , these alignment scores follow a Gumbel distribution with scale parameter $λ = 1 / t$ (Yu and Hwa, 2001).

7.5 Aims of sequence comparison

One possible aim of sequence comparison is to judge whether two given sequences are related. For this aim, it is appropriate to calculate a likelihood ratio, for models of related and unrelated sequences (Eddy, 2008; Neyman and Pearson, 1933). Typical alignment methods do not do this, e.g. the Forward algorithm in this study outputs a probability ratio whose denominator is not the likelihood of an unrelated-sequences model. One symptom of this is that optimal alignment scores between random sequences tend to increase with sequence length (Eddy, 2008).

A different aim, more relevant for longer sequences, is to find related segments. For example, if we compare the human and mouse X chromosomes, it is of limited use to obtain one likelihood ratio indicating whether these chromosomes are related. We probably wish to find related parts. For this aim, it is natural that optimal alignment scores between random sequences tend to increase with sequence length, because the search space increases.

A related issue is how to report the significance of sequence similarities. A BLAST E-value is the expected number of ‘distinct’ alignments with greater or equal score, between two random sequences with lengths equal to the given query sequence and the database (Altschul et al., 1997, 2001). This means that identical alignments have different E-values depending on the query length. For example, if we find three identical alignments, two in a long query (say chromosome 1) and one in a short query (say chromosome Y), only the latter may be deemed significant. This is appropriate for judging whether each whole query has a significant match, but not for finding significant matches in a set of queries. For the latter aim, significance could be reported as, say, expected number of alignments per million query bases.

Supplementary Material

btz576_Supplementary_Data

Click here for additional data file.^{(281.7KB, pdf)}

Acknowledgements

I thank John Spouge, Michiaki Hamada, Kiyoshi Asai, Anish Shrestha and members of the CBRC Genome Meeting for useful comments.

Conflict of Interest: none declared.

References

Allison L. (1993) Normalization of affine gap costs used in optimal sequence alignment. J. Theor. Biol., 161, 263–269. [DOI] [PubMed] [Google Scholar]
Allison L. et al. (1992) Finite-state models in the alignment of macromolecules. J. Mol. Evol., 35, 77–89. [DOI] [PubMed] [Google Scholar]
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
Altschul S.F. et al. (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res., 29, 351–361. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cameron M. et al. (2004) Improved gapped alignment in BLAST. IEEE/ACM Trans. Comput. Biol. Bioinform., 1, 116–129. [DOI] [PubMed] [Google Scholar]
Chiaromonte F. et al. (2002) Scoring pairwise genomic sequence alignments. Pac. Symp. Biocomput., 7, 115–126. [DOI] [PubMed] [Google Scholar]
Durbin R. et al. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press. [Google Scholar]
Eddy S.R. (2008) A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput. Biol., 4, e1000069.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eddy S.R. (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform., 23, 205–211. [PubMed] [Google Scholar]
Farrar M. (2007) Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics, 23, 156–161. [DOI] [PubMed] [Google Scholar]
Frith M.C. (2011) A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res., 39, e23. [DOI] [PMC free article] [PubMed] [Google Scholar]
Frith M.C., Kawaguchi R. (2015) Split-alignment of genomes finds orthologies more accurately. Genome Biol., 16, 106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Frith M.C., Khan S. (2018) A survey of localized sequence rearrangements in human DNA. Nucleic Acids Res., 46, 1661–1673. [DOI] [PMC free article] [PubMed] [Google Scholar]
Frith M.C. et al. (2010) Incorporating sequence quality data into alignment improves DNA read mapping. Nucleic Acids Res., 38, e100.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Frith M.C. et al. (2012) A mostly traditional approach improves alignment of bisulfite-converted DNA. Nucleic Acids Res., 40, e100.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gotoh O. (1982) An improved algorithm for matching biological sequences. J. Mol. Biol., 162, 705–708. [DOI] [PubMed] [Google Scholar]
Katoh K. et al. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res., 30, 3059–3066. [DOI] [PMC free article] [PubMed] [Google Scholar]
Miyazawa S. (1995) A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng., 8, 999–1009. [DOI] [PubMed] [Google Scholar]
Neyman J., Pearson E.S. (1933) On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. Lond. A, 231, 289–337. [Google Scholar]
Park Y. et al. (2009) Estimating the Gumbel scale parameter for local alignment of random sequences by importance sampling with stopping times. Ann. Statist., 37, 3697–3714. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rognes T. (2011) Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinformatics, 12, 221.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smith T.F., Waterman M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197. [DOI] [PubMed] [Google Scholar]
Suzuki H., Kasahara M. (2018) Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics, 19 (Suppl. 1), 45.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yu Y.K., Altschul S.F. (2005) The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics, 21, 902–911. [DOI] [PubMed] [Google Scholar]
Yu Y.K., Hwa T. (2001) Statistical significance of probabilistic sequence alignment and related local hidden Markov models. J. Comput. Biol., 8, 249–282. [DOI] [PubMed] [Google Scholar]
Yu Y.K. et al. (2003) The compositional adjustment of amino acid substitution matrices. Proc. Natl. Acad. Sci. USA, 100, 15688–15693. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang M.Q., Marr T.G. (1995) Alignment of molecular sequences seen as random path analysis. J. Theor. Biol., 174, 119–129. [DOI] [PubMed] [Google Scholar]
Zhang Z. et al. (1997) Aligning a DNA sequence with a protein sequence. J. Comput. Biol., 4, 339–349. [DOI] [PubMed] [Google Scholar]
Zhang Z. et al. (1999) Post-processing long pairwise alignments. Bioinformatics, 15, 1012–1019. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btz576_Supplementary_Data

Click here for additional data file.^{(281.7KB, pdf)}

[btz576-B1] Allison L. (1993) Normalization of affine gap costs used in optimal sequence alignment. J. Theor. Biol., 161, 263–269. [DOI] [PubMed] [Google Scholar]

[btz576-B2] Allison L. et al. (1992) Finite-state models in the alignment of macromolecules. J. Mol. Evol., 35, 77–89. [DOI] [PubMed] [Google Scholar]

[btz576-B3] Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btz576-B4] Altschul S.F. et al. (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res., 29, 351–361. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btz576-B5] Cameron M. et al. (2004) Improved gapped alignment in BLAST. IEEE/ACM Trans. Comput. Biol. Bioinform., 1, 116–129. [DOI] [PubMed] [Google Scholar]

[btz576-B6] Chiaromonte F. et al. (2002) Scoring pairwise genomic sequence alignments. Pac. Symp. Biocomput., 7, 115–126. [DOI] [PubMed] [Google Scholar]

[btz576-B7] Durbin R. et al. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press. [Google Scholar]

[btz576-B8] Eddy S.R. (2008) A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput. Biol., 4, e1000069.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btz576-B9] Eddy S.R. (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform., 23, 205–211. [PubMed] [Google Scholar]

[btz576-B10] Farrar M. (2007) Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics, 23, 156–161. [DOI] [PubMed] [Google Scholar]

[btz576-B11] Frith M.C. (2011) A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res., 39, e23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btz576-B12] Frith M.C., Kawaguchi R. (2015) Split-alignment of genomes finds orthologies more accurately. Genome Biol., 16, 106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btz576-B13] Frith M.C., Khan S. (2018) A survey of localized sequence rearrangements in human DNA. Nucleic Acids Res., 46, 1661–1673. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btz576-B14] Frith M.C. et al. (2010) Incorporating sequence quality data into alignment improves DNA read mapping. Nucleic Acids Res., 38, e100.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btz576-B15] Frith M.C. et al. (2012) A mostly traditional approach improves alignment of bisulfite-converted DNA. Nucleic Acids Res., 40, e100.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btz576-B16] Gotoh O. (1982) An improved algorithm for matching biological sequences. J. Mol. Biol., 162, 705–708. [DOI] [PubMed] [Google Scholar]

[btz576-B17] Katoh K. et al. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res., 30, 3059–3066. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btz576-B18] Miyazawa S. (1995) A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng., 8, 999–1009. [DOI] [PubMed] [Google Scholar]

[btz576-B19] Neyman J., Pearson E.S. (1933) On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. Lond. A, 231, 289–337. [Google Scholar]

[btz576-B20] Park Y. et al. (2009) Estimating the Gumbel scale parameter for local alignment of random sequences by importance sampling with stopping times. Ann. Statist., 37, 3697–3714. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btz576-B21] Rognes T. (2011) Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinformatics, 12, 221.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btz576-B22] Smith T.F., Waterman M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197. [DOI] [PubMed] [Google Scholar]

[btz576-B23] Suzuki H., Kasahara M. (2018) Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics, 19 (Suppl. 1), 45.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btz576-B24] Yu Y.K., Altschul S.F. (2005) The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics, 21, 902–911. [DOI] [PubMed] [Google Scholar]

[btz576-B25] Yu Y.K., Hwa T. (2001) Statistical significance of probabilistic sequence alignment and related local hidden Markov models. J. Comput. Biol., 8, 249–282. [DOI] [PubMed] [Google Scholar]

[btz576-B26] Yu Y.K. et al. (2003) The compositional adjustment of amino acid substitution matrices. Proc. Natl. Acad. Sci. USA, 100, 15688–15693. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btz576-B27] Zhang M.Q., Marr T.G. (1995) Alignment of molecular sequences seen as random path analysis. J. Theor. Biol., 174, 119–129. [DOI] [PubMed] [Google Scholar]

[btz576-B28] Zhang Z. et al. (1997) Aligning a DNA sequence with a protein sequence. J. Comput. Biol., 4, 339–349. [DOI] [PubMed] [Google Scholar]

[btz576-B29] Zhang Z. et al. (1999) Post-processing long pairwise alignments. Bioinformatics, 15, 1012–1019. [DOI] [PubMed] [Google Scholar]

PERMALINK

How sequence alignment scores correspond to probability models

Martin C Frith

Roles

Abstract

Motivation

Results

Supplementary information

1 Introduction

Fig. 1.

1.1 Review of score-based alignment

1.1.1 Degrees of freedom

1.1.2 Algorithms for local alignment

Algorithm I.

Algorithm II.

1.2 Review of alignment probability models

Fig. 2.

2 Degrees of freedom in the gapless model

2.1 Homogeneous letter probabilities

2.2 Uniform length probability

3 Examples

Table 1.

Table 2.

Fig. 3.

3.1 Effect of mismatch score

4 Linear gap costs

Fig. 4.

4.1 Degrees of freedom

4.2 Balanced length probability

5 Affine gap costs

Fig. 5.

5.1 Degrees of freedom

5.2 Limits to degrees of freedom

Fig. 6.

5.3 Balanced length probability

5.4 Non-uniqueness of t

6 Sum of alignment probabilities

Algorithm III.

7 Discussion

7.1 Useful probability calculations

7.2 Model ambiguity

7.3 Sequences with multiple similar segments

7.4 Alignment significance

7.4.1 ‘Hybrid’ alignment

7.5 Aims of sequence comparison

Supplementary Material

Acknowledgements

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases