Appendix 1: S1–S2 Filter
Assessing the alignment of sequence reads involves calculating only the sensitivity, but in practice one also wants high specificity. We find that a simple filter, the so-called (s1–s2) filter, can be used to decide whether to accept or reject the highest-scoring alignment of a sequence read. The (s1–s2) filter is to accept the highest-scoring alignment only if the difference in score of the two highest-scoring alignments exceeds a threshold. Because of its simplicity, we were able to apply the (s1–s2) filter to alignments generated with both the BLASTZ and Smith-Waterman algorithms.
Using the (s1–s2) filter and a threshold depending on the alignment method rejects nearly all of the instances where the top-scoring alignment is incorrect, while discarding few of the instances when the top-scoring alignment is correct. We find that in these tests, the difference between the top two alignment scores (s1–s2) is much more effective than just the top score (s1) in deciding whether to accept or reject the best alignment.
Below we present the data behind these conclusions and then offer a heuristic explanation and discuss the choice of cutoff for the (s1–s2) filter.
The Data. For each of the 1,590 sequence reads aligned by each of the two methods, we recorded the scores of the best and second-best alignments and displayed this information graphically (see Fig. 7).
The goal in filtering the set of top-scoring alignments (each sequence read has a top-scoring alignment somewhere in the human genome) is to reject incorrect alignments (blue) while retaining correct alignments (red). In Fig. 7, a 45° line [i.e., a threshold for (s1–s2)] would appear to geometrically separate the red and blue points in a fashion better than a vertical line [a threshold for (s1)]. For the BLASTZ alignments, we confirm this intuition by plotting the histograms of (s1) for both the red and blue points, and compare that to the histograms of (s1–s2) points. In the latter case, there is much better separation between the two histograms.
The visual observation of increasing separation between histograms (Fig. 8) can also be misleading, so we also generated receiver-operator curves (ROCs; Fig. 9) for all four combinations of alignment (BLASTZ and Smith-Waterman) and filtering methods [(s1–s2) and just (s1)]. In a ROC curve, the rate of false positives (FP/n) is shown on the x axis, and the rate of true positives (TP/n) is shown on the y axis. As the threshold for either (s1) or (s1–s2) is continuously lowered, both the false-positive and true-positive rates increase, tracing out the ROC curve.
The fraction of reads with an orthologous position is some unknown fraction q < 1.0; this forms a horizontal line at TP = q, indicating the best performance possible. If the alignment method performed perfectly, always giving the orthologous position (if there is one) as the top alignment, and every top-scoring alignment were accepted, then it would have a true-positive rate of (q) and a false-positive rate of (1 – q). In practice, every alignment has a true-positive rate of (p) and a false-positive rate of (1 – p), with p < q.
For a given performance p < q of the alignment method, perfect performance of the filtering method would mean that all of the correct alignments can be accepted without accepting any incorrect alignments. The ROC curve for the perfect filter would begin at (0, 0) when the threshold is so high that no alignments are accepted, rise to (FP = 0, TP = p) as the threshold is lowered and the correct alignments are all accepted, make a right-angled turn, and then, as the threshold is lowered further, continue horizontally to (FP = 1 – p, TP = p) as the incorrect alignments are also accepted.
In practice, we use the BLASTZ alignments with a cutoff of (s1–s2) > 3,000 for most of the analyses presented in the paper. Under these conditions, for example, P = 0.51 is the fraction of sequence reads for which the top BLASTZ alignment is correct. Before filtering, the false-positive and true-positive rates are (FP = 0.485 = 771/1,590; TP = 0.515 = 819/1,590), and after filtering by (s1–s2) > 3,000, the performance improves to (FP = 0.005 = 8/1,590; TP = 0.470 = 747/1,590).
Heuristic Explanation. We propose a heuristic explanation for why the (s1–s2) filter is effective at determining whether the top alignment is correct. If a sequence read has an orthologous position in the human genome, then the score associated with its correct alignment is a random variable, depending on such things as the fraction of the bases in the read that have an ortholog in humans, whether there are any big insertions or deletions, and the local mutation rate. There is one such correct score associated with each sequence read. Each sequence read also has its own "background distribution" of scores for incorrect alignments. Different sequence reads have different background distributions, which may depend on repeat content, sequence complexity, and other factors.
The top-scoring alignment of a read will be correct if and only if the score of the true alignment exceeds the highest score from the background distribution. Now assume that a reasonable cutoff for the (s1–s2) filter is known. The only way that the (s1–s2) filter would discard a correct top alignment would be if the score of the true alignment was just barely larger than the best score from the background distribution, and there is no reason for this coincidence to occur. The score of the true alignment is more likely to be much larger or much smaller than the best score from the background distribution. Thus, the rejection of a correct top alignment is a rare event.
The only way that the (s1–s2) filter would accept an incorrect alignment would be if the top two extreme values from the background distribution were widely separated. However, it is a general property of extreme values from any distribution that the top two extreme values are very close, if enough samples are taken and the tails of the distribution are not exceedingly broad. Thus, the accepting of an incorrect top alignment is a rare event.