Blood -- Comprehensive assessment of T-cell receptor {beta}-chain diversity in {alpha}{beta} T cells

Blood, Vol. 114, Issue 19, 4099-4107, November 5, 2009

Comprehensive assessment of T-cell receptor β-chain diversity in ${alpha}$ β T cells
Blood Robins et al. 114: 4099

Supplemental materials for: Robins et al

Error correction

Our experiment has multiple potential sources of error (or noise) in the sequence data. Two of the major contributors to the noise are PCR amplification errors and Genome Analyzer sequencing errors. Anticipating the possibility of error in the sequence data, the quantity of DNA chosen for sequencing was such that our sequence data would in effect act as an error correcting code. The principle underlying error correcting codes is that redundancy in the data can be used to reconstruct the original message sent through a “noisy” channel (assuming that the channel is not too noisy). To enable the implementation of an error correcting code, the quantity of genomic DNA used for sequencing in each lane of the Genome Analyzer flow cell had to be adjusted so that each TCRβ CDR3 region would be sequenced multiple times. Each lane of the Genome Analyzer produces 5–8 million sequence reads, so it we attempted to sequence a million TCRβ CDR3 templates in each lane, which would generate, on average, five sequence reads from each CDR3 template. Human DNA has ~300 duplex genomes per nanogram. In our experiment, we utilized on the order of ten micrograms of DNA as our template for PCR amplification (see Table S1). The total error rate should average less than 1% per base, according to both the PCR and Genome Analyzer specifications. The Genome Analyzer error rate is position-dependent, with the error rate remaining very low until cycle 40. The first 40 nucleotides of each sequence read are sufficient to capture the entirety of any particular TCRβ CDR3 sequence. This low error rate suggests that multiple errors in the same sequence are rare within the first 40 nucleotides of the sequence read.

We also know that the entire set of TCRβ CDR3 sequences present in an individual (~4 × 10⁶) sparsely cover the space of all theoretically possible TCRβ CDR3 sequences (~10¹⁰). We therefore expect that two independent TCRβ CDR3 rearrangement events to have negligible probability of generating CDR3 sequences that have three or fewer mismatches.

Combining the low error rate and sparse covering of the sequence space allows a simple metric to be used to correct the error in the sequences. Our strategy is to cluster all sequences into groups with Hamming distance less than or equal to two. These groups represent the unique TCRβ CDR3 sequences, including errors. The true underlying sequence in each cluster is resolved using parsimony, although this is not required for the calculation of sequence diversity.

The strategy is simple, but the computational task is not straightforward, due to the large number of sequences (on the order of 10⁷) that need to be compared in order to perform the clustering. With N = ~10⁷ CDR3 sequences, N² = (10⁷)² = 10¹⁴ sequence comparisons need to be made. We utilized a computational “trick” to reduce to order NlogN the number of sequence comparisons that need to made in order to perform the clustering. The objective is to identify sequences with Hamming distance less than 3 between pairs. If each sequence in a pair is first broken up into 3 non-overlapping subsequences, then at least one of those subsequences must be an exact match. Therefore, we sort our list of sequences in 3 different ways by the non-overlapping substrings. The sorting is fast, on O(NlogN). Since each substring has > 10 nucleotides, we reduce the number of Hamming distance comparisons needed by a factor > 4¹⁰, which is very small for our data set. This strategy can be readily utilized to quickly search of sequences with any number of allowable errors significantly less than the total string size. It leverages the use of extremely fast matching of identical strings.

Files in this Data Supplement: