Abstract
We have entered the era of individual genomic sequencing, and can already see exponential progress in the field. It is of utmost importance to exclude false-positive variants from reported datasets. However, because of the nature of the used algorithms, this task has not been optimized to the required level of precision. This study presents a unique strategy for identifying SNPs, called COIN-VGH, that largely minimizes the presence of false-positives in the generated data. The algorithm was developed using the X-chromosome–specific regions from the previously sequenced genomes of Craig Venter and James Watson. The algorithm is based on the concept that a nucleotide can be individualized if it is analyzed in the context of its surrounding genomic sequence. COIN-VGH consists of defining the most comprehensive set of nucleotide strings of a defined length that map with 100% identity to a unique position within the human reference genome (HRG). Such set is used to retrieve sequence reads from a query genome (QG), allowing the production of a genomic landscape that represents a draft HRG-guided assembly of the QG. This landscape is analyzed for specific signatures that indicate the presence of SNPs. The fidelity of the variation signature was assessed using simulation experiments by virtually altering the HRG at defined positions. Finally, the signature regions identified in the HRG and in the QG reads are aligned and the precise nature and position of the corresponding SNPs are detected. The advantages of COIN-VGH over previous algorithms are discussed.
Keywords: DNA sequencing, genomic algorithms, genomic analysis, individual human genomes, single nucleotide variants
Unraveling the relationships between genotype and phenotype is central to understanding the differences among human beings and using such knowledge to benefit mankind. In this regard the human reference genome (HRG) represents a crucial landmark (1, 2), as it comprises the nucleotide sequence of most of the euchromatic component of the human genome and offers a high-quality standard to which other human sequences can be compared.
It has been well established that human genomes can vary at the different levels of potential mutation, ranging from SNPs and microindels to genomic rearrangements that result in different structural variations (3–5). It has proven particularly challenging to elucidate the genetic predispositions to common diseases, the genetic factors underlying variations in drug responses, and the mutations that can drive specific cancers. The HapMap Project was established to determine the common human variations within different population groups (6, 7), and researchers are currently using nucleotide sequencing of individual genomes to gain a comprehensive understanding of the genomic differences among humans.
The first whole individual genome draft constructed was that of Craig Venter (CV) (8) and was obtained using the Sanger methodology, producing sequence reads of ∼800 nucleotides. The second was that of James Watson (JW) (9), which was obtained with the 454 pyrosequencing technology, with reads of ∼250 nucleotides. To date, several individual genomes, including those of individuals from different populations, have been sequenced using the latest generation of high-throughput platforms (10, 11). Recently, the 1,000 Genomes Project (12) was launched with the stated goal of constructing a highly complete catalog of human variation.
In order for researchers to take full advantage of the data from individual genomic sequencing, very high standards of quality must be achieved. In particular, it is imperative to avoid the inclusion of false-positive variants (13). The contamination of databases with false-positive results inevitably lowers the quality and confidence in the overall accumulated data. Sequencing errors inherent to the particular platforms used and, most important, the mislocalization of sequence reads or partial assemblies onto the HRG, are the most common causes of mistakes. Different types of quality filters have been used to cope with these problems during data analysis, but only partial success has been achieved to date (9, 14, 15). In this regard, it is important that recent studies have highlighted the importance of considering the potential alignment ambiguity of different regions of the genome (16).
The present study offers a unique strategy for the precise identification and localization of SNPs in sequenced human genomes. Throughout this work we use the term SNP in a broad sense, without implying its frequence in the population. This strategy was developed using the haploid chromosome X-specific regions from the previously sequenced genomes of CV and JW as working models. The bioinformatic pipeline of our strategy, which we call COIN-VGH, culminates with the direct support of each SNP with high-quality local alignments between corresponding regions of the query genome (QG) and HRG.
Results
Rationale of the COIN-VGH Strategy.
If a given nucleotide is considered in the context of its surrounding genomic sequence, at a certain sequence length the nucleotide becomes unique in the genome and thus unambiguously identifiable. The length of the sequence required for identification varies for each nucleotide, but most nucleotides within the HRG may be identified using relatively small strings (see below). The sequence strings used as probes for nucleotide identification, herein dubbed COIN-Strings (CSs), produce only one perfect match from the entire HRG and each can be used to identify all of the nucleotides present in its sequence. Here, we use all of the CSs present within a target genomic region for analysis.
Once the CS set has been defined for a particular genomic region, each CS is hybridized in silico (virtual genomic hybridization, VGH) against all sequence reads from the QG. All of the CSs are ordered according to their positions in the HRG, and the number of QG sequence attracted and their identifiers are indicated, thereby creating a COIN-VGH genomic landscape (CVGL). We then analyze the CVGL to detect signatures indicating the presence of SNPs (see below). Finally, alignments between the signature regions from the HRG and the QG are used to reveal the nature and precise location of each SNP.
Selection of the CS Set.
The length of the used CSs defines the number of nucleotides that can be unambiguously identified in the corresponding genomic region to be analyzed. The present study focuses on the chromosome X-specific region, which comprises 148,431,524 sequenced nucleotides in the HRG, Build 37. To determine the relation between the size of the CSs and the identifiable nucleotides, the target region was divided into sets of strings of different size. For each set, starting at the 5′ end of the region, strings move downstream one nucleotide with respect to the preceding one. Each set thus represents the most comprehensive array of sequence strings of a specific size that can be constructed for a particular genomic region.
Each string of a particular set is then hybridized in silico against the whole sequence of the HRG (Materials and Methods) to detect the corresponding CSs that are, as mentioned above, those that can be uniquely mapped with 100% identity to the HRG. The total number of identifiable nucleotides can then be inferred for each set based on the regions covered by the corresponding CSs.
The HRG does not yield any CS of 10 nucleotides. As shown in Fig. 1, CSs of 15 nucleotides are capable of uniquely identifying only about 10% of the total nucleotides in the region. The number of identifiable nucleotides significantly increases, to about 80%, with CSs of 20 nucleotides and then increases steadily thereafter, reaching about 95% when strings of 100 nucleotides are used. It is important to point out that the COIN-VGH strategy uses the unmasked HRG and can uniquely identify nucleotides in highly repetitive genomic elements (e.g., transposons) and highly similar regions (e.g., segmental duplications). As might be expected, exons can be identified using smaller CSs, and the regions containing fewer identifiable nucleotides using CSs with a technological useful size are those corresponding to highly identical (>99%) segmental duplications. However, even in the latter regions, numerous nucleotides can be unambiguously identified. For the present study, we chose to use the set of CSs of 50 nucleotides, which were able to identify about 92% of the total nucleotides present in the target region.
Fig. 1.
CS length and the genomic coverage of COIN-VGH. The X-specific region of the HRG was divided into all possible strings of the indicated lengths, and each string was mapped to the HRG. Strings presenting only one perfect match were designated as CSs and used to form COIN-Tigs (see text), and the total number of identifiable nucleotides was calculated for different categories of functional and structural elements present in the region. The tridimensional plot relates the percentage of identifiable nucleotides (y axis) to the different CS lengths (z axis) for the different elements analyzed (x axis). T, total nucleotides; E, exons; I, introns; IG, intergenic regions; R, common repeats; SD, segmental duplications, with the percentage of identity indicated; and N, nucleotides.
Construction of the CVGL.
The chromosome X sequence contained in version 37 of the HRG is reported to be 155,270,561 nucleotides long. It includes two pseudoautosomal regions, PAR 1 and PAR 2, which are located at the tips of the p and q arms, respectively, and has 15 sequence gaps (Fig. 2A). As mentioned above, the X-specific region comprises 148,531,524 sequenced nucleotides. Here, we generated a set of 127,804,890 50-nucleotide CSs covering the region and ordered them according to their positions in the HRG. Overlapping CSs can be concatenated to form a contig (herein referred to as a COIN-Tig), and a region not covered by CSs is referred as a COIN-Gap. All of the nucleotides in a COIN-Tig can be uniquely identified, whereas the nucleotides present in COIN-Gaps are not accessible to analysis by COIN-VGH with the set of CSs used here.
Fig. 2.
Construction of the CVGL. (A) Construction of the CSs set. (Upper) A schematic of the general landscape of the X chromosome from the HRG. The two pseudoautosomal regions are shown (large bars at the tips), as are the regions for which build 37 does not contain sequence information (inner bars). (Lower Left) A magnified portion of the X-specific region shows the COIN-Tigs (CT) and COIN-Gaps (CG). (Lower) A magnified portion of a COIN-Tig shows the relative positions of CSs and highlights an identifiable nucleotide (CN, green). (B) Schematic of the VGH with two CSs, one that retrieves three sequence reads from the QG (black) and other that lacks a perfect match within the set of QG reads (red). (C) Schematic of the ongoing construction of the CVGL, including the data from the two CSs shown in B.
In the next step, VGH, each CS is mapped in silico against all reads of the QG, and exact matches are retrieved (Fig. 2B). This step represented virtual hybridizations of about 133 million CSs against about 32- and 74-million sequence reads from the CV and JW genome projects, respectively.
Next, the CVGL of the QG is constructed by listing each CS according to its position in the HRG and indicating the number of sequence reads attracted from the QG along with their identifiers (Fig. 2C). Notably, the CVGL constitutes a partial HRG-based assembly of the QG (Discussion).
Signature for the Presence of SNPs.
Analysis of the CVGL reveals signatures indicating the presence of SNPs between the QG and the HRG. The signature is schematically represented in Fig. 3. A segment covering 520 nucleotides of the CVGL is shown in Fig. 3A. The actual data from this sector of the CVGL are presented in Dataset S1. The number of reads of the QG (in this case that of CV) attracted by each CS is plotted according to the position of the first nucleotide of the corresponding CS in the HRG. As can be appreciated, the number of reads attracted drops to zero in a region corresponding to 50 nucleotides, herein called the zero sector.
Fig. 3.
Signature for the presence of SNPs in the CVGL. (A) Representation of a fragment of the CVGL, with the number of reads retrieved from the QG plotted with respect to the corresponding CSs. The numbers in the x axis indicate the position of the region in the CV genome. (B) Bidimensional plot looking more closely at the region containing a SNP signature. Symbols: black dot, position of the SNP; green dots, initial position of each CS that retrieved sequence reads from the QG; red dots, initial positions of the CSs that did not retrieve sequence reads from the QG. (C) Schematic of the whole signature region indicating the position of the pCS and dCS (green), the SNP (black dot), and the zero sector (red with bars indicating the first nucleotide of each CS), and showing part of the alignment confirming the SNP (A in the HRG, G in the QG).
A segment of the region is projected in a 2D graph in Fig. 3B. The location of an SNP is indicated (black dot). The relative position of each nucleotide in the HRG (x axis) corresponds to that in the QG reads (y axis), and thus the plot reveals a diagonal line with a slope of 1. CSs that contain the SNP fail to attract reads from the QG (zero sector). Two important landmarks of this region are the last CS to attract sequence reads from the QG (proximal CS, pCS) and the first CS to resume attracting reads from the QG after the SNP (distal CS, dCS).
In the context of the COIN-VGH strategy, the presence of an SNP in the CVGL is indicated with a signature characterized by: (i) a region (the zero sector) spanning the length of the used CSs (in this case, 50 nucleotides) through which no CS attracts any read; and (ii) an equal distance in nucleotides from the 5′ end of the pCS to the 3′ end of the dCS in both the HRG and QG reads. This region is herein referred to as the “signature region” (blue in Fig. 3C). In some cases, the pCS and the dCS are located in different reads of the QG. In our experience, these cases are prone to errors because of the presence of low-quality regions in the reads (Discussion). Thus, in the present analysis we only considered SNPs for which the complete signature region was embedded in at least one sequence read of the QG. In addition to SNP signatures, analysis of the CVGL may also reveal signatures for the presence of indels. In some cases, SNPs and indels are closely located, in this case less than 50 nucleotides apart, and thus the corresponding signatures are combined to produce a larger signature region. However, the simultaneous presence of SNPs and indels may result in ambiguities in the corresponding alignment, thereby not allowing the precise localization of the SNPs. In the present study, we did not consider SNPs located within signature regions that also contained indels.
For SNP identification, the signature region is extracted from both the HRG and QG reads, and the corresponding sequences are aligned (Fig. 3C). Such alignments reveal the nucleotides present in the HRG versus the QG, and the SNP is identified and localized.
Overall Analysis of the CVGL for Detecting SNPs.
Once the CVGL is constructed, its entire length is screened for the presence of signatures for variation. Regions spanning 50 nucleotides or more for which all of the corresponding CSs fail to attract sequence reads from the QG (zero sectors) are selected for further analysis. The attracted QG reads are then searched for the presence of the whole signature region, from the 5′ end of the pCS to the 3′ end of the dCS. The region is extracted from the HRG and the corresponding reads of the QG, and the length difference between them is calculated. Equal lengths indicate the presence of one or more SNPs (see above). Finally, for each of the detected signatures, the alignments between the HRG and the QG are analyzed to determine the precise nature and location of the SNPs.
Comparison of SNPs Detected by the COIN-VGH Strategy with Those in Previously Reported Data.
When the data obtained using the COIN-VGH strategy were compared with those from the previous studies, we were surprised to find that the existing database entries included a large number of “biallelic SNPs” for the haploid chromosome X-specific region. The databases contained 8,211 such SNPs for CV and 18,912 for JW, representing 11% and 44%, respectively, of the total SNPs reported for this region in the corresponding genomes. Although some of these SNPs might represent somatic variation, the large number found suggests that the vast majority arose through either sequencing errors or misalignment of the reads to the HRG (see below). If one of the two nucleotides reported for a biallelic SNP coincides with that present in the HRG, most probably the other nucleotide represents an error generated during data analysis. Notably, when the COIN-VGH algorithm detects a nucleotide corresponding to that present in the HRG at a particular site of a haploid region, it automatically bypasses any other alternative nucleotide in the same position. In the present analysis, COIN-VGH detected only 55 sites in which two nucleotides were different from that present in the HRG, in contrast to the the 102,759 monoallelic SNPs found in the genomes of CV and JW (see below). Our comparison between the COIN-VGH data and those from previous reports considered only monoallelic SNPs.
As shown in Fig. 4, the COIN-VGH strategy detected more SNPs than those that had been previously reported for either genome. Of the previously reported SNPs, COIN-VGH detected about 80% for CV and 70% for JW (see Fig. 4 for the actual numbers). Of the SNPs detected by COIN-VGH, about half (53%) of those found in the genome in which fewer SNPs were detected (JW) were also detected in the CV genome. These common SNPs should be the result of those inherited by descendent by both individuals, as well as those that represent variations in the genomes used to construct the HRG.
Fig. 4.
Comparison of SNPs found in the X-specific region by COIN-VGH vs. those in the previously reported data. The numbers indicate the quantity of SNPs represented by each intersection. (A) CV genome. Color scheme: yellow, detected by COIN-VGH; enclosed in black, previously reported; showing in gray, those reported as biallelic. (B) JW genome. Color scheme: blue, detected by COIN-VGH; enclosed in black, previously reported; showing in gray those previously reported as biallelic. (C) Data obtained with COIN-VGH. Color scheme: yellow, detected only in the CV genome; blue, detected only in the JW genome; green, detected in both genomes.
A bioinformatic analysis was used to further examine the SNPs that were not jointly detected by both COIN-VGH and the previous studies. This experiment used the CV data, because we could only access a partial set of sequence reads (about 74 million) for JW, not the full dataset mentioned in the original publication (about 106 million reads) (9).
Two sets of randomly selected SNPs were constructed: (i) a set of 100 SNPs that were detected by COIN-VGH but had not been previously reported for CV; and (ii) a set of 126 SNPs previously reported for CV and not detected by COIN-VGH. For the first set, we manually analyzed all of the final alignments between the HRG and the QG that had been used to ascertain the positions of the SNPs identified by COIN-VGH. In all 100 cases, the alignments covered the whole length of the signature region (see above) and no ambiguity was detected, thus confirming the position of the corresponding SNPs (Dataset S2). For the second set, the CVGL was analyzed to locate CSs proximal to the site of the previously reported SNP, which could then be used to retrieve sequence reads containing the target site. In 26 cases, we were unable to find an appropriate CS, suggesting that the reported site was located within a COIN-Gap (see above) and thus not available for the COIN-VGH analysis as used in this study. For the 100 cases in which we were able to retrieve the reported site, the region stretching from the first nucleotide of the CS to 50 nucleotides downstream of the reported site were extracted and alignments between the HRG and each of the attracted sequence reads were performed. The previously reported SNPs were confirmed in 68 cases, and inconsistencies were detected in 32 cases. The latter included 15 cases of positional ambiguity because of the presence of indels and low complexity regions, 5 cases of sequence ambiguities among the corresponding reads, and 12 cases in which the reported SNPs were not present (Dataset S3).
Simulation of the Addition and Subtraction of SNPs.
The CVGL obtained from the CV genome was used for a simulation experiment in which the HRG was individually altered at 200 sites and subjected to analysis. In 100 sites at which COIN-VGH had detected SNPs, the variant nucleotide was changed to that found in the CV genome. In all cases, this change resulted in the signature for the SNP being erased from the subsequently generated CVGL. In 100 sites located in regions present within COIN-Tigs (see above) in which no SNP had been detected by COIN-VGH, a nucleotide of the HRG was substituted with a different one, thus artificially creating a novel SNP. In all cases, the subsequently generated CVGL showed the expected signature for the presence of an SNP. Examples of the corresponding regions of the CVGL before and after the simulation are presented in Fig. 5.
Fig. 5.
Alteration of the CVGL following the exclusion or addition of SNPs. The HRG was altered by either changing the nucleotide to that found in the CV genome (exclusion of SNP, A) or by changing a nucleotide in a region where no SNP was detected in the CV genome (addition of SNP, B). The corresponding regions of the CVGL are shown as plots of QG sequence reads versus the corresponding CS, both before (red lines) and after (blue lines) the HRG was altered.
Characterization of SNPs Localized by the COIN-VGH Strategy in the Chromosome X-Specific Region of the Genomes of CV and JW.
The algorithm used in the present study detected 68,697 and 34,062 SNPs in the CV and JW genomes, respectively. Each SNP was characterized with regard to its location in functional and structural elements of the genome. A summary of the data are presented in Table 1. It is interesting that there were relatively fewer SNPs found in exons (0.27/Kb) than in introns (0.34/Kb) or intergenic regions (0.52/Kb). The actual position of each SNP is presented in Dataset S4 for CV and Dataset S5 for JW. These datasets present as well the the small number of biallelic SNPs found by COIN-VGH. The general landscape of the SNPs present in the X-specific region of the CV genome is shown schematically in Fig. 6.
Table 1.
SNPs localized by COIN-VGH
Venter | Watson | |
Total | 68,697 | 34,062 |
Exonic | 712 | 335 |
Intronic | 15,250 | 7,925 |
Intergenic | 52,735 | 25,802 |
Common repeats* | 42,525 | 20,366 |
SD† >90 <95 | 1,238 | 637 |
SD† >95 < 99 | 1,951 | 571 |
SD† >99 | 104 | 15 |
*LINEs, SINEs, LTRs, and DNA transposons.
†Segmental duplications and percentage of identity.
Fig. 6.
Location of SNPs found by COIN-VGH in the X-specific region of the CV genome. Note that the concentric circles are open, thus representing a linear region. The outmost circle represents the chromosome bands, with the pseudoautosomal regions (PAR 1 and PAR 2) and the centromere (C) indicated. From this circle inward, SNPs are localized with respect to exons (red), introns (green), intergenic regions (blue), and segmental duplications (black).
Discussion
The central concept of the COIN-VGH strategy is that of CS, as described above, a continuous sequence of the HRG that can unambiguously map, without any mismatch, to only one site within the whole HRG. Recently, Koehler et al. (16) have described a method dubbed “the Uniqueome” that assesses the potentiality to uniquely map a particular region of the genome. The nucleotides in the HRG that can be unambiguously identified by a particular set of CSs can be viewed as a particular “Uniqueome.”
In principle, the COIN-VGH strategy can be used to individualize any nucleotide in the genome. However, very long strings would be required to individualize some nucleotides. The actual length of the used CSs is restricted by the length of high-quality regions in the sequence reads of the QG. In this study, we selected a set of 50-nucleotide long CSs from the X-specific region, as our analysis indicated that CSs of this length could detect about 92% of all nucleotides in the region and penetrate regions of highly repetitive elements and segmental duplications. Using these CSs, we were able to identify more high-quality SNPs than those reported before for this region of the CV and JW genomes (8, 9).
The present study showed that the signature for the presence of SNPs in the CVGL was very clear for haploid regions of the genome, where the ability of nearby CSs to attract reads from the QG drops to zero. In the case of diploid regions, heterozygotic SNPs would be characterized by a ∼50% reduction in the number of attracted reads (i.e., the depth of coverage). The coverage of the CV genome is about 7.5×, with a SD of about 3, and that of the reads made public for the JW genome is even lower, with an even higher SD. For this reason, a 50% drop in the genomic coverage of a diploid region cannot be detected with the required confidence. To clearly detect signals for heterozygotic variants in diploid regions a deep genomic coverage is essential. This result can be achieved using the latest generation of high-throughput sequencing platforms. We are currently adapting COIN-VGH to analyze both haploid and diploid regions of the genome with high-coverage Illumina reads of 150 nucleotides.
It should be pointed out that special sets of CSs can be designed to target specific regions or elements of the genome, such as the exome, the histocompatibility antigen region, sets of known cancer genes, and so forth. Furthermore, regions where the method has low penetrance could be analyzed by combining strings having different characteristics. For example, high-identity segmental duplications might be analyzed using sets of CSs of different sizes, including some that are longer and thus better suited for individualizing certain nucleotides.
At this stage of the development of individual genome sequencing technologies and repositories, it is very important to avoid including false positives in the reported results and databases. In fact, major concerns have been expressed in regard to the inclusion of false positives in reported studies (13). The main sources of false-positive variants include mislocalization of the reads with regard to the HRG, as well as sequencing errors. As shown in this study, the use of the COIN-VGH algorithm greatly diminishes the probability of such errors. In this regard, our simulation experiments strongly support the fidelity of the signature used to detect SNPs. The fact that our strategy individualizes most of the nucleotides in the genome, and each nucleotide is in turn specifically detected by several CSs, makes it highly unlikely that the attracted reads will be mislocalized. In regard to sequencing errors, the signature region (Results) demands that two 50-nucleotide CSs show perfect matches between the HRG and one or more QG sequence reads bordering the SNP-containing region. This finding should automatically discard low-quality regions in the zone of the sequence reads that contain the called SNP, without the need of using threshold-based quality filters that most available methods include. In fact, we determined the Phred quality score of each nucleotide called as SNP for the CV genome by COIN-VGH. The corresponding position in each of the 153,897 sequence reads that support the 68,697 SNPs reported here, revealed an average Phred score of 33, well above the commonly used threshold quality filters.
In addition, the exclusion of signatures including both, potential SNPs and indels, prevents ambiguities in the position of called variants, even at the expense of missing some bona fide SNPs (false negatives). Actually, we might have discarded up to 7% of potential SNPs in regard to those reported here for the CV genome. Although this represents a significant amount of false-negative data, our main goal is not to include false-positive data in the reported results. Even without including these potential SNPs, our strategy calls more SNPs than those previously found for the chromosome X-specific region of either CV or JW.
Furthermore, at the end of the bioinformatic pipeline of COIN-VGH, all of the detected SNPs are supported by very high-quality alignments between the corresponding zones of the HRG and the QG. These alignments match equal-length regions and allow the corresponding SNPs to be precisely localized (Materials and Methods). In the present study, we manually analyzed 100 such alignments from our data using the CV reads, and we failed to find any potential ambiguity upon comparison with data herein reported, thus confirming the nature and position of the corresponding SNPs. In contrast, a significant number (about 30%) of the SNPs previously reported for CV and not found by COIN-VGH, showed ambiguities in regard to the existence or position of the corresponding SNPs.
The COIN-VGH landscape of a particular region of the genome actually represents an HRG-oriented partial assembly of the QG. In fact, CSs that retrieve reads correspond to exact matches between the HRG and the QG. To define the nucleotides present in the zero sectors of the CVGL (Results), the QG data obtained through alignments with the HRG should be used.
The COIN-VGH approach requires a large amount of computational processing; in particular, each CS of the HRG must be hybridized with the whole set of sequence reads of the QG. However, the processes themselves are simple (Materials and Methods) and can be easily parallelized.
In conclusion, COIN-VGH offers a unique algorithm that analyzes individual genome sequencing data from a new perspective. The main goal of this strategy is to limit the inclusion of false-positive variants in the generated data. Here, we show the potentiality of the COIN-VGH strategy using the X-specific regions of two previously sequenced genomes. Work in progress suggests that the principle of COIN-VGH strategy has the potentiallity to detect other types of genomic variation, including microindels and structural variants. It must be emphasized that although the present study focuses on the human genome, the principles of the COIN-VGH strategy could be applied to any genome for which a high-quality reference is available.
Materials and Methods
Bioinformatic Procedures.
CS sets were defined by mapping every possible k-mer of the established length (Fig. 1) to the HRG using Bowtie 0.12.2 (17), and selecting those that produced a unique and perfect alignment with the whole GRCh37. CSs were mapped to raw sequence reads from the query genomes using Bowtie (17), with no mismatches allowed. Ad hoc Perl scripts were used to identify signatures indicating the presence of SNPs (Results). Sequences from both GRCh37 and the corresponding regions of the retrieved reads from the QG were aligned using bl2sEq. (18).
Data Sources.
The Human Reference Genome Build 37 (GRCh37) was obtained from the University of California at Santa Cruz (UCSC) Genome Browser. Raw sequence reads for the CV genome (31,861,638 reads) and the JW genome (74,198,831 reads) were retrieved from the National Center for Biotechnology Information Trace Archive. Annotation data for functional, structural, and repeated elements of the GRCh37 were obtained from the UCSC Table Browser (19). Segmental duplication coordinates corresponding to the HRG Build 36 were retrieved from the Segmental Duplication Database (http://humanparalogy.gs.washington.edu/build36/build36.htm) and updated to GRCh37 coordinates using the LiftOver tool from Galaxy (20, 21). Previously reported SNPs for the QGs were obtained from the UCSC Table Browser (19).
Supplementary Material
Acknowledgments
The authors thank Patricia Bustos, Angeles Moreno, and Marisa Rodríguez for their most valuable technical assistance throughout the project. This work was supported by the National Autonomous University of Mexico.
Footnotes
The authors declare no conflict of interest.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1112567108/-/DCSupplemental.
References
- 1.Lander ES, et al. International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
- 2.International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. doi: 10.1038/nature03001. [DOI] [PubMed] [Google Scholar]
- 3.Iafrate AJ, et al. Detection of large-scale variation in the human genome. Nat Genet. 2004;36:949–951. doi: 10.1038/ng1416. [DOI] [PubMed] [Google Scholar]
- 4.Sebat J, et al. Large-scale copy number polymorphism in the human genome. Science. 2004;305:525–528. doi: 10.1126/science.1098918. [DOI] [PubMed] [Google Scholar]
- 5.Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nat Rev Genet. 2011;12:363–376. doi: 10.1038/nrg2958. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.International HapMap Consortium A haplotype map of the human genome. Nature. 2005;437:1299–1320. doi: 10.1038/nature04226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Frazer KA, et al. International HapMap Consortium A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. doi: 10.1038/nature06258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Levy S, et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e254. doi: 10.1371/journal.pbio.0050254. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wheeler DA, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876. doi: 10.1038/nature06884. [DOI] [PubMed] [Google Scholar]
- 10.Kim JI, et al. A highly annotated whole-genome sequence of a Korean individual. Nature. 2009;460:1011–1015. doi: 10.1038/nature08211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Bentley DR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. doi: 10.1038/nature07517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.1000 Genomes Project Consortium et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Alberts B. Editorial expression of concern. Science. 2010;330:912. doi: 10.1126/science.330.6006.912-b. [DOI] [PubMed] [Google Scholar]
- 14.Ley TJ, et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature. 2008;456:66–72. doi: 10.1038/nature07485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Handsaker RE, Korn JM, Nemesh J, McCarroll SA. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat Genet. 2011;43:269–276. doi: 10.1038/ng.768. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Koehler R, Issac H, Cloonan N, Grimmond SM. The uniqueome: A mappability resource for short-tag sequencing. Bioinformatics. 2011;27:272–274. doi: 10.1093/bioinformatics/btq640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. doi: 10.1186/gb-2009-10-3-r25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Altschul SF, et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Karolchik D, et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004;32(Database issue):D493–D496. doi: 10.1093/nar/gkh103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Goecks J, Nekrutenko A, Taylor J. Galaxy Team Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010;11:R86. doi: 10.1186/gb-2010-11-8-r86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Blankenberg D, et al. Galaxy: A web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol. 2010;Chapter 19:1–21. doi: 10.1002/0471142727.mb1910s89. Unit 19.10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.