Abstract
In this review, we focus on progress that has been made with detecting small insertions and deletions (INDELs) in human genomes. Over the past decade, several million small INDELs have been discovered in human populations and personal genomes. The amount of genetic variation that is caused by these small INDELs is substantial. The number of INDELs in human genomes is second only to the number of single nucleotide polymorphisms (SNPs), and, in terms of base pairs of variation, INDELs cause similar levels of variation as SNPs. Many of these INDELs map to functionally important sites within human genes, and thus, are likely to influence human traits and diseases. Therefore, small INDEL variation will play a prominent role in personalized medicine.
INTRODUCTION
Next generation sequencing technologies have unlocked a new era of human genome sequencing. Human genomes are being sequenced at unprecedented rates (1–9), and the age of personalized medicine is rapidly approaching. In the near future, an individual's genome will be sequenced to predict the future health of the individual and to develop personalized medical treatments that are tailored to work with the genetic variation that is detected. Variation profiles also will be used to predict a host of other human traits including height, weight, appearance, lifespan and intelligence. Genome-wide association studies (GWASs) have already begun to identify the predictive variants that are necessary for this enterprise, and many additional variants will be identified. Ultimately, personal genomes will be compared with databases of informative variants, genes and biological pathways to make predictions about human traits and diseases.
To fulfill this vision, it will be necessary to routinely detect all of the genetic variation that is present in a given human genome. This includes: (i) single nucleotide polymorphisms [SNPs (10–12)], (ii) small insertions and deletions (INDELs) ranging from 1 to 10 000 bp in length (13–15) and (iii) larger forms of structural variation (16–22). In this review, we focus on progress that has been made with detecting small INDELs in human genomes. Despite the fact that small INDELs are highly abundant in humans and cause a great deal of variation in human genes, they have received far less attention than SNPs and larger forms of structural variation. Small INDELs have been particularly challenging to detect, validate and genotype, and customized tools have been necessary to study this class of variation. Moving forward, new technologies will be needed to evaluate the impact of these INDELs on human traits and diseases.
INDEL VARIATION ON HUMAN CHROMOSOME 22
The first large-scale efforts to identify INDELs in the human genome were focused on human chromosome 22 (23,24). Mullikin et al. (23) analyzed ABI re-sequencing data from 31 humans to identify genetic variation on chromosome 22 and determined that 13% of the variants on this chromosome were INDEL polymorphisms. Dawson et al. (24) later identified dense clusters of both SNPs and INDELs on chromosome 22 by comparing the overlapping sequences of adjacent bacterial artificial chromosome (BAC) clones and P1-derived artificial chromosome (P1) clones. The BAC and P1 libraries that were generated to sequence chromosome 22 were constructed from nine diverse diploid individuals and harbored 18 distinct haplotypes. Therefore, new genetic variation could be readily discovered by comparing the overlapping sequences of adjacent clones. A total of 10 051 SNPs and 2180 small INDELs were identified from the 358 overlapping regions that were compared [13.2 Mb of overlapping DNA (24)]. These studies suggested that small INDELs were likely to be abundant in human genomes (82% of the variants that were detected were SNPs and 18% were INDELs). Ninety-two percent of these variants were confirmed in follow-up validation studies (Table 1).
Table 1.
Study | Year | No. of INDELs | Individual(s) | Method | INDEL size range (bp) | Validation studya | Rate (%) |
---|---|---|---|---|---|---|---|
Human populations | |||||||
Mullikin et al. (23) | 2000 | NRb | 31 | ABI trace mapping | NR | ND | N/A |
Dawson et al. (24) | 2001 | 2180 | 9c | BAC overlap | NR | PCR/RFLP/invader | 92 |
Weber et al. (13) | 2002 | 2000 | NR | Various | 2–55 | PCR | 58 |
Bhangale et al. (14) | 2005 | 2393 | 137 | PCR/sequencing | 1–543 | ND | N/A |
Bhangale et al. (25) | 2006 | 1126 | ENCODE | PCR/sequencing | 1 to 30 | Manual inspection | 100 |
Mills et al. (15) | 2006 | 415 436 | 36 | ABI trace mapping | 1–9989 | PCR | 97 |
Kidd et al. (22) | 2008 | 796 273 | 8 | ABI trace mapping | 1–100 | NDd | 96 |
R.E. Mills et al. (submitted for publication) | 2010 | 1.96 million | 79e | ABI trace mapping | 1–10 000 | PCRf | 97 |
Personal human genomes | |||||||
Levy et al. (1) | 2007 | 823 396 | Venter | ABI | 1–82 711 | PCR | 84–100g |
Wheeler et al. (2) | 2008 | 222 718 | Watson | 454 | 2–38 896 | PCR | 70h |
Wang et al. (3) | 2008 | 135 262 | Han Chinese | Illumina/SOAP | 1–3 | PCR | 90–100 |
Bentley et al. (4) | 2008 | 400 000i | Yoruban | Illumina/ELAND/MAQ | 1–16 | ND | N/A |
Ley et al. (5) | 2008 | 726h | AML | Illumina/Custom | 1–30 | PCR | 93j |
Ahn et al. (7) | 2009 | 342 965 | Korean (SJK) | Illumina/MAQ | 1–26 | PCR | 100h,k |
Kim et al. (6) | 2009 | 170 202 | Korean (AK1) | Illumina/Alpheus | 1–29 | Re-sequencing | 100 |
Schuster et al. (8) | 2010 | NR | African | N/A | N/A | N/A | N/A |
NR, not reported; ND, not done; N/A, not applicable.
aDefined as a study examining a randomly selected set of INDEL variants predicted in real humans and then measuring the confirmation rate of the predicted variants using an assay such as PCR-RFLP, PCR sequencing, microarray genotyping or similar assay.
b13% of the variants were INDELs.
cBAC/PAC libraries (each representing a human).
dSSAHA-DIP, which was used to detect the INDELs, has a validation rate of 96% (46).
eNumber of humans or libraries (each library representing a human).
fThe pipeline that was developed and tested above (15) was used in this study.
g36/43 (84%) of 100–1000 bp homozygous insertions and 15/15 (100%) of 1–8 bp INDELs examined were confirmed.
hCoding INDELs only.
iReported as: ‘0.4 million' (4).
j24/26 (93%) of the candidates that were examined successfully were validated.
k9/9 (100%) of the INDELs examined were validated.
A GENOME-WIDE STUDY OF SMALL INDEL VARIATION IN HUMANS
One of the earliest genome-wide INDEL discovery efforts in humans was conducted by Weber et al. (13). Similar to the Dawson et al. (24) study, most of the INDELs in this study were identified by comparing overlapping sequences from adjacent BAC clones that were sequenced for the human genome project. Three other methods also were used, and a total of 2000 INDELs were identified (Table 1). Most of these INDELs (96%) were 2–16 bp in length, and the largest INDEL was 55 bp in length. These INDELs were genotyped extensively in three panels of diverse humans to both validate the INDELs and to examine the allelic frequencies of these polymorphisms. The INDELs also were genotyped in chimpanzees to examine the ancestral alleles. These studies confirmed that INDEL variation is the second most abundant form of genetic variation in humans after SNPs (79% of the variants detected were SNPs and 21% of the variants were INDELs).
INDEL DISCOVERY BY TARGETED RE-SEQUENCING OF HUMAN GENES AND ENCODE REGIONS
Bhangale et al. (14) re-sequenced 330 genes in diverse humans to identify 2393 small INDELs ranging from 1 to 543 bp in length (Table 1). The genes that were targeted for re-sequencing were involved in several medically relevant biological pathways, including lipid metabolism and DNA repair. The authors discovered that heterozygous INDELs produce a characteristic signature on an ABI chromatogram in which the two superimposed INDEL alleles cause a transition from a high quality region at the beginning of the chromatogram to a poor quality region that abruptly begins at the INDEL. Bhangale et al. found that this was a highly reproducible signature for identifying heterozygous INDELs from polymerase chain reaction (PCR) amplicons. The authors developed a computational package called PolyPhred that automates INDEL detection and also detects SNPs. In a follow-up study, Bhangale et al. (25) used PolyPhred ver. 6.0 to identify INDELs in the ENCODE regions of the human genome, and identified 1126 additional small INDELs (Table 1). A similar INDEL detection pipeline called PolyScan has been developed by Chen et al. (26).
LARGE-SCALE INDEL DISCOVERY PROJECTS IN HUMAN POPULATIONS
Mills et al. (15) developed a custom variation discovery pipeline that detects SNPs, INDELs, and transposon insertions from ABI trace data. The pipeline was used to discover 415 436 unique INDEL variants from DNA re-sequencing traces that originally were generated for SNP discovery projects [including the HapMap project (Table 1) (10–12)]. In a follow-up study, almost two million non-redundant INDELs were identified from DNA re-sequencing data (R.E. Mills et al., submitted for publication). The INDELs that were discovered ranged from 1 to 10 000 bp in length, and were distributed throughout the human genome on all 24 chromosomes. Several classes of INDELs were identified, including repeat expansions, transposon insertions and random sequences. PCR-based validation studies indicated that the Mills et al. (15) pipeline has an accuracy of 97%.
Kidd et al. (22) also reported 796 273 small INDELs in a project that was designed to detect structural variation in eight diverse humans (Table 1). In addition to the structural variants that were detected in fosmids by paired-end sequencing, SNPs and small INDELs also were discovered from the ABI traces that were generated. A comparison of the Mills et al.'s INDELs [dbSNP handle: DEVINE_LAB (15, R.E. Mills et al., submitted for publication)] with those discovered by Kidd et al. [dbSNP handle: HSVAR (22)] revealed minimal overlap between the two collections (only 10% of the DEVINE_LAB INDELs were present in the HSVAR collection). This low level of overlap suggests that small INDEL discovery is far from complete in human genomes. The 1000 Genomes Project is likely to close this gap by discovering a very large number of INDELs in the next few years (9).
INDEL DISCOVERY IN PERSONAL GENOMES
Small INDELs also have been discovered in most of the personal human genomes that have been sequenced thus far (Table 1) (1–8). Both the number of INDELs reported and the range of INDEL sizes that have been detected vary greatly among these genomes (Table 1). For example, 823 396 INDELs were discovered in the Venter genome (1), 222 718 INDELs were found in the Watson genome (2), and 135 262 INDELs were discovered in the Han Chinese genome (3). It seems unlikely that INDEL variation is genuinely this different among three humans. The INDELs discovered in the Venter genome ranged from 1 to 82 711 bp in length (1), whereas those discovered in the Han Chinese genome were limited to 1–3 bp in length (3). Thus, some of these differences appear to be due to differences in INDEL discovery among these projects. The sequencing platforms and analysis approaches that were employed also were quite variable. For example, the Venter genome was sequenced with traditional ABI reads (1), the Watson genome was sequenced with 454 reads (2) and the Han Chinese genome was sequenced with Illumina reads (3). The validation rates for these approaches also varied considerably. In most cases, very few INDELs were evaluated in validation studies, and in many cases, only a small subset was examined. For example, although 342 965 small INDELs were reported in the Korean genome [SJK (7)], only nine of these INDELs were validated by PCR and sequencing (for a validation rate of 9/9 or 100%, Table 1) (7). For the Yoruban genome, no INDEL validation was performed (Table 1) (4).
SOAP (27), ELAND (Illumina) and MAQ (28) have been used extensively for trace mapping and variation discovery in Illumina-based personal genome projects (Table 2). Several other approaches also have been developed (29–32), and most of these approaches have INDEL detection capabilities (Table 2). Although all of these approaches initially attempt to map either the whole trace or a segment of the trace to the reference genome, accurate mapping is often limited by the presence of INDELs [particularly if they are larger than 1–3 bp (27–32)]. In some cases, sequence reads that initially do not map well to the genome are re-mapped and aligned using programs that can support gapped or split-read alignments (30). Paired-end reads also are used to detect INDELs: one read serves to map the pair of reads, whereas the second read is subjected to gapped alignments and INDEL detection (28–30). In this instance, the first read is not allowed to vary as much as the second read, and variation is undoubtedly missed when both reads contain INDELs. In other cases, all traces are mapped and aligned to the genome using algorithms that perform gapped alignments (29,32). Although simulation studies suggest that several of these methods have high levels of sensitivity, the false negative rates for these approaches range from 10 to 35% (Table 2). Thus, up to a third of the small INDELs in personal genomes are being missed with these approaches.
Table 2.
Program | Reference | INDEL discovery (%) |
Validation studya | Rate (%) | |
---|---|---|---|---|---|
False positive | False negative | ||||
SOAP | Li et al. (27) | NR | NR | PCRb | 90–100b |
MAQ | Li et al. (28) | <2 | 35c | PCRd | 100d |
BWA | Li and Durbin (29) | NR | NR | ND | N/A |
Pindel | Ye et al. (30) | 2 | 20–31.4 | ND | N/A |
Bowtie | Langmead et al. (31) | NR | NR | ND | N/A |
BFAST | Homer et al. (32) | <3 | 10–20 | ND | N/A |
NR, not reported; ND, not done; N/A, not applicable.
aDefined as a study examining a randomly selected set of INDEL variants predicted in real humans and then measuring the confirmation rate of the predicted variants using an assay such as PCR-RFLP, PCR sequencing, microarray genotyping or similar assay.
bFrom the Han Chinese genome sequencing project (3).
cAt 20X coverage.
dFrom the Korean Genome Project (7), nine coding INDELs were examined.
FUNCTIONALLY IMPORTANT INDELS
Like SNPs and structural variation, INDELs are of great interest because they can alter human traits and can cause human diseases. One of the most common genetic diseases in humans, cystic fibrosis, is frequently caused by a coding INDEL polymorphism within the CFTR gene that eliminates a single amino acid (33). The Mills et al. (15) study described above revealed that coding INDELs are prevalent in human populations. A total of 262 coding INDELs were identified in the 36 healthy humans that were examined. More than half of these coding INDELs (160/262 or 61.1%) were multiples of 3bp, and thus, maintained the open reading frames of the proteins. In some of these cases, amino acid(s) would be precisely inserted or deleted. However, in other cases (when an in-frame INDEL does not coincide perfectly with codon boundaries), additional amino acid changes may also occur in the region that is altered. The remaining INDELs from the Mills et al. study (102/262 or 38.9%) were not multiples of 3 bp and thus caused frame shifts in the encoded proteins. This class of INDELs generally would be expected to abolish gene function. Although the 3 bp in-frame CFTR coding INDEL abolishes gene function and produces cystic fibrosis, other in-frame INDELs might have less severe effects on protein function.
Coding INDELs subsequently have been found in several of the personal genome sequences that have been examined. For example, 739 coding INDELs were identified in the Venter genome (1,34) and 345 coding INDELs were identified in the Watson genome (2). Although it is not clear whether all of the personal genomes that have been sequenced thus far harbor coding INDELs, it does not appear that attempts were made to identify coding INDELs in all cases (Table 1). Exome re-sequencing projects likewise have identified a large number of coding INDELs in typical humans (35), suggesting that most, if not all humans, carry a substantial genetic load of coding INDELs. Thus, coding INDELs are likely to have a major impact on human biology and diseases.
INDELs that occur in other functionally important regions of genes also could be envisioned to affect gene function. For example, DNA insertions within the promoter region of the FMR1 gene cause Fragile X syndrome in humans (36). Once the trinucleotide expansion reaches a critical size threshold, the methylation patterns of the promoter are altered, leading to changes in FMR1 gene expression. Mills et al. (15) identified 3829 small INDELs that map to the promoter regions of RefSeq genes. In addition to repeat expansions as outlined for FMR1, INDELs that occur within transcription factor binding sites or enhancers also could be envisioned to diminish or abolish gene expression. INDELs also have the capacity to dramatically alter the phasing and spacing of DNA sequences within promoters. For example, a 5bp INDEL could rotate a transcription factor binding site to the opposite face of the DNA helix, and a 100 bp INDEL could increase the spacing between two binding sites such that productive interactions are disrupted. Thus, INDELs within gene promoter regions might help explain the differences in gene expression that have been observed in diverse humans (37).
ENDOGENOUS RETROTRANSPOSONS CAUSE SMALL INDELS IN HUMAN GENOMES
Mobile genetic elements produce a great deal of small INDEL variation in humans (38,39). In particular, active Alu, L1 and SVA retrotransposons continue to produce new insertions when they ‘jump' from one place to another in the human genome (38,39). When they jump, Alu elements cause new insertions of ∼300 bp, whereas L1 and SVA elements cause insertions that range from tens of base pairs in length up to 3 kb (SVA) or 6 kb (L1) in length (38). Recent studies indicate that Alu and L1 elements produce millions of new insertions worldwide in human populations (40–45). New insertions, by definition, cause rare insertion alleles because they initially occur in only a single human. Many of these new insertions occur in human genes (including the exons of these genes) and such insertions can affect gene function (38,40). The full-length source elements that produce new L1 insertions are themselves highly variable among diverse humans, suggesting that some genomes might produce more new L1 insertions than others (41). Nevertheless, endogenous retrotransposons appear to cause rare insertion alleles in most, if not all, human genomes and these insertions can influence human biology (38,40).
THE FUTURE OF INDEL DISCOVERY IN PERSONAL HUMAN GENOMES
Moving forward, there are several major questions. First, exactly how many small INDELs are present in the average human genome? And how does this number differ from one human to the next? The numbers that have been reported thus far are quite variable (Table 1). It appears that variability in INDEL discovery is at the root of these differences rather than true differences in the number of INDELs. Genuine differences between two individuals might be of great interest. Do some genomes actually carry fewer INDELs than others? To answer these questions, it will be necessary to define a standardized size range for INDEL discovery that will be applied to all genomes (1 to 100 bp? 1 to 10 000 bp?). In this regard, it might be useful to compare two or more of the genomes that have already been sequenced (Table 1) using a single uniform approach. Another major issue is INDEL validation. Most of the validation studies that have been conducted thus far have sampled only a very limited number of INDELs, and as a consequence, the true rates of INDEL discovery are hard to estimate. The false negative rates that have been reported (Table 2) indicate that many of the INDELs in personal genomes are being missed with current detection methods. What are these missing INDELs and can we afford to overlook them? Can the detection methods be modified to detect more of these INDELs?
And the biggest question of all: which small INDELs affect human traits and diseases and how will we find these INDELs? Several experimental approaches could be taken to answer this question. First, the small INDELs that have been discovered thus far (Table 1; totaling several million in dbSNP) could be genotyped and integrated into the human HapMap in the same way that SNPs were genotyped for the HapMap (10–12). Common INDELs that map to exons, promoters and other functionally important sites could be given the highest priority. Once integrated into the HapMap, these INDELs would greatly expand the number of high-scoring variants that are identified in GWASs. As efforts turn to whole genome re-sequencing as an approach for identifying causative variants, it will become even more critical to detect INDELs fully and accurately. The biggest and perhaps most interesting challenge ahead will be to develop a comprehensive framework of variants, genes and biological pathways that can be used to predict human health.
FUNDING
This work was supported by the National Human Genome Research Institute, National Institutes of Health (grant number R01HG002898 to S.E.D.).
ACKNOWLEDGEMENTS
We thank Shari Corin for critical review of the manuscript and for helpful discussions. We also thank Anup Mahurkar, Todd Creasy and Umar Farooq for helpful discussions.
Conflict of Interest statement. None declared.
REFERENCES
- 1.Levy S., Sutton G., Ng P.C., Feuk L., Halpern A.L., Walenz B.P., Axelrod N., Huang J., Kirkness E.F., Denisov G., et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e254. doi: 10.1371/journal.pbio.0050254. doi:10.1371/journal.pbio.0050254. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wheeler D.A., Srinivasan M., Egholm M., Shen Y., Chen L., McGuire A., He W., Chen Y., Makhijani V., Roth G.T., et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876. doi: 10.1038/nature06884. doi:10.1038/nature06884. [DOI] [PubMed] [Google Scholar]
- 3.Wang J., Wang W., Li R., Li Y., Tian G., Goodman L., Fan W., Zhang J., Li J., Zhang J., et al. The diploid genome sequence of an Asian individual. Nature. 2008;456:60–65. doi: 10.1038/nature07484. doi:10.1038/nature07484. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Bentley D.R., Balasubramanian S., Swerdlow H.P., Smith G.P., Milton J., Brown C.G., Hall K.P., Evers D.J., Barnes C.L., Bignell H.R., et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. doi: 10.1038/nature07517. doi:10.1038/nature07517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ley T.J., Mardis E.R., Ding L., Fulton B., McLellan M.D., Chen K., Dooling D., Dunford-Shore B.H., McGrath S., Hickenbotham M., et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature. 2008;456:66–72. doi: 10.1038/nature07485. doi:10.1038/nature07485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kim J.I., Ju Y.S., Park H., Kim S., Lee S., Yi J.H., Mudge J., Miller N.A., Hong D., Bell C.J., et al. A highly annotated whole-genome sequence of a Korean individual. Nature. 2009;460:1011–1016. doi: 10.1038/nature08211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ahn S.M., Kim T.H., Lee S., Kim D., Ghang H., Kim D.S., Kim B.C., Kim S.Y., Kim W.Y., Kim C., et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 2009;19:1622–1629. doi: 10.1101/gr.092197.109. doi:10.1101/gr.092197.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Schuster S.C., Miller W., Ratan A., Tomsho L.P., Giardine B., Kasson L.R., Harris R.S., Peterson D.C., Zhao F., Qi J., Alkan C., et al. Complete Khoisan and Bantu genomes from southern Africa. Nature. 2010;463:943–947. doi: 10.1038/nature08795. doi:10.1038/nature08795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hayden E.C. International genome project launched. Nature. 2008;451:378–379. doi: 10.1038/451378b. doi:10.1038/451378b. [DOI] [PubMed] [Google Scholar]
- 10.The International SNP Map Working Group. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 2001;409:928–933. doi: 10.1038/35057149. doi:10.1038/35057149. [DOI] [PubMed] [Google Scholar]
- 11.The International HapMap Consortium. The International HapMap Project. Nature. 2003;426:789–796. doi: 10.1038/nature02168. doi:10.1038/nature02168. [DOI] [PubMed] [Google Scholar]
- 12.The International HapMap Consortium. A haplotype map of the human genome. Nature. 2005;437:1299–1320. doi: 10.1038/nature04226. doi:10.1038/nature04226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Weber J.L., David D., Heil J., Fan Y., Zhao C., Marth G. Human diallelic insertion/deletion polymorphisms. Am. J. Hum. Genet. 2002;71:854–862. doi: 10.1086/342727. doi:10.1086/342727. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Bhangale T.R., Rieder M.J., Livingston R.J., Nickerson D.A. Comprehensive identification and characterization of diallelic insertion–deletion polymorphisms in 330 human candidate genes. Hum. Mol. Genet. 2005;14:59–69. doi: 10.1093/hmg/ddi006. doi:10.1093/hmg/ddi006. [DOI] [PubMed] [Google Scholar]
- 15.Mills R.E., Luttig C.T., Larkins C.E., Beauchamp A., Tsui C., Pittard W.S., Devine S.E. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 2006;16:1182–1190. doi: 10.1101/gr.4565806. doi:10.1101/gr.4565806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Lafrate A.J., Feuk L., Rivera M.N., Listewnik M.L., Donohue P.K., Qi Y., Scherer S.W., Lee C. Detection of large-scale variation in the human genome. Nat. Genet. 2004;36:949–951. doi: 10.1038/ng1416. doi:10.1038/ng1416. [DOI] [PubMed] [Google Scholar]
- 17.Sebat J., Lakshmi B., Troge J., Alexander J., Young J., Lundin P., Maner S., Massa H., Walker M., Chi M., et al. Large-scale copy number polymorphism in the human genome. Science. 2004;305:525–528. doi: 10.1126/science.1098918. doi:10.1126/science.1098918. [DOI] [PubMed] [Google Scholar]
- 18.Tuzun E., Sharp A.J., Bailey J.A., Kaul R., Morrison V.A., Pertz L.M., Haugen E., Hayden H., Albertson D., Pinkel D., et al. Fine-scale structural variation of the human genome. Nat. Genet. 2005;27:727–732. doi: 10.1038/ng1562. doi:10.1038/ng1562. [DOI] [PubMed] [Google Scholar]
- 19.McCarroll S.A., Hadnott T.N., Perry G.H., Sabeti P.C., Zody M.C., Barrett J.C., Dallaire S., Gabriel S.B., Lee C., Daly M.J., et al. Common deletion polymorphisms in the human genome. Nat. Genet. 2006;38:86–92. doi: 10.1038/ng1696. doi:10.1038/ng1696. [DOI] [PubMed] [Google Scholar]
- 20.Hinds D.A., Kloek A.P., Jen M., Chen X., Frazer K.A. Common deletions and SNPs are in linkage disequilibrium in the human genome. Nat. Genet. 2006;38:82–85. doi: 10.1038/ng1695. doi:10.1038/ng1695. [DOI] [PubMed] [Google Scholar]
- 21.Conrad D.F., Andrews T.D., Carter N.P., Hurles M.E., Pritchard J.K. A high-resolution survey of deletion polymorphism in the human genome. Nat. Genet. 2006;38:75–81. doi: 10.1038/ng1697. doi:10.1038/ng1697. [DOI] [PubMed] [Google Scholar]
- 22.Kidd J.M., Cooper G.M., Donahue W.F., Hayden H.S., Sampas N., Graves T., Hansen N., Teague B., Alkan C., Anonacci F., et al. Mapping and sequencing of structural variation from eight human genomes. Nature. 2008;453:56–64. doi: 10.1038/nature06862. doi:10.1038/nature06862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Mullikin J.C., Hunt S.E., Cole C.G., Mortimore B.J., Rice C.M., Burton J., Matthews L.H., Pavitt R., Plumb R.W., Sims S.K., et al. An SNP map of human chromosome 22. Nature. 2000;407:516–520. doi: 10.1038/35035089. doi:10.1038/35035089. [DOI] [PubMed] [Google Scholar]
- 24.Dawson E., Chen Y., Hunt S., Smink L.J., Hunt A., Rice K., Livingston S., Bumpstead S., Bruskiewich R., Sham P., et al. A SNP resource for human chromosome 22: extracting dense clusters of SNPs from the genomic sequence. Genome Res. 2001;11:170–178. doi: 10.1101/gr.156901. doi:10.1101/gr.156901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Bhangale T.R., Stephens M., Nickerson D.A. Automating resequencing-based detection of insertion–deletion polymorphisms. Nat. Genet. 2006;38:1457–1462. doi: 10.1038/ng1925. doi:10.1038/ng1925. [DOI] [PubMed] [Google Scholar]
- 26.Chen K., McLellan M.D., Ding L., Wendl M.C., Kasai Y., Wilson R.K., Mardis E.R. PolyScan: an automatic indel and SNP detection approach to the analysis of human resequencing data. Genome Res. 2007;17:656–666. doi: 10.1101/gr.6151507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Li R., Li Y., Kristiansen K., Wang J. SOAP: short oligonucleotide alignment program. Bioinformatics. 2008;24:713–714. doi: 10.1093/bioinformatics/btn025. doi:10.1093/bioinformatics/btn025. [DOI] [PubMed] [Google Scholar]
- 28.Li H., Ruan J., Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;18:1851–1858. doi: 10.1101/gr.078212.108. doi:10.1101/gr.078212.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Li H., Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. doi:10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Ye K., Schulz M.H., Long Q., Apweiler R., Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009;25:2865–2871. doi: 10.1093/bioinformatics/btp394. doi:10.1093/bioinformatics/btp394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Langmead B., Trapnell C., Pop M., Salzberg S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. doi: 10.1186/gb-2009-10-3-r25. doi:10.1186/gb-2009-10-3-r25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Homer N., Merriman B., Nelson S.F. BFAST: an alignment tool for large scale genome resequencing. PLoS ONE. 2009;4:e7767. doi: 10.1371/journal.pone.0007767. doi:10.1371/journal.pone.0007767. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Collins F.S., Drumm M.L., Cole J.L., Lockwood W.K., Vande Woude G.F., Iannuzzi M.C. Construction of a general human chromosome jumping library, with application to cystic fibrosis. Science. 1987;235:1046–1049. doi: 10.1126/science.2950591. doi:10.1126/science.2950591. [DOI] [PubMed] [Google Scholar]
- 34.Ng P.C., Levy S., Huang J., Stockwell T.B., Walenz B.P., Li K., Axelrod N., Busam D.A., Strausberg R.L., Venter J.C. Genetic variation in an individual human exome. PLoS Genet. 2008;4:e10000160. doi: 10.1371/journal.pgen.1000160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ng S.B., Turner E.H., Robertson P.D., Flygare S.D., Bigham A.W., Lee C., Shaffer T., Wong M., Bhattacharjee A., Eichler E.E., et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461:272–276. doi: 10.1038/nature08250. doi:10.1038/nature08250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Warren S.T., Zhang F., Licameli G.R., Peters J.F. The fragile X sites in somatic cell hybrids: an approach for molecular cloning of fragile sites. Science. 1987;237:420–423. doi: 10.1126/science.3603029. doi:10.1126/science.3603029. [DOI] [PubMed] [Google Scholar]
- 37.Cheung V.G., Spielman R.S. Genetics of human gene expression: mapping DNA variants that influence gene expression. Nat. Rev. Genet. 2009;10:595–604. doi: 10.1038/nrg2630. doi:10.1038/nrg2630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Mills R.E., Bennett E.A., Iskow R.C., Devine S.E. Which transposable elements are active in the human genome. Trends Genet. 2007;23:183–191. doi: 10.1016/j.tig.2007.02.006. doi:10.1016/j.tig.2007.02.006. [DOI] [PubMed] [Google Scholar]
- 39.Bennett E.A., Coleman L.E., Tsui C., Pittard W.S., Devine S.E. Natural genetic variation caused by transposable elements in humans. Genetics. 2004;168:933–951. doi: 10.1534/genetics.104.031757. doi:10.1534/genetics.104.031757. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Iskow R.C., McCabe M.T., Mills R.E., Torene S., Pittard W.S., Neuwald A.F., Van Meir E.G., Vertino P.M., Devine S.E. Natural mutagenesis of human genomes by endogenous retrotransposons. Cell. 2010;141:1253–1261. doi: 10.1016/j.cell.2010.05.020. doi:10.1016/j.cell.2010.05.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Beck C.R., Collier P., Macfarlane C., Malig M., Kidd J.M., Eichler E.E., Badge R.M., Moran J.V. LINE-1 retrotransposition activity in human genomes. Cell. 2010;141:1159–1170. doi: 10.1016/j.cell.2010.05.021. doi:10.1016/j.cell.2010.05.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Huang C.R., Schneider A.M., Lu Y., Niranjan T., Shen P., Robinson M.A., Steranka J.P., Valle D., Civin C.I., Wang T., et al. Mobile interspersed repeats are major structural variants in the human genome. Cell. 2010;141:1171–1182. doi: 10.1016/j.cell.2010.05.026. doi:10.1016/j.cell.2010.05.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Lupski J.R. Retrotransposition and structural variation in the human genome. Cell. 2010;141:1110–1112. doi: 10.1016/j.cell.2010.06.014. doi:10.1016/j.cell.2010.06.014. [DOI] [PubMed] [Google Scholar]
- 44.Ewing A.D., Kazazian H.H., Jr High-throughput sequencing reveals extensive variation in human-specific L1 content in individual genomes. Genome Res. 2010 doi: 10.1101/gr.106419.110. [Epub ahead of print PMID: 20488934] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Witherspoon D.J., Xing J., Zhang Y., Watkins W.S., Batzer M.A., Jorde L.B. Mobile element scanning (ME-Scan) by targeted high-throughput sequencing. BMC Genomics. 2010;11:410. doi: 10.1186/1471-2164-11-410. doi:10.1186/1471-2164-11-410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Mullikin J.C., Hansen N.F., Shen L., Ebling H., Donahue W.F., Tao W., Saranga D.J., Brand A., Rubenfield M.J., Young A.C., et al. Light whole genome sequence for SNP discovery across domestic cat breeds. BMC Genomics. 2010;11:406. doi: 10.1186/1471-2164-11-406. doi:10.1186/1471-2164-11-406. [DOI] [PMC free article] [PubMed] [Google Scholar]