Limitations of next-generation genome sequence assembly

Can Alkan; Saba Sajjadian; Evan E Eichler

doi:10.1038/nmeth.1527

. Author manuscript; available in PMC: 2011 Jul 1.

Published in final edited form as: Nat Methods. 2010 Nov 21;8(1):61–65. doi: 10.1038/nmeth.1527

Limitations of next-generation genome sequence assembly

Can Alkan ¹, Saba Sajjadian ¹, Evan E Eichler ¹

PMCID: PMC3115693 NIHMSID: NIHMS297025 PMID: 21102452

Abstract

High-throughput sequencing technologies promise to transform the fields of genetics and comparative biology by delivering tens of thousands of genomes in the near future. Although it is feasible to construct de novo genome assemblies in a few months, there has been relatively little attention to what is lost by sole application of short sequence reads. We compared the recent de novo assemblies using the short oligonucleotide analysis package (SOAP), generated from the genomes of a Han Chinese individual and a Yoruban individual, to experimentally validated genomic features. We found that de novo assemblies were 16.2% shorter than the reference genome and that 420.2 megabase pairs of common repeats and 99.1% of validated duplicated sequences were missing from the genome. Consequently, over 2,377 coding exons were completely missing. We conclude that high-quality sequencing approaches must be considered in conjunction with high-throughput sequencing for comparative genomics analyses and studies of genome evolution.

The plummeting costs and massive throughput of second-generation sequencing platforms are paving the way for de novo sequencing applications to characterize the genomes of thousands of species. Recently, researchers from the Beijing Genome Institute sequenced the cucumber genome using both capillary sequencing and Illumina technology¹, and the panda genome was the first mammalian genome to be assembled using sequence data generated solely using next-generation sequencing (NGS) platforms². An international consortium of scientists proposes more ambitious projects, such as the Genome 10K Project to sequence the genomes of 10,000 vertebrate species³. The information obtained from sequencing these genomes will help us better understand genome evolution, providing rapid access to gene models from many more organisms than previously anticipated. However, a critical assessment of NGS genome assemblies should be performed in comparison to known standards, and a robust classification of what is missing from the assemblies needs to be taken into account. Such analyses are essential to correctly perform comparative genomics studies. Typical genome assembly standards, such as complete cDNA libraries or sequences from large-insert genomic clones that sample the genome do not yet exist for newly sequenced genomes such as the cucumber and panda and are unlikely to be generated to test the assembly quality. We can, however, compare the recently generated de novo sequence assemblies of two human individuals⁴ with the human reference genome^5,6 to assess the limitations of such genomes assembled primarily with short reads. Here we present a formal analysis of the de novo sequence assembly generated from the genome of a Han Chinese individual (YH; Supplementary Note) using the Illumina platform⁴ with an emphasis on the repeats and segmental duplications that cover approximately half of the human genome. In addition, we analyzed the new sequences discovered from the genome of another individual (Yoruban from Ibadan, Nigeria; NA18507) to test the utility of de novo assemblies in the characterization of new sequence insertions.

Sequence properties and algorithmic challenges

NGS technologies typically generate shorter sequences with higher error rates from relatively short insert libraries^7,8. For example, one of the most commonly used technologies, Illumina’s sequencing by synthesis, routinely produces read lengths of 75–100 base pairs (bp) from libraries with insert sizes of 200–500 bp. It is, therefore, expected that assembly of longer repeats and duplications will suffer from this short read length. Similar to the whole-genome shotgun sequence (WGS) assembly algorithms that use capillary-based data such as the Celera assembler⁹, the predominant assembly methods for short reads are based on de Bruijn graph and Eulerian path approaches¹⁰, which have difficulty in assembling complex regions of the genome. As argued by groups that presented various implementations of this approach (for example, the algorithms named EULER-USR¹¹, ABySS¹² and SOAPdenovo⁴), paired-end sequence libraries with long inserts help to ameliorate this bias. However, even the longest currently available inserts (<17 kilobases with Roche 454 sequencing¹³) are insufficient to bridge across regions that harbor the majority of recently duplicated human genes. Criticisms of WGS assembly algorithms and characterization of various types of errors associated with them as well as requirements for better assemblies have been previously discussed^14-16.

Contamination or new insertions?

An important consideration of any sequencing project, including those that use Sanger sequencing, is DNA contamination from other organisms. Before analyzing the genomes, we searched for possible contaminants by comparing the repeat-masked YH genome against the US National Center for Biotechnology Information (NCBI) nucleotide (nt) database¹⁷ (Supplementary Note). We identified 1,033 contigs and 166 scaffolds (361 kbp) with high-identity matches to other species (Supplementary Note). Although this represents a small fraction of the total genome, nearly half the sequence (152 kbp) (Supplementary Table 1) was classified as new human sequences corresponding to ~15% of all reported insertion polymorphisms for YH (1,079 of 7,211 sequences or 3% of 5.12 Mb)¹⁸ (Supplementary Note). Similarly, 2.8% (136.6 kbp of 4.8 Mbp) of the new sequences reported using the genome of a Yoruban individual (NA18507), likely represent contamination. The majority of these contaminations had high sequence identity to Epstein-Barr virus, an agent commonly used to immortalize cell lines (Fig. 1a and Supplementary Table 1). (Note that the NA18507 genome was sequenced using DNA from a cell line, whereas the YH genome sequence was generated from blood DNA.) Thus, although de novo sequence assemblies may be an important source for the discovery of insertion polymorphisms and are complementary to clone-based methods (Fig. 1a), such sequences require particular scrutiny and additional validation because of their tendency to enrich for contamination artifacts. Discriminating such sequences before sequence assembly becomes particularly problematic when the underlying sequence read data are short.

Summary of *de novo* genome assembly and new sequence analysis. (a) Venn diagram comparing insertion sequences (total base pairs that do not exist in the reference genome build 36) detected by fosmid end sequencing²⁷ and *de novo* assembly¹⁸ for the same genome (NA18507). The number of base pairs of Epstein-Barr virus contamination is also shown. Approximately 1.6 Mbp of new insertion sequence aligns with 1.42 Mbp detected by *de novo* assembly with NGS. (b) Average sequence identity of L1 common repeat sequences and depletion ratio in the YH genome assembly. (c) The pairwise sequence identity distribution of duplicated sequences in the YH genome compared to the human reference genome and a WGS assembly based on capillary sequence²⁴ (Celera). (d) Number of base pairs in segmental duplications detected in the YH assembly (YH WGAC) compared with duplications common to NCBI build 36 WGAC analysis (≥94% sequence identity) and read-depth analyses of the capillary-based (Celera) and YH (intersection of three datasets).

Repeat content

Any WGS-based de novo sequence assembly algorithm will collapse identical repeats, resulting in reduced or lost genomic complexity¹⁴. We compared the repeat content of the YH genome and the human reference genome (build 36) generating summary statistics for various repeat classes¹⁹ (Supplementary Table 2). Although the repeat structure may vary between individuals, most retrotransposons are fixed in the human lineage²⁰, thus we would expect to observe a similar number of base pairs corresponding to retrotransposon-derived repeats in the genome of any human individual and the reference genome assembly. We identified 420.2 Mbp of missing common repeat sequence from the YH assembly corresponding to 173.6 Mbp of missing LINE1 (L1) and 159.2 Mbp of missing Alu repeats. As highly identical sequences will be more problematic, we quantified this effect by comparing this depletion as a function of sequence divergence. The depletion of repeat sequences was enriched in L1 classes with lower sequence divergence (R² = 0.86; Fig. 1b). We found that the depletion rose rapidly (>50%) for L1 repeat subfamilies where sequence identity exceeded 85%.

In general, most Alu subfamilies were underrepresented, but evolutionarily younger Alu repeats with higher identity to consensus sequences had high depletion rates although this trend was weak (R² = 0.02, Supplementary Fig. 1), likely because of the shorter sequence length of the Alu repeat class. Most common repeat classes showed reduced representation in the YH genome (Supplementary Table 2).

Segmental duplications

We used the whole-genome assembly comparison (WGAC) method²¹ to analyze the segmental duplication content in the YH genome. Despite the fact that genomes typically contain 140.2 Mbp to 159.6 Mbp (25,914 pairwise alignments) of euchromatic segmental duplication²², we detected only 10 Mb of segmental duplications (1,652 pairwise alignments) in the YH assembly (Table 1). Although the depletion becomes more pronounced with increasing sequence identity, the number of pairwise alignments was dramatically reduced (>90%) for all classes of duplication (Fig. 1c). This is in stark contrast to capillary sequencing–based WGS assembly, which recovered a substantial fraction of duplications with less than 95% sequence identity²². We previously constructed a duplication map of the YH genome using read-depth methods and validated copy-number differences using array comparative genomic hybridization²³. We discovered 92 Mb of segmental duplications (>94% sequence identity) and found that the duplication content was similar to that of other human genomes (Supplementary Fig. 2a). We did not observe the common human duplication pattern within the YH genome de novo assembly (Fig. 1d and Supplementary Fig. 2b). If we limit our analysis to those duplications commonly present in the human reference genome and duplications we detected through read-depth analysis of a capillary sequencing–based WGS dataset²⁴ (Celera) and YH (total of 72 Mbp common duplications), we conclude that 99.4% of true pairwise segmental duplications were absent. We predict that 95.6% of the duplications in the YH de novo assembly are likely false because they did not correspond to duplications predicted by read depth or were not detected by array comparative genomic hybridization analysis²³ of the YH genome.

Table 1.

Summary of segmental duplication statistics

		NCBI build 36	Celera WGS assembly	YH genome assembly
Genome size (bp)		3,107,677,273	2,695,614,880	2,874,204,399
Nonredundant	Intrachromosomal	114,538,257	36,232,042	5,178,588
duplication space (bp)	Interchromosomal	74,560,372	32,383,828	4,891,680
	Total	159,204,446	58,887,898	10,034,278
Pairwise	Intrachromosomal	9,245	7,080	1,652
alignments	Interchromosomal	16,699	13,308	1,754
	Total	25,944	20,388	3,406

Open in a new tab

The YH genome assembly includes 496 Mb of scaffold gaps.

Missing and fragmented genes

Finally, we analyzed the impact of this genomic reduction on both gene coverage and fragmentation of genes into multiple scaffolds. We examined a nonredundant autosomal gene set (17,601 genes; Supplementary Note) and required ≥98% sequence identity between the assembly and the reference gene set. (At the exon level, we found that 93% of all coding exons (159,621 of 171,746 exons) were completely represented in the YH assembly. At the gene level, however, only 56.3% of the genes (9,909 of 17,601 genes) had sufficient representation in the assembly (≥95% of the gene). Not surprisingly, among the 2,377 protein-coding exons that were completely absent, 47.7% (1,133 of 2,377 exons) mapped to segmental duplications (Supplementary Tables 3 and 4), representing a tenfold enrichment of duplicated sequence. Although these losses would prevent appropriate annotation of at least 1,112 genes, we also noted 83 genes for which all exons were completely missing or had less than 1% of their protein-coding sequence represented. Of these genes, 81.9% (68 of 83 genes) corresponded to members of duplicated gene families, many of which are high in copy number in the YH genome, as we previously characterized²³ (Supplementary Tables 3 and 4).

The analysis described above did not consider gene fragmentation (that is, parts of the same gene represented in different scaffolds). The presence of duplicated and repetitive sequences in introns complicates complete gene assembly and annotation, leading to genes being broken among multiple sequence scaffolds. To test for this effect of gene fragmentation, we calculated the minimum number of scaffolds in the YH de novo assembly required to reconstruct every human gene according to the reference genome (Supplementary Note). We found that 69.7% of the genes (12,268 of 17,601 genes) are contained in a single scaffold. Among the fragmented genes (those mapping to two or more scaffolds), we found that 42% intersect with segmental duplications (1,779 of 5,291 genes) or map to regions in which repeat content exceeded >50% (1,582 of 5,291 genes) (Supplementary Table 3). Of 11,766 nonfragmented genes with all protein-coding exons present (Supplementary Table 3), 255 were shuffled in their respective scaffolds (that is, the exons were out of order). We observed that 29 genes were fragmented into >100 scaffolds and most (93%) corresponded to duplicated genes (Table 2 and Supplementary Table 3). Among the most shattered genes with more than 200 scaffolds were two genes (HYDIN2 and PRIM2) that have high-identity segmental duplications in YH^23,25. Although HYDIN2 was not present in the NCBI build 36 assembly, it is now partially represented in GRCh37 human genome assembly but not assigned to a chromosomal location.

Table 2.

Top 20 most-fragmented or missing genes in the YH genome assembly

Chromosome	Start	End	Length (bp)	Gene symbol	Transcript name	Copy number	Type	Fragments	Coverage (%)	Duplicated^a (%)
16	69398790	69822070	423,280	HYDIN	NM_032821	3.47	Fragmented	215	95.82	94.12
6	57290381	57621334	330,953	PRIM2	NM_000947	3.87	Fragmented	213	82.30	98.27
9	39062766	39278300	215,534	CNTNAP3	NM_033655	4.65	Fragmented	208	84.92	100
5	21786731	22889488	1,102,757	CDH12	NM_004061	1.87	Fragmented	184	95.86	50.91
11	87877393	88438782	561,389	GRM5	NM_001143831	2.11	Fragmented	162	90.40	67.86
7	66099252	66341931	242,679	TYW1	NM_018264	3.27	Fragmented	155	82.94	99.61
10	50696330	51041337	345,007	PARG	NM_003631	4.17	Fragmented	154	57.14	80.96
1	143663118	143787436	124,318	PDE4DIP	NM_022359	7.4	Fragmented	147	93.31	100
7	153380709	154316928	936,219	DPP6	NM_130797	1.93	Fragmented	146	80.46	38.63
1	120255701	120413799	158,098	NOTCH2	NM_0 24408	2.97	Fragmented	142	95.26	68.52
8	7408721	7427585	18,864	FAM90A7	NM_001136572	36.03	Missing	0	0	100
16	14938801	14953432	14,631	NPIP	NM_006985	30.73	Missing	0	0	100
7	76506733	76520291	13,558	LOC100132832	NM_001129851	19.82	Missing	0	0	100
8	12327497	12338223	10,726	FAM86B2	NM_001137610	20.82	Missing	0	0	100
15	80895765	80905166	9,401	LOC440295	NM_198181	27.21	Missing	0	0	100
7	74962235	74971564	9,329	LOC442590	NM_001099435	32.2	Missing	0	0	100
7	44007014	44016247	9,233	WBSCR19	NM_175064	32.98	Missing	0	0	100
10	135333669	135341873	8,204	DUX4	NM_033178	195.66	Missing	0	0	100
22	22706139	22714284	8,145	GSTT1	NM_000853	0.44	Missing	0	0	44.20
8	86755947	86762978	7,031	REXO1L1	NM_172239	134.92	Missing	0	0	100

Open in a new tab

Includes RepeatMasker data, whole-genome assembly comparison, Celera whole-genome shotgun sequence detection and YH whole-genome shotgun sequence detection.

Outlook

This is a watershed moment in genomics. Although data production capabilities are substantially improved, accurately building genome assemblies and correctly annotating them remains challenging, especially among complex genomes with higher repeat and duplication content. The de novo assembly of the YH genome coupled with experimental validation of its duplication and repeat content allow us to quantify this effect. Other than contaminating sequence, the most noticeable casualties of a de novo NGS assembly are segmental duplications and larger common repeats. We found that this depletion became acute when the sequence identity exceeded 85% resulting in the loss of ~16% of the genome. This is a more considerable bias when compared to capillary sequencing–based WGS assembly of the human genome in which sequence misassembly and collapse occurred for only ~8% of the genome when duplications or repeats exceeded 95% sequence identity. In the absence of alternative NGS-based human genome assemblies with different algorithms, we cannot test the effects of assembly method, but we believe that the limitations we present in this work are due to the properties of the data and whole-genome shotgun sequencing approach in general, rather than algorithmic inefficiency.

Without complementary efforts to fully sequence complex genomes, the field of comparative genomics may face a crisis. There is the problem that although the genomes of many more species are now accessible, the portion of each genome that can be reliably accessed has diminished substantially (<80%). Such biases are ironically transforming our definition of what it means to sequence a genome and threaten to skew our understanding of organismal biology and genome evolution. Third-generation technologies, which increase read length or library insert sizes, promise to alleviate this deficit, but the issue is fundamentally greater than a technological gap. The expertise and motivation to sequence genomes to a high quality are disappearing. Large-insert clone library resources such as bacterial artificial chromosomes, required for accurate assembly of the human genome, were once a mainstay of genome sequencing projects but are now considered too costly to create or maintain for many organisms. Moreover, if ‘genome manuscripts’ can now be published without accounting for the 20% that is missing, what incentive remains to spend the additional effort and cost to sequence these genomes well? Such biases can be minimized when the genome of a closely related species finished with high-quality, clone-based sequencing is available (such as closely related nonhuman primate genomes compared against the reference human genome assembly). The problem is exacerbated when analyzing genomes without a reference index genome. In these cases, the portions that are missing or misassembled cannot be readily inferred and are invisible to the biologist. Biases against duplications and repeats, as well as fragmentation, raise questions related to the accuracy and completeness of similarly assembled genomes such as the panda genome², as recently discussed²⁶. It is the responsibility of the scientific community to enforce standards of quality that can be measured and assessed. In our opinion, it is critical to develop new hybrid sequencing approaches, such as multiplatform strategies including the third-generation long-read technologies, high-quality finished long-insert clones and new assembly algorithms that can accommodate these heterogeneous datasets. The genome assemblies themselves must be experimentally validated. Large-molecule, high-quality sequencing should not be abandoned until the balance between quantity and quality of genomes has been reestablished.

Supplementary Material

Supplementary Note & Tables

NIHMS297025-supplement-Supplementary_Note___Tables.pdf^{(402.8KB, pdf)}

Supplementary Table 1

NIHMS297025-supplement-Supplementary_Table_1.xls^{(406KB, xls)}

Supplementary Table 3

NIHMS297025-supplement-Supplementary_Table_3.xls^{(5MB, xls)}

Supplementary Table 4

NIHMS297025-supplement-Supplementary_Table_4.xlsx^{(15.2MB, xlsx)}

Supplementary Table 5

NIHMS297025-supplement-Supplementary_Table_5.xls^{(2.2MB, xls)}

ACKNOWLEDGMENTS

We thank E. Karakoc and P. Sudmant for helpful discussions, T. Marques-Bonet and J.M. Kidd for providing the nonredundant gene table, and T. Brown for proofreading the manuscript. This work was partly supported by US National Institutes of Health grant HG002385 to E.E.E. E.E.E. receives funds as an Investigator of the Howard Hughes Medical Institute.

Footnotes

Note: Supplementary information is available on the Nature Methods website.

COMPETING FINANCIAL INTERESTS

The authors declare competing financial interests: details accompany the full-text HTML version of the paper at http://www.nature.com/naturemethods/.

References

1.Huang S, et al. The genome of the cucumber, Cucumis sativus L. Nat. Genet. 2009;41:1275–1281. doi: 10.1038/ng.475. [DOI] [PubMed] [Google Scholar]
2.Li R, et al. The sequence and de novo assembly of the giant panda genome. Nature. 2010;463:311–317. doi: 10.1038/nature08696. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Genome 10K Community of Scientists Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J. Hered. 2009;100:659–674. doi: 10.1093/jhered/esp086. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Li R, et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010;20:265–272. doi: 10.1101/gr.097261.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Lander ES, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
6.International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. doi: 10.1038/nature03001. [DOI] [PubMed] [Google Scholar]
7.Wheeler DA, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876. doi: 10.1038/nature06884. [DOI] [PubMed] [Google Scholar]
8.Bentley DR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. doi: 10.1038/nature07517. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Myers EW, et al. A whole-genome assembly of Drosophila. Science. 2000;287:2196–2204. doi: 10.1126/science.287.5461.2196. [DOI] [PubMed] [Google Scholar]
10.Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA. 2001;98:9748–9753. doi: 10.1073/pnas.171285098. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Chaisson MJ, Brinza D, Pevzner PA. De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Res. 2009;19:336–346. doi: 10.1101/gr.079053.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Simpson JT, et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19:1117–1123. doi: 10.1101/gr.089532.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Schuster SC, et al. Complete Khoisan and Bantu genomes from southern Africa. Nature. 2010;463:943–947. doi: 10.1038/nature08795. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Green P. Whole-genome disassembly. Proc. Natl. Acad. Sci. USA. 2002;99:4143–4144. doi: 10.1073/pnas.082095999. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Schatz MC, Delcher AL, Salzberg SL. Assembly of large genomes using second-generation sequencing. Genome Res. 2010;20:1165–1173. doi: 10.1101/gr.101360.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Meader S, Hillier LW, Locke D, Ponting CP, Lunter G. Genome assembly quality: assessment and improvement using the neutral indel model. Genome Res. 2010;20:675–684. doi: 10.1101/gr.096966.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 2000;7:203–214. doi: 10.1089/10665270050081478. [DOI] [PubMed] [Google Scholar]
18.Li R, et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 2010;28:57–63. doi: 10.1038/nbt.1596. [DOI] [PubMed] [Google Scholar]
19.Jurka J, et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 2005;110:462–467. doi: 10.1159/000084979. [DOI] [PubMed] [Google Scholar]
20.Mills RE, Bennett EA, Iskow RC, Devine SE. Which transposable elements are active in the human genome? Trends Genet. 2007;23:183–191. doi: 10.1016/j.tig.2007.02.006. [DOI] [PubMed] [Google Scholar]
21.Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 2001;11:1005–1017. doi: 10.1101/gr.187101. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.She X, et al. Shotgun sequence assembly and recent segmental duplications within the human genome. Nature. 2004;431:927–930. doi: 10.1038/nature03062. [DOI] [PubMed] [Google Scholar]
23.Alkan C, et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 2009;41:1061–1067. doi: 10.1038/ng.437. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Venter JC, et al. The sequence of the human genome. Science. 2001;291:1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]
25.Doggett NA, et al. A 360-kb interchromosomal duplication of the human HYDIN locus. Genomics. 2006;88:762–771. doi: 10.1016/j.ygeno.2006.07.012. [DOI] [PubMed] [Google Scholar]
26.Worley KC, Gibbs RA. Genetics: decoding a national treasure. Nature. 2010;463:303–304. doi: 10.1038/463303a. [DOI] [PubMed] [Google Scholar]
27.Kidd JM, et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat. Methods. 2010;7:365–371. doi: 10.1038/nmeth.1451. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Note & Tables

NIHMS297025-supplement-Supplementary_Note___Tables.pdf^{(402.8KB, pdf)}

Supplementary Table 1

NIHMS297025-supplement-Supplementary_Table_1.xls^{(406KB, xls)}

Supplementary Table 3

NIHMS297025-supplement-Supplementary_Table_3.xls^{(5MB, xls)}

Supplementary Table 4

NIHMS297025-supplement-Supplementary_Table_4.xlsx^{(15.2MB, xlsx)}

Supplementary Table 5

NIHMS297025-supplement-Supplementary_Table_5.xls^{(2.2MB, xls)}

[R1] 1.Huang S, et al. The genome of the cucumber, Cucumis sativus L. Nat. Genet. 2009;41:1275–1281. doi: 10.1038/ng.475. [DOI] [PubMed] [Google Scholar]

[R2] 2.Li R, et al. The sequence and de novo assembly of the giant panda genome. Nature. 2010;463:311–317. doi: 10.1038/nature08696. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Genome 10K Community of Scientists Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J. Hered. 2009;100:659–674. doi: 10.1093/jhered/esp086. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Li R, et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010;20:265–272. doi: 10.1101/gr.097261.109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Lander ES, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]

[R6] 6.International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. doi: 10.1038/nature03001. [DOI] [PubMed] [Google Scholar]

[R7] 7.Wheeler DA, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876. doi: 10.1038/nature06884. [DOI] [PubMed] [Google Scholar]

[R8] 8.Bentley DR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. doi: 10.1038/nature07517. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Myers EW, et al. A whole-genome assembly of Drosophila. Science. 2000;287:2196–2204. doi: 10.1126/science.287.5461.2196. [DOI] [PubMed] [Google Scholar]

[R10] 10.Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA. 2001;98:9748–9753. doi: 10.1073/pnas.171285098. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Chaisson MJ, Brinza D, Pevzner PA. De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Res. 2009;19:336–346. doi: 10.1101/gr.079053.108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Simpson JT, et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19:1117–1123. doi: 10.1101/gr.089532.108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Schuster SC, et al. Complete Khoisan and Bantu genomes from southern Africa. Nature. 2010;463:943–947. doi: 10.1038/nature08795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Green P. Whole-genome disassembly. Proc. Natl. Acad. Sci. USA. 2002;99:4143–4144. doi: 10.1073/pnas.082095999. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Schatz MC, Delcher AL, Salzberg SL. Assembly of large genomes using second-generation sequencing. Genome Res. 2010;20:1165–1173. doi: 10.1101/gr.101360.109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Meader S, Hillier LW, Locke D, Ponting CP, Lunter G. Genome assembly quality: assessment and improvement using the neutral indel model. Genome Res. 2010;20:675–684. doi: 10.1101/gr.096966.109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 2000;7:203–214. doi: 10.1089/10665270050081478. [DOI] [PubMed] [Google Scholar]

[R18] 18.Li R, et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 2010;28:57–63. doi: 10.1038/nbt.1596. [DOI] [PubMed] [Google Scholar]

[R19] 19.Jurka J, et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 2005;110:462–467. doi: 10.1159/000084979. [DOI] [PubMed] [Google Scholar]

[R20] 20.Mills RE, Bennett EA, Iskow RC, Devine SE. Which transposable elements are active in the human genome? Trends Genet. 2007;23:183–191. doi: 10.1016/j.tig.2007.02.006. [DOI] [PubMed] [Google Scholar]

[R21] 21.Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 2001;11:1005–1017. doi: 10.1101/gr.187101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.She X, et al. Shotgun sequence assembly and recent segmental duplications within the human genome. Nature. 2004;431:927–930. doi: 10.1038/nature03062. [DOI] [PubMed] [Google Scholar]

[R23] 23.Alkan C, et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 2009;41:1061–1067. doi: 10.1038/ng.437. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Venter JC, et al. The sequence of the human genome. Science. 2001;291:1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]

[R25] 25.Doggett NA, et al. A 360-kb interchromosomal duplication of the human HYDIN locus. Genomics. 2006;88:762–771. doi: 10.1016/j.ygeno.2006.07.012. [DOI] [PubMed] [Google Scholar]

[R26] 26.Worley KC, Gibbs RA. Genetics: decoding a national treasure. Nature. 2010;463:303–304. doi: 10.1038/463303a. [DOI] [PubMed] [Google Scholar]

[R27] 27.Kidd JM, et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat. Methods. 2010;7:365–371. doi: 10.1038/nmeth.1451. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Limitations of next-generation genome sequence assembly

Can Alkan

Saba Sajjadian

Evan E Eichler

Abstract

Sequence properties and algorithmic challenges

Contamination or new insertions?

Figure 1.

Repeat content

Segmental duplications

Table 1.

Missing and fragmented genes

Table 2.

Outlook

Supplementary Material

ACKNOWLEDGMENTS

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Limitations of next-generation genome sequence assembly

Can Alkan

Saba Sajjadian

Evan E Eichler

Abstract

Sequence properties and algorithmic challenges

Contamination or new insertions?

Figure 1.

Repeat content

Segmental duplications

Table 1.

Missing and fragmented genes

Table 2.

Outlook

Supplementary Material

ACKNOWLEDGMENTS

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases