Highly contiguous assemblies of 101 drosophilid genomes

Bernard Y Kim; Jeremy R Wang; Danny E Miller; Olga Barmina; Emily Delaney; Ammon Thompson; Aaron A Comeault; David Peede; Emmanuel RR D'Agostino; Julianne Pelaez; Jessica M Aguilar; Diler Haji; Teruyuki Matsunaga; Ellie E Armstrong; Molly Zych; Yoshitaka Ogawa; Marina Stamenković-Radak; Mihailo Jelić; Marija Savić Veselinović; Marija Tanasković; Pavle Erić; Jian-Jun Gao; Takehiro K Katoh; Masanori J Toda; Hideaki Watabe; Masayoshi Watada; Jeremy S Davis; Leonie C Moyle; Giulia Manoli; Enrico Bertolini; Vladimír Košťál; R Scott Hawley; Aya Takahashi; Corbin D Jones; Donald K Price; Noah Whiteman; Artyom Kopp; Daniel R Matute; Dmitri A Petrov

doi:10.7554/eLife.66405

. 2021 Jul 19;10:e66405. doi: 10.7554/eLife.66405

Highly contiguous assemblies of 101 drosophilid genomes

Bernard Y Kim ^1,^†,^✉, Jeremy R Wang ^2,^†, Danny E Miller ³, Olga Barmina ⁴, Emily Delaney ⁴, Ammon Thompson ⁴, Aaron A Comeault ⁵, David Peede ⁶, Emmanuel RR D'Agostino ⁶, Julianne Pelaez ⁷, Jessica M Aguilar ⁷, Diler Haji ⁷, Teruyuki Matsunaga ⁷, Ellie E Armstrong ¹, Molly Zych ⁸, Yoshitaka Ogawa ⁹, Marina Stamenković-Radak ¹⁰, Mihailo Jelić ¹⁰, Marija Savić Veselinović ¹⁰, Marija Tanasković ¹¹, Pavle Erić ¹¹, Jian-Jun Gao ¹², Takehiro K Katoh ¹², Masanori J Toda ¹³, Hideaki Watabe ¹⁴, Masayoshi Watada ¹⁵, Jeremy S Davis ¹⁶, Leonie C Moyle ¹⁷, Giulia Manoli ¹⁸, Enrico Bertolini ¹⁸, Vladimír Košťál ¹⁹, R Scott Hawley ²⁰, Aya Takahashi ⁹, Corbin D Jones ⁶, Donald K Price ²¹, Noah Whiteman ⁷, Artyom Kopp ⁴, Daniel R Matute ^6,^†,^✉, Dmitri A Petrov ^1,^†,^✉

Editors: Graham Coop²², Patricia J Wittkopp²³

¹Department of Biology, Stanford University, Stanford, United States

²Department of Genetics, University of North Carolina, Chapel Hill, United States

³Department of Pediatrics, Division of Genetic Medicine, University of Washington and Seattle Children’s Hospital, Seattle, United States

⁴Department of Evolution and Ecology, University of California Davis, Davis, United States

⁵School of Natural Sciences, Bangor University, Bangor, United Kingdom

⁶Biology Department, University of North Carolina, Chapel Hill, United States

⁷Department of Integrative Biology, University of California, Berkeley, Berkeley, United States

⁸Molecular and Cellular Biology Program, University of Washington, Seattle, United States

⁹Department of Biological Sciences, Tokyo Metropolitan University, Hachioji, Japan

¹⁰Faculty of Biology, University of Belgrade, Belgrade, Serbia

¹¹University of Belgrade, Institute for Biological Research "Siniša Stanković", National Institute of Republic of Serbia, Belgrade, Serbia

¹²School of Ecology and Environmental Science, Yunnan University, Kunming, China

¹³Hokkaido University Museum, Hokkaido University, Sapporo, Japan

¹⁴Biological Laboratory, Sapporo College, Hokkaido University of Education, Sapporo, Japan

¹⁵Graduate School of Science and Engineering, Ehime University, Matsuyama, Japan

¹⁶Department of Biology, University of Kentucky, Lexington, United States

¹⁷Department of Biology, Indiana University, Bloomington, United States

¹⁸Neurobiology and Genetics, Theodor Boveri Institute, Biocentre, University of Würzburg, Würzburg, Germany

¹⁹Institute of Entomology, Biology Centre, Academy of Sciences of the Czech Republic, Prague, Czech Republic

²⁰Department of Molecular and Integrative Physiology, University of Kansas Medical Center, Stowers Institute for Medical Research, Kansas City, United States

²¹School of Life Science, University of Nevada, Las Vegas, United States

²²University of California, Davis, United States

²³University of Michigan, United States

^†

These authors contributed equally to this work.

^✉

Corresponding author.

Roles

Graham Coop: Reviewing Editor

Patricia J Wittkopp: Senior Editor

PMCID: PMC8337076 PMID: 34279216

Abstract

Over 100 years of studies in Drosophila melanogaster and related species in the genus Drosophila have facilitated key discoveries in genetics, genomics, and evolution. While high-quality genome assemblies exist for several species in this group, they only encompass a small fraction of the genus. Recent advances in long-read sequencing allow high-quality genome assemblies for tens or even hundreds of species to be efficiently generated. Here, we utilize Oxford Nanopore sequencing to build an open community resource of genome assemblies for 101 lines of 93 drosophilid species encompassing 14 species groups and 35 sub-groups. The genomes are highly contiguous and complete, with an average contig N50 of 10.5 Mb and greater than 97% BUSCO completeness in 97/101 assemblies. We show that Nanopore-based assemblies are highly accurate in coding regions, particularly with respect to coding insertions and deletions. These assemblies, along with a detailed laboratory protocol and assembly pipelines, are released as a public resource and will serve as a starting point for addressing broad questions of genetics, ecology, and evolution at the scale of hundreds of species.

Research organism: D. melanogaster, Other

Introduction

The rise of long-read sequencing alongside the continuously decreasing costs of next-generation sequencing have served to greatly democratize the process of genome assembly, making it feasible to assemble high-quality genomes at a previously unthinkable scale. Currently, a number of large consortia are leading well-publicized efforts to assemble the genomes of many taxa throughout the Tree of Life. Some often overlapping examples include the Vertebrate Genomes Project (Rhie et al., 2021), the Bird 10,000 Genomes Project (Feng et al., 2020), the Zoonomia Project (Zoonomia Consortium et al., 2020), the Darwin Tree of Life (Threlfall and Blaxter, 2021), the Earth Biogenome Project (Lewin et al., 2018), and the 5000 Arthropod Genomes Initiative (Robinson et al., 2011a). In addition to establishing new standards for modern large-scale genomics projects and opening avenues for genomic research that were previously only feasible in model organisms across a multitude of species, these projects are creating an opportunity to study genetic variation and address fundamental biological questions at a scope that was simply not possible before.

In many respects, the foundation for modern genomics was built by those studying the vinegar (also called fruit or pomace) fly Drosophila melanogaster and related species in the family Drosophilidae. As a premier model organism for genetic and biological research since the foundational work of Morgan and colleagues, D. melanogaster was, after C. elegans, the second metazoan organism to undergo whole-genome sequencing (Adams et al., 2000). At that time, the completion of the D. melanogaster genome proved the viability of shotgun sequencing approaches and paved the way for larger, more complicated genomes (Hales et al., 2015). The genomic tractability that made drosophilids attractive for this work has led to their continued widespread use as model organisms in the genomic era: the whole-genome sequencing of 12 Drosophila species (Clark et al., 2007) and the characterization of functional elements in Drosophila genomes (Roy et al., 2010) are prominent milestones in the history of modern genomics.

As it is a popular model system, an extensive collection of genomic resources exists for drosophilids today. Excluding genomes from this study, there are representative genome assemblies available on NCBI databases (GenBank and RefSeq) for about 75 different drosophilid species (Hotaling et al., 2021). About a third of these genomes are provided as chromosome-level scaffolds. Along with this diverse catalog of whole-genome sequences are collections of expression and regulation data (Chen et al., 2014; Roy et al., 2010), maps of constrained (i.e. functional) sequences inferred with comparative genomics tools (Stark et al., 2007), and population genomic data (e.g. Guirao-Rico and González, 2019; Lack et al., 2016; Signor et al., 2018). Well-studied D. melanogaster was among the first species to have high-quality genomes assembled for multiple individuals, revealing population variation in structural variants (Chakraborty et al., 2019; Long et al., 2018). Yet even with the intense scientific interest and effort thus far, only a small portion of the remarkably diverse drosophilids, a family which includes over 1600 described and possibly thousands of other undescribed species (O'Grady and DeSalle, 2018), is available for genomic study today.

There is much scientific potential to be unlocked by improving the catalog of genomic diversity within this group, and the simplification that long reads bring to the genome assembly process is key. Long reads have proved to be a way to quickly generate affordable yet high-quality genomes, in fact the cost of a highly contiguous and complete Drosophila assembly based on long-read sequencing was recently estimated to be about $1,000 US dollars (Miller et al., 2018; Solares et al., 2018), orders of magnitude less than the first D. melanogaster genome. While a number of studies have already used long reads to assemble the genomes of one or a few drosophilid species (Bracewell et al., 2019; Chakraborty et al., 2021; Comeault et al., 2020; Flynn et al., 2020; Hill et al., 2020; Mai et al., 2020; Miller et al., 2018; Paris et al., 2020; Rezvykh et al., 2021; Solares et al., 2018), a sequencing and genome assembly project at a scale similar to that of the large genome assembly consortia, especially without similar resources and funding, remains challenging even with the benefits of long reads. Yet, there continue to be rapid improvements to long-read sequencing that may alleviate some of these logistical challenges. Long-read sequencing costs have dropped significantly in the past few years as protocols, kits, and the underlying technology improves. Ultra-long (50–100 kb or longer) reads are obtainable with Oxford Nanopore (ONT) sequencing and under the right conditions should allow entire chromosomes to be fully assembled without additional time-consuming and costly scaffolding methods (e.g. Nurk et al., 2021). By simplifying the genome assembly process and reducing the cost of genome assembly even further, these techniques finally make it possible to assemble tens or hundreds of drosophilid genomes at a time.

Here, we present another step toward a comprehensive drosophilid genome dataset: a community resource of 101 de novo genome assemblies from 93 drosophilid species. These genomes were assembled using lines contributed by Drosophila researchers from across the world, and represent a diversity of ecologies and geographical distributions. We improve upon the Nanopore-based hybrid assembly (Nanopore plus Illumina) approach for Drosophila lines (Miller et al., 2018) to substantially increase the sequencing throughput contained in ultra-long reads while reducing overall costs. The contiguity, completeness, and quality of these genomes is assessed. We show that under ideal conditions, about two Drosophila lines (assuming an average 180 Mb genome) can be sequenced to at least 30× depth of coverage per ONT r9.4.1 (rev D) flow cell, at an approximate cost of 350 US dollars per line. Along with this manuscript and data, we provide a detailed Nanopore sequencing laboratory protocol specifically optimized for Drosophila lines, along with containerized computational pipelines. These genome assemblies and technical resources should facilitate the process of conducting large-scale genome projects in this key model clade and beyond.

Results and discussion

Taxon sampling

Our selection of species and strains for sequencing (Table 1) improves the geographic, ecological, and phylogenetic diversity of genomic data from the family Drosophilidae. Most (99 of 101) of the genome assemblies presented here are from 14 species groups in subgenera Drosophila and Sophophora of the subfamily Drosophilinae (Toda, 2020). One species of each of the genera Leucophenga and Chymomyza, both contained in less-studied sister subfamily Steganinae, have also been sequenced. We note some taxonomic inconsistencies arising from the paraphyly or polyphyly of certain drosophilid taxa (Finet et al., 2021; O'Grady and DeSalle, 2018; Yassin, 2013) but will make no attempt to address those issues here. The sequenced species originate from mainland and island locations in North America, Europe, Africa, and Asia; are distributed from northern (e.g. D. tristis, D.littoralis) to equatorial (e.g. D. bocqueti) latitudes; represent two independent transitions to leaf-mining herbivory (Scaptomyza and Lordiphosa); and for some species, like the pest Zaprionus indianus, represent reproductively isolated populations taken from throughout the range. For difficult to culture species, for instance Leucophenga varia and some Lordiphosa spp., only wild-caught flies were sequenced. Finally, we have sequenced lines in active research use. Additional genomic resources like gene expression or population data should be expected in the near future to accompany many of these assemblies. For species where multiple lines were assembled, we have selected a recommended line to use based on genome quality and denote this recommendation in Table 1.

Table 1. Species and strain information for all samples assembled for this work.

Note: Species group and subgroup information is taken from the NCBI Taxonomy Browser with slight modifications following O'Grady and DeSalle, 2018. Strain names along with corresponding NDSSC and Kyoto DGRC stock center numbers are provided to the best of our knowledge. See Supplementary file 1 and Supplementary file 6 for detailed information on samples and data. When multiple lines of a species are listed, * denotes the preferred assembly.

Subgenus	Group	Subgroup	Species	Sex	Strain name	NDSSC	Kyoto DGRC/ Ehime	Additional notes
Sophophora	melanogaster	melanogaster	D. melanogaster	MF	ISO-1 GENOME	14021-0231.36	NA	BDGP reference strain
			D. mauritiana	F	NA	14021-0241.01	NA	Miller et al., 2018
			D. simulans	F	NA	14021-0251.006	NA	Miller et al., 2018
			D. sechellia	F	NA	14021-0248.01	NA	Miller et al., 2018
			D. teissieri *	M	273.3	NA	NA
			D. teissieri	M	CT02	NA	NA
			D. yakuba	F	NA	14021-0261.01	NA	Miller et al., 2018
			D. erecta	F	NA	14021-0224.01	NA	Miller et al., 2018
		eugracilis	D. eugracilis	F	NA	14026-0451.02	NA	Miller et al., 2018
		suzukii	D. subpulchrella	M	L1	NA	NA
		suzukii	D. biarmipes	MF	361.0 iso1 l-11 GENOME strain 1	14023-0361.10	NA	modENCODE strain
		takahashii	D. takahashii	F	IR98-3 E-12201	NA	E-912201	inbred derivative of Ehime stock IR98-3
		ficusphila	D. ficusphila	F	631.0-iso1 l-10 GENOME	14025-0441.05	NA	modENCODE strain
		rhopaloa	D. carrolli	MF	KB866	NA	NA
			D. rhopaloa	MF	BaVi067 GENOME	14029-0021.01	E-24701	modENCODE strain
			D. kurseongensis	F	SaPa58	NA	NA
			D. fuyamai	F	KB-1217	14029-0011.01	NA
		elegans	D. elegans	F	HK0461.03 GENOME	14027-0461.03	NA	modENCODE strain
		suzukii	D. oshimai	M	MT-04	NA	NA
		montium	D. bocqueti	M	YAK3_mont-66	NA	NA
			D. sp aff chauvacae	M	mont_up-71	NA	NA
			D. jambulina	MF	st-2	14028-0671.01	NA
			D. kikkawai	F	561.0-iso4 l-10 GENOME	14028-0561.14	NA	modENCODE strain
			D. rufa	F	EH091 iso-C L_3	NA	914802	inbred derivative of Ehime stock EH091
			D. triauraria	F	NA	14028-0691.9	NA	Miller et al., 2018; previously mis-identified as D. kikkawai
		ananassae	D. malerkotliana pallens	F	palQ-isoG	NA	NA
			D. malerkotliana malerkotliana	MF	mal0-isoC	14024-0391.00	NA	inbred derivative of strain 14024-0391.00
			D. bipectinata	MF	4-4-2-3-1-1-1-1-1 BackUp	14024-0381.04	NA	Inbred derivative of NDSSC strain
			D. parabipectinata	MF	par2-isoB	14024-0401.02	NA	inbred derivative of strain 14024-0401.02 (now extinct)
			D. pseudoananassae pseudoananassae	F	Wau 125	NA	NA
			D. pseudoananassae nigrens	F	VT04-31	NA	NA
			D. ananassae	F	14024-0371.13	NA	NA	Miller et al., 2018
			D. varians	MF	CKM15-L1	NA	NA
			D. ercepeace	MF	164-14	14024-0432.00	NA
	obscura	obscura	D. ambigua	M	R42	NA	NA	isofemale strain from the wild
			D. tristis	M	D2	NA	NA	isofemale strain from the wild
			D. obscura	M	BZ-5	NA	NA	isofemale strain from the wild
			D. subobscura	M	Küsnacht	NA	NA	standard laboratory strain
		pseudoobscura	D. persimilis	F	NA	14011-0111.01	NA	Miller et al., 2018
		pseudoobscura	D. pseudoobscura	F	NA	14011-0121.94	NA	Miller et al., 2018
	willistoni	willistoni	D. willistoni (Uruguay) *	M	L-G3	14030-0811.17	NA
			D. willistoni	F	NA	14030-0811.00	NA	Miller et al., 2018
			D. paulistorum L06 *	M	(Heed) H66.1C	14030-0771.06	NA
			D. paulistorum L12	M	L12	14030-0771.12	NA
			D. tropicalis	M	(Heed) H65.2	14030-0801.00	NA
			D. insularis	M	jp01i	NA	NA	isofemale line from J. Powell
		bocainensis	D. sucinea	M	49.15	14030-0791.01	NA
		bocainensis	D. sucinea**	M	H176.10	14030-0761.01	NA	NDSSC strain is misidentified as D. nebulosa
	saltans	saltans	D. saltans	M	(Heed) H180.40	14045-0911.00	NA
		saltans	D. prosaltans	M	(Heed) H29.6	14045-0901.02	NA
		neocordata	D. neocordata	M	2536.7	14041-0831.00	NA
		sturtevanti	D. sturtevanti	F	H191.23	14043-0871.01	NA
	Lordiphosa	miki	L. clarofinis	MF	Guizhou062018LC	NA	NA	Line inbred for 2 generations in the lab before sequencing
			L. stackelbergi	MF	UCILTSSapporo052019LS	NA	NA	Pool of 50 wild-caught flies
			L. magnipectinata	MF	UCKTSapporo052019LM	NA	NA	Pool of 50 wild-caught flies
		fenestrarum	L. collinella	MF	UCKTSapporo052019LC	NA	NA	Pool of 30 wild-caught flies
			L. mommai	MF	MMSapporo052014LM	NA	NA
Drosophila	Zaprionus	vittiger	Z. nigranus	M	st01n	NA	NA	line derived from wild collection
			Z. camerounensis	M	jd01cam	NA	NA	isofemale line from J. David
			Z. lachaisei	M	jd01l	NA	NA	line derived from wild collection
			Z. vittiger	M	jd01v	NA	NA	isofemale line from J. David
			Z. davidi	M	jd01d	NA	NA	isofemale line from J. David
			Z. taronus	M	st01t	NA	NA	line derived from wild collection
			Z. capensis	M	jd01cap	NA	NA	isofemale line from J. David
			Z. gabonicus	M	jd01gab	NA	NA	isofemale line from J. David
			Z. indianus RCR04	M	RCR04	NA	NA
			Z. indianus 16GNV01	M	16GNV01	NA	NA
			Z. indianus BS02 *	M	BS02	NA	NA
			Z. indianus CDD18	M	CDD18	NA	NA
			Z. africanus	M	BS06	NA	NA
			Z ornatus	M	jd01o	NA	NA	isofemale line from J. David
		tuberculatus	Z. tsacasi	M	car7-4	NA	NA
		tuberculatus	Z. tsacasi *	M	jd01t	NA	NA	isofemale line from J. David
		inermis	Z. kolodkinae	M	jd01k	NA	NA	isofemale line from J. David
			Z. inermis	M	18BSZ10	NA	NA
			Z. ghesquierei	M	jd01ghe	NA	NA	isofemale line from J. David
	cardini	dunni	D. dunni	M	H254.21	15182-2291.00	NA
		dunni	D. arawakana	M	MONHI050227(B)-104	15182-2261.03	NA
		cardini	D. cardini	M	NA	15181-2181.03	917701
	funebris	funebris?	undescribed (Sao Tome mushroom)	M	st01m	NA	NA	undescribed species collected on mushroom, Sao Tome
	funebris	funebris	D. funebris	M	fst01	NA	NA	line derived from wild collection
	immigrans	immigrans	D. immigrans *	F	FK05-19	15111.1731.12	NA
		immigrans	D. immigrans kari17	M	kari17	NA	NA
		(incertae sedis)	D. pruinosa	M	iso-A1 l-9	NA	NA
		quadrilineata	D. quadrilineata	M	quad-TMU	NA	914402
	tumiditarsus		D. repletoides	M	ISZ-isoB I-10	NA	NA
	Scaptomyza	Scaptomyza	S. montana	MF	iso-CA-L1	NA	NA
		Scaptomyza	S. graminum	F	TMU-2019	NA	NA	30 wild-caught females
		Parascaptomyza	S. pallida	MF	iso-CA-L1	NA	NA
		Hemiscaptomyza	S. hsui	MF	iso-CA-L1	NA	NA
	HawaiianDrosophila	orphnopeza	D. sproati	MF	DKPTOMS02	NA	NA	Pool of wild-caught flies
		orphnopeza	D. murphyi	MF	DKPHETFM01	NA	NA	Flies from recently established but not inbred lab line
		grimshawi	D. grimshawi	F	NA	15287-2541.00	NA	Same line as caf1 genome
	virilis	virilis	D. virilis	F	NA	15010-1051.87	NA	Miller et al., 2018
			D. americana	M	3367.1	15010-0951.00	NA	Also called Anderson strain
			D. littoralis	M	Kilpisjärvi 1	NA	NA	Originally misidentified as D. ezoana (Lankinen 1986, J Comp Physiol A 159: 123-142)
	repleta	repleta	D. repleta	M	kari30	NA	NA
	repleta	mulleri	D. mojavensis	F	15081-1352.22	NA	NA	Miller et al., 2018
genus: Leucophenga			L. varia	M	nc01v	NA	NA	Sequenced single wild-caught fly, no amplification
genus: Chymomyza			C. costata	M	Sapporo	NA	NA

Open in a new tab

* denotes the genome of best quality when multiple assemblies are available for a species.

Near chromosome-scale assembly with ultra-long reads

We sequenced the fly samples using a ONT 1D ligation kit approach, replacing magnetic bead cleanups with size selective precipitation. This modified workflow is optimized for genomic DNA extractions from 15 to 30 whole flies, increases the yield of ultra-long reads relative to the standard ligation kit protocol, increases overall sequencing throughput, and significantly reduces the cost of library preparation. Sequencing runs varied with sample quality and type, and in general read lengths and throughput increased over the course of this work with improved iterations of the protocol. Under optimal conditions and with enough starting material (at least 2,000 ng of very high molecular weight DNA) to prepare at least three library loads (~1200–500 ng total prepared library, 350–500 ng per load), along with regular DNAse flushes to maintain yields, Nanopore sequencing runs following the supplied protocol should net 12–15 Gb of data per R9.4.1 flow cell with a read N50 greater than 20 kb, and about 30% of data in reads longer than 50 kb. We generated paired-end, 150 bp Illumina reads for most strains unless public datasets were available.

Deep (average 52×) sequencing coverage with a substantial fraction of ultra-long reads (Supplementary file 1) resulted in high-quality genome assemblies that were comparable to and often better than currently available reference genomes in terms of contiguity and completeness (Figure 1, Figure 1—figure supplement 1, Supplementary file 2). We chose Flye (Kolmogorov et al., 2019) as our assembler based on superior contiguity and favorable runtimes relative to Miniasm (Li, 2016) and Canu (Koren et al., 2017; Figure 1—figure supplement 2). To provide standardization for measures of contiguity, we estimated genome size for each assembly using long-read coverage over single-copy BUSCO loci (Supplementary file 2).

Figure 1. — (**A,B**) Assembly contiguity is compared to the *D. melanogaster* v6.22 reference genome (blue) as well as five recently published, highly contiguous Illumina assemblies (red lines, *D. birchii, D. bocki, D. bunnanda, D. kanapiae, D. truncata*; Bronski et al., 2020). (A) Nx curves, or the (y-axis) size of each contig when contigs are sorted in descending size order, in relation to the (x-axis) cumulative proportion of the genome assembly that is covered. (B) The distribution of contig N50, the size of the contig at which 50% of the assembly is covered. (C) Assembly completeness assessed by BUSCO v4.0.6 (Seppey et al., 2019). Note, *D. equinoxialis* was evaluated with BUSCO v4.1.4 due to an issue with v4.0.6. *L. stackelbergi* has >10% missing BUSCOs. Individual assembly summary statistics are provided in Supplementary file 2.

Figure 1—figure supplement 1. — (**A,B**) Assembly contiguity is compared to the *D. melanogaster* v6.22 reference genome (blue) as well as five recently published, highly contiguous Illumina assemblies (red lines, *D. birchii, D. bocki, D. bunnanda, D. kanapiae, D. truncata*; Bronski et al., 2020). (A) Nx curves, or the (y-axis) size of each contig when contigs are sorted in descending size order, in relation to the (x-axis) cumulative proportion of the genome assembly that is covered. (B) The distribution of contig N50, the size of the contig at which 50% of the assembly is covered. (C) Assembly completeness assessed by BUSCO v4.0.6 (Seppey et al., 2019). Note, *D. equinoxialis* was evaluated with BUSCO v4.1.4 due to an issue with v4.0.6. *L. stackelbergi* has >10% missing BUSCOs. Individual assembly summary statistics are provided in Supplementary file 2.

Of 101 total assemblies, 94 contain over 98% of the assembly in contigs larger than 10 kb, and both contig N50s and NG50s exceed 1 Mb for these genomes (Figure 1A, Figure 1B, Figure 1—figure supplement 3, Supplementary file 2). Assembly sizes were highly correlated with estimated genome sizes (Figure 1—figure supplement 4). In addition to meeting the megabase contig N50 standard for new genomes proposed by the Vertebrate Genomes Project (Rhie et al., 2021), these statistics show that most of the genome is present in the assembly in megabase-sized contigs. In other words, the assemblies are nearly at the chromosome level. For comparison, of the 76 representative drosophilid genomes that were previously available on NCBI (Hotaling et al., 2021), only 25 have an N50 greater than 1 Mb (Figure 1—figure supplement 1). Moreover, many of these highly contiguous NCBI genomes are scaffolded, an additional step that would have added a significant amount of time and additional expenses to this study. Even when DNA was extracted from pools of wild-caught flies or a single fly (Leucophenga varia) resulting in sub-optimal read lengths and output, the assembly was comparable to existing short read assemblies (Figure 1A, Figure 1B). High contiguity resulted in benchmarking universal single-copy ortholog (BUSCO) completeness (Seppey et al., 2019; Simão et al., 2015) in the range of 97–99+% for all but the three most fragmented genomes (Figure 1C). As with contiguity, the completeness of these genomes is comparable to reference genomes on NCBI (Figure 1—figure supplement 1).

Estimates of sample diversity

We have utilized a variety of fly samples, from highly inbred lab lines to wild-caught flies, for genome assembly. We therefore sought to quantify the level of diversity inherent to each sample and use variant calls to estimate the error rate for each assembly. Long and short reads (if available) were mapped separately to each finished genome and variant calling was performed with PEPPER-Margin-DeepVariant (Shafin et al., 2021) for long reads and BCFtools (Danecek et al., 2021; Li, 2011) for short reads. After quality filtering and masking genomic regions annotated as repeats, the counts of single nucleotide polymorphisms (SNPs), indels, and the fraction of sites with a non-reference SNP were computed (Figure 2, Supplementary file 3). Note, when short reads were not from the same strain as used for the assembly, short read polishing was used to only correct indels, and called SNPs will not accurately represent the variation in the sample that was sequenced with Nanopore. Also note that SNP calls from Nanopore data should be relatively accurate but indel calls will not (Shafin et al., 2021).

Figure 2. — Per-site SNP heterozygosity (number of heterozygous SNPs/number of callable sites) is plotted for each of the 101 assembled lines. Blue dots represent heterozygosity estimates from Nanopore reads with PEPPER-Margin-DeepVariant (Shafin et al., 2021). Orange dots represent heterozygosity estimates from short reads with BCFtools (Li, 2011). The genomes on the right are for species that did not have available short-read data. Numerical values for these estimates are provided in Supplementary file 4.

Figure 2—figure supplement 1. — Per-site SNP heterozygosity (number of heterozygous SNPs/number of callable sites) is plotted for each of the 101 assembled lines. Blue dots represent heterozygosity estimates from Nanopore reads with PEPPER-Margin-DeepVariant (Shafin et al., 2021). Orange dots represent heterozygosity estimates from short reads with BCFtools (Li, 2011). The genomes on the right are for species that did not have available short-read data. Numerical values for these estimates are provided in Supplementary file 4.

Large variation in sample diversity over several orders of magnitude was observed. Estimated SNP heterozygosity, the number of heterozygous SNPs divided by the number of callable sites, ranged from 0.00035% to 1.1% from long reads and 0.0015% to 2.1% from short reads, and heterozygosity estimated from long reads was systematically lower than that from short reads, particularly when sample diversity was high (Figure 2, Figure S6). Qualitative patterns of heterozygosity generally tracked the history of the samples (e.g. the highly inbred reference strains had very low diversity). Conditioning on datasets where both long and short reads were generated from the same sample, heterozygosity estimates from both types of reads were positively correlated (Pearson correlation R²=0.50, p=1.13×10–12). If we ignore Lordiphosa, the group with wild-caught or recently collected samples that was consequently the most challenging to assemble, this correlation is greatly increased (Pearson correlation R²=0.81, p<2.2×10–16). Interestingly, we did not observe a significant relationship (p=0.30) between estimated heterozygosity and assembly contiguity (Figure 2—figure supplement 1). The number of heterozygous non-reference variants almost always exceeded the number of homozygous variants (Supplementary file 3), as would be expected from residual diversity in the sequenced lines.

Estimates of sequence quality

Next, we estimated the genome-wide error rates in our assemblies using both the variant calls obtained previously and a reference-free method (Supplementary file 4). For the first approach, Phred-scaled (Ewing et al., 1998) consensus quality (QV) was estimated by assuming all sites with a non-reference variant were an error. The error rate was then computed by dividing the number of sites with at least one non-reference variant by the total number of callable bases. As expected from the patterns of heterozygosity estimated from long and short reads, there was a large amount of variability in quality scores. Estimates from short reads ranged from QV17 to QV45 and from long reads were slightly higher, from QV19 to QV52 (Supplementary file 4).

This method is likely to be biased by assembly features that affect the quality of read mapping, for example, we remove sequences annotated as repeats when filtering the variant calls. To address this bias, we employed the reference-free approach implemented in Merqury (Rhie et al., 2020) for the 94 assemblies which had some kind of short-read data available (Figure 3A, Supplementary file 4). Estimated quality scores ranged from QV16 to QV40, and once again, samples for which reads from a different strain or a genetically diverse sample (i.e. wild samples or recent isolates) were used had the lowest estimated QV. Merqury-estimated QV was on average higher than consensus quality estimated by the variant calling methods, but the relative ranking of QV estimates remained largely consistent with QV based on short-read (Spearman’s ρ=0.642, p<2.2e-16) and long-read (Spearman’s ρ=0.684, p<2.2e-16) variant calls.

Figure 3. — (A) Genome-wide, Phred quality scores estimated with the reference-free, k-mer based approach implemented in Merqury (Rhie et al., 2020). Merqury requires a short-read dataset to perform the evaluation. Filled circles represent QV estimates with short-read data from the same strain used for Nanopore sequencing, and empty circles denote estimates using short-read data from a different strain than used for Nanopore sequencing. (**B, C, D**) Phred quality score cutoffs for the bottom 10th percentile of 100 kb genomic windows, as evaluated with a reference-based approach, in coding sequences only. Quality scores are capped at 60 for visualization purposes. At least 90% of 100 kb windows are this accurate. Only Nanopore assemblies with an NCBI RefSeq genome counterpart of the same strain were evaluated. Accuracy is shown for SNVs (B), insertions (C), and deletions (D) separately. Additional details on quality score estimates are provided in Figure 3—figure supplement 1 and Supplementary file 4.

Figure 3—figure supplement 1. — (A) Genome-wide, Phred quality scores estimated with the reference-free, k-mer based approach implemented in Merqury (Rhie et al., 2020). Merqury requires a short-read dataset to perform the evaluation. Filled circles represent QV estimates with short-read data from the same strain used for Nanopore sequencing, and empty circles denote estimates using short-read data from a different strain than used for Nanopore sequencing. (**B, C, D**) Phred quality score cutoffs for the bottom 10th percentile of 100 kb genomic windows, as evaluated with a reference-based approach, in coding sequences only. Quality scores are capped at 60 for visualization purposes. At least 90% of 100 kb windows are this accurate. Only Nanopore assemblies with an NCBI RefSeq genome counterpart of the same strain were evaluated. Accuracy is shown for SNVs (B), insertions (C), and deletions (D) separately. Additional details on quality score estimates are provided in Figure 3—figure supplement 1 and Supplementary file 4.

While these estimates showed our genomes to mostly fall below the often-recommended QV40 threshold for reference genomes (Koren et al., 2019; Rhie et al., 2021), there are many reasons to expect that sequence quality in certain regions of the genome will be far better than the average. As expected, we found that QV estimates were particularly low when short-read data from a different sample was used for the estimation, as any true variation between strains will inflate the error rate. Because we sequenced pools of flies, residual polymorphism will be found in the data even when long and short reads are sampled from the same pool of flies. In these cases QV might be considered as a lower bound estimate of the true accuracy of the assembly. Additionally, complex coding sequences are likely to be far more accurate than other regions of the genome, like repeats, due to better short-read mapping. The single genome-wide estimates of QV we report obscure this variation.

Nanopore-based assemblies are highly accurate in coding regions

For these reasons, we found it critical to further examine how errors are distributed in Nanopore assemblies. Of particular concern is the accuracy of coding sequences. Gene annotation is an important and obvious next step after assembling a new genome, but Nanopore sequences are known to systematically contain indels in homopolymer runs that cannot be called accurately when a run exceeds the size of the nanopore reader head. Indel disruptions to otherwise highly accurate coding sequences would have a disproportionately large negative impact on protein prediction (Watson and Warr, 2019). On the other hand, it is likely that coding sequences are generally more accurate than the rest of the genome since short-read mapping is generally more reliable there. In theory, most exons should be free of errors somewhere between a genome-wide quality of QV30 to QV40 (Koren et al., 2019), but many of our assemblies do not appear to reach this benchmark.

Reference-based quality assessments were used to better understand how error rates vary across different genomic elements. We downloaded the 8 NCBI RefSeq genome assemblies for which we had a Nanopore genome of the same species and strain: D. biarmipes, D. elegans, D. ficusphila, D. grimshawi, D. kikkawai, D. melanogaster, D. mojavensis, and D. rhopaloa. Using the ONT Pomoxis software, we aligned each Nanopore assembly to its corresponding reference genome and estimated QV in non-overlapping 100 kb windows, using the entire sequence, then only coding sequences, introns, intergenic regions, and repeats, using gene and repeat definitions provided through NCBI RefSeq. All differences between query and reference assemblies were considered to be errors.

As expected, we found that sequence accuracy varied greatly within each genome assembly (Figure 3—figure supplement 1). Mean genome-wide QV ranged from QV15 to QV24 while median QV across the 100 kb windows ranged from QV14 to QV36. When looking only at coding sequences, mean QV ranged from QV23 to QV29, while the median window accuracy, with the exceptions of D. grimshawi (QV25) and D. rhopaloa (QV30), indicated complete identity (>QV50) between assembly and reference. For D. grimshawi and D. rhopaloa, SNVs were the primary contributor to the error rate and the number of indels was similar to the other genomes (median QV(indel)>50). Sequence accuracy was lower when looking at introns, intergenic regions, and repeats, in that order. However, regardless of the genomic element type, median QV across the windows always exceeded mean QV, often by more than QV10, or an order of magnitude difference in the error rate. In other words, differences between Nanopore and reference assemblies were clustered heavily into a few genomic regions, and most coding sequences were very accurate despite the seemingly high mean error rate (Figure 3B, Figure 3C, Figure 3D). Further caution is warranted in the interpretation of these quality scores: we have assumed that all differences between our Nanopore-based genomes and the reference genomes are errors in the Nanopore assembly, rather than errors in the reference, or true differences between the two sequenced samples. We will shortly show that reference-based comparisons might be heavily biased against the Nanopore assemblies.

To better understand the nature of putative indel errors in coding sequences, we focused on the D. melanogaster reference strain, where we have the best information about the genome from multiple independent high-quality assemblies (Kim et al., 2014; Koren et al., 2017; Solares et al., 2018). While D. melanogaster is a best-case scenario for genome assembly with a fly line, we think it reasonable to expect errors in other assemblies, for which we have utilized the same genome assembly workflow, to be similar in nature. Across the 22,209,264 bp of our D. melanogaster genome that aligned to reference coding sequences, our assembly contained 15 insertions and 17 deletions in 21 out of 13,913 (0.15%) queried protein-coding genes, with a total of 10,092 inserted and 46 deleted base pairs relative to the reference. All deletions (15 out of 15) were under 50 bp, 8 out of 15 insertions were under 50 bp, and the remaining 7 out of 15 insertions ranged from 120 bp to 4410 bp (Figure 3—figure supplement 2, Supplementary file 5). These larger insertions account for nearly all (99.3%) of the coding sequence differences between our genome and the reference. There is a clear, disproportionate impact of these large insertions in an otherwise nearly identical protein-coding genome.

We followed up on each of these 32 coding indels through manual curation with the genome browser IGV (Robinson et al., 2011b). Using the Release 6 D. melanogaster genome (Hoskins et al., 2015) as the reference, we aligned Nanopore and Illumina reads, a different Nanopore-based de novo assembly of the reference strain (Solares et al., 2018), a PacBio-based de novo assembly (Kim et al., 2014; Koren et al., 2017), and our assembly. We were particularly interested in the two other long-read assemblies, as we wondered if they might independently support any of the large variants in our assembly. RepeatMasker annotations for the Nanopore-based assembly were lifted over into Release 6 coordinates to see if these indels overlapped with a repetitive element.

This manual curation process revealed that the coding indels, in addition to being exceedingly rare, could be straightforwardly explained by regions of poor short read mapping and the presence of a duplicate contig in the assembly (Supplementary file 5). A series of large and small indels, including four out of the five insertions longer than 100 bp, overlapped a tandem repeat in genes CR44666, Mu68Ca, and Mu68E. While short reads mapped poorly to this region, limiting our ability to determine accuracy locally, long reads spanning the entire region and the two other long-read assemblies supported the large insertions. The remaining long (1414 bp) insertion was similarly supported by long reads and the other assemblies, but did not overlap with a repeat. Again, these insertions account for more than 99% of the indel differences between our genome and the reference. The remaining indels occurred in either repetitive regions (simple repeats and long interspersed nuclear element retrotransposons), in homopolymer runs in regions with poor short read mapping, or along a single contig that appeared to be a short duplicated segment of chromosome 2L. The other contig was error-free. All indels occurred on contigs with poor short-read mapping, suggesting they were a consequence of locally ineffective short read polishing, but also that sensible filtering based on short read depth or map quality would prevent these issues from propagating into downstream analyses. Importantly, these results suggest that reference-based quality analyses can be heavily biased against long-read assemblies and further support our caution against a naive projection of genome-wide quality score estimates onto coding regions.

A comparative genomics resource

To demonstrate the potential this dataset holds for the study of genome evolution and chromosome organization, we revisit a classic result with our highly contiguous assemblies. Although the ordering of genes in drosophilid chromosomal (Muller) elements has been extensively shuffled throughout ~53 million years of evolution (Suvorov et al., 2021), the gene content of each element remains largely conserved (Bracewell et al., 2019; Ranz et al., 2001; Sturtevant and Novitski, 1941). To examine synteny in our assemblies, many of which contain several contigs tens of megabases in length, we constructed an undirected graph using single-copy orthologous markers (i.e. BUSCOs). The number of times two markers were connected by assemblies determined the weight of the graph’s edges. A graph layout method was applied to spatialize (map) these relationships, clustering together BUSCOs that are frequently connected in the assemblies. We found that BUSCOs formed six major clusters following the D. melanogaster chromosome arm on which they are found, consistent with the expected conservation of gene content in Muller elements across drosophilids (Figure 4). Furthermore, the lack of a clear order within groups is consistent with extensive shuffling within Muller elements. This demonstrates that our dataset can be used for studies of genome evolution. New reference-free, whole-genome alignment methods (Armstrong et al., 2020) should substantially facilitate more detailed comparative analyses.

Figure 4. — Each node in this graph represents an orthologous marker corresponding to single-copy orthologs annotated by BUSCOv4 (Seppey et al., 2019). An edge between two nodes represents the number of times that BUSCO pair is directly connected within an assembly. Each BUSCO is colored by the chromosome arm in *D. melanogaster* that it is found on. The ForceAtlas2 (Jacomy et al., 2014) graph layout algorithm was used for visualization.

Repeat content

A large number of genome assemblies enables comparative analysis of repeat variation against a wide range of genome assembly sizes (140–450 Mb), for example the independent expansions of satellite repeats in D. grimshawi or retroelements in D. paulistorum, D. bipectinata, or D. subpulchrella (Figure 5). Within our dataset alone, RepeatMasker annotations reveal large variation in repeat content among drosophilids (Figure 5). No correlation exists between assembly contiguity and repeat content (Figure 5—figure supplement 1), suggesting long-read sequencing overcomes many of the challenges to drosophilid genome assembly posed by repetitive sequences. Additionally, we observe a positive relationship between the size of repetitive sequences and non-repetitive sequences, suggesting that genome size is influenced by expansions and contractions of both portions of the genome (Figure 5—figure supplement 2). Some discretion is warranted in the interpretation of these results. Repeats are likely to be better annotated in genomes from well-studied species groups, since they are more likely to be well-characterized in the repeat databases we used. Nevertheless, the high continuity of these assemblies should allow for the proper identification of new transposable elements in the genomes and allow for the analyses of transposable element evolution at the level of individual transposable elements or transposable element families in a way that is not feasible with more fragmented genome assemblies (Clark et al., 2007).

Figure 5. — For each species, the proportion of each genome annotated with a particular repeat type is depicted. Species relationships were inferred by randomly selecting 250 of the set of BUSCOs (Seppey et al., 2019) that were complete and single-copy in all assemblies. RAxML-NG (Kozlov et al., 2019) was used to build gene trees for each BUSCO then ASTRAL-MP (Yin et al., 2019) to infer a species tree. Repeat annotation was performed with RepeatMasker (Smit et al., 2013) using the Dfam 3.1 (Hubley et al., 2016) and RepBase RepeatMasker edition (Bao et al., 2015) databases. ASTRAL local posterior probabilities are reported at each node.

Figure 5—figure supplement 1. — For each species, the proportion of each genome annotated with a particular repeat type is depicted. Species relationships were inferred by randomly selecting 250 of the set of BUSCOs (Seppey et al., 2019) that were complete and single-copy in all assemblies. RAxML-NG (Kozlov et al., 2019) was used to build gene trees for each BUSCO then ASTRAL-MP (Yin et al., 2019) to infer a species tree. Repeat annotation was performed with RepeatMasker (Smit et al., 2013) using the Dfam 3.1 (Hubley et al., 2016) and RepBase RepeatMasker edition (Bao et al., 2015) databases. ASTRAL local posterior probabilities are reported at each node.

Next steps

We have built an open resource of 101 nearly chromosome-level drosophilid genome assemblies, adding to the rapidly growing number of high-quality genomes available for this model system (Hotaling et al., 2021; Suvorov et al., 2021). We envision this dataset being used to address a large number of outstanding questions entailing large comparative analyses among species, including the comparison of population genomic data between a large number of species, providing unprecedented resolution to investigate fundamental questions about the evolutionary process. In addition, we provide detailed laboratory and computational workflows that we hope will provide a jumping off point for future genome assembly projects in drosophilids or other taxa. While we hope this to already be a valuable resource to the scientific community, we acknowledge there is much to be done to build upon the resource and to improve its usability.

Despite our best efforts to improve species diversity, both the species we sequenced and the drosophilid genomes available today are significantly biased towards well-studied, easy-to-maintain species already in use for scientific research. Reducing sampling bias, with respect to phylogenetic diversity, geographic distribution, and ecology should be a goal of future genome assembly projects in this group. The high input DNA requirement for PCR-free long-read sequencing is a major limitation of our assembly workflow in this context. Our protocol requires a DNA extraction from multiple flies, ideally from an inbred line to minimize genetic diversity in the sample used for assembly. High diversity in the sample usually results in a fragmented assembly with many duplicated sequences, and while these issues can be addressed with computational tools, the quality of the final assembly is still affected. However, some species, for instance the Lordiphosa spp. or Hawaiian Drosophila we sequenced, cannot be quickly raised in the lab on standard media and thus cannot easily be inbred like other drosophilids. Many other species are simply understudied and sample availability is limited to a few flies collected from the wild and possibly preserved in ethanol for many years. Methods for assembling genomes with small quantities of DNA from single insects (Adams et al., 2020; Kingan et al., 2019; Schneider et al., 2021) or dealing with degraded specimens from older collections will be particularly important as the scope of future work expands beyond stock center and laboratory lines.

Some of these sequencing challenges will be better addressed by new technology and techniques. While we hesitate to make specific recommendations due to the rapidly changing landscape of long-read sequencing and genome assembly methods, there are a few clear ways in which many recently assembled long-read genomes can be improved. Even in the short time since we performed the sequencing for this work, there have been remarkable improvements to library preparation workflows, the accuracy of base calling algorithms, and assembly tools. At a minimum, we plan to iteratively update these assemblies using newer base calling methods to maximize the usefulness of the dataset.

This alone is unlikely to future-proof Nanopore R9 flow cell-based assemblies when the ultimate goal is to build genomes that are free of errors, and we recommend that a genome assembly project initiated today look beyond a Nanopore and Illumina approach. There is ample room to reduce the per-assembly cost while improving both contiguity and accuracy. The current major obstacle to high genome-wide accuracy is the difficulty of calling bases accurately in homopolymer runs combined with the limitations of short reads for correcting these errors when they occur in genomic regions with poor short read mappability. One new strategy to address this is to generate supplementary lower coverage data from high fidelity long read sequencing, for instance with PacBio HiFi (Nanopore versions are currently in development). New polishing tools are specifically designed to polish Nanopore assemblies with higher-fidelity reads (e.g. Shafin et al., 2021) and users should see greatly improved overall sequence accuracy.

This kind of hybrid long-read assembly approach may prove to be even more efficient than the assembly workflow we have presented. Interestingly, we find that high contiguity can be achieved even with minimal (10–20×) coverage of moderately long (read N50 25 kb) Nanopore reads (Figure 6). Similar coverage with even longer reads could serve as a cheap way to generate almost chromosome-level contigs, which will then be polished with higher fidelity long reads or Illumina reads. Ultra-long Nanopore sequencing is also significantly more accessible than before. Recently (as of March 2021), Oxford Nanopore and Circulomics released new ultra-long sequencing kits that, under ideal conditions, allows users to perform ultra-long Nanopore sequencing runs where read N50s exceed 100 kb while nearly doubling overall flow cell throughput compared to the sequencing runs performed for this study. Further cost savings should be possible if sequencing is done with the ONT PromethION or PacBio CLR, depending on the scale of the project. Both technologies have a lower per-base cost than MinION sequencing and similarly long reads can be obtained.

Figure 6. — The NGx curve is shown for *Drosophila jambulina* assemblies at varying levels of coverage. The length of the assembly with the full data is assumed to be the genome size. Read sets used for each assembly were obtained by randomly downsampling the basecalled reads (read N50 ~27.5 kb) to varying (5× to 30×) depth of coverage. Proportionally, these read sets contain ~55% of total sequenced bases in reads longer than 25 kb, ~25% of bases in reads longer than 50 kb, and ~7% of bases in reads longer than 100 kb. Near chromosome scale assemblies (N50>20Mb) were achievable even at 15× to 20× depth with this read length distribution. This corresponds to approximately 8× to 10× depth in reads longer than 25 kb.

Finally, we are in the process of improving the utility of this resource by generating a suite of comparative genomics tools and annotations to be released in the upcoming months. Specifically, we are utilizing Progressive Cactus (Armstrong et al., 2020), a reference-free whole-genome aligner that is designed to be scalable to modern genomic datasets and that has already been applied to hundreds of mammal and bird genomes generated by the Zoonomia (Zoonomia Consortium et al., 2020) and Bird 10K (Feng et al., 2020) projects. These alignments will be used to create sequence conservation maps (Hickey et al., 2013; Pollard et al., 2010; Siepel et al., 2005), the precision of which should be close to single nucleotide resolution given the large number of drosophilid genomes that are now available. While ultimately RNA-seq across all species will be needed for annotation, we plan to quickly generate the first round of gene annotations using comparative annotation tools. For new assemblies where a previously annotated reference genome is available, LiftOff (Shumate and Salzberg, 2020) provides a way to quickly transfer annotations to a new genome. For the more challenging task of gene annotation in species that do not already have a well-annotated reference, we are using the Comparative Annotation Toolkit (Fiddes et al., 2018), software to perform first-pass annotations assisted by homology information from the Progressive Cactus alignment. New RNA-seq data will be generated for select species in clades without a well-annotated member (e.g. Zaprionus). These tools will provide a framework for anyone to apply iterative improvements as new data become available.

Reproducibility

Detailed laboratory protocols, computational pipelines, and computational container recipes are provided as a reference and to maximize reproducibility. The protocol is publicly available at Protocols.io and pipeline scripts along with associated compute containers are provided in a public GitHub repository. See Materials and methods for additional details on compute containers, accession numbers, and web links to these resources.

Materials and methods

Key resources table.

Reagent type (species) or resource	Designation	Source or reference	Identifiers	Additional information
Strain, strain background (Drosophila spp. and relatives)	See Table 1 and Supplementary files 1–6 for sample information, strain designations, stock center line identifiers (when applicable), biomaterial provider, and NCBI accession numbers.
Commercial assay or kit	Blood and Cell Culture DNA Mini Kit	Qiagen	cat # 13323
Commercial assay or kit	Ligation Sequencing Kit	Oxford Nanopore	SQK-LSK109	Superseded by SQK-LSK110
Commercial assay or kit	Flow cell wash kit	Oxford Nanopore	EXP-WSH003	Superseded by EXP-WSH004
Commercial assay or kit	Short Read Eliminator kit	Circulomics	SKU # SS-100-101-01
Commercial assay or kit	Companion Module for ONT Ligation Sequencing	NEBNext	cat # E7180S
Commercial assay or kit	Nextera XT DNA Library Preparation Kit	Illumina	cat # FC-131–1002	Superseded by version 2
Commercial assay or kit	Kapa HyperPrep Kit	Roche	cat # KK8502
Software, algorithm	Flye	Kolmogorov et al., 2019	2.6
Software, algorithm	Canu	Koren et al., 2017	1.8
Software, algorithm	Miniasm	Li, 2016	0.3
Software, algorithm	Guppy	Oxford Nanopore	3.2.4
Software, algorithm	Medaka	Oxford Nanopore	0.9.1
Software, algorithm	Minimap2	Li, 2016	2.17
Software, algorithm	SAMtools	Li et al., 2009	1.12
Software, algorithm	Racon	Vaser et al., 2017	1.4.3
Software, algorithm	BUSCO	Simão et al., 2015	3.0.2
Software, algorithm	BUSCO	Seppey et al., 2019	4.0.6
Software, algorithm	Purge_haplotigs	Roach et al., 2018	1.1.1
Software, algorithm	npScarf	Cao et al., 2017	1.9-2b
Software, algorithm	Pilon	Walker et al., 2014	1.23
Software, algorithm	BLAST	Altschul et al., 1990	2.10.0
Software, algorithm	SPAdes	Bankevich et al., 2012	3.11.1
Software, algorithm	FMLRC	Wang et al., 2018	1.0.0
Software, algorithm	LINKS	Warren et al., 2015	1.8.7
Software, algorithm	RepeatMasker	Smit et al., 2013	4.1.0
Software, algorithm	Dfam repeat databse	Hubley et al., 2016	3.1	Library for RepeatMasker
Software, algorithm	RepBase RepeatMasker edition	Bao et al., 2015	20181026	Library for RepeatMasker
Software, algorithm	cross_match	Green, 2009	1.090518
Software, algorithm	Tandem Repeat Finder	Benson, 1999	4.0.9
Software, algorithm	Bioawk	Li, 2017	1.0
Software, algorithm	GenomeScope	Vurture et al., 2017	1.0.0
Software, algorithm	Jellyfish	Marçais and Kingsford, 2011	2.2.3
Software, algorithm	Sambamba	Tarasov et al., 2015	0.8.0
Software, algorithm	PEPPER-Margin-Deepvariant	Shafin et al., 2021	0.4
Software, algorithm	BCFtools	Li, 2011	1.12
Software, algorithm	Merqury	Rhie et al., 2020	1.3
Software, algorithm	Pomoxis	Oxford Nanopore	0.3.7
Software, algorithm	bedtools	Quinlan and Hall, 2010	2.30.0
Software, algorithm	HALtools	Hickey et al., 2013	2.1
Software, algorithm	Integrative Genomics Viewer	Robinson et al., 2011b	2.9.4
Software, algorithm	MAFFT	Katoh and Standley, 2013	7.453
Software, algorithm	RAxML-NG	Kozlov et al., 2019	0.9.0
Software, algorithm	ASTRAL-MP	Yin et al., 2019	5.14.7
Software, algorithm	ForceAtlas2	Jacomy et al., 2014		Implemented in R package https://github.com/analyxcompany/ForceAtlas2
Software, algorithm	ape	Paradis and Schliep, 2019	5.4.1	R package
Software, algorithm	Docker	docker.com
Software, algorithm	Singularity	sylabs.io

Open in a new tab

Taxon sampling and sample collection

The selection of species used for this study was driven by several key objectives. First, we aimed to provide data for ongoing research projects. Second, we aimed to supplement existing genomic data, both as a benchmarking resource against well-studied references (e.g. D. melanogaster) and to provide a technological update to some older assemblies (Roy et al., 2010). Third, we aimed to increase the phylogenetic and ecological diversity of publically available Drosophila genome assemblies.

In most cases, genomic DNA was collected from lab-raised flies, which were either derived from lines maintained at public Drosophila stock centers and individual labs or, in a few cases, from F1 or F2 progeny of flies recently collected in the wild. We collected specimens from the wild with standard fruit or mushroom-baited traps, sweep netting, and aspiration. We established isofemale lines from individual females collected using these baits unless otherwise specified (Supplementary file 1). For species difficult to culture in the lab (all Lordiphosa spp. except Lo. clarofinis, D. sproati, D. murphyi, Le. varia, S. graminum), either wild-caught flies or flies from a transient lab culture were used. In accordance with domestic and international shipping laws, these flies were either fixed in ethanol before transport (Lordiphosa spp., D. subobscura, D. obscura, C. costata, D. littoralis, D. tristis, D. ambigua) or transported with permits (P526P-15–02964 to D. Matute, P526P-20–02787 and P526P-19–01521 to A. Kopp, and Hawaii State permit I1302 to D. Price).

Of 101 total assemblies, we include 13 genomes assembled with re-analyzed sequences from Miller et al., 2018; 60 genomes from stock center lines or established lab cultures; 22 genomes from lab-raised flies derived from recent wild collections; and six genomes from wild-caught flies. Of note, 6 Zaprionus lines used in this study (Z. africanus, Z. indianus, Z. tsacasi, Z. nigranus, Z. taronus) were assembled by Comeault et al., 2020, but updated higher contiguity assemblies are provided with this manuscript with the exception of Z. indianus line 16GNV01 (see ‘Alternative hybrid assembly process’ section below). Details on each sample including (if available) line designations and collection information, are provided in Table 1 and Supplementary file 6.

DNA extraction and nanopore sequencing

A high molecular weight (HMW) genomic DNA (gDNA) extraction and ONT library prep was performed for each sample, with slight variation in the protocol through time and to deal with differences in sample quality or preservation. Here, we briefly describe a recommended general protocol for HMW gDNA extraction and library prep from 15 to 30 flies. This protocol is sufficient to reproduce all results from this manuscript at the same or higher levels of data quality. Detailed step-by-step instructions are provided at Protocols.io (see Data availability). We note one exception made necessary by sample availability and shipping laws. Scaptomyza graminum gDNA was extracted by using the Qiagen Blood and Cell Culture DNA Mini Kit (Qiagen, Germantown, MD) from 30 unfrozen flies and prepared with the ONT LSK109 kit (Oxford Nanopore, Oxford, UK) without any modifications to the manufacturer's instructions.

Genomic DNA was prepared from about 30 flash frozen or ethanol fixed adult flies. For non-inbred samples, we tried to use 15 flies or less to minimize the genetic diversity of the sample. In the absence of amplification, about 1.5–3 μg of input DNA is needed to prepare 3–4 library loads with the ONT LSK109 kit. Sufficient input DNA is particularly important when selecting for longer reads. Ethanol preserved samples were soaked in a rehydration buffer (400 mM NaCl, 20 mM Tris-HCl pH 8.0, 30 mM EDTA) for 30 min at room temperature (~23°C), dabbed dry with a Kimwipe, then frozen for 1 hr at −80°C before extraction. Frozen flies were ground in 1.5 mL of homogenization buffer (0.1M NaCl, 30 mM Tris HCl pH 8.0, 10 mM EDTA, 0.5% Triton X-100) with a 2 mL Kimble (DWK Life Sciences, Millville, NJ) Kontes Dounce homogenizer. The homogenate was centrifuged for 5 min at 2000 ×g, the supernatant discarded by decanting, and the pellet resuspended in 100 μL of fresh homogenization buffer. This mixture was then added to a tube with 380 μL extraction buffer (0.1M Tris-HCl pH 8.0, 0.1M NaCl, 20 mM EDTA) along with 10 μL of 20 mg/mL Proteinase K (Thermo Fisher Scientific, Waltham, MA), 10 μL SDS (10% w/v), and 2 μL of 10 mg/mL RNAse A (Millipore Sigma, Hayward, CA). This tube was incubated at 50°C for 4 hr, with mixing at 30–60 min intervals by gentle inversion.

High-molecular-weight gDNA was purified with a standard phenol-chloroform extraction. The lysate was extracted twice with an equal volume of 25:24:1 v/v phenol chloroform isoamyl alcohol (Thermo Fisher Scientific, Waltham, MA) in a 2 mL light phase lock gel tube (Quantabio, Beverly, MA). Next, the aqueous layer was decanted into a fresh 2 mL phase lock gel tube then extracted once with an equal volume of chloroform (Millipore Sigma, Hayward, CA). The use of the phase lock gel tube reduces DNA shearing at this stage by minimizing pipette handling. HMW DNA was precipitated by adding 0.1 vol of 3M sodium acetate and 2.0–2.4 volumes of cold absolute ethanol. Gentle mixing resulted in the precipitation of a white, stringy clump of DNA, which was then transferred to a DNA LoBind tube (Eppendorf, Hamburg, Germany) and washed twice with 70% ethanol. After washing, the DNA was pelleted by centrifugation and all excess liquid removed from the tube. The pellet was allowed to air dry until the moment it became translucent, resuspended in 65 μL of 1× Tris-EDTA buffer on a heat block at 50°C for 60 min, then incubated for at least 48 hr at 4°C. After 48 hr, the viscous DNA solution was mixed by gentle pipetting with a P1000 tip. This controlled shearing step encourages resuspension of HMW DNA and improves library prep yield. DNA was quantified with Qubit (Thermo Fisher Scientific, Waltham, MA) and Nanodrop (Thermo Fisher Scientific, Waltham, MA) absorption ratios were checked to ensure 260/280 was greater than 1.8 and 260/230 was greater than 2.0.

The sequencing library was prepared following the ONT Ligation Sequencing Kit (SQK-LSK109) protocol, with two important modifications. First, we started with approximately 3 μg of input DNA, three times the amount recommended by the manufacturer. Second, we utilized a form of size-selective polymer precipitation (Paithankar and Prasad, 1991) with the Circulomics Short Read Eliminator (SRE) buffer (Circulomics, Baltimore, MD) plus centrifugation to isolate DNA instead of magnetic beads. We found this to be necessary because magnetic beads irreversibly clumped with viscous HMW gDNA, decreasing library yield and limiting read lengths. The manner in which this was performed was specific to the cleanup step. After the end-prep/repair step (New England Biolabs, Ipswich, MA), the SRE buffer was used according to the manufacturer’s instructions. After adapter ligation, DNA was pelleted by centrifuging the sample at 10,000×g for 30 min without the addition of any reagents, since DNA readily precipitated upon addition of the ligation buffer. Ethanol washes were avoided past this step since ethanol will denature motor proteins in the prepared library. Instead, the DNA pellet was washed with 100 μL SFB or LFB (interchangeably) from the ligation sequencing kit instead of 70% ethanol. If library yield was sufficient (>50 ng/μL), the Circulomics SRE buffer was used for a final round of size selection, replacing the ethanol wash with LFB/SFB as described above. Of note, a cheaper and open-source alternative made with polyethylene glycol MW 8000 (PEG 8000), although less effective at size selection, to the SRE buffer is described by Tyson, 2020 (dx.doi.org/10.17504/protocols.io.7euhjew). A 1:1 dilution of the PEG 8000 solution described in that protocol can be substituted for SFB or LFB in the washing steps described above.

The typical yield of a library prepared in this manner is in the range of 1–1.5 μg. Approximately 350 ng of the prepared library was loaded for each sequencing run. To maintain flow cell throughput and read length, flow cells were flushed every 8–16 hr with the ONT Flow Cell Wash Kit (EXP-WSH003) and reloaded with a fresh library.

Obtaining short read datasets for polishing

We performed 2×150 bp Illumina sequencing for most of the strains that did not have publicly available short read data available. Illumina libraries were prepared from the same gDNA extractions as the Nanopore library for most samples, with some exceptions as described in Supplementary file 1. The libraries were prepared in either of two manners. For the majority of samples, sequencing libraries were prepared with a modified version of the Nextera DNA Library Kit (Illumina, San Diego, CA) protocol (Baym et al., 2015) and sequencing was performed by Admera Health on NextSeq 4000 or HiSeq 4000 machines. Alternatively, Illumina libraries were prepared with the KAPA Hyper DNA kit (Roche, Basel, Switzerland) according to the manufacturer’s protocol and sequenced at the UNC sequencing core on a HiSeq 4000 machine. In either case, all samples on a lane were uniquely dual indexed. Illumina sequencing was not performed for D. equinoxialis, D. funebris, D. subpulchrella, D. tropicalis, Le. varia, Z. lachaisei, Z. taronus, and the unidentified São Tomé mushroom feeder due to material unavailability (line extinction/culling). Details for each sample, including accession numbers for any public data used in this work, are provided in Supplementary file 1.

Choice of long read assembly program

Flye v2.6 (Kolmogorov et al., 2019) was used due to its quick CPU runtime, low memory requirements, excellent assembly contiguity, and its consistent performance on benchmarking datasets (Wick and Holt, 2020). We additionally validated the performance of Flye for Drosophila genomes using Nanopore data previously generated by Miller et al., 2018 and 60× depth of new Nanopore sequencing of the Berkeley Drosophila Genome Project ISO-1 strain of D. melanogaster. We assembled genomes with Flye v2.6 and Canu v1.8 (Koren et al., 2017) to evaluate simple benchmarks of assembly contiguity and run time and to provide a comparison to the Miniasm (Li, 2016) assemblies from Miller et al., 2018 Canu produced relatively contiguous assemblies, but a single assembly took several days on a 92-core cloud server and even longer when a large number of extra-long (>50kb) reads were present in the data. This was determined to be too costly when scaled to >100 species. In addition to a much shorter (8–12 hr wall-clock time) runtime, Flye also produced significantly more contiguous assemblies than those reported by Miller et al. (Figure 1—figure supplement 2). Note, several new long read assemblers have been released and these assembly programs have been significantly updated since this work was performed. Assembler performance should be evaluated with up-to-date versions in any future work.

Assembly and long read polishing

After Nanopore sequencing was performed, raw Nanopore data were basecalled with Guppy v3.2.4, using the high-accuracy caller (option: -c dna_r0.4.1_450bps_hac.cfg). Raw Nanopore data previously generated by Miller et al., 2018 were processed in the same manner.

Next, basecalled reads were assembled using Flye v2.6 with default settings. Genome size estimates (option: --genomeSize) were obtained through a web search or taken from a closely related species. If no such information was available, an initial estimate of 200 Mb was used. The specific genome size estimate provided to Flye (separate from the one estimated later from BUSCO coverage) is provided in Supplementary file 2.

After generating a draft assembly, we performed long read polishing using Medaka following the developer’s instructions (https://nanoporetech.github.io/medaka/draft_origin.html). Reads were aligned to the draft genome with Minimap2 v2.17 (Li, 2016) and parsed with SAMtools v1.12 (Danecek et al., 2021; Li et al., 2009) before each round of polishing (option: -ax ont). The draft was polished with two rounds of Racon v1.4.3 (Vaser et al., 2017) (options: -m8 -x 6 g 8 w 500) and then a single round of Medaka v0.9.1.

Haplotig identification and removal

Next, we assessed each Medaka-polished assembly for the presence of duplicated haplotypes (haplotigs) using BUSCO v3.0.2 (Simão et al., 2015; Waterhouse et al., 2018) along with the OrthoDB v9 dipteran gene set (Zdobnov et al., 2017). If the BUSCO duplication rate exceeded 1%, haplotig identification and removal was performed, but on the draft assembly produced by Flye rather than the polished assembly. Purge_haplotigs v1.1.1 (Roach et al., 2018) was run on these sequences following the guidelines provided by the developer (https://bitbucket.org/mroachawri/purge_haplotigs). Illumina reads were mapped to the draft assembly with Minimap2 (option: -ax sr) to obtain read depth information. The optional clipping step was performed to remove overlapping (duplicate) contig ends. Finally, remaining contigs were re-scaffolded with Nanopore reads using npScarf v1.9-2b (Cao et al., 2017), with support from at least four long reads required to link two contigs (option: --support=4). These sequences were polished with Racon and Medaka as described above.

Final polishing and decontamination

The Medaka-polished assembly was further polished with Illumina data and any contigs identified as microbial sequences were removed. Illumina reads were mapped to the draft assembly with Minimap2 (option: -ax sr) and the assembly polished with Pilon v1.23 (--fix snps,indels) (Walker et al., 2014). If a genome did not have an accompanying short read dataset but Illumina reads were available from a different strain of the same species (Supplementary file 1), Pilon was run without correcting SNVs (option: --fix indels). We found that allowing Pilon to fix gaps or local misassemblies in default mode introduced large spurious indels in regions where short reads map poorly such as tandem repeats. These variants were not supported by long reads or by comparison to a reference assembly. Thus, we chose to use Pilon to only fix base-level errors.

Assembly decontamination

After Pilon polishing, assembly completeness was assessed again with BUSCO v3.0.2. We used BLAST v2.10.0 (Altschul et al., 1990) to remove any contigs not associated with at least one BUSCO that were also of bacterial, protozoan, or fungal origin. Finally, any sequences flagged by the NCBI Contamination Screen were excluded or trimmed.

A flow chart outline of the full genome assembly process described here is provided in Figure 7.

Alternative hybrid assembly process

Zaprionus indianus line 16GNV01 had insufficient Nanopore data for a Flye assembly. For this line only and to consolidate all assemblies as a single resource, the same genome assembly from Comeault et al., 2020 is both reported here and associated with the NCBI BioProject associated with this work. An alternative assembly strategy was taken for this line. Briefly, short-read sequence data was assembled first using SPAdes v3.11.1 (Bankevich et al., 2012) using default parameters. Nanopore reads were corrected with Illumina data using FMLRC v.1.0.0 (Wang et al., 2018) and subsequently used to scaffold the SPAdes assembly using LINKS v.1.8.7 (Warren et al., 2015) using the recommended iterative approach of 33 iterations with incrementally increasing k-mer distance threshold. The resulting scaffolds were polished with four rounds of Racon followed by four rounds of Pilon (but without Medaka) as described above.

Repeat annotation and masking

Each draft assembly was soft repeat masked with RepeatMasker v4.1.0 (Smit et al., 2013) at medium sensitivity, with both Dfam 3.1 (Hubley et al., 2016) and RepBase RepeatMasker edition (Bao et al., 2015) repeat libraries installed (options: --species Drosophila --xsmall). RepeatMasker was initialized with cross_match v1.090518 (Green, 2009) as the sequence search engine and Tandem Repeat Finder v4.0.9 (Benson, 1999).

Genome size estimation

Genome size was estimated with Nanopore and Illumina data separately (Supplementary file 2). To estimate genome size from Illumina reads, we used the k-mer counting approach implemented in GenomeScope v1.0.0 (Vurture et al., 2017). Briefly, we followed the developer-provided workflow (https://github.com/schatzlab/genomescope) and generated a k-mer count histogram using a k-mer size of 21 (option: -m 21) with Jellyfish v2.2.3 (Marçais and Kingsford, 2011). The histogram was passed to the genomescope.R script to estimate the haploid genome size. We found these estimates to be somewhat unreliable, particularly when we tried to estimate genome size from a non-inbred sample. Due to this issue and because some samples were missing short read data, we took additional steps to estimate genome size from long reads.

Since the higher error rate of Nanopore reads (5–15%) precludes the use of k-mer based reference-free approaches for genome size estimation, we instead used regions annotated as a single-copy BUSCO gene to estimate genome size. Our rationale was that non-duplicated complete BUSCOs in each assembly could reasonably be assumed to be true single-copy markers and serve a similar function as unique k-mers for genome size estimation. Then, genome size can be roughly estimated from the depth of coverage across single-copy BUSCOs:

g e n o m e s i z e = (t o t a l b a s e s i n N a n o p o r e r e a d s) / (c o v e r a g e)

To perform this estimation, Nanopore reads were aligned to the coding sequences with Minimap2, only keeping primary alignments (options: --ax map-ont --secondary=no). Read depth was computed from genomic regions annotated as a single-copy BUSCO with SAMtools. If some proportion of the genome assembly was identified as non-fly and removed during the contaminant removal step, we adjusted the genome size estimate based on the total length of removed sequence:

g e n o m e s i z e = (t o t a l b a s e s i n r e a d s) * (1 - p r o p o r t i o n o f a s s e m b l y r e m o v e d) / (m e a n d e p t h o f c o v e r a g e)

This assumes uniform Nanopore coverage across fly and contaminant sequence in the assembly and serves only as a rough approximation.

Assessing assembly contiguity and completeness

Assembly contiguity statistics were computed using a series of custom shell and R scripts. Fasta files were parsed with Bioawk v1.0 (Li, 2017) and summary statistics were computed in the standard manner with the custom scripts. Contig N50 and NG50 were computed in the standard manner, in the latter case using the long-read based estimates of genome size. In addition to these statistics, we present contiguity in terms of auN. The auN statistic (Li, 2020) is the area under an Nx curve, and can be computed by multiplying the length of each contig (L_i) by the proportion of the assembled genome it accounts for (L_i /∑L_i), then summing these values for all i contigs:

a u N = \sum_{i}^{} (L_{i} \frac{L_{i}}{\sum_{j}^{} L_{j}})

Contig N50 represents a single point on the Nx curve and may or may not be affected by assembly breaks, but auN is always sensitive to a break in the assembly. Therefore, auN is a fairer statistic for comparison between different versions of the same assembly.

Assembly completeness was assessed with BUSCO v4.0.6 (Seppey et al., 2019), using the OrthoDB v10 Diptera database (Kriventseva et al., 2019) (options: --m geno -l diptera_odb10 --augustus_species fly). Note, the BUSCO version used here is different from what was used during the assembly process. When this work was started, BUSCO v3 was the current version. Version 4 was released while the project was ongoing. For consistency, version three was used during the assembly process for all assemblies, but the completeness of all final assemblies was assessed with BUSCO v4.For D. equinoxialis only, BUSCO v4.1.4 was used instead of v4.0.6 due to the presence of a bug that precluded the use of earlier versions.

Computation of sample heterozygosity and sequence quality from long and short reads

Sample diversity was estimated by counting the number of non-reference single-nucleotide polymorphisms (SNPs) and indels, called separately from long and short reads. We mapped ONT reads to the finished genome with Minimap2 (option: -ax map-ont) then sorted the output with sambamba 0.8.0 (Tarasov et al., 2015). Variants were called with PEPPER-Margin-Deepvariant r0.4 (Shafin et al., 2021), following the developer’s Singularity container-based ‘Nanopore variant calling’ instructions (https://github.com/kishwarshafin/pepper), to generate both variant call format (VCF) and banded genomic variant call format (gVCF) files, that is, a variant call file including intervals of invariant sites. Similarly, we mapped Illumina reads to each finished genome genome with Minimap2 (option: -ax sr), sorted and removed duplicates with sambamba, then performed variant calling, with output including all invariant sites, with BCFtools v1.12 (Danecek et al., 2021; Li, 2011). For both types of variant calls, we performed additional quality filtering using BCFtools. Only sites with minimum read depth 10, site quality score 30, and (if applicable) genotype quality score 30 filters were kept. The number of callable sites was estimated by adding the number of sites and the lengths of the intervals that passed these quality filters.

Sequence quality was estimated from variant calls following the standard workflow (e.g. Koren et al., 2017; Solares et al., 2018). Error was estimated by counting the number of non-reference variants (SNPs or indels), in either heterozygous or homozygous form, then dividing this count by the number of informative bases for variant calling: P_error = (Number of variants)/(Number of callable sites). A Phred-scaled quality score (QV) was computed in the standard manner: QV = −10 * log₁₀(P_error).

Reference-free consensus quality scores

Reference-free quality score estimates were computed with Merqury v1.3 (Rhie et al., 2020), following the instructions provided by the developer on the GitHub repository (https://github.com/marbl/merqury). Briefly, we used the tools included with the installation of Merqury to estimate an optimal k-mer size for each genome assembly, at a collision rate of 0.001. Then, we built a k-mer database using the Illumina reads used to polish the genome assembly. Note, in some cases, Illumina reads from a different strain were used, and polishing was only used to correct indels. Finally, we ran the main Merqury script on the assembly of interest to estimate a genome-wide Phred-scaled consensus quality score (QV).

Reference-based quality assessment

Reference-based quality score estimates were computed ONT Pomoxis v0.3.7 (https://github.com/nanoporetech/pomoxis) for assemblies where a well-annotated counterpart of not only the same species but the same strain was available through the NCBI RefSeq database. Gene and repeat annotations were downloaded from NCBI and coding regions, introns, intergenic regions, and repeats were parsed into BED formatted intervals with bedtools v2.30.0 (Quinlan and Hall, 2010). Introns were computed as the within-gene complement of exons, and intergenic regions were computed as the complement of genic regions. Then, we ran Pomoxis, which aligns each Nanopore-based assembly to the reference genome and assessed differences between the two genomes in 100 kb windows (option: -c 100000). Consensus quality was estimated by counting SNVs, insertions, and deletions and dividing the number of affected base pairs by the length of the alignment. This computation was performed separately for exons, introns, intergenic regions, repeats, and for the whole genome, using the genomic intervals described above.

For manual validation, we first used Pomoxis, as described above, to generate a list of all 1 bp or longer (option: -l 1) indel differences between our Nanopore-based assembly and the Release six assembly (Hoskins et al., 2015) of the D. melanogaster reference strain. The CA 8.2 MHAP version of the PacBio-based (Kim et al., 2014; Koren et al., 2017) D. melanogaster ISO-1 assembly was obtained from GenBank accession GCA_000778455.1. The iso1_onta2_quickmerge_scaffolds version from Solares et al., 2018 was downloaded from the GitHub repository associated with that project (https://github.com/danrdanny/Nanopore_ISO1). We aligned short reads, long reads, and each of the non-reference genomes to the Release six reference genome using Minimap2 (option, for short reads: -ax sr; for long reads: -ax map-ont; for genomes: -ax asm5), then sorted and parsed the output into BAM format with SAMtools. Repeat annotations for the Nanopore-based assembly were generated as described previously, then lifted over into reference coordinates. The liftover was performed with HALtools v2.1 (Hickey et al., 2013). Specifically, we aligned our Nanopore assembly to the Release six assembly with Minimap2 (options: --cx asm5 --cs long), converted the PAF alignment to MAF with Minimap2’s paftools.js program, MAF alignment to HAL with HALtools’ hal2maf program, and executed the liftover with HALtools’ halLiftover program. Alignments and genomic intervals were viewed in the Integrative Genomics Viewer v2.9.4 (Robinson et al., 2011b).

Species tree inference from BUSCO orthologs

We inferred species relationships using complete and single-copy orthologs identified by the BUSCO analysis. Amino acid sequences were used instead of nucleotide sequences to achieve better alignments in the face of high-sequence divergence (Bininda-Emonds, 2005). Out of 990 single-copy orthologs present in all assemblies, we randomly selected 250 to construct gene trees. The predicted protein sequence of each ortholog was aligned separately with MAFFT v7.453 (Katoh and Standley, 2013), using the E-INS-i algorithm (options: --ep 0 --genafpair --maxiterate 1000). Gene trees were inferred with RAxML-NG v0.9.0 (Kozlov et al., 2019), using the Le and Gascuel, 2008 amino acid substitution model (options: --msa-format FASTA --data-type AA --model LG). The summary method ASTRAL-MP v.5.14.7 (Yin et al., 2019) was run with default settings to reconstruct the species tree. We note that this is not intended to be a definitive phylogenetic reconstruction of species relationships; see Suvorov et al., 2021 for a time-calibrated phylogeny utilizing 158 drosophilid whole genomes.

Analysis of chromosome organization

Syntenic comparisons were performed by representing the genome assemblies as paths through an undirected graph. The path each genome traverses can be considered a series of connections between single copy orthologous markers (i.e. BUSCOs). Using BUSCO v4 annotations for each final genome, we constructed a 3285 by 3285 symmetric adjacency matrix, with row and column headers (nodes) corresponding to 3285 possible BUSCOs from the diptera_odb10 database. Off-diagonal entries in each matrix (edges) were the number of times two single-copy BUSCOs were found as connected and immediate neighbors in the assemblies. Sequences of three or more BUSCOs were not considered. The graph was then visualized in two dimensions using the ForceAtlas2 graph layout algorithm (Jacomy et al., 2014) as implemented in the ForceAtlas2 R package (https://github.com/analyxcompany/ForceAtlas2). While this method is primarily designed for flexible, user-friendly tuning of graph visualization, it is similar in effect to other nonlinear dimensionality reduction techniques (Böhm et al., 2020). ForceAtlas2 was run with the settings: tolerance=1, gravity=1, iterations=3000. D. equinoxialis was omitted from this analysis due to the BUSCO v4 issues mentioned previously.

Repeat content and genome size analysis

The contribution of repeat content to genome size variation in Drosophila was examined by comparing the number of bases in each genome annotated as a type of repeat (previously described) to the number of bases not annotated as repetitive sequence. Phylogenetic independent contrasts (Felsenstein, 1985) were computed for the counts of bases in both categories using the R package ape v5.4.1 (Paradis and Schliep, 2019) using the species tree described above with the root age set to 53 million years following the estimate in Suvorov et al., 2021.

Compute containers

While the overall computational demands of this work were high, the unique computational challenge we faced was the variety of computational resources used for various stages of the assembly process. Assemblies took place across local servers, institutional clusters, and cloud computing resources. A key factor in ensuring reproducibility across computing environments was the use of computing containers, which is like a lightweight virtual machine that can be customized such that sets of programs and their dependencies are packaged together. Specifically, we used the programs Docker and Singularity to manage containers. These programs allow containers to be built and packaged as an image file which is transferred to another computer. A Dockerfile, a text file containing instructions to set up an image, is used to select the Linux operating system and the suite of programs to be installed within a Docker container. Singularity is used to package the Docker container as an image file that can be transferred to and used in a cluster or cloud environment without the need for administrative permissions. Standard commands are then run inside the container environment. The files and instructions necessary to build these containers, which will allow for the exact reproduction of the computing environment in which this work was performed, are provided at: https://github.com/flyseq/drosophila_assembly_pipelines, (copy archived at swh:1:rev:4e40d28d0bdcd1bc7e4eabb7709f301df9ad7ead, Kim, 2021). We hope these files will facilitate the work of researchers new to Nanopore sequencing or the genome assembly process.

Acknowledgements

We thank Brandon Cooper, Antonio Serrato-Capuchina, and David Turissini for help with collections and field logistics; Sarah CR Elgin, Wilson Leung, Elena Gracheva, and Sophia Bieser for help with modENCODE fly lines; Jonathan Chang for helpful discussions about phylogenetic methods; Charlotte Helfrich-Förster for providing lab resources for G Manoli and E Bertolini; and John Tyson along with the staff at Circulomics, in particular Kelvin Liu and Michelle Kim, for many illuminating discussions about long read library prep and sequencing.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Bernard Y Kim, Email: bernardkim@stanford.edu.

Daniel R Matute, Email: dmatute@email.unc.edu.

Dmitri A Petrov, Email: dpetrov@stanford.edu.

Graham Coop, University of California, Davis, United States.

Patricia J Wittkopp, University of Michigan, United States.

Funding Information

This paper was supported by the following grants:

National Institute of General Medical Sciences F32GM135998 to Bernard Y Kim.
National Institute of General Medical Sciences R35GM118165 to Dmitri A Petrov.
National Institute of Diabetes and Digestive and Kidney Diseases K01DK119582 to Jeremy R Wang.
National Science Foundation DEB-1457707 to Corbin D Jones.
National Institute of General Medical Sciences R01GM121750 to Daniel R Matute.
National Institute of General Medical Sciences R01GM125715 to Daniel R Matute.
Google Google Cloud Research Credits to Bernard Y Kim, Jeremy R Wang.
National Institute of General Medical Sciences R35GM122592 to Artyom Kopp.
National Institute of General Medical Sciences R35GM119816 to Noah Whiteman.
Uehara Memorial Foundation 201931028 to Teruyuki Matsunaga.
Ministry of Education, Science and Technological Development of the Republic of Serbia 451-03-68/2020-14/200178 to Marina Stamenković-Radak, Mihailo Jelić, Marija Savić Veselinović.
Ministry of Education, Science and Technological Development of the Republic of Serbia 451-03-68/2020-14/200007 to Marija Tanasković, Pavle Erić.
National Natural Science Foundation of China 32060112 to Jian-Jun Gao.
Japan Society for the Promotion of Science JP18K06383 to Masayoshi Watada.
Horizon 2020 - Research and Innovation Framework Programme 765937-CINCHRON to Giulia Manoli, Enrico Bertolini.
Czech Science Foundation 19-13381S to Vladimír Košťál.
Japan Society for the Promotion of Science JP19H03276 to Aya Takahashi.
National Science Foundation 1345247 to Donald K Price.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Resources, Data curation, Software, Formal analysis, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing.

Conceptualization, Resources, Data curation, Software, Investigation, Methodology, Writing - original draft, Writing - review and editing.

Conceptualization, Data curation, Investigation, Methodology, Writing - original draft, Writing - review and editing.

Resources, Investigation.

Resources, Data curation, Investigation, Writing - review and editing.

Data curation, Validation.

Resources, Data curation, Investigation, Writing - review and editing.

Investigation.

Resources, Validation, Writing - review and editing.

Resources, Writing - review and editing.

Resources, Data curation, Methodology, Writing - review and editing.

Investigation.

Resources, Investigation, Methodology.

Resources, Writing - review and editing.

Resources.

Resources, Writing - review and editing.

Resources, Supervision, Writing - review and editing.

Resources.

Resources, Writing - review and editing.

Resources, Supervision, Writing - review and editing.

Resources, Data curation, Supervision, Investigation, Writing - review and editing.

Resources, Supervision, Funding acquisition, Writing - review and editing.

Resources, Data curation, Funding acquisition, Writing - review and editing.

Conceptualization, Resources, Supervision, Funding acquisition, Project administration, Writing - review and editing.

Conceptualization, Resources, Supervision, Funding acquisition, Visualization, Project administration, Writing - review and editing.

Conceptualization, Resources, Supervision, Funding acquisition, Methodology, Writing - original draft, Project administration, Writing - review and editing.

Conceptualization, Resources, Formal analysis, Supervision, Funding acquisition, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing.

Additional files

Supplementary file 1. Detailed information on both long-read and short-read data used for this project, including accession numbers if publicly available data were used for assembly.

elife-66405-supp1.xlsx^{(16.5KB, xlsx)}

Supplementary file 2. Assembly summary statistics and genome size estimates.

elife-66405-supp2.xlsx^{(27.2KB, xlsx)}

Supplementary file 3. Counts of SNPs, indels, and per-site heterozygosity estimated from both long reads and short reads.

elife-66405-supp3.xlsx^{(25.5KB, xlsx)}

Supplementary file 4. Consensus quality scores estimated with reference-free and reference-based methods.

elife-66405-supp4.xlsx^{(23.3KB, xlsx)}

Supplementary file 5. Characterization of all coding sequence indel differences between Nanopore and Release six reference D. melanogaster assemblies.

elife-66405-supp5.xlsx^{(11.7KB, xlsx)}

Supplementary file 6. Detailed sample information.

elife-66405-supp6.xlsx^{(33.6KB, xlsx)}

Transparent reporting form

elife-66405-transrepform.pdf^{(201.4KB, pdf)}

Data availability

All sequencing data and assemblies generated by this study are deposited at NCBI SRA and GenBank under NCBI BioProject PRJNA675888. Accession numbers for all data used but not generated by this study are provided in the supporting files. Dockerfiles and scripts for reproducing pipelines and analyses are provided on GitHub (https://github.com/flyseq/drosophila_assembly_pipelines; copy archived at https://archive.softwareheritage.org/swh:1:rev:4e40d28d0bdcd1bc7e4eabb7709f301df9ad7ead). A detailed wet lab protocol is provided at https://Protocols.io (https://doi.org/10.17504/protocols.io.bdfqi3mw).

The following dataset was generated:

Kim BY, Wang JR. 2020. Nanopore-based assembly of many drosophilid genomes. NCBI BioProject. PRJNA675888

The following previously published datasets were used:

Miller DE. 2018. Sequencing and assembly of 14 Drosophila species. NCBI BioProject. PRJNA427774

The Drosophila modENCODE Project 2011. modENCODE Drosophila reference genome sequencing (fruit flies) NCBI BioProject. 63477

Yang H. 2018. DNA-seq of sexed Drosophila grimshawi, Drosophila silvestris, and Drosophila heteroneura. NCBI BioProject. PRJNA484408

Bronski M. 2019. Drosophila montium Species Group Genomes Project. NCBI BioProject. PRJNA554346

Rane R. 2018. Invertebrate sample from Drosophila repleta. NCBI BioProject. 476692

Turissini D. 2017. Fly lines. NCBI BioProject. 395473

National Institute of Genetics [Japan] 2016. Genome sequences of 10 Drosophila species. NCBI BioProject. PRJDB4817

Ellison C. 2019. Raw genomic sequencing data from 16 Drosophila species. NCBI BioProject. PRJNA550077

References

Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, George RA, Lewis SE, Richards S, Ashburner M, Henderson SN, Sutton GG, Wortman JR, Yandell MD, Zhang Q, Chen LX, Brandon RC, Rogers YH, Blazej RG, Champe M, Pfeiffer BD, Wan KH, Doyle C, Baxter EG, Helt G, Nelson CR, Gabor GL, Abril JF, Agbayani A, An HJ, Andrews-Pfannkoch C, Baldwin D, Ballew RM, Basu A, Baxendale J, Bayraktaroglu L, Beasley EM, Beeson KY, Benos PV, Berman BP, Bhandari D, Bolshakov S, Borkova D, Botchan MR, Bouck J, Brokstein P, Brottier P, Burtis KC, Busam DA, Butler H, Cadieu E, Center A, Chandra I, Cherry JM, Cawley S, Dahlke C, Davenport LB, Davies P, de Pablos B, Delcher A, Deng Z, Mays AD, Dew I, Dietz SM, Dodson K, Doup LE, Downes M, Dugan-Rocha S, Dunkov BC, Dunn P, Durbin KJ, Evangelista CC, Ferraz C, Ferriera S, Fleischmann W, Fosler C, Gabrielian AE, Garg NS, Gelbart WM, Glasser K, Glodek A, Gong F, Gorrell JH, Gu Z, Guan P, Harris M, Harris NL, Harvey D, Heiman TJ, Hernandez JR, Houck J, Hostin D, Houston KA, Howland TJ, Wei MH, Ibegwam C, Jalali M, Kalush F, Karpen GH, Ke Z, Kennison JA, Ketchum KA, Kimmel BE, Kodira CD, Kraft C, Kravitz S, Kulp D, Lai Z, Lasko P, Lei Y, Levitsky AA, Li J, Li Z, Liang Y, Lin X, Liu X, Mattei B, McIntosh TC, McLeod MP, McPherson D, Merkulov G, Milshina NV, Mobarry C, Morris J, Moshrefi A, Mount SM, Moy M, Murphy B, Murphy L, Muzny DM, Nelson DL, Nelson DR, Nelson KA, Nixon K, Nusskern DR, Pacleb JM, Palazzolo M, Pittman GS, Pan S, Pollard J, Puri V, Reese MG, Reinert K, Remington K, Saunders RD, Scheeler F, Shen H, Shue BC, Sidén-Kiamos I, Simpson M, Skupski MP, Smith T, Spier E, Spradling AC, Stapleton M, Strong R, Sun E, Svirskas R, Tector C, Turner R, Venter E, Wang AH, Wang X, Wang ZY, Wassarman DA, Weinstock GM, Weissenbach J, Williams SM, Woodage T, Worley KC, Wu D, Yang S, Yao QA, Ye J, Yeh RF, Zaveri JS, Zhan M, Zhang G, Zhao Q, Zheng L, Zheng XH, Zhong FN, Zhong W, Zhou X, Zhu S, Zhu X, Smith HO, Gibbs RA, Myers EW, Rubin GM, Venter JC. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. doi: 10.1126/science.287.5461.2185. [DOI] [PubMed] [Google Scholar]
Adams M, McBroome J, Maurer N, Pepper-Tunick E, Saremi NF, Green RE, Vollmers C, Corbett-Detig RB. One fly–one genome: chromosome-scale genome assembly of a single outbred Drosophila melanogaster. Nucleic Acids Research. 2020;356:450. doi: 10.1093/nar/gkaa450. [DOI] [PMC free article] [PubMed] [Google Scholar]
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
Armstrong J, Hickey G, Diekhans M, Fiddes IT, Novak AM, Deran A, Fang Q, Xie D, Feng S, Stiller J, Genereux D, Johnson J, Marinescu VD, Alföldi J, Harris RS, Lindblad-Toh K, Haussler D, Karlsson E, Jarvis ED, Zhang G, Paten B. Progressive Cactus is a multiple-genome aligner for the thousand-genome era. Nature. 2020;587:246–251. doi: 10.1038/s41586-020-2871-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology : a journal of computational molecular cell biology. 2012;19:455–477. doi: 10.1089/cmb.2012.0021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bao W, Kojima KK, Kohany O. Repbase update, a database of repetitive elements in eukaryotic genomes. Mobile DNA. 2015;6:11. doi: 10.1186/s13100-015-0041-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Baym M, Kryazhimskiy S, Lieberman TD, Chung H, Desai MM, Kishony R. Inexpensive multiplexed library preparation for megabase-sized genomes. PLOS ONE. 2015;10:e0128036. doi: 10.1371/journal.pone.0128036. [DOI] [PMC free article] [PubMed] [Google Scholar]
Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research. 1999;27:573–580. doi: 10.1093/nar/27.2.573. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bininda-Emonds OR. transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences. BMC bioinformatics. 2005;6:156. doi: 10.1186/1471-2105-6-156. [DOI] [PMC free article] [PubMed] [Google Scholar]
Böhm JN, Berens P, Kobak D. A unifying perspective on neighbor embeddings along the Attraction-Repulsion spectrum. arXiv. 2020 http://arxiv.org/abs/2007.08902
Bracewell R, Chatla K, Nalley MJ, Bachtrog D. Dynamic turnover of centromeres drives karyotype evolution in Drosophila. eLife. 2019;8:e49002. doi: 10.7554/eLife.49002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bronski MJ, Martinez CC, Weld HA, Eisen MB. Whole Genome Sequences of 23 Species from the Drosophila montium Species Group (Diptera: Drosophilidae): A Resource for Testing Evolutionary Hypotheses. G3: Genes, Genomes, Genetics. 2020;10:1443–1455. doi: 10.1534/g3.119.400959. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cao MD, Nguyen SH, Ganesamoorthy D, Elliott AG, Cooper MA, Coin LJ. Scaffolding and completing genome assemblies in real-time with nanopore sequencing. Nature communications. 2017;8:14515. doi: 10.1038/ncomms14515. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chakraborty M, Emerson JJ, Macdonald SJ, Long AD. Structural variants exhibit widespread allelic heterogeneity and shape variation in complex traits. Nature communications. 2019;10:4872. doi: 10.1038/s41467-019-12884-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chakraborty M, Chang CH, Khost DE, Vedanayagam J, Adrion JR, Liao Y, Montooth KL, Meiklejohn CD, Larracuente AM, Emerson JJ. Evolution of genome structure in the Drosophila simulans species complex. Genome research. 2021;31:380–396. doi: 10.1101/gr.263442.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen ZX, Sturgill D, Qu J, Jiang H, Park S, Boley N, Suzuki AM, Fletcher AR, Plachetzki DC, FitzGerald PC, Artieri CG, Atallah J, Barmina O, Brown JB, Blankenburg KP, Clough E, Dasgupta A, Gubbala S, Han Y, Jayaseelan JC, Kalra D, Kim YA, Kovar CL, Lee SL, Li M, Malley JD, Malone JH, Mathew T, Mattiuzzo NR, Munidasa M, Muzny DM, Ongeri F, Perales L, Przytycka TM, Pu LL, Robinson G, Thornton RL, Saada N, Scherer SE, Smith HE, Vinson C, Warner CB, Worley KC, Wu YQ, Zou X, Cherbas P, Kellis M, Eisen MB, Piano F, Kionte K, Fitch DH, Sternberg PW, Cutter AD, Duff MO, Hoskins RA, Graveley BR, Gibbs RA, Bickel PJ, Kopp A, Carninci P, Celniker SE, Oliver B, Richards S. Comparative validation of the D. melanogaster modENCODE transcriptome annotation. Genome research. 2014;24:1209–1223. doi: 10.1101/gr.159384.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
Clark AG, Eisen MB, Smith DR, Bergman CM, Oliver B, Markow TA, Kaufman TC, Kellis M, Gelbart W, Iyer VN, Pollard DA, Sackton TB, Larracuente AM, Singh ND, Abad JP, Abt DN, Adryan B, Aguade M, Akashi H, Anderson WW, Aquadro CF, Ardell DH, Arguello R, Artieri CG, Barbash DA, Barker D, Barsanti P, Batterham P, Batzoglou S, Begun D, Bhutkar A, Blanco E, Bosak SA, Bradley RK, Brand AD, Brent MR, Brooks AN, Brown RH, Butlin RK, Caggese C, Calvi BR, Bernardo de Carvalho A, Caspi A, Castrezana S, Celniker SE, Chang JL, Chapple C, Chatterji S, Chinwalla A, Civetta A, Clifton SW, Comeron JM, Costello JC, Coyne JA, Daub J, David RG, Delcher AL, Delehaunty K, Do CB, Ebling H, Edwards K, Eickbush T, Evans JD, Filipski A, Findeiss S, Freyhult E, Fulton L, Fulton R, Garcia AC, Gardiner A, Garfield DA, Garvin BE, Gibson G, Gilbert D, Gnerre S, Godfrey J, Good R, Gotea V, Gravely B, Greenberg AJ, Griffiths-Jones S, Gross S, Guigo R, Gustafson EA, Haerty W, Hahn MW, Halligan DL, Halpern AL, Halter GM, Han MV, Heger A, Hillier L, Hinrichs AS, Holmes I, Hoskins RA, Hubisz MJ, Hultmark D, Huntley MA, Jaffe DB, Jagadeeshan S, Jeck WR, Johnson J, Jones CD, Jordan WC, Karpen GH, Kataoka E, Keightley PD, Kheradpour P, Kirkness EF, Koerich LB, Kristiansen K, Kudrna D, Kulathinal RJ, Kumar S, Kwok R, Lander E, Langley CH, Lapoint R, Lazzaro BP, Lee SJ, Levesque L, Li R, Lin CF, Lin MF, Lindblad-Toh K, Llopart A, Long M, Low L, Lozovsky E, Lu J, Luo M, Machado CA, Makalowski W, Marzo M, Matsuda M, Matzkin L, McAllister B, McBride CS, McKernan B, McKernan K, Mendez-Lago M, Minx P, Mollenhauer MU, Montooth K, Mount SM, Mu X, Myers E, Negre B, Newfeld S, Nielsen R, Noor MA, O'Grady P, Pachter L, Papaceit M, Parisi MJ, Parisi M, Parts L, Pedersen JS, Pesole G, Phillippy AM, Ponting CP, Pop M, Porcelli D, Powell JR, Prohaska S, Pruitt K, Puig M, Quesneville H, Ram KR, Rand D, Rasmussen MD, Reed LK, Reenan R, Reily A, Remington KA, Rieger TT, Ritchie MG, Robin C, Rogers YH, Rohde C, Rozas J, Rubenfield MJ, Ruiz A, Russo S, Salzberg SL, Sanchez-Gracia A, Saranga DJ, Sato H, Schaeffer SW, Schatz MC, Schlenke T, Schwartz R, Segarra C, Singh RS, Sirot L, Sirota M, Sisneros NB, Smith CD, Smith TF, Spieth J, Stage DE, Stark A, Stephan W, Strausberg RL, Strempel S, Sturgill D, Sutton G, Sutton GG, Tao W, Teichmann S, Tobari YN, Tomimura Y, Tsolas JM, Valente VL, Venter E, Venter JC, Vicario S, Vieira FG, Vilella AJ, Villasante A, Walenz B, Wang J, Wasserman M, Watts T, Wilson D, Wilson RK, Wing RA, Wolfner MF, Wong A, Wong GK, Wu CI, Wu G, Yamamoto D, Yang HP, Yang SP, Yorke JA, Yoshida K, Zdobnov E, Zhang P, Zhang Y, Zimin AV, Baldwin J, Abdouelleil A, Abdulkadir J, Abebe A, Abera B, Abreu J, Acer SC, Aftuck L, Alexander A, An P, Anderson E, Anderson S, Arachi H, Azer M, Bachantsang P, Barry A, Bayul T, Berlin A, Bessette D, Bloom T, Blye J, Boguslavskiy L, Bonnet C, Boukhgalter B, Bourzgui I, Brown A, Cahill P, Channer S, Cheshatsang Y, Chuda L, Citroen M, Collymore A, Cooke P, Costello M, D'Aco K, Daza R, De Haan G, DeGray S, DeMaso C, Dhargay N, Dooley K, Dooley E, Doricent M, Dorje P, Dorjee K, Dupes A, Elong R, Falk J, Farina A, Faro S, Ferguson D, Fisher S, Foley CD, Franke A, Friedrich D, Gadbois L, Gearin G, Gearin CR, Giannoukos G, Goode T, Graham J, Grandbois E, Grewal S, Gyaltsen K, Hafez N, Hagos B, Hall J, Henson C, Hollinger A, Honan T, Huard MD, Hughes L, Hurhula B, Husby ME, Kamat A, Kanga B, Kashin S, Khazanovich D, Kisner P, Lance K, Lara M, Lee W, Lennon N, Letendre F, LeVine R, Lipovsky A, Liu X, Liu J, Liu S, Lokyitsang T, Lokyitsang Y, Lubonja R, Lui A, MacDonald P, Magnisalis V, Maru K, Matthews C, McCusker W, McDonough S, Mehta T, Meldrim J, Meneus L, Mihai O, Mihalev A, Mihova T, Mittelman R, Mlenga V, Montmayeur A, Mulrain L, Navidi A, Naylor J, Negash T, Nguyen T, Nguyen N, Nicol R, Norbu C, Norbu N, Novod N, O'Neill B, Osman S, Markiewicz E, Oyono OL, Patti C, Phunkhang P, Pierre F, Priest M, Raghuraman S, Rege F, Reyes R, Rise C, Rogov P, Ross K, Ryan E, Settipalli S, Shea T, Sherpa N, Shi L, Shih D, Sparrow T, Spaulding J, Stalker J, Stange-Thomann N, Stavropoulos S, Stone C, Strader C, Tesfaye S, Thomson T, Thoulutsang Y, Thoulutsang D, Topham K, Topping I, Tsamla T, Vassiliev H, Vo A, Wangchuk T, Wangdi T, Weiand M, Wilkinson J, Wilson A, Yadav S, Young G, Yu Q, Zembek L, Zhong D, Zimmer A, Zwirko Z, Jaffe DB, Alvarez P, Brockman W, Butler J, Chin C, Gnerre S, Grabherr M, Kleber M, Mauceli E, MacCallum I, Drosophila 12 Genomes Consortium Evolution of genes and genomes on the Drosophila phylogeny. Nature. 2007;450:203–218. doi: 10.1038/nature06341. [DOI] [PubMed] [Google Scholar]
Comeault AA, Wang J, Tittes S, Isbell K, Ingley S, Hurlbert AH, Matute DR. Genetic Diversity and Thermal Performance in Invasive and Native Populations of African Fig Flies. Molecular biology and evolution. 2020;37:1893–1906. doi: 10.1093/molbev/msaa050. [DOI] [PMC free article] [PubMed] [Google Scholar]
Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10:giab008. doi: 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome research. 1998;8:175–185. doi: 10.1101/gr.8.3.175. [DOI] [PubMed] [Google Scholar]
Felsenstein J. Phylogenies and the Comparative Method. The American Naturalist. 1985;125:1–15. doi: 10.1086/703055. [DOI] [PubMed] [Google Scholar]
Feng S, Stiller J, Deng Y, Armstrong J, Fang Q, Reeve AH, Xie D, Chen G, Guo C, Faircloth BC, Petersen B, Wang Z, Zhou Q, Diekhans M, Chen W, Andreu-Sánchez S, Margaryan A, Howard JT, Parent C, Pacheco G, Sinding MS, Puetz L, Cavill E, Ribeiro ÂM, Eckhart L, Fjeldså J, Hosner PA, Brumfield RT, Christidis L, Bertelsen MF, Sicheritz-Ponten T, Tietze DT, Robertson BC, Song G, Borgia G, Claramunt S, Lovette IJ, Cowen SJ, Njoroge P, Dumbacher JP, Ryder OA, Fuchs J, Bunce M, Burt DW, Cracraft J, Meng G, Hackett SJ, Ryan PG, Jønsson KA, Jamieson IG, da Fonseca RR, Braun EL, Houde P, Mirarab S, Suh A, Hansson B, Ponnikas S, Sigeman H, Stervander M, Frandsen PB, van der Zwan H, van der Sluis R, Visser C, Balakrishnan CN, Clark AG, Fitzpatrick JW, Bowman R, Chen N, Cloutier A, Sackton TB, Edwards SV, Foote DJ, Shakya SB, Sheldon FH, Vignal A, Soares AER, Shapiro B, González-Solís J, Ferrer-Obiol J, Rozas J, Riutort M, Tigano A, Friesen V, Dalén L, Urrutia AO, Székely T, Liu Y, Campana MG, Corvelo A, Fleischer RC, Rutherford KM, Gemmell NJ, Dussex N, Mouritsen H, Thiele N, Delmore K, Liedvogel M, Franke A, Hoeppner MP, Krone O, Fudickar AM, Milá B, Ketterson ED, Fidler AE, Friis G, Parody-Merino ÁM, Battley PF, Cox MP, Lima NCB, Prosdocimi F, Parchman TL, Schlinger BA, Loiselle BA, Blake JG, Lim HC, Day LB, Fuxjager MJ, Baldwin MW, Braun MJ, Wirthlin M, Dikow RB, Ryder TB, Camenisch G, Keller LF, DaCosta JM, Hauber ME, Louder MIM, Witt CC, McGuire JA, Mudge J, Megna LC, Carling MD, Wang B, Taylor SA, Del-Rio G, Aleixo A, Vasconcelos ATR, Mello CV, Weir JT, Haussler D, Li Q, Yang H, Wang J, Lei F, Rahbek C, Gilbert MTP, Graves GR, Jarvis ED, Paten B, Zhang G. Dense sampling of bird diversity increases power of comparative genomics. Nature. 2020;587:252–257. doi: 10.1038/s41586-020-2873-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fiddes IT, Armstrong J, Diekhans M, Nachtweide S, Kronenberg ZN, Underwood JG, Gordon D, Earl D, Keane T, Eichler EE, Haussler D, Stanke M, Paten B. Comparative Annotation Toolkit (CAT)-simultaneous clade and personal genome annotation. Genome research. 2018;28:1029–1038. doi: 10.1101/gr.233460.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
Finet C, Kassner VA, Carvalho AB, Chung H, Day JP, Day S, Delaney EK, Ré FCD, Dufour HD, Dupim E, Izumitani HF, Gautério TB, Justen J, Katoh T, Kopp A, Koshikawa S, Longdon B, Loreto EL, Nunes MDS. DrosoPhyla: genomic resources for drosophilid phylogeny and systematics. bioRxiv. 2021 doi: 10.1101/2021.03.23.436709. [DOI] [PMC free article] [PubMed]
Flynn JM, Long M, Wing RA, Clark AG. Evolutionary Dynamics of Abundant 7-bp Satellites in the Genome of Drosophila virilis. Molecular biology and evolution. 2020;37:1362–1375. doi: 10.1093/molbev/msaa010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Green P. Phrap. 2009 http://phrap.org
Guirao-Rico S, González J. Evolutionary insights from large scale resequencing datasets in Drosophila melanogaster. Current opinion in insect science. 2019;31:70–76. doi: 10.1016/j.cois.2018.11.002. [DOI] [PubMed] [Google Scholar]
Hales KG, Korey CA, Larracuente AM, Roberts DM. Genetics on the Fly: A Primer on the Drosophila Model System. Genetics. 2015;201:815–842. doi: 10.1534/genetics.115.183392. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hickey G, Paten B, Earl D, Zerbino D, Haussler D. HAL: a hierarchical format for storing and analyzing multiple genome alignments. Bioinformatics. 2013;29:1341–1342. doi: 10.1093/bioinformatics/btt128. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hill T, Rosales-Stephens H-L, Unckless RL. Rapid divergence of the copulation proteins in the Drosophila dunni group is associated with hybrid post-mating-prezygotic incompatibilities. bioRxiv. 2020 doi: 10.1101/2020.05.20.106724. [DOI]
Hoskins RA, Carlson JW, Wan KH, Park S, Mendez I, Galle SE, Booth BW, Pfeiffer BD, George RA, Svirskas R, Krzywinski M, Schein J, Accardo MC, Damia E, Messina G, Méndez-Lago M, de Pablos B, Demakova OV, Andreyeva EN, Boldyreva LV, Marra M, Carvalho AB, Dimitri P, Villasante A, Zhimulev IF, Rubin GM, Karpen GH, Celniker SE. The Release 6 reference sequence of the Drosophila melanogaster genome. Genome research. 2015;25:445–458. doi: 10.1101/gr.185579.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hotaling S, Sproul JS, Heckenhauer J, Powell A, Larracuente AM, Pauls SU, Kelley JL, Frandsen PB. Long-reads are revolutionizing 20 years of insect genome sequencing. Genome Biology and Evolution. 2021;10:evab138. doi: 10.1101/2021.02.14.431146. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hubley R, Finn RD, Clements J, Eddy SR, Jones TA, Bao W, Smit AF, Wheeler TJ. The Dfam database of repetitive DNA families. Nucleic acids research. 2016;44:D81–D89. doi: 10.1093/nar/gkv1272. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jacomy M, Venturini T, Heymann S, Bastian M. ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PLOS ONE. 2014;9:e98679. doi: 10.1371/journal.pone.0098679. [DOI] [PMC free article] [PubMed] [Google Scholar]
Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim KE, Peluso P, Babayan P, Yeadon PJ, Yu C, Fisher WW, Chin CS, Rapicavoli NA, Rank DR, Li J, Catcheside DE, Celniker SE, Phillippy AM, Bergman CM, Landolin JM. Long-read, whole-genome shotgun sequence data for five model organisms. Scientific data. 2014;1:140045. doi: 10.1038/sdata.2014.45. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim BY. Software Heritage. https://archive.softwareheritage.org/swh:1:rev:4e40d28d0bdcd1bc7e4eabb7709f301df9ad7ead: 2021. [Google Scholar]
Kingan S, Heaton H, Cudini J, Lambert C, Baybayan P, Galvin B, Durbin R, Korlach J, Lawniczak M. A High-Quality de novo genome assembly from a single mosquito using PacBio sequencing. Genes. 2019;10:62. doi: 10.3390/genes10010062. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-prone reads using repeat graphs. Nature biotechnology. 2019;37:540–546. doi: 10.1038/s41587-019-0072-8. [DOI] [PubMed] [Google Scholar]
Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome research. 2017;27:722–736. doi: 10.1101/gr.215087.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
Koren S, Phillippy AM, Simpson JT, Loman NJ, Loose M. Reply to 'Errors in long-read assemblies can critically affect protein prediction'. Nature biotechnology. 2019;37:127–128. doi: 10.1038/s41587-018-0005-y. [DOI] [PubMed] [Google Scholar]
Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics. 2019;35:4453–4455. doi: 10.1093/bioinformatics/btz305. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kriventseva EV, Kuznetsov D, Tegenfeldt F, Manni M, Dias R, Simão FA, Zdobnov EM. OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic acids research. 2019;47:D807–D811. doi: 10.1093/nar/gky1053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lack JB, Lange JD, Tang AD, Corbett-Detig RB, Pool JE. A Thousand Fly Genomes: An Expanded Drosophila Genome Nexus. Molecular biology and evolution. 2016;33:3308–3313. doi: 10.1093/molbev/msw195. [DOI] [PMC free article] [PubMed] [Google Scholar]
Le SQ, Gascuel O. An improved general amino acid replacement matrix. Molecular biology and evolution. 2008;25:1307–1320. doi: 10.1093/molbev/msn067. [DOI] [PubMed] [Google Scholar]
Lewin HA, Robinson GE, Kress WJ, Baker WJ, Coddington J, Crandall KA, Durbin R, Edwards SV, Forest F, Gilbert MTP, Goldstein MM, Grigoriev IV, Hackett KJ, Haussler D, Jarvis ED, Johnson WE, Patrinos A, Richards S, Castilla-Rubio JC, van Sluys MA, Soltis PS, Xu X, Yang H, Zhang G. Earth BioGenome project: sequencing life for the future of life. PNAS. 2018;115:4325–4333. doi: 10.1073/pnas.1720115115. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27:2987–2993. doi: 10.1093/bioinformatics/btr509. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics. 2016;32:2103–2110. doi: 10.1093/bioinformatics/btw152. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H. Bioawk. 2017 https://github.com/lh3/bioawk
Li H. AuN: A New Metric to Measure Assembly Contiguity. 2020 https://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity
Long E, Evans C, Chaston J, Udall JA. Genomic Structural Variations Within Five Continental Populations of Drosophila melanogaster. G3: Genes, Genomes, Genetics. 2018;8:3247–3253. doi: 10.1534/g3.118.200631. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mai D, Nalley MJ, Bachtrog D. Patterns of Genomic Differentiation in the Drosophila nasuta Species Complex. Molecular biology and evolution. 2020;37:208–220. doi: 10.1093/molbev/msz215. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–770. doi: 10.1093/bioinformatics/btr011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Miller DE, Staber C, Zeitlinger J, Hawley RS. Highly Contiguous Genome Assemblies of 15 Drosophila Species Generated Using Nanopore Sequencing. G3: Genes, Genomes, Genetics. 2018;8:3131–3141. doi: 10.1534/g3.118.200160. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze A, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A, Aganezov S, Hoyt SJ, Diekhans M, Logsdon GA, Alonge M, Antonarakis SE, Borchers M, Bouffard GG, Brooks SY, Phillippy AM. The complete sequence of a human genome. bioRxiv. 2021 doi: 10.1101/2021.05.26.445798. [DOI] [PMC free article] [PubMed]
O'Grady PM, DeSalle R. Phylogeny of the Genus Drosophila. Genetics. 2018;209:1–25. doi: 10.1534/genetics.117.300583. [DOI] [PMC free article] [PubMed] [Google Scholar]
Paithankar KR, Prasad KS. Precipitation of DNA by polyethylene glycol and ethanol. Nucleic Acids Research. 1991;19:1346. doi: 10.1093/nar/19.6.1346. [DOI] [PMC free article] [PubMed] [Google Scholar]
Paradis E, Schliep K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 2019;35:526–528. doi: 10.1093/bioinformatics/bty633. [DOI] [PubMed] [Google Scholar]
Paris M, Boyer R, Jaenichen R, Wolf J, Karageorgi M, Green J, Cagnon M, Parinello H, Estoup A, Gautier M, Gompel N, Prud'homme B. Near-chromosome level genome assembly of the fruit pest Drosophila suzukii using long-read sequencing. Scientific reports. 2020;10:11227. doi: 10.1038/s41598-020-67373-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome research. 2010;20:110–121. doi: 10.1101/gr.097857.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ranz JM, Casals F, Ruiz A. How malleable is the eukaryotic genome? Extreme rate of chromosomal rearrangement in the genus Drosophila. Genome research. 2001;11:230–239. doi: 10.1101/gr.162901. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rezvykh AP, Funikov SY, Protsenko LA, Kulikova DA, Zelentsova ES, Chuvakova LN, Blumenstiel JP, Evgen'ev MB. Evolutionary Dynamics of the Pericentromeric Heterochromatin in Drosophila virilis and Related Species. Genes. 2021;12:175. doi: 10.3390/genes12020175. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rhie A, Walenz BP, Koren S, Phillippy AM. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome biology. 2020;21:245. doi: 10.1186/s13059-020-02134-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rhie A, McCarthy SA, Fedrigo O, Damas J, Formenti G, Koren S, Uliano-Silva M, Chow W, Fungtammasan A, Kim J, Lee C, Ko BJ, Chaisson M, Gedman GL, Cantin LJ, Thibaud-Nissen F, Haggerty L, Bista I, Smith M, Haase B, Mountcastle J, Winkler S, Paez S, Howard J, Vernes SC, Lama TM, Grutzner F, Warren WC, Balakrishnan CN, Burt D, George JM, Biegler MT, Iorns D, Digby A, Eason D, Robertson B, Edwards T, Wilkinson M, Turner G, Meyer A, Kautt AF, Franchini P, Detrich HW, Svardal H, Wagner M, Naylor GJP, Pippel M, Malinsky M, Mooney M, Simbirsky M, Hannigan BT, Pesout T, Houck M, Misuraca A, Kingan SB, Hall R, Kronenberg Z, Sović I, Dunn C, Ning Z, Hastie A, Lee J, Selvaraj S, Green RE, Putnam NH, Gut I, Ghurye J, Garrison E, Sims Y, Collins J, Pelan S, Torrance J, Tracey A, Wood J, Dagnew RE, Guan D, London SE, Clayton DF, Mello CV, Friedrich SR, Lovell PV, Osipova E, Al-Ajli FO, Secomandi S, Kim H, Theofanopoulou C, Hiller M, Zhou Y, Harris RS, Makova KD, Medvedev P, Hoffman J, Masterson P, Clark K, Martin F, Howe K, Flicek P, Walenz BP, Kwak W, Clawson H, Diekhans M, Nassar L, Paten B, Kraus RHS, Crawford AJ, Gilbert MTP, Zhang G, Venkatesh B, Murphy RW, Koepfli KP, Shapiro B, Johnson WE, Di Palma F, Marques-Bonet T, Teeling EC, Warnow T, Graves JM, Ryder OA, Haussler D, O'Brien SJ, Korlach J, Lewin HA, Howe K, Myers EW, Durbin R, Phillippy AM, Jarvis ED. Towards complete and error-free genome assemblies of all vertebrate species. Nature. 2021;592:737–746. doi: 10.1038/s41586-021-03451-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roach MJ, Schmidt SA, Borneman AR. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC bioinformatics. 2018;19:460. doi: 10.1186/s12859-018-2485-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robinson GE, Hackett KJ, Purcell-Miramontes M, Brown SJ, Evans JD, Goldsmith MR, Lawson D, Okamuro J, Robertson HM, Schneider DJ. Creating a buzz about insect genomes. Science. 2011a;331:1386. doi: 10.1126/science.331.6023.1386. [DOI] [PubMed] [Google Scholar]
Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP. Integrative genomics viewer. Nature biotechnology. 2011b;29:24–26. doi: 10.1038/nbt.1754. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roy S, Ernst J, Kharchenko PV, Kheradpour P, Negre N, Eaton ML, Landolin JM, Bristow CA, Ma L, Lin MF, Washietl S, Arshinoff BI, Ay F, Meyer PE, Robine N, Washington NL, Di Stefano L, Berezikov E, Brown CD, Candeias R, Carlson JW, Carr A, Jungreis I, Marbach D, Sealfon R, Tolstorukov MY, Will S, Alekseyenko AA, Artieri C, Booth BW, Brooks AN, Dai Q, Davis CA, Duff MO, Feng X, Gorchakov AA, Gu T, Henikoff JG, Kapranov P, Li R, MacAlpine HK, Malone J, Minoda A, Nordman J, Okamura K, Perry M, Powell SK, Riddle NC, Sakai A, Samsonova A, Sandler JE, Schwartz YB, Sher N, Spokony R, Sturgill D, van Baren M, Wan KH, Yang L, Yu C, Feingold E, Good P, Guyer M, Lowdon R, Ahmad K, Andrews J, Berger B, Brenner SE, Brent MR, Cherbas L, Elgin SC, Gingeras TR, Grossman R, Hoskins RA, Kaufman TC, Kent W, Kuroda MI, Orr-Weaver T, Perrimon N, Pirrotta V, Posakony JW, Ren B, Russell S, Cherbas P, Graveley BR, Lewis S, Micklem G, Oliver B, Park PJ, Celniker SE, Henikoff S, Karpen GH, Lai EC, MacAlpine DM, Stein LD, White KP, Kellis M, modENCODE Consortium Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science. 2010;330:1787–1797. doi: 10.1126/science.1198374. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schneider C, Woehle C, Greve C, D'Haese CA, Wolf M, Hiller M, Janke A, Bálint M, Huettel B. Two high-quality de novo genomes from single ethanol-preserved specimens of tiny metazoans (Collembola) GigaScience. 2021;10:giab035. doi: 10.1093/gigascience/giab035. [DOI] [PMC free article] [PubMed] [Google Scholar]
Seppey M, Manni M, Zdobnov EM. In: Gene Prediction: Methods and Protocols. Kollmar M, editor. Springer; 2019. BUSCO: Assessing Genome Assembly and Annotation Completeness; pp. 227–245. [DOI] [PubMed] [Google Scholar]
Shafin K, Pesout T, Chang P-C, Nattestad M, Kolesnikov A, Goel S, Baid G, Eizenga JM, Miga KH, Carnevali P, Jain M, Carroll A, Paten B. Haplotype-aware variant calling enables high accuracy in Nanopore long-reads using deep neural networks. bioRxiv. 2021 doi: 10.1101/2021.03.04.433952. [DOI] [PMC free article] [PubMed]
Shumate A, Salzberg SL. Liftoff: accurate mapping of gene annotations. Bioinformatics. 2020;15:btaa1016. doi: 10.1093/bioinformatics/btaa1016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome research. 2005;15:1034–1050. doi: 10.1101/gr.3715005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Signor SA, New FN, Nuzhdin S. A Large Panel of Drosophila simulans Reveals an Abundance of Common Variants. Genome biology and evolution. 2018;10:189–206. doi: 10.1093/gbe/evx262. [DOI] [PMC free article] [PubMed] [Google Scholar]
Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–3212. doi: 10.1093/bioinformatics/btv351. [DOI] [PubMed] [Google Scholar]
Smit AFA, Hubley R, Green P. RepeatMasker Open-4.0. 2013 https://www.repeatmasker.org/
Solares EA, Chakraborty M, Miller DE, Kalsow S, Hall K, Perera AG, Emerson JJ, Hawley RS. Rapid Low-Cost Assembly of the Drosophila melanogaster Reference Genome Using Low-Coverage, Long-Read Sequencing . G3: Genes, Genomes, Genetics. 2018;8:3143–3154. doi: 10.1534/g3.118.200162. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stark A, Lin MF, Kheradpour P, Pedersen JS, Parts L, Carlson JW, Crosby MA, Rasmussen MD, Roy S, Deoras AN, Ruby JG, Brennecke J, Hodges E, Hinrichs AS, Caspi A, Paten B, Park SW, Han MV, Maeder ML, Polansky BJ, Robson BE, Aerts S, van Helden J, Hassan B, Gilbert DG, Eastman DA, Rice M, Weir M, Hahn MW, Park Y, Dewey CN, Pachter L, Kent WJ, Haussler D, Lai EC, Bartel DP, Hannon GJ, Kaufman TC, Eisen MB, Clark AG, Smith D, Celniker SE, Gelbart WM, Kellis M, Harvard FlyBase curators. Berkeley Drosophila Genome Project Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature. 2007;450:219–232. doi: 10.1038/nature06340. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sturtevant AH, Novitski E. The homologies of the chromosome elements in the genus Drosophila. Genetics. 1941;26:517–541. doi: 10.1093/genetics/26.5.517. [DOI] [PMC free article] [PubMed] [Google Scholar]
Suvorov A, Kim BY, Wang J, Armstrong EE, Peede D, D’Agostino ERR, Price DK, Wadell P, Lang M, Courtier-Orgogozo V, David JR, Petrov D, Matute DR, Schrider DR, Comeault AA. Widespread introgression across a phylogeny of 155 Drosophila genomes. bioRxiv. 2021 doi: 10.1101/2020.12.14.422758. [DOI] [PMC free article] [PubMed]
Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P. Sambamba: fast processing of NGS alignment formats. Bioinformatics. 2015;31:2032–2034. doi: 10.1093/bioinformatics/btv098. [DOI] [PMC free article] [PubMed] [Google Scholar]
Threlfall J, Blaxter M. Launching the tree of life gateway. Wellcome Open Research. 2021;6:125. doi: 10.12688/wellcomeopenres.16913.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Toda MJ. bioinfo—Taxonomic information. 2020 https://bioinfo.museum.hokudai.ac.jp/db/index.php
Tyson J. Bead-Free Long Fragment LSK109 Library Preparation. 2020 doi: 10.17504/protocols.io.7euhjew. [DOI]
Vaser R, Sović I, Nagarajan N, Šikić M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome research. 2017;27:737–746. doi: 10.1101/gr.214270.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vurture GW, Sedlazeck FJ, Nattestad M, Underwood CJ, Fang H, Gurtowski J, Schatz MC. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics. 2017;33:2202–2204. doi: 10.1093/bioinformatics/btx153. [DOI] [PMC free article] [PubMed] [Google Scholar]
Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q, Wortman J, Young SK, Earl AM. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLOS ONE. 2014;9:e112963. doi: 10.1371/journal.pone.0112963. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang JR, Holt J, McMillan L, Jones CD. FMLRC: Hybrid long read error correction using an FM-index. BMC bioinformatics. 2018;19:50. doi: 10.1186/s12859-018-2051-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJ, Birol I. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience. 2015;4:35. doi: 10.1186/s13742-015-0076-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Waterhouse RM, Seppey M, Simão FA, Manni M, Ioannidis P, Klioutchnikov G, Kriventseva EV, Zdobnov EM. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Molecular biology and evolution. 2018;35:543–548. doi: 10.1093/molbev/msx319. [DOI] [PMC free article] [PubMed] [Google Scholar]
Watson M, Warr A. Errors in long-read assemblies can critically affect protein prediction. Nature biotechnology. 2019;37:124–126. doi: 10.1038/s41587-018-0004-z. [DOI] [PubMed] [Google Scholar]
Wick RR, Holt KE. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Research. 2020;8:1–22. doi: 10.12688/f1000research.21782.3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yassin A. Phylogenetic classification of the Drosophilidae rondani (Diptera): the role of morphology in the postgenomic era. Systematic Entomology. 2013;38:349–364. doi: 10.1111/j.1365-3113.2012.00665.x. [DOI] [Google Scholar]
Yin J, Zhang C, Mirarab S. ASTRAL-MP: scaling ASTRAL to very large datasets using randomization and parallelization. Bioinformatics. 2019;35:3961–3969. doi: 10.1093/bioinformatics/btz211. [DOI] [PubMed] [Google Scholar]
Zdobnov EM, Tegenfeldt F, Kuznetsov D, Waterhouse RM, Simão FA, Ioannidis P, Seppey M, Loetscher A, Kriventseva EV. OrthoDB v9.1: cataloging evolutionary and functional annotations for animal, fungal, plant, archaeal, bacterial and viral orthologs. Nucleic acids research. 2017;45:D744–D749. doi: 10.1093/nar/gkw1119. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zoonomia Consortium. Genereux DP, Serres A, Armstrong J, Johnson J, Marinescu VD, Murén E, Juan D, Bejerano G, Casewell NR, Chemnick LG, Damas J, Di Palma F, Diekhans M, Fiddes IT, Garber M, Gladyshev VN, Goodman L, Haerty W, Houck ML. A comparative genomics multitool for scientific discovery and conservation. Nature. 2020;587:240–245. doi: 10.1038/s41586-020-2876-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

eLife. doi: 10.7554/eLife.66405.sa1

Decision letter

Editor: Graham Coop¹

Reviewed by: Timothy B Sackton²

Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Acceptance summary:

Drosophila species have long served as an important model system for genetics and genomics. The authors have developed an important community resource of high standard genomes for many species across the Drosophila clade. This resource will serve to empower the next generation of Drosophila research and provides an important road map for similar efforts in other groups of organisms.

Decision letter after peer review:

Thank you for submitting your article "Highly contiguous assemblies of 101 drosophilid genomes" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Patricia Wittkopp as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Timothy B Sackton (Reviewer #2).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

The reviewers and we appreciated the impressive community resource that has been carefully brought together and that this acts as an important road map for other clade-wide sequencing efforts.

The strong points of consensus that emerge across the reviews, and in the discussion with the reviewers, were:

1) The need for base accuracy to be incorporated into the quality metrics discussed.

2) The need to place the work into a broad context of large-scale genome efforts to expand the readership outside of Drosophila researchers and show the expanding wave of community efforts focused on high-quality reference genomes. Suggestions included the vertebrate genome project and the earth biogenome project.

3) The reviewers all noted that acknowledgement of previous work in Drosophilidae was lacking. Drosophila researches have clearly been leading this charge for a while and it would be good to place the current effort into prior work more thoroughly.

4) While annotation of these genomes is beyond the scope of the current paper, it would be good to include more of a discussion of the planned road map for annotation going forward.

These points all should be addressed in a revised version. Below we have copied each review in its entirety. The reviewers have read each other's reviews and agree that all of the points are reasonable. Please provide a point by point response to the reviews, with a focus on the above broad goals.

Reviewer #1 (Recommendations for the authors):

– Data availability. I found all (or at least most) of the raw sequencing data. I couldn't find assembly accessions.

– Guidance for future sequencing work

A back of the envelope calculation based on typical core prices for PacBio Sequel II CLR sequencing and Illumina Novaseq PE150 reads suggest a price total price of $14.50 / GB (< $2,175 / 150GB) and $9 / GB (< $6,750 / 750 GB) respectively. To sample a 150MB genome at 100-fold depth (15GB) for both platforms would seem to imply a cost of ~$350 (comparable to the price cited by the authors) or less. Accepting lower depth would lead to concomittant decreases in price / genome. The average sequencing depth for this project appears from Table S1 to be 9.8GB for long reads and 12.8GB for short reads, implying that the PacBio approach might be a bit cheaper. I don't think a detailed price analysis is necessary (or even advisable), but communicating the fact that the authors' approach is one of at least two more or less equally viable approaches would be both valuable and accurate.

– Quality control metrics: sequencing error and sample polymorphism

A description of the consensus error rate of the assemblies would be an important piece of documentation serving two purposes. It permits users to quantify the amount of error they might expect from this particular resource. Relatedly, since many of these samples are conducted on strains for which near isogenic samples are difficult to acquire, measuring and reporting the heterozygosity would help guide users as to the extent of this property of the material from which the assembly was derived, especially if such users intend to use the reference strains described here for conducting genetic work.

– Context via comparison to existing resources

Existing resources for highly contiguous assemblies (operatively, contig N50 > 1MB)

– The total number of existing assemblies is already at least 75 with N50 >= 1MB (57 from NCBI[1], 13 from Miller[2], 4 from Comeault[3], 1 from Rezvykh[4]).

– Obviously, this includes a lot of within-species samples (especially for D. melanogaster, including many assemblies of the reference strain) and many assemblies that have been subsequently improved by the authors. However, the authors are also sampling the same species multiple times (including the D. melanogaster reference strain) to their total count, so this is at least consistent with their counting.

– Importantly, this resource triples unique species possessing highly contiguous assemblies from 34 to 102 and expands species group representation from 8 to 15. Although, the quinaria group, represented by D. innubila, already existed and wasn't sampled again here, they re-sample 7 of the previous 8 species groups and add 7 more.

– As far as I can tell, until now, the most distant relative of D. melanogaster within Drosophilidae with a highly contiguous genome was Scaptodrosophila lebanonensis, which is in the tribe Colocasiomyini. The manuscript adds two additional members of this tribe, Leucophenga varia and Chymomyza costata. So, the number of distant relatives is tripled, but the actual phylogenetic breadth isn't (if my understanding of Drosophilid taxonomy is correct, and it may not be)

Scholarship

The $1,000 Nanopore genome was cited by both [2] and [5].

"Future work to improve biological and taxonomic diversity, particularly for species difficult to culture, should employ single fly sequencing and assembly workflows (Adams et al., 2020)."

An earlier long read precedent can be found in [6].

In addition to enumerating other high-quality Drosophila genomes that already exist, it would be extremely useful to users to see a comparison of the quality of the resources when they have published descriptions (in order to guide authors as to what types of contiguity, completeness, and error they can expect, especially in comparison to this work). At the least, I think the authors should put their work in the context of best previous assemblies for each species (an operative definition of contig N50 > 1 Mb seems consistent with their own thinking), particularly when that work has been published. To be consistent with their own accounting, they might even consider addressing works like [7], and species that have experienced extensive high quality sequencing effort like D. obscura, D. simulans, and D. pseudoobscura. I have attached a table with entries corresponding to every species in Table S2 as well as additional species exceeding contig N50 of 1Mb, including citations and NCBI accession numbers when they could be found.

The authors have actually cited many of these assemblies in other work [8], including genomes that have yet to be published, so it would seem that the authors are aware of them and trust the quality enough to incorporate into their own work, so improving the scholarship should be straightforward.

Typo

– In the abstract. There are 93 species represented, not 95.

Reviewer #2 (Recommendations for the authors):

While this manuscript presents a large amount of valuable new data, and is inherently important for that reason alone, I believe that some key improvements and additional analyses could greatly strengthen this manuscript and really improve the value to the community.

1) Improvements to quality metrics.

This paper reports genomes that are at the low-cost end of the cost/quality tradeoff in genome assembly. This is a extremely valuable contribution, because many other large-scale projects in genomics (most notably, the Vertebrate Genome Project) have focused on the other end of this spectrum. Yet, for many researchers, a low-cost way to produce 10 genomes from related species may be higher value than a "complete" assembly from one species. However, it remains somewhat unclear exactly how good these genomes are, beyond the observation that the gene space is largely complete, and contig N50s are generally high.

Therefore, I think the biggest and most important improvement that would increase the reach and usefulness of this manuscript is improvements to quality metrics. A recent preprint from the Vertebrate Genome Project team (Rhie et al., 2020; https://www.biorxiv.org/content/10.1101/2020.05.22.110833v1.full) provides a number of potentially useful quality metrics that may be worth considering applying here, although of course not all will be relevant to this project, and I realize the computational burden of trying to do everything could be large. Nonetheless, I think it is crucial to be able to give some sense of consensus quality, as base-level errors in assemblies has negative effects on many downstream applications. Based on Koren et al., 2019, there appears to be a large drop in likelihood of disrupting a gene due to a indel error between QV30 and QV40, and Rhie et al., 2020 has a lot more detail on various aspects of consensus quality metrics. I realize that many existing tools, e.g. Merqury (also Rhie et al., 2020, in Genome Biology) make the assumption that Illumina data is available for the same individual as the genome assembly, which is not universally true here (even in approximation, e.g. treating a strain as an individual). Still, some attempt to tackle this problem seems necessary, even if it cannot be done perfectly.

2) Assembly content

Related to the first point, basic descriptions of genome size (e.g., estimated from k-mers) would help to contextualize the resource produced, as would a definition of "near chromosome level contiguity" and validation of which of the newly reported assemblies here reach that threshold (especially as contig N50s vary by several orders of magnitude). Again, I don't think the VGP definitions are the only possible way to approach this question, but there is value in having some systematic summary of overall contiguity.

This paper does not describe the extent to which heterogametic sex chromosomes (not expected in all species based on sampling) or mitochondrial genomes are recovered. Presumably at least the presence of mt scaffolds is picked up in the NCBI screens, and is the kind of information that would be relatively straightforward to add to a table.

3) Drosophila focus.

There is a tension throughout this manuscript between describing a basically Drosophila-specific resource, and describing a more generally applicable approach to low-cost, clade-wide assembly. I think that the latter is really necessary and important, since not every community or group has the resources (in money, samples, compute) or desire (for their scientific questions) to use a "VGP-style" approach (with long reads, short reads, HiC, and optical mapping to produce as close to error-free chromosomal assemblies as possible). But the value of this manuscript as a blueprint for low-cost community genomics is somewhat limited by the Drosophila-centric nature of the results.

The most obvious Drosophila-specific assumptions are the availability of a inbred strain, and a genome size of 100-250 Mb or so, with a few exceptions. Notably, the assemblies of the larger genomes (and the ones derived from wild-caught flies) tend to be worse, with lower contig N50s and auN metrics, more contigs, and more fragmented or missing BUSCOs.

Of course it would be well beyond the scope of this manuscript to attempt to validate any of these approaches in other clades, or provide a simple recipe for how to assembly any possible genome. Nonetheless, it would certainly be possible to broaden the discussion, and be clearer in the text when certain statements are Drosophila specific (e.g., the $350 in sequencing costs assumes a genome on the order of 100-200 Mb).

4) Limitations of the existing resource and future prospects for improvement:

The genomes presented here do not include annotations, or any other form of supplemental resource such as whole genome alignments (e.g. Armstrong et al., 2020), and don't use HiC or any kind of scaffolding to obtain true chromosomal scaffolds. I think these are understandable and defensible choices, given the computational and technical requirements to extend this work in those directions. However, it may be valuable to discuss more explicitly what this resource is and is not. At a minimum, doing so could prevent the corresponding authors from receiving many emails asking where the gene annotations for species X are once this work is published.

Reviewer #3 (Recommendations for the authors):

This is really an impressive resource and has the potential to be widely used both in terms of the data itself and also the methodology. I have several suggestions that may improve the manuscript.

There are several species (willistoni, paulistorum, etc.) that are sequenced more than once without any reference to why. It might be useful to describe that somewhere (Table 1?). As a resource, it would be simpler to use if it was clear when one isolate per species vs. multiple were appropriate to use for analyses.

Lines 59-61 – a bit more detail about modifications here since you are writing methods last.

Lines 107-111 – I'm concerned about the conclusions drawn about repeat content based upon the way the data was analyzed. A comprehensive analysis of repeat content is probably beyond the scope of this manuscript, but without de novo characterization of repeat sequences, I'm worried that satellites and TEs in more distantly related species may be missed if they are lineage restricted. I believe TRF would get at this to a point, but maybe not robustly. Other software like RepeatModeler might be better. However, my suggestion is not that these genomes are individually de novo annotated for repeats. It is just that the conclusions about relationships between contiguity or genome size and repeat content are presented with more caveats.

Figure 2 – This figure is a bit difficult to intuit, so this is another place where more detail in the main text would be useful.

Figure 3 – though the authors argue that this tree is not meant as a robust measure of phylogenetic relationships, it would be nice to put some support values on the tree.

Line 336 – I think the bioawk people would appreciate a citation (though all I can find on the internet is to cite the github page).

Line 340 – Please describe the auN statistic (what is L and what are we summing over?) in more detail. The Li github page describes nicely.

eLife. 2021 Jul 19;10:e66405. doi: 10.7554/eLife.66405.sa2

Author response

Essential revisions:

The reviewers and we appreciated the impressive community resource that has been carefully brought together and that this acts as an important road map for other clade-wide sequencing efforts.

The strong points of consensus that emerge across the reviews, and in the discussion with the reviewers, were:

1) The need for base accuracy to be incorporated into the quality metrics discussed.

The revised manuscript now contains several new sections with evaluations of sample diversity and base quality. In the text, see the new “Estimates of sample diversity”, “Estimates of sequence quality”, and “Nanopore-based assemblies are highly accurate in coding regions” sections in the Results and Discussion and their accompanying sections in the Methods. The figures associated with these new sections are: Figure 2 along with Figure 2—figure supplement 1, Figure 3 along with Figure 3—figure supplements 1-2, and Supplementary Files 3-5. As with the rest of the manuscript, sample scripts and any raw data underlying the figures is provided in the GitHub repository. We will briefly summarize our new results below.

As the reviewers recognized, a big challenge in conducting quality evaluation properly was the variation in usable data for each assembly. As a brief summary, Illumina data were generated for most strains, downloaded from NCBI for other strains, and for a small number of strains we did not have Illumina reads (due to strain extinction/culling in 2020). A limited number of these genomes had reference assemblies of the same strain. This limited the usefulness of both reference-free and reference-based methods to specific assemblies, for example, we could not effectively utilize Merqury on genomes without short reads. Because of this, we chose to utilize several complementary strategies to evaluate both sample diversity and accuracy using long reads, short reads, and reference assemblies so that: (1) each genome had at least one associated measurement of sample diversity and quality and (2) a broad enough evaluation was performed with each method such that readers could infer a sense of quality even if some data were missing.

Briefly, we first performed variant calling using long reads and short reads separately, to both get a sense of the heterozygosity in each line and to get a count of non-reference variants to estimate the error rate. The number of non-reference variants divided by the number of callable sites was used as a proxy for the error rate similar to how error rates were estimated by Solares et al., (2018) and Koren et al., (2017). Reference-free estimates of sequence quality were computed with Merqury, using any short read data used for the assembly (when available). We find that overall, Nanopore-based assemblies do not seem to meet the often-cited QV40 “reference quality” threshold, but this is not unexpected as the effectiveness of short-read mapping and polishing will vary greatly across the genome.

We therefore followed up on genome-wide quality assessments with reference-based comparisons for the few genomes where we assembled the same line as the NCBI reference genomes. These comparisons allowed us to directly assess how our assemblies differed from the reference genomes in specific genomic regions (CDS, introns, repeats, etc.).

These reference-based comparisons showed our assemblies to indeed be highly accurate in coding regions, with a few important subtleties to note. At first glance, the mean consensus quality for coding sequences is concerningly low, ranging between QV20-QV30 even for D. melanogaster. However, when assessing quality in genomic windows or gene by gene, it quickly becomes clear that most coding sequences in the Nanopore assemblies are identical to the reference genomes.

After seeing this pattern, we thought to perform a manual assessment of every indel difference between our D. melanogaster genome and the reference, bringing in other long-read assemblies of the reference strain to help determine whether these large indels might be real or artifacts of our assembly. We found that nearly all (99.7%) protein-coding genes in our D. melanogaster genome were identical to the reference in coding sequences, and out of the total coding sequence differences between the two genomes, 99% of “erroneous” bases were explained by large indels that were also supported by the other long-read assemblies.

In the “Next Steps” section, we point out these issues and recommend that future work add lower-coverage high-fidelity sequencing to improve genome assemblies outside of complex coding sequences.

2) The need to place the work into a broad context of large-scale genome efforts to expand the readership outside of Drosophila researchers and show the expanding wave of community efforts focused on high-quality reference genomes. Suggestions included the vertebrate genome project and the earth biogenome project.

We have carefully re-examined our manuscript based on the reviewer comments and agree that too many relevant background details were omitted for the sake of brevity. Significant changes have been applied to the Introduction and Results and Discussion to address this shortcoming.

The Introduction is now started by citing the growing wave of large-scale genome assembly projects, to show the timeliness of the work. We also clarify why a cheap approach to high-quality genome assembly is valuable: time and cost are still big limitations to generating clade-scale genome datasets if one is not a part of a well-funded consortium. Ultra-long reads play a key role in constructing high-quality genomes at low cost. We hope this will provide readers with a better sense of the timeliness and significance of this work.

In addition, we have also revised the Next Steps section with better descriptions of forthcoming additions to this resource (annotations and whole-genome alignments). Many of the other large genome assembly projects are building (or have built) similar resources; we hope to both let the reader know these resources are forthcoming but also that we are utilizing modern tools to do so. Specifics are provided below (Point 4).

3) The reviewers all noted that acknowledgement of previous work in Drosophilidae was lacking. Drosophila researches have clearly been leading this charge for a while and it would be good to place the current effort into prior work more thoroughly.

This is an important point and we certainly did not mean to downplay the important role that Drosophilia labs have played in modern genomics. We have added both historical context and more background on the current state of Drosophila genomes to address this shortcoming. The revised Introduction emphasizes the important roles of D. melanogaster, the 12 Drosophila genomes, and the modENCODE project. The quantity and quality of drosophilid genome assemblies that were available prior to this work are more clearly discussed.

Our original plan for adding background on existing drosophilid genome assemblies was to include a new supplementary table listing the genome assemblies currently available on NCBI alongside some quality comparisons. However, a recently released (in the time since this manuscript was submitted) review of arthropod genome assemblies by Hotaling et al., (2021) provides exactly this: lists of genomes, accessions, the technology used to build the assembly, and contiguity/completeness statistics. Seventy-six drosophilid genomes for 75 species (two D. pseudoobscura subspecies) are listed in a file provided at the following link:

https://github.com/pbfrandsen/insect_genome_assemblies

Instead of essentially duplicating this table for this manuscript, we decided to point the reader to Hotaling et al. We compare the contiguity and completeness of our genomes to the other drosophilid assemblies and show that, while not fully chromosome-level, our genomes are comparable to many of the best reference genomes (Figure 1—figure supplement 1).

4) While annotation of these genomes is beyond the scope of the current paper, it would be good to include more of a discussion of the planned road map for annotation going forward.

Related to the last point from Point 2, we have revised the “Next Steps” section to clearly state what kinds of resources we plan to release in the near future.

Briefly, our plan follows the well-established workflows of the UCSC Comparative Genomics toolkits, similar to some recently published genome consortium studies (e.g. Genereux et al., 2020 and Feng et al., 2020). We have a whole-genome alignment with ~170 genomes, built with the Progressive Cactus aligner, in hand and plan to release it shortly. The alignment will greatly facilitate other kinds of analyses. Using the HALtools and PHAST software, we have prepared pipelines for generating PhastCons, PhyloP, and GERP tracks. We have also set up LiftOff (same species) and Comparative Annotation Toolkit (different species) based pipelines as a first pass at gene annotation. In the select cases where clades do not contain a well-annotated representative (e.g. Zaprionus), we plan to generate RNA-seq data for gene annotation.

While this annotation strategy was chosen because we do not have the centralization, personnel, or funding of the large genome consortia, the plan is to make these resources easily accessible for anyone wishing to utilize them or to perform their own iterative improvements such as annotation for specific taxa. Like this study, we will release these resources along with carefully documented workflows.

These points all should be addressed in a revised version. Below we have copied each review in its entirety. The reviewers have read each other's reviews and agree that all of the points are reasonable. Please provide a point by point response to the reviews, with a focus on the above broad goals.

Reviewer #1 (Recommendations for the authors):

– Data availability. I found all (or at least most) of the raw sequencing data. I couldn't find assembly accessions.

The assemblies should now be available on NCBI. The assemblies were submitted to NCBI before the initial draft of this paper was submitted. We apologize for the delay; it seems COVID has significantly delayed the genome submission process.

– Guidance for future sequencing work

A back of the envelope calculation based on typical core prices for PacBio Sequel II CLR sequencing and Illumina Novaseq PE150 reads suggest a price total price of $14.50 / GB (< $2,175 / 150GB) and $9 / GB (< $6,750 / 750 GB) respectively. To sample a 150MB genome at 100-fold depth (15GB) for both platforms would seem to imply a cost of ~$350 (comparable to the price cited by the authors) or less. Accepting lower depth would lead to concomitant decreases in price / genome. The average sequencing depth for this project appears from Table S1 to be 9.8GB for long reads and 12.8GB for short reads, implying that the PacBio approach might be a bit cheaper. I don't think a detailed price analysis is necessary (or even advisable), but communicating the fact that the authors' approach is one of at least two more or less equally viable approaches would be both valuable and accurate.

This is a great point. Although this manuscript and the accompanying methods are very specific to the Nanopore and Illumina hybrid approach, there are certainly other equally viable approaches to be considered if one were planning a similar project today. Long-read sequencing with PromethION or PacBio CLR would almost certainly result in a lower $/GB cost with comparable read lengths. Assembly quality would also benefit greatly from lower coverage of higher-accuracy long reads like PacBio HiFi.

We are hesitant to make a specific recommendation given there have been significant (one might say game-changing) improvements to Nanopore protocols, computational tools, and new flow cell types in the last year, with more promised to come soon. The best practices also depend on sample type, quality, and other factors that can vary from experiment to experiment. There are a few new developments available today that are worth mentioning. It is not uncommon to see Nanopore runs with half the data contained in reads 100kb or longer, and new kits are reported to have substantially increased (reportedly doubled) the throughput of Nanopore flow cells. Even Nanopore-only assemblies are far more accurate than before simply due to improved basecalling methods. For example, in internal testing of Nanopore-only assemblies, we have seen genome-wide consensus accuracy increase by more than 2-fold simply by updating the basecaller. Still, homopolymer indels and regions of poor short read mapping remain problematic in the final assembly.

Because of this, we think it is important (if possible) to start looking past the Nanopore+Illumina approach and at alternative strategies. A Nanopore ultra-long read and lower-coverage high-accuracy long read (e.g. HiFi) approach will be far more effective at improving accuracy outside of coding regions, and seems to be gaining traction in the literature. Combining ultra-long and high-fidelity long reads could be a similarly affordable and more future-proof alternative to Nanopore+Illumina.

We have added this discussion to the “Next Steps” section and hope it is useful to readers of the manuscript that are planning their own assembly project today.

– Quality control metrics: sequencing error and sample polymorphism

A description of the consensus error rate of the assemblies would be an important piece of documentation serving two purposes. It permits users to quantify the amount of error they might expect from this particular resource. Relatedly, since many of these samples are conducted on strains for which near isogenic samples are difficult to acquire, measuring and reporting the heterozygosity would help guide users as to the extent of this property of the material from which the assembly was derived, especially if such users intend to use the reference strains described here for conducting genetic work.

We have added measures of sample diversity and sequence quality to the manuscript; please see Points 1 and 2 above for specific details.

– Context via comparison to existing resources

Existing resources for highly contiguous assemblies (operatively, contig N50 > 1MB)

– The total number of existing assemblies is already at least 75 with N50 >= 1MB (57 from NCBI[1], 13 from Miller[2], 4 from Comeault[3], 1 from Rezvykh[4]).

– Obviously, this includes a lot of within-species samples (especially for D. melanogaster, including many assemblies of the reference strain) and many assemblies that have been subsequently improved by the authors. However, the authors are also sampling the same species multiple times (including the D. melanogaster reference strain) to their total count, so this is at least consistent with their counting.

– Importantly, this resource triples unique species possessing highly contiguous assemblies from 34 to 102 and expands species group representation from 8 to 15. Although, the quinaria group, represented by D. innubila, already existed and wasn't sampled again here, they re-sample 7 of the previous 8 species groups and add 7 more.

– As far as I can tell, until now, the most distant relative of D. melanogaster within Drosophilidae with a highly contiguous genome was Scaptodrosophila lebanonensis, which is in the tribe Colocasiomyini. The manuscript adds two additional members of this tribe, Leucophenga varia and Chymomyza costata. So, the number of distant relatives is tripled, but the actual phylogenetic breadth isn't (if my understanding of Drosophilid taxonomy is correct, and it may not be).

As described in Point 3 above, we added comparisons to genome assemblies on NCBI, using the summaries provided in the very recent review of long-read insect genomes by Hotaling et al., 2021. As there are several genome versions for several species, of different or the same strain, and because we have re-analyzed or added to data previously used by Miller et al., 2018 and Comeault et al., 2020 (the earlier versions are also not on NCBI), we opted to perform genome comparisons using only the representative genome on NCBI. This also made it easier since we directly utilize Hotaling et al.’s table for any comparisons.

Citations to other studies generating high-quality drosophilid genomes, including the references provided by the reviewer, have been added to the Introduction. We also substantially revised the “Taxon Sampling” and the “Next Steps” sections to better reflect how our genomes fit into the phylogeny relative to the ones that exist: that is, that genomes we assembled were mostly from already well-studied groups and that subfamily Steganinae (along with several other clades) remains poorly sampled.

Scholarship

The $1,000 Nanopore genome was cited by both [2] and [5].

"Future work to improve biological and taxonomic diversity, particularly for species difficult to culture, should employ single fly sequencing and assembly workflows (Adams et al., 2020)."

An earlier long read precedent can be found in [6].

References to these studies have been added in the appropriate locations.

In addition to enumerating other high-quality Drosophila genomes that already exist, it would be extremely useful to users to see a comparison of the quality of the resources when they have published descriptions (in order to guide authors as to what types of contiguity, completeness, and error they can expect, especially in comparison to this work). At the least, I think the authors should put their work in the context of best previous assemblies for each species (an operative definition of contig N50 > 1 Mb seems consistent with their own thinking), particularly when that work has been published. To be consistent with their own accounting, they might even consider addressing works like [7], and species that have experienced extensive high quality sequencing effort like D. obscura, D. simulans, and D. pseudoobscura. I have attached a table with entries corresponding to every species in Table S2 as well as additional species exceeding contig N50 of 1Mb, including citations and NCBI accession numbers when they could be found.

The authors have actually cited many of these assemblies in other work [8], including genomes that have yet to be published, so it would seem that the authors are aware of them and trust the quality enough to incorporate into their own work, so improving the scholarship should be straightforward.

As discussed in the Essential revisions, we now compare the contiguity and completeness of Nanopore assemblies to representative genomes currently available on NCBI. Additionally, we conduct reference-based quality assessments for Nanopore genomes that have an NCBI RefSeq counterpart of the same strain. This should give the reader a better sense of how our Nanopore-based assemblies stack up against the highest-quality reference genomes. Importantly, we find coding sequences in Nanopore assemblies to be nearly identical to NCBI RefSeq genomes.

Typo

– In the abstract. There are 93 species represented, not 95.

We initially counted genomes of subspecies (D. malerkotliana malerkotliana and D. malerkotliana pallens, D. pseudoananassae pseudoananassae and D. pseudoananassae nigrens) as separate species. This was incorrect and the number has been revised to 93.

Reviewer #2 (Recommendations for the authors):

While this manuscript presents a large amount of valuable new data, and is inherently important for that reason alone, I believe that some key improvements and additional analyses could greatly strengthen this manuscript and really improve the value to the community.

1) Improvements to quality metrics.

This paper reports genomes that are at the low-cost end of the cost/quality tradeoff in genome assembly. This is an extremely valuable contribution, because many other large-scale projects in genomics (most notably, the Vertebrate Genome Project) have focused on the other end of this spectrum. Yet, for many researchers, a low-cost way to produce 10 genomes from related species may be higher value than a "complete" assembly from one species. However, it remains somewhat unclear exactly how good these genomes are, beyond the observation that the gene space is largely complete, and contig N50s are generally high.

Therefore, I think the biggest and most important improvement that would increase the reach and usefulness of this manuscript is improvements to quality metrics. A recent preprint from the Vertebrate Genome Project team (Rhie et al., 2020; https://www.biorxiv.org/content/10.1101/2020.05.22.110833v1.full) provides a number of potentially useful quality metrics that may be worth considering applying here, although of course not all will be relevant to this project, and I realize the computational burden of trying to do everything could be large. Nonetheless, I think it is crucial to be able to give some sense of consensus quality, as base-level errors in assemblies has negative effects on many downstream applications. Based on Koren et al., 2019, there appears to be a large drop in likelihood of disrupting a gene due to a indel error between QV30 and QV40, and Rhie et al., 2020 has a lot more detail on various aspects of consensus quality metrics. I realize that many existing tools, e.g. Merqury (also Rhie et al., 2020, in Genome Biology) make the assumption that Illumina data is available for the same individual as the genome assembly, which is not universally true here (even in approximation, e.g. treating a strain as an individual). Still, some attempt to tackle this problem seems necessary, even if it cannot be done perfectly.

Briefly, we now include both reference-free and reference-based (if possible) quality assessments of our genomes in the revised text. We found the latter to be highly informative given the non-uniform distribution of errors in the Drosophila genomes. Although overall sequence quality often appears to be quite far from the QV30-40 (i.e. reference quality) level, we show that most coding sequences have reference-quality accuracy and most protein-coding genes are unaffected by indel errors. Moreover, large indels, many of which could be real, have a disproportionate effect on the reference-based error rate, leading to large underestimates of true sequence quality.

2) Assembly content

Related to the first point, basic descriptions of genome size (e.g., estimated from k-mers) would help to contextualize the resource produced, as would a definition of "near chromosome level contiguity" and validation of which of the newly reported assemblies here reach that threshold (especially as contig N50s vary by several orders of magnitude). Again, I don't think the VGP definitions are the only possible way to approach this question, but there is value in having some systematic summary of overall contiguity.

We now provide genome size estimates and genome quality measures (NGx curves, NG50) in Figure 1—figure supplement 3 and Supplementary File 2. Similar to the quality assessment, we ran into issues trying to estimate genome size from k-mers. Not all samples had short reads plus samples with high diversity and/or low coverage had unusual k-mer count histograms leading to odd genome size estimates. Although we have long read data for every genome, the noisiness of Nanopore read (5-15% error rates) precludes k-mer based estimation of genome size.

To get around these issues, we estimated genome size by mapping Nanopore reads back to each assembly and looking at depth of coverage over single-copy BUSCO genes. Our reasoning here was that both noisy long reads and short reads should be mappable to the reference genome, and that single-copy BUSCO genes could safely be assumed to be reliably mappable single-copy regions in each genome. Assuming this, genome size can be roughly estimated from the depth of coverage and the summed length of all Nanopore reads.

These estimates of genome size turned out to be surprisingly consistent with the size of the assemblies. In most cases they were slightly larger than the assembly size. Of course there are caveats here, the most significant one being how we handled datasets with lots of apparent contamination (e.g., the assemblies from wild-collected Hawaiian Drosophila). Here we assumed uniform coverage over all contigs and adjusted the genome size estimate by the proportion of the assembly identified as contaminant sequence.

This paper does not describe the extent to which heterogametic sex chromosomes (not expected in all species based on sampling) or mitochondrial genomes are recovered. Presumably at least the presence of mt scaffolds is picked up in the NCBI screens, and is the kind of information that would be relatively straightforward to add to a table.

Admittedly we did not adopt a rigorous way to identify either sex chromosomes and mitochondrial genomes/contigs from the beginning, although we certainly should have the data to do so for most samples. Another complication is that high-coverage contigs were tagged for removal when running purge_haplotigs, so it is possible that many mitochondrial sequences were removed. This was roughly consistent with the NCBI contamination screen as mtDNA did not show up for many of the genomes. Unfortunately, we have not retained the NCBI contamination screen files, so the information on mtDNA contigs is not as straightforward to recover. We attempted to run the contamination screen through Docker (https://github.com/NCBI-Hackathons/ContamFilter) but there is an issue with the NCBI databases the Docker image points to and we could not get it to run.

Despite these shortsighted initial decisions, we expect this information to be eventually released. At least one person we know of is started working on identifying the sex chromosomes after we released the data. This is not a formal collaboration so we do not have a timeline for this work. We are also planning to run the VGP mitochondrial assembly pipeline (mitoVGP) with our sequences and attempt to properly assemble the mitochondrial genomes. This is not ready to go into this manuscript, but we plan to include it in a future update.

3) Drosophila focus.

There is a tension throughout this manuscript between describing a basically Drosophila-specific resource, and describing a more generally applicable approach to low-cost, clade-wide assembly. I think that the latter is really necessary and important, since not every community or group has the resources (in money, samples, compute) or desire (for their scientific questions) to use a "VGP-style" approach (with long reads, short reads, HiC, and optical mapping to produce as close to error-free chromosomal assemblies as possible). But, the value of this manuscript as a blueprint for low-cost community genomics is somewhat limited by the Drosophila-centric nature of the results.

The most obvious Drosophila-specific assumptions are the availability of a inbred strain, and a genome size of 100-250 Mb or so, with a few exceptions. Notably, the assemblies of the larger genomes (and the ones derived from wild-caught flies) tend to be worse, with lower contig N50s and auN metrics, more contigs, and more fragmented or missing BUSCOs.

Of course it would be well beyond the scope of this manuscript to attempt to validate any of these approaches in other clades, or provide a simple recipe for how to assembly any possible genome. Nonetheless, it would certainly be possible to broaden the discussion, and be clearer in the text when certain statements are Drosophila specific (e.g., the $350 in sequencing costs assumes a genome on the order of 100-200 Mb).

In the revised manuscript, we have tried to make it clearer that the $350 assembly is assuming Drosophila-specific parameters, both in terms of genome size and the amount of genomic DNA one might start the library prep process with. We have also improved the “Next Steps” to improve the discussion around sequencing species that cannot easily be reared in the lab or are only available as a few ethanol-preserved specimens. While the text remains somewhat specific to drosophilids, the discussion of the challenging samples we face and plans to address these challenges should have broader relevance to those not sequencing drosophilids.

4) Limitations of the existing resource and future prospects for improvement:

The genomes presented here do not include annotations, or any other form of supplemental resource such as whole genome alignments (e.g. Armstrong et al., 2020), and don't use HiC or any kind of scaffolding to obtain true chromosomal scaffolds. I think these are understandable and defensible choices, given the computational and technical requirements to extend this work in those directions. However, it may be valuable to discuss more explicitly what this resource is and is not. At a minimum, doing so could prevent the corresponding authors from receiving many emails asking where the gene annotations for species X are once this work is published.

This is mostly addressed in our reply to Point 4, but to add a few specific comments here: we thought this was a very important point and have addressed it in the updated manuscript. We revised the “Next Steps” section to state clearly that whole genome alignments are forthcoming, and that whole genome alignments will allow us to start the annotation process in a logistically efficient manner.

Reviewer #3 (Recommendations for the authors):

This is really an impressive resource and has the potential to be widely used both in terms of the data itself and also the methodology. I have several suggestions that may improve the manuscript.

There are several species (willistoni, paulistorum, etc.) that are sequenced more than once without any reference to why. It might be useful to describe that somewhere (Table 1?). As a resource, it would be simpler to use if it was clear when one isolate per species vs. multiple were appropriate to use for analyses.

To be completely forthcoming, we ended up performing multiple assemblies for a few species without specifically intending to do so. Two D. immigrans strains were sequenced separately due to a miscommunication; Z. tsacasi car7-4 was originally misidentified as Z. tuberculatus; and the additional D. willistoni line (14030-0811.17) was originally sequenced for a grad student project in planned in South America, but this plan was disrupted by the pandemic. Separate lines of D. teissieri, D. paulistorum, and Z. indianus were sequenced because the lines appear to be reproductively isolated (observed by D. Matute).

We have tried to address any confusion the reader might experience by denoting a preferred assembly with an asterisk in Table 1 of the revised manuscript. Clarifying text regarding preferred assemblies has also been added to “Taxon Sampling.” Preferred assemblies were selected on the basis of contiguity, completeness, and whether Illumina data from the same sample was used for the assembly.

Lines 59-61 – a bit more detail about modifications here since you are writing methods last.

Added text clarifies the protocol modifications that improve the yield of ultra-long reads while improving library prep and sequencing efficiency (hence reducing costs).

Lines 107-111 – I'm concerned about the conclusions drawn about repeat content based upon the way the data was analyzed. A comprehensive analysis of repeat content is probably beyond the scope of this manuscript, but without de novo characterization of repeat sequences, I'm worried that satellites and TEs in more distantly related species may be missed if they are lineage restricted. I believe TRF would get at this to a point, but maybe not robustly. Other software like RepeatModeler might be better. However, my suggestion is not that these genomes are individually de novo annotated for repeats. It is just that the conclusions about relationships between contiguity or genome size and repeat content are presented with more caveats.

This is a good point. As mentioned, we would like other groups with more expertise in analyzing repeat content to be able to freely work on these data as that is not the focus of this work. We have added text to the “Repeat content” section to indicate that these results are conditional on what is currently in the repeat databases and that this resource is intended to be a starting point to improve repeat databases rather than a comprehensive characterization of repeats in drosophilid genomes.

Figure 2 – This figure is a bit difficult to intuit, so this is another place where more detail in the main text would be useful.

The main text has been revised to specifically state that the graph layout method is being used to create a map where BUSCOs that are connected in the assemblies will cluster together.

Figure 3 – though the authors argue that this tree is not meant as a robust measure of phylogenetic relationships, it would be nice to put some support values on the tree.

We have added local posterior probabilities (as reported by ASTRAL) to provide a measure of support for each node.

Line 336 – I think the bioawk people would appreciate a citation (though all I can find on the internet is to cite the github page).

This is our mistake, a citation to this tool should have been included. In the revised version of the manuscript, a citation to the GitHub page is included.

Line 340 – Please describe the auN statistic (what is L and what are we summing over?) in more detail. The Li github page describes nicely.

Based on Reviewer 2’s comments, we decided to de-emphasize auN as the main summary statistic for contiguity as most readers will have an immediate intuitive understanding of N50/NG50 but not auN. N50 and NG50 are now presented in Figure 1 and Figure 1—figure supplement 3, respectively.

Nevertheless, we still think auN provides the best comparison of contiguity between assemblies and plan to utilize it in future work, for instance, we plan to update these genomes with improved Nanopore basecalling algorithms. We have thus clarified the calculation of this statistic further in the Methods and still provide auN in Supplementary File 2.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Kim BY, Wang JR. 2020. Nanopore-based assembly of many drosophilid genomes. NCBI BioProject. PRJNA675888
Miller DE. 2018. Sequencing and assembly of 14 Drosophila species. NCBI BioProject. PRJNA427774
The Drosophila modENCODE Project 2011. modENCODE Drosophila reference genome sequencing (fruit flies) NCBI BioProject. 63477
Yang H. 2018. DNA-seq of sexed Drosophila grimshawi, Drosophila silvestris, and Drosophila heteroneura. NCBI BioProject. PRJNA484408
Bronski M. 2019. Drosophila montium Species Group Genomes Project. NCBI BioProject. PRJNA554346
Rane R. 2018. Invertebrate sample from Drosophila repleta. NCBI BioProject. 476692
Turissini D. 2017. Fly lines. NCBI BioProject. 395473
National Institute of Genetics [Japan] 2016. Genome sequences of 10 Drosophila species. NCBI BioProject. PRJDB4817
Ellison C. 2019. Raw genomic sequencing data from 16 Drosophila species. NCBI BioProject. PRJNA550077

Supplementary Materials

Supplementary file 1. Detailed information on both long-read and short-read data used for this project, including accession numbers if publicly available data were used for assembly.

elife-66405-supp1.xlsx^{(16.5KB, xlsx)}

Supplementary file 2. Assembly summary statistics and genome size estimates.

elife-66405-supp2.xlsx^{(27.2KB, xlsx)}

Supplementary file 3. Counts of SNPs, indels, and per-site heterozygosity estimated from both long reads and short reads.

elife-66405-supp3.xlsx^{(25.5KB, xlsx)}

Supplementary file 4. Consensus quality scores estimated with reference-free and reference-based methods.

elife-66405-supp4.xlsx^{(23.3KB, xlsx)}

Supplementary file 5. Characterization of all coding sequence indel differences between Nanopore and Release six reference D. melanogaster assemblies.

elife-66405-supp5.xlsx^{(11.7KB, xlsx)}

Supplementary file 6. Detailed sample information.

elife-66405-supp6.xlsx^{(33.6KB, xlsx)}

Transparent reporting form

elife-66405-transrepform.pdf^{(201.4KB, pdf)}

Data Availability Statement

The following dataset was generated:

Kim BY, Wang JR. 2020. Nanopore-based assembly of many drosophilid genomes. NCBI BioProject. PRJNA675888

The following previously published datasets were used:

Miller DE. 2018. Sequencing and assembly of 14 Drosophila species. NCBI BioProject. PRJNA427774

The Drosophila modENCODE Project 2011. modENCODE Drosophila reference genome sequencing (fruit flies) NCBI BioProject. 63477

Yang H. 2018. DNA-seq of sexed Drosophila grimshawi, Drosophila silvestris, and Drosophila heteroneura. NCBI BioProject. PRJNA484408

Bronski M. 2019. Drosophila montium Species Group Genomes Project. NCBI BioProject. PRJNA554346

Rane R. 2018. Invertebrate sample from Drosophila repleta. NCBI BioProject. 476692

Turissini D. 2017. Fly lines. NCBI BioProject. 395473

National Institute of Genetics [Japan] 2016. Genome sequences of 10 Drosophila species. NCBI BioProject. PRJDB4817

Ellison C. 2019. Raw genomic sequencing data from 16 Drosophila species. NCBI BioProject. PRJNA550077

[bib2] Adams M, McBroome J, Maurer N, Pepper-Tunick E, Saremi NF, Green RE, Vollmers C, Corbett-Detig RB. One fly–one genome: chromosome-scale genome assembly of a single outbred Drosophila melanogaster. Nucleic Acids Research. 2020;356:450. doi: 10.1093/nar/gkaa450. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]

[bib4] Armstrong J, Hickey G, Diekhans M, Fiddes IT, Novak AM, Deran A, Fang Q, Xie D, Feng S, Stiller J, Genereux D, Johnson J, Marinescu VD, Alföldi J, Harris RS, Lindblad-Toh K, Haussler D, Karlsson E, Jarvis ED, Zhang G, Paten B. Progressive Cactus is a multiple-genome aligner for the thousand-genome era. Nature. 2020;587:246–251. doi: 10.1038/s41586-020-2871-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology : a journal of computational molecular cell biology. 2012;19:455–477. doi: 10.1089/cmb.2012.0021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] Bao W, Kojima KK, Kohany O. Repbase update, a database of repetitive elements in eukaryotic genomes. Mobile DNA. 2015;6:11. doi: 10.1186/s13100-015-0041-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] Baym M, Kryazhimskiy S, Lieberman TD, Chung H, Desai MM, Kishony R. Inexpensive multiplexed library preparation for megabase-sized genomes. PLOS ONE. 2015;10:e0128036. doi: 10.1371/journal.pone.0128036. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research. 1999;27:573–580. doi: 10.1093/nar/27.2.573. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] Bininda-Emonds OR. transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences. BMC bioinformatics. 2005;6:156. doi: 10.1186/1471-2105-6-156. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] Böhm JN, Berens P, Kobak D. A unifying perspective on neighbor embeddings along the Attraction-Repulsion spectrum. arXiv. 2020 http://arxiv.org/abs/2007.08902

[bib11] Bracewell R, Chatla K, Nalley MJ, Bachtrog D. Dynamic turnover of centromeres drives karyotype evolution in Drosophila. eLife. 2019;8:e49002. doi: 10.7554/eLife.49002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] Bronski MJ, Martinez CC, Weld HA, Eisen MB. Whole Genome Sequences of 23 Species from the Drosophila montium Species Group (Diptera: Drosophilidae): A Resource for Testing Evolutionary Hypotheses. G3: Genes, Genomes, Genetics. 2020;10:1443–1455. doi: 10.1534/g3.119.400959. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] Cao MD, Nguyen SH, Ganesamoorthy D, Elliott AG, Cooper MA, Coin LJ. Scaffolding and completing genome assemblies in real-time with nanopore sequencing. Nature communications. 2017;8:14515. doi: 10.1038/ncomms14515. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] Chakraborty M, Emerson JJ, Macdonald SJ, Long AD. Structural variants exhibit widespread allelic heterogeneity and shape variation in complex traits. Nature communications. 2019;10:4872. doi: 10.1038/s41467-019-12884-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] Chakraborty M, Chang CH, Khost DE, Vedanayagam J, Adrion JR, Liao Y, Montooth KL, Meiklejohn CD, Larracuente AM, Emerson JJ. Evolution of genome structure in the Drosophila simulans species complex. Genome research. 2021;31:380–396. doi: 10.1101/gr.263442.120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] Chen ZX, Sturgill D, Qu J, Jiang H, Park S, Boley N, Suzuki AM, Fletcher AR, Plachetzki DC, FitzGerald PC, Artieri CG, Atallah J, Barmina O, Brown JB, Blankenburg KP, Clough E, Dasgupta A, Gubbala S, Han Y, Jayaseelan JC, Kalra D, Kim YA, Kovar CL, Lee SL, Li M, Malley JD, Malone JH, Mathew T, Mattiuzzo NR, Munidasa M, Muzny DM, Ongeri F, Perales L, Przytycka TM, Pu LL, Robinson G, Thornton RL, Saada N, Scherer SE, Smith HE, Vinson C, Warner CB, Worley KC, Wu YQ, Zou X, Cherbas P, Kellis M, Eisen MB, Piano F, Kionte K, Fitch DH, Sternberg PW, Cutter AD, Duff MO, Hoskins RA, Graveley BR, Gibbs RA, Bickel PJ, Kopp A, Carninci P, Celniker SE, Oliver B, Richards S. Comparative validation of the D. melanogaster modENCODE transcriptome annotation. Genome research. 2014;24:1209–1223. doi: 10.1101/gr.159384.113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] Comeault AA, Wang J, Tittes S, Isbell K, Ingley S, Hurlbert AH, Matute DR. Genetic Diversity and Thermal Performance in Invasive and Native Populations of African Fig Flies. Molecular biology and evolution. 2020;37:1893–1906. doi: 10.1093/molbev/msaa050. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10:giab008. doi: 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome research. 1998;8:175–185. doi: 10.1101/gr.8.3.175. [DOI] [PubMed] [Google Scholar]

[bib21] Felsenstein J. Phylogenies and the Comparative Method. The American Naturalist. 1985;125:1–15. doi: 10.1086/703055. [DOI] [PubMed] [Google Scholar]

[bib22] Feng S, Stiller J, Deng Y, Armstrong J, Fang Q, Reeve AH, Xie D, Chen G, Guo C, Faircloth BC, Petersen B, Wang Z, Zhou Q, Diekhans M, Chen W, Andreu-Sánchez S, Margaryan A, Howard JT, Parent C, Pacheco G, Sinding MS, Puetz L, Cavill E, Ribeiro ÂM, Eckhart L, Fjeldså J, Hosner PA, Brumfield RT, Christidis L, Bertelsen MF, Sicheritz-Ponten T, Tietze DT, Robertson BC, Song G, Borgia G, Claramunt S, Lovette IJ, Cowen SJ, Njoroge P, Dumbacher JP, Ryder OA, Fuchs J, Bunce M, Burt DW, Cracraft J, Meng G, Hackett SJ, Ryan PG, Jønsson KA, Jamieson IG, da Fonseca RR, Braun EL, Houde P, Mirarab S, Suh A, Hansson B, Ponnikas S, Sigeman H, Stervander M, Frandsen PB, van der Zwan H, van der Sluis R, Visser C, Balakrishnan CN, Clark AG, Fitzpatrick JW, Bowman R, Chen N, Cloutier A, Sackton TB, Edwards SV, Foote DJ, Shakya SB, Sheldon FH, Vignal A, Soares AER, Shapiro B, González-Solís J, Ferrer-Obiol J, Rozas J, Riutort M, Tigano A, Friesen V, Dalén L, Urrutia AO, Székely T, Liu Y, Campana MG, Corvelo A, Fleischer RC, Rutherford KM, Gemmell NJ, Dussex N, Mouritsen H, Thiele N, Delmore K, Liedvogel M, Franke A, Hoeppner MP, Krone O, Fudickar AM, Milá B, Ketterson ED, Fidler AE, Friis G, Parody-Merino ÁM, Battley PF, Cox MP, Lima NCB, Prosdocimi F, Parchman TL, Schlinger BA, Loiselle BA, Blake JG, Lim HC, Day LB, Fuxjager MJ, Baldwin MW, Braun MJ, Wirthlin M, Dikow RB, Ryder TB, Camenisch G, Keller LF, DaCosta JM, Hauber ME, Louder MIM, Witt CC, McGuire JA, Mudge J, Megna LC, Carling MD, Wang B, Taylor SA, Del-Rio G, Aleixo A, Vasconcelos ATR, Mello CV, Weir JT, Haussler D, Li Q, Yang H, Wang J, Lei F, Rahbek C, Gilbert MTP, Graves GR, Jarvis ED, Paten B, Zhang G. Dense sampling of bird diversity increases power of comparative genomics. Nature. 2020;587:252–257. doi: 10.1038/s41586-020-2873-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] Fiddes IT, Armstrong J, Diekhans M, Nachtweide S, Kronenberg ZN, Underwood JG, Gordon D, Earl D, Keane T, Eichler EE, Haussler D, Stanke M, Paten B. Comparative Annotation Toolkit (CAT)-simultaneous clade and personal genome annotation. Genome research. 2018;28:1029–1038. doi: 10.1101/gr.233460.117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] Finet C, Kassner VA, Carvalho AB, Chung H, Day JP, Day S, Delaney EK, Ré FCD, Dufour HD, Dupim E, Izumitani HF, Gautério TB, Justen J, Katoh T, Kopp A, Koshikawa S, Longdon B, Loreto EL, Nunes MDS. DrosoPhyla: genomic resources for drosophilid phylogeny and systematics. bioRxiv. 2021 doi: 10.1101/2021.03.23.436709. [DOI] [PMC free article] [PubMed]

[bib25] Flynn JM, Long M, Wing RA, Clark AG. Evolutionary Dynamics of Abundant 7-bp Satellites in the Genome of Drosophila virilis. Molecular biology and evolution. 2020;37:1362–1375. doi: 10.1093/molbev/msaa010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] Green P. Phrap. 2009 http://phrap.org

[bib27] Guirao-Rico S, González J. Evolutionary insights from large scale resequencing datasets in Drosophila melanogaster. Current opinion in insect science. 2019;31:70–76. doi: 10.1016/j.cois.2018.11.002. [DOI] [PubMed] [Google Scholar]

[bib28] Hales KG, Korey CA, Larracuente AM, Roberts DM. Genetics on the Fly: A Primer on the Drosophila Model System. Genetics. 2015;201:815–842. doi: 10.1534/genetics.115.183392. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] Hickey G, Paten B, Earl D, Zerbino D, Haussler D. HAL: a hierarchical format for storing and analyzing multiple genome alignments. Bioinformatics. 2013;29:1341–1342. doi: 10.1093/bioinformatics/btt128. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] Hill T, Rosales-Stephens H-L, Unckless RL. Rapid divergence of the copulation proteins in the Drosophila dunni group is associated with hybrid post-mating-prezygotic incompatibilities. bioRxiv. 2020 doi: 10.1101/2020.05.20.106724. [DOI]

[bib31] Hoskins RA, Carlson JW, Wan KH, Park S, Mendez I, Galle SE, Booth BW, Pfeiffer BD, George RA, Svirskas R, Krzywinski M, Schein J, Accardo MC, Damia E, Messina G, Méndez-Lago M, de Pablos B, Demakova OV, Andreyeva EN, Boldyreva LV, Marra M, Carvalho AB, Dimitri P, Villasante A, Zhimulev IF, Rubin GM, Karpen GH, Celniker SE. The Release 6 reference sequence of the Drosophila melanogaster genome. Genome research. 2015;25:445–458. doi: 10.1101/gr.185579.114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] Hotaling S, Sproul JS, Heckenhauer J, Powell A, Larracuente AM, Pauls SU, Kelley JL, Frandsen PB. Long-reads are revolutionizing 20 years of insect genome sequencing. Genome Biology and Evolution. 2021;10:evab138. doi: 10.1101/2021.02.14.431146. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] Hubley R, Finn RD, Clements J, Eddy SR, Jones TA, Bao W, Smit AF, Wheeler TJ. The Dfam database of repetitive DNA families. Nucleic acids research. 2016;44:D81–D89. doi: 10.1093/nar/gkv1272. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] Jacomy M, Venturini T, Heymann S, Bastian M. ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PLOS ONE. 2014;9:e98679. doi: 10.1371/journal.pone.0098679. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] Kim KE, Peluso P, Babayan P, Yeadon PJ, Yu C, Fisher WW, Chin CS, Rapicavoli NA, Rank DR, Li J, Catcheside DE, Celniker SE, Phillippy AM, Bergman CM, Landolin JM. Long-read, whole-genome shotgun sequence data for five model organisms. Scientific data. 2014;1:140045. doi: 10.1038/sdata.2014.45. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] Kim BY. Software Heritage. https://archive.softwareheritage.org/swh:1:rev:4e40d28d0bdcd1bc7e4eabb7709f301df9ad7ead: 2021. [Google Scholar]

[bib38] Kingan S, Heaton H, Cudini J, Lambert C, Baybayan P, Galvin B, Durbin R, Korlach J, Lawniczak M. A High-Quality de novo genome assembly from a single mosquito using PacBio sequencing. Genes. 2019;10:62. doi: 10.3390/genes10010062. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-prone reads using repeat graphs. Nature biotechnology. 2019;37:540–546. doi: 10.1038/s41587-019-0072-8. [DOI] [PubMed] [Google Scholar]

[bib40] Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome research. 2017;27:722–736. doi: 10.1101/gr.215087.116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] Koren S, Phillippy AM, Simpson JT, Loman NJ, Loose M. Reply to 'Errors in long-read assemblies can critically affect protein prediction'. Nature biotechnology. 2019;37:127–128. doi: 10.1038/s41587-018-0005-y. [DOI] [PubMed] [Google Scholar]

[bib42] Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics. 2019;35:4453–4455. doi: 10.1093/bioinformatics/btz305. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib43] Kriventseva EV, Kuznetsov D, Tegenfeldt F, Manni M, Dias R, Simão FA, Zdobnov EM. OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic acids research. 2019;47:D807–D811. doi: 10.1093/nar/gky1053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib44] Lack JB, Lange JD, Tang AD, Corbett-Detig RB, Pool JE. A Thousand Fly Genomes: An Expanded Drosophila Genome Nexus. Molecular biology and evolution. 2016;33:3308–3313. doi: 10.1093/molbev/msw195. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib45] Le SQ, Gascuel O. An improved general amino acid replacement matrix. Molecular biology and evolution. 2008;25:1307–1320. doi: 10.1093/molbev/msn067. [DOI] [PubMed] [Google Scholar]

[bib46] Lewin HA, Robinson GE, Kress WJ, Baker WJ, Coddington J, Crandall KA, Durbin R, Edwards SV, Forest F, Gilbert MTP, Goldstein MM, Grigoriev IV, Hackett KJ, Haussler D, Jarvis ED, Johnson WE, Patrinos A, Richards S, Castilla-Rubio JC, van Sluys MA, Soltis PS, Xu X, Yang H, Zhang G. Earth BioGenome project: sequencing life for the future of life. PNAS. 2018;115:4325–4333. doi: 10.1073/pnas.1720115115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib47] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib48] Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27:2987–2993. doi: 10.1093/bioinformatics/btr509. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib49] Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics. 2016;32:2103–2110. doi: 10.1093/bioinformatics/btw152. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib50] Li H. Bioawk. 2017 https://github.com/lh3/bioawk

[bib51] Li H. AuN: A New Metric to Measure Assembly Contiguity. 2020 https://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity

[bib52] Long E, Evans C, Chaston J, Udall JA. Genomic Structural Variations Within Five Continental Populations of Drosophila melanogaster. G3: Genes, Genomes, Genetics. 2018;8:3247–3253. doi: 10.1534/g3.118.200631. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib53] Mai D, Nalley MJ, Bachtrog D. Patterns of Genomic Differentiation in the Drosophila nasuta Species Complex. Molecular biology and evolution. 2020;37:208–220. doi: 10.1093/molbev/msz215. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib54] Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–770. doi: 10.1093/bioinformatics/btr011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib55] Miller DE, Staber C, Zeitlinger J, Hawley RS. Highly Contiguous Genome Assemblies of 15 Drosophila Species Generated Using Nanopore Sequencing. G3: Genes, Genomes, Genetics. 2018;8:3131–3141. doi: 10.1534/g3.118.200160. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib56] Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze A, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A, Aganezov S, Hoyt SJ, Diekhans M, Logsdon GA, Alonge M, Antonarakis SE, Borchers M, Bouffard GG, Brooks SY, Phillippy AM. The complete sequence of a human genome. bioRxiv. 2021 doi: 10.1101/2021.05.26.445798. [DOI] [PMC free article] [PubMed]

[bib57] O'Grady PM, DeSalle R. Phylogeny of the Genus Drosophila. Genetics. 2018;209:1–25. doi: 10.1534/genetics.117.300583. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib58] Paithankar KR, Prasad KS. Precipitation of DNA by polyethylene glycol and ethanol. Nucleic Acids Research. 1991;19:1346. doi: 10.1093/nar/19.6.1346. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib59] Paradis E, Schliep K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 2019;35:526–528. doi: 10.1093/bioinformatics/bty633. [DOI] [PubMed] [Google Scholar]

[bib60] Paris M, Boyer R, Jaenichen R, Wolf J, Karageorgi M, Green J, Cagnon M, Parinello H, Estoup A, Gautier M, Gompel N, Prud'homme B. Near-chromosome level genome assembly of the fruit pest Drosophila suzukii using long-read sequencing. Scientific reports. 2020;10:11227. doi: 10.1038/s41598-020-67373-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib61] Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome research. 2010;20:110–121. doi: 10.1101/gr.097857.109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib62] Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib63] Ranz JM, Casals F, Ruiz A. How malleable is the eukaryotic genome? Extreme rate of chromosomal rearrangement in the genus Drosophila. Genome research. 2001;11:230–239. doi: 10.1101/gr.162901. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib64] Rezvykh AP, Funikov SY, Protsenko LA, Kulikova DA, Zelentsova ES, Chuvakova LN, Blumenstiel JP, Evgen'ev MB. Evolutionary Dynamics of the Pericentromeric Heterochromatin in Drosophila virilis and Related Species. Genes. 2021;12:175. doi: 10.3390/genes12020175. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib65] Rhie A, Walenz BP, Koren S, Phillippy AM. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome biology. 2020;21:245. doi: 10.1186/s13059-020-02134-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib66] Rhie A, McCarthy SA, Fedrigo O, Damas J, Formenti G, Koren S, Uliano-Silva M, Chow W, Fungtammasan A, Kim J, Lee C, Ko BJ, Chaisson M, Gedman GL, Cantin LJ, Thibaud-Nissen F, Haggerty L, Bista I, Smith M, Haase B, Mountcastle J, Winkler S, Paez S, Howard J, Vernes SC, Lama TM, Grutzner F, Warren WC, Balakrishnan CN, Burt D, George JM, Biegler MT, Iorns D, Digby A, Eason D, Robertson B, Edwards T, Wilkinson M, Turner G, Meyer A, Kautt AF, Franchini P, Detrich HW, Svardal H, Wagner M, Naylor GJP, Pippel M, Malinsky M, Mooney M, Simbirsky M, Hannigan BT, Pesout T, Houck M, Misuraca A, Kingan SB, Hall R, Kronenberg Z, Sović I, Dunn C, Ning Z, Hastie A, Lee J, Selvaraj S, Green RE, Putnam NH, Gut I, Ghurye J, Garrison E, Sims Y, Collins J, Pelan S, Torrance J, Tracey A, Wood J, Dagnew RE, Guan D, London SE, Clayton DF, Mello CV, Friedrich SR, Lovell PV, Osipova E, Al-Ajli FO, Secomandi S, Kim H, Theofanopoulou C, Hiller M, Zhou Y, Harris RS, Makova KD, Medvedev P, Hoffman J, Masterson P, Clark K, Martin F, Howe K, Flicek P, Walenz BP, Kwak W, Clawson H, Diekhans M, Nassar L, Paten B, Kraus RHS, Crawford AJ, Gilbert MTP, Zhang G, Venkatesh B, Murphy RW, Koepfli KP, Shapiro B, Johnson WE, Di Palma F, Marques-Bonet T, Teeling EC, Warnow T, Graves JM, Ryder OA, Haussler D, O'Brien SJ, Korlach J, Lewin HA, Howe K, Myers EW, Durbin R, Phillippy AM, Jarvis ED. Towards complete and error-free genome assemblies of all vertebrate species. Nature. 2021;592:737–746. doi: 10.1038/s41586-021-03451-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib67] Roach MJ, Schmidt SA, Borneman AR. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC bioinformatics. 2018;19:460. doi: 10.1186/s12859-018-2485-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib68] Robinson GE, Hackett KJ, Purcell-Miramontes M, Brown SJ, Evans JD, Goldsmith MR, Lawson D, Okamuro J, Robertson HM, Schneider DJ. Creating a buzz about insect genomes. Science. 2011a;331:1386. doi: 10.1126/science.331.6023.1386. [DOI] [PubMed] [Google Scholar]

[bib69] Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP. Integrative genomics viewer. Nature biotechnology. 2011b;29:24–26. doi: 10.1038/nbt.1754. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib70] Roy S, Ernst J, Kharchenko PV, Kheradpour P, Negre N, Eaton ML, Landolin JM, Bristow CA, Ma L, Lin MF, Washietl S, Arshinoff BI, Ay F, Meyer PE, Robine N, Washington NL, Di Stefano L, Berezikov E, Brown CD, Candeias R, Carlson JW, Carr A, Jungreis I, Marbach D, Sealfon R, Tolstorukov MY, Will S, Alekseyenko AA, Artieri C, Booth BW, Brooks AN, Dai Q, Davis CA, Duff MO, Feng X, Gorchakov AA, Gu T, Henikoff JG, Kapranov P, Li R, MacAlpine HK, Malone J, Minoda A, Nordman J, Okamura K, Perry M, Powell SK, Riddle NC, Sakai A, Samsonova A, Sandler JE, Schwartz YB, Sher N, Spokony R, Sturgill D, van Baren M, Wan KH, Yang L, Yu C, Feingold E, Good P, Guyer M, Lowdon R, Ahmad K, Andrews J, Berger B, Brenner SE, Brent MR, Cherbas L, Elgin SC, Gingeras TR, Grossman R, Hoskins RA, Kaufman TC, Kent W, Kuroda MI, Orr-Weaver T, Perrimon N, Pirrotta V, Posakony JW, Ren B, Russell S, Cherbas P, Graveley BR, Lewis S, Micklem G, Oliver B, Park PJ, Celniker SE, Henikoff S, Karpen GH, Lai EC, MacAlpine DM, Stein LD, White KP, Kellis M, modENCODE Consortium Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science. 2010;330:1787–1797. doi: 10.1126/science.1198374. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib71] Schneider C, Woehle C, Greve C, D'Haese CA, Wolf M, Hiller M, Janke A, Bálint M, Huettel B. Two high-quality de novo genomes from single ethanol-preserved specimens of tiny metazoans (Collembola) GigaScience. 2021;10:giab035. doi: 10.1093/gigascience/giab035. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib72] Seppey M, Manni M, Zdobnov EM. In: Gene Prediction: Methods and Protocols. Kollmar M, editor. Springer; 2019. BUSCO: Assessing Genome Assembly and Annotation Completeness; pp. 227–245. [DOI] [PubMed] [Google Scholar]

[bib73] Shafin K, Pesout T, Chang P-C, Nattestad M, Kolesnikov A, Goel S, Baid G, Eizenga JM, Miga KH, Carnevali P, Jain M, Carroll A, Paten B. Haplotype-aware variant calling enables high accuracy in Nanopore long-reads using deep neural networks. bioRxiv. 2021 doi: 10.1101/2021.03.04.433952. [DOI] [PMC free article] [PubMed]

[bib74] Shumate A, Salzberg SL. Liftoff: accurate mapping of gene annotations. Bioinformatics. 2020;15:btaa1016. doi: 10.1093/bioinformatics/btaa1016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib75] Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome research. 2005;15:1034–1050. doi: 10.1101/gr.3715005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib76] Signor SA, New FN, Nuzhdin S. A Large Panel of Drosophila simulans Reveals an Abundance of Common Variants. Genome biology and evolution. 2018;10:189–206. doi: 10.1093/gbe/evx262. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib77] Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–3212. doi: 10.1093/bioinformatics/btv351. [DOI] [PubMed] [Google Scholar]

[bib78] Smit AFA, Hubley R, Green P. RepeatMasker Open-4.0. 2013 https://www.repeatmasker.org/

[bib79] Solares EA, Chakraborty M, Miller DE, Kalsow S, Hall K, Perera AG, Emerson JJ, Hawley RS. Rapid Low-Cost Assembly of the Drosophila melanogaster Reference Genome Using Low-Coverage, Long-Read Sequencing . G3: Genes, Genomes, Genetics. 2018;8:3143–3154. doi: 10.1534/g3.118.200162. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib80] Stark A, Lin MF, Kheradpour P, Pedersen JS, Parts L, Carlson JW, Crosby MA, Rasmussen MD, Roy S, Deoras AN, Ruby JG, Brennecke J, Hodges E, Hinrichs AS, Caspi A, Paten B, Park SW, Han MV, Maeder ML, Polansky BJ, Robson BE, Aerts S, van Helden J, Hassan B, Gilbert DG, Eastman DA, Rice M, Weir M, Hahn MW, Park Y, Dewey CN, Pachter L, Kent WJ, Haussler D, Lai EC, Bartel DP, Hannon GJ, Kaufman TC, Eisen MB, Clark AG, Smith D, Celniker SE, Gelbart WM, Kellis M, Harvard FlyBase curators. Berkeley Drosophila Genome Project Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature. 2007;450:219–232. doi: 10.1038/nature06340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib81] Sturtevant AH, Novitski E. The homologies of the chromosome elements in the genus Drosophila. Genetics. 1941;26:517–541. doi: 10.1093/genetics/26.5.517. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib82] Suvorov A, Kim BY, Wang J, Armstrong EE, Peede D, D’Agostino ERR, Price DK, Wadell P, Lang M, Courtier-Orgogozo V, David JR, Petrov D, Matute DR, Schrider DR, Comeault AA. Widespread introgression across a phylogeny of 155 Drosophila genomes. bioRxiv. 2021 doi: 10.1101/2020.12.14.422758. [DOI] [PMC free article] [PubMed]

[bib83] Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P. Sambamba: fast processing of NGS alignment formats. Bioinformatics. 2015;31:2032–2034. doi: 10.1093/bioinformatics/btv098. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib84] Threlfall J, Blaxter M. Launching the tree of life gateway. Wellcome Open Research. 2021;6:125. doi: 10.12688/wellcomeopenres.16913.1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib85] Toda MJ. bioinfo—Taxonomic information. 2020 https://bioinfo.museum.hokudai.ac.jp/db/index.php

[bib86] Tyson J. Bead-Free Long Fragment LSK109 Library Preparation. 2020 doi: 10.17504/protocols.io.7euhjew. [DOI]

[bib87] Vaser R, Sović I, Nagarajan N, Šikić M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome research. 2017;27:737–746. doi: 10.1101/gr.214270.116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib88] Vurture GW, Sedlazeck FJ, Nattestad M, Underwood CJ, Fang H, Gurtowski J, Schatz MC. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics. 2017;33:2202–2204. doi: 10.1093/bioinformatics/btx153. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib89] Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q, Wortman J, Young SK, Earl AM. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLOS ONE. 2014;9:e112963. doi: 10.1371/journal.pone.0112963. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib90] Wang JR, Holt J, McMillan L, Jones CD. FMLRC: Hybrid long read error correction using an FM-index. BMC bioinformatics. 2018;19:50. doi: 10.1186/s12859-018-2051-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib91] Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJ, Birol I. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience. 2015;4:35. doi: 10.1186/s13742-015-0076-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib92] Waterhouse RM, Seppey M, Simão FA, Manni M, Ioannidis P, Klioutchnikov G, Kriventseva EV, Zdobnov EM. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Molecular biology and evolution. 2018;35:543–548. doi: 10.1093/molbev/msx319. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib93] Watson M, Warr A. Errors in long-read assemblies can critically affect protein prediction. Nature biotechnology. 2019;37:124–126. doi: 10.1038/s41587-018-0004-z. [DOI] [PubMed] [Google Scholar]

[bib94] Wick RR, Holt KE. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Research. 2020;8:1–22. doi: 10.12688/f1000research.21782.3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib95] Yassin A. Phylogenetic classification of the Drosophilidae rondani (Diptera): the role of morphology in the postgenomic era. Systematic Entomology. 2013;38:349–364. doi: 10.1111/j.1365-3113.2012.00665.x. [DOI] [Google Scholar]

[bib96] Yin J, Zhang C, Mirarab S. ASTRAL-MP: scaling ASTRAL to very large datasets using randomization and parallelization. Bioinformatics. 2019;35:3961–3969. doi: 10.1093/bioinformatics/btz211. [DOI] [PubMed] [Google Scholar]

[bib97] Zdobnov EM, Tegenfeldt F, Kuznetsov D, Waterhouse RM, Simão FA, Ioannidis P, Seppey M, Loetscher A, Kriventseva EV. OrthoDB v9.1: cataloging evolutionary and functional annotations for animal, fungal, plant, archaeal, bacterial and viral orthologs. Nucleic acids research. 2017;45:D744–D749. doi: 10.1093/nar/gkw1119. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib98] Zoonomia Consortium. Genereux DP, Serres A, Armstrong J, Johnson J, Marinescu VD, Murén E, Juan D, Bejerano G, Casewell NR, Chemnick LG, Damas J, Di Palma F, Diekhans M, Fiddes IT, Garber M, Gladyshev VN, Goodman L, Haerty W, Houck ML. A comparative genomics multitool for scientific discovery and conservation. Nature. 2020;587:240–245. doi: 10.1038/s41586-020-2876-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Highly contiguous assemblies of 101 drosophilid genomes

Bernard Y Kim

Jeremy R Wang

Danny E Miller

Olga Barmina

Emily Delaney

Ammon Thompson

Aaron A Comeault

David Peede

Emmanuel RR D'Agostino

Julianne Pelaez

Jessica M Aguilar

Diler Haji

Teruyuki Matsunaga

Ellie E Armstrong

Molly Zych

Yoshitaka Ogawa

Marina Stamenković-Radak

Mihailo Jelić

Marija Savić Veselinović

Marija Tanasković

Pavle Erić

Jian-Jun Gao

Takehiro K Katoh

Masanori J Toda

Hideaki Watabe

Masayoshi Watada

Jeremy S Davis

Leonie C Moyle

Giulia Manoli

Enrico Bertolini

Vladimír Košťál

R Scott Hawley

Aya Takahashi

Corbin D Jones

Donald K Price

Noah Whiteman

Artyom Kopp

Daniel R Matute

Dmitri A Petrov

Roles

Abstract

Introduction

Results and discussion

Taxon sampling

Table 1. Species and strain information for all samples assembled for this work.

Near chromosome-scale assembly with ultra-long reads

Figure 1. Nanopore-based assemblies are highly contiguous and complete.

Figure 1—figure supplement 1. Nanopore-based assemblies compare favorably to representative genomes on NCBI.

Figure 1—figure supplement 2. Large improvements in assembly contiguity from an updated assembly workflow.

Figure 1—figure supplement 3. Contiguity metrics standardized by the estimated genome size.

Figure 1—figure supplement 4. Estimated genome size is similar to assembly size.

Estimates of sample diversity

Figure 2. Estimated heterozygosity in the data used for genome assembly.

Figure 2—figure supplement 1. Assembly contiguity is not related to sample heterozygosity.

Estimates of sequence quality

Figure 3. Nanopore-based Drosophila assemblies are accurate, particularly in coding regions.

Figure 3—figure supplement 1. Variation in sequence accuracy within the genome assemblies.

Figure 3—figure supplement 2. Large insertions account for nearly all differences between the Nanopore-based and reference D. melanogaster assembly.

Nanopore-based assemblies are highly accurate in coding regions

A comparative genomics resource

Figure 4. Gene content of Muller elements is conserved across drosophilids while gene order changes.

Repeat content

Figure 5. Repeat content varies greatly between drosophilid groups.

Figure 5—figure supplement 1. Assembly contiguity is not determined by repeat content.

Figure 5—figure supplement 2. The non-repetitive and repetitive portions of the genome both contribute to genome size differences between drosophilids.

Next steps

Figure 6. Highly contiguous assemblies can be obtained with lower coverage of ultra-long reads.

Reproducibility

Materials and methods

Key resources table.

Taxon sampling and sample collection

DNA extraction and nanopore sequencing

Obtaining short read datasets for polishing

Choice of long read assembly program

Assembly and long read polishing

Haplotig identification and removal

Final polishing and decontamination

Assembly decontamination