Skip to main content
GigaScience logoLink to GigaScience
. 2021 Mar 15;10(3):giab017. doi: 10.1093/gigascience/giab017

Chromosome-level genome assemblies of the malaria vectors Anopheles coluzzii and Anopheles arabiensis

Anton Zamyatin 1,#, Pavel Avdeyev 2,3,#, Jiangtao Liang 4,5,#, Atashi Sharma 6,7, Chujia Chen 8,9, Varvara Lukyanchikova 10,11,12, Nikita Alexeev 13, Zhijian Tu 14,15, Max A Alekseyev 16,17, Igor V Sharakhov 18,19,
PMCID: PMC7957348  PMID: 33718948

Abstract

Background

Anopheles coluzzii and Anopheles arabiensis belong to the Anopheles gambiae complex and are among the major malaria vectors in sub-Saharan Africa. However, chromosome-level reference genome assemblies are still lacking for these medically important mosquito species.

Findings

In this study, we produced de novo chromosome-level genome assemblies for A. coluzzii and A. arabiensis using the long-read Oxford Nanopore sequencing technology and the Hi-C scaffolding approach. We obtained 273.4 and 256.8 Mb of the total assemblies for A. coluzzii and A. arabiensis, respectively. Each assembly consists of 3 chromosome-scale scaffolds (X, 2, 3), complete mitochondrion, and unordered contigs identified as autosomal pericentromeric DNA, X pericentromeric DNA, and Y sequences. Comparison of these assemblies with the existing assemblies for these species demonstrated that we obtained improved reference-quality genomes. The new assemblies allowed us to identify genomic coordinates for the breakpoint regions of fixed and polymorphic chromosomal inversions in A. coluzzii and A. arabiensis.

Conclusion

The new chromosome-level assemblies will facilitate functional and population genomic studies in A. coluzzii and A. arabiensis. The presented assembly pipeline will accelerate progress toward creating high-quality genome references for other disease vectors.

Keywords: Anopheles arabiensis, Anopheles coluzzii, malaria mosquito, genome assembly, Oxford nanopore sequencing, Hi-C chromosome conformation capture

Introduction and Background

Malaria has a devastating global impact on public health and welfare, with the majority of the world's malaria cases occurring in tropical Africa. Anopheles mosquitoes are exclusive vectors of malaria, with species from the Anopheles gambiae complex being the deadliest African vectors. Anopheles arabiensis Patton, 1905 (NCBI:txid7173) and Anopheles coluzzii Coetzee & Wilkerson, 2013 (NCBI:txid1518534), along with A. gambiae Giles, 1902 (NCBI:txid7165), are the malaria vectors of most widespread importance in sub-Saharan Africa. A. arabiensis feeds and rests predominantly outdoors, replacing A. gambiae in some localities where there is high use of long-lasting insecticide-treated nets and indoor residual spraying [1]. Females of A. arabiensis display opportunistic feeding behavior as they seek both human and animal blood [2]. A genomic study revealed that alleles linked to the chromosomal inversions 2Rb and 3Ra of A. arabiensis may influence choice of host for blood feeding [3]. Because traits relevant to vectorial capacity have genetic determinants, genomic studies are crucial for developing novel approaches to malaria control. Interspecies crosses between A. arabiensis and other species of the A. gambiae complex produce sterile males [46]. Fertile hybrid females allow for gene flow between species. The discovery of pervasive genomic introgression between A. arabiensis and A. gambiae or A. coluzzii [7, 8] opened an opportunity to investigate how traits enhancing vectorial capacity can be acquired through an interspecific genetic exchange. Hybrids of both sexes between the closely related species A. coluzzii and A. gambiae are fertile [9, 10]. These species are highly anthropophilic and endophilic; they are often sympatric but differ in geographical range [11], larval ecology [12], mating behavior, [13], and strategies for surviving the dry season [14]. Genomic analyses have the power to infer how these different adaptations are determined and maintained. A recent study described a new taxon, designated Anopheles TENGRELA, that genetically is most similar to A. coluzzii [15]. Still undiscovered and misidentified cryptic taxa could seriously confound ongoing genomic studies of Anopheles ecology and evolution of insecticide resistance.

The quality of genome annotation and analyses of any organism highly depends on the completeness of the assembly [16, 17]. Draft genome assemblies of species with highly polymorphic genomes, such as mosquitoes, may have many gene annotation problems: genes can be missing entirely, have missing exons or gaps, or be split between scaffolds. As a consequence, it is difficult to estimate the total gene number or gene copy number, both of which may be linked to important phenotypic traits. Genes of particular interest with respect to vectorial capacity are especially prone to such errors because they often belong to gene families: aquaporins; ionotropic, odorant, and gustatory receptors; immunity genes; insecticide resistance genes; and reproduction gene clusters [1825]. A genome with missing information can also cause problems for correct analyses of transcriptome, epigenome, and population genomic data. A. gambiae, because of its epidemiological importance, was the first disease vector to have its genome sequenced by the Sanger method (2002 and updated in 2007) [26, 27]. Since then, the AgamP4 assembly remains the standard chromosome-level genome reference for species of the A. gambiae complex [2832]. Using the A. gambiae genome as a reference for functional annotation and population genomic analyses in other species comes at the expense of losing important information on species-specific genetic architectures in the sequencing data. Moreover, the AgamP4 assembly has misassembled haplotype scaffolds, large gaps, incorrect orientation of some scaffolds, and unmapped sequences [26, 27]. The 16 Anopheles mosquito species genome project included several members of the A. gambiae complex [33]. Among them, the genome of A. arabiensis was sequenced on the Illumina platform in addition to the previously sequenced genome of the A. coluzzii MALI strain by Sanger [34]. Unlike the Sanger-based chromosome-level AgamP4 genome assembly, genomes of A. arabiensis and A. coluzzii MALI are represented by numerous unmapped sequencing scaffolds. Combined bioinformatics and physical mapping approaches recently produced 20 new super-scaffolded assemblies with improved contiguities for anopheline species, including A. arabiensis and A. coluzzii [35]. In the case of A. arabiensis, the genomic scaffolds have been assigned to and ordered and oriented on 5 chromosomal arms with the help of the AgamP4 reference, thus creating the AaraD2 Illumina-based chromosome-level assembly for this species. Also, a de novo genome assembly from a single A. coluzzii Ngousso mosquito was obtained using Pacific Biosciences (PacBio) sequencing [36]. The high-quality AcolN1 assembly for the Ngousso strain was placed into chromosome context by ordering and orienting the PacBio contigs to the AgamP4 reference. Also, 40% of the unmapped sequences in AgamP4 were assigned to the appropriate chromosomal positions [36].

Although each current sequencing or scaffolding technology alone cannot provide a telomere-to-telomere genome assembly and each method has its limitations [37], a combination of complementary approaches can lead to a chromosome-scale assembly [38, 39]. The ongoing revolution in sequencing and scaffolding methods urges researchers to undertake efforts to create new genome references that satisfy the modern standards. The Oxford Nanopore sequencing technology is a single-molecule, real-time sequencing approach that uses biological membranes with extremely small holes (nanopores) and electrophoresis to measure the change in ionic current when a DNA or RNA molecule passes through the membrane [40]. This high-throughput technology generates exceptionally long reads and has been proven successful in genomic studies for a wide range of biological samples such as arboviruses [41], bacteria [42], plants [43], insects [44], and humans [45]. Hi-C is a groundbreaking technology that exploits in vivo chromatin proximity information to yield dramatically improved genome assemblies. In contrast to alternative scaffolding approaches, such as phasmid libraries or other mate-paired sequencing methods [33], the Hi-C method can produce chromosome-level genomic scaffolds [39, 4650].

In this study, we tested the pipeline for obtaining superior-quality genome assemblies for malaria mosquitoes using the following steps: (i) high-coverage Oxford Nanopore sequencing and assembly using high molecular weight genomic DNA from inbred individuals, (ii) gap filling and error correction using Illumina sequencing data, (iii) Hi-C scaffolding of Oxford Nanopore contigs to chromosomes, and (iv) evaluation and validation of completeness and contiguity of the assemblies. We developed new reference genome assemblies for A. coluzzii and A. arabiensis, which will facilitate studies for a deeper understanding of the biology and genetics of these major African malaria vectors. The presented assembly pipeline will accelerate progress toward creating high-quality genome references for other disease vectors.

Data Description

Here, we describe the pipeline for obtaining de novo chromosome-level assemblies for A. coluzzii MOPTI (AcolMOP1) and A. arabiensis DONGOLA (AaraD3). To produce the highly contiguous assemblies, we adopted and modified a strategy that was recently used for these goals in fruit flies [51, 52]. Briefly, we sequenced the genomes using the Oxford Nanopore technology, assembled contigs from long Nanopore reads, polished them with short Illumina reads, and scaffolded the contigs to the chromosome level using Hi-C proximity ligation data. Evaluation and comparison of the new assemblies with the previously released versions demonstrated substantial improvements in genome completeness and contiguity. We also provide genomic coordinates within the new references for the breakpoint regions of fixed and polymorphic inversions in A. coluzzii and A. arabiensis.

Nanopore sequencing of A. coluzzii and A. arabiensis genomes

We performed Oxford Nanopore sequencing using genomic DNA isolated from sibling males after 6 generations of inbreeding to reduce heterozygosity. Our analysis and visualization of the long Nanopore reads reported 3.3 million (M) reads of a total length of 28 Gb and 5M reads of a total length of 35 Gb for A. coluzzii and A. arabiensis, respectively. The N50 read length was 19 and 21 kb (Additional File 1), the read median length was 3.8 and 2.3 kb (Additional File 2), and the read quality was 10.3 and 10.0 (Additional File 3) for A. coluzzii and A. arabiensis, respectively. We aligned Nanopore reads from A. coluzzii and A. arabiensis to the genome of the closely related species A. gambiae (AgamP4) using minimap2 [53]. For A. coluzzii, the total number of aligned and unaligned reads was 3.3M (99%) and 0.03M (1%), respectively. In the case of A. arabiensis, the total number of aligned and unaligned reads was 4.5M (89%) and 0.56M (11%). The alignment statistics reported 100× coverage for the A. coluzzii genome and 114× coverage for the A. arabiensis genome (Additional File 4). Our contamination analysis with Kraken2 [54] identified 98.4% of A. coluzzii reads as having mosquito origin and 1.6% of reads as having bacterial origin. For A. arabiensis, 89.55% of the reads had mosquito origin, 5.65% of the reads had bacterial origin, and 4.80% of the reads had unknown origin. The reads of bacterial origin were filtered out. We retained the reads of unknown origin for a downstream analysis because they may represent novel mosquito sequences.

Assembly of the A. coluzzii and A. arabiensis genomes

Genome assembly from Nanopore sequencing data is an actively developing area of research. However, no comprehensive comparisons exist of different software on diverse genomes. To assess the performance of several available assemblers on Anopheles genomes, we ran the Nanopore reads, including wtdbg2 v1.1 [55], FLYE v2.4.1 [56], Miniasm v0.3-r179 [57], and Canu v1.8 [58]. In the case of the Canu v1.8 assembler, we obtained 2 assemblies: 1 consisting of unitigs (i.e., unambiguous reconstructions of the sequence) and the other consisting of contigs. We evaluated the contiguity of the draft assemblies using QUAST-LG [59] and estimated the genome sizes for A. coluzzii and A. arabiensis by a k-mer analysis of Illumina reads using Jellyfish [60]. For the A. coluzzii genome, the peak of the 19-mer distribution was at a depth of 54, and the genome size was estimated as 301.3 Mb. The length of single-copy genomic regions was estimated as 204.1 Mb (Additional File 5). For Illumina reads of A. arabiensis, the peak of the 19-mer distribution was at a depth of 88, and the genome size was estimated as 315.6 Mb. The size of single-copy genome regions was estimated to be 249.4 Mb (Additional File 5).

We obtained the following 5 assemblies for A. coluzzii:

  • 1,392 contigs of a total length of 267.2 Mb produced by Wtdbg2;

  • 1,618 contigs of a total length of 279.5 Mb produced by FLYE;

  • 634 contigs of a total length of 318.0 Mb produced by Miniasm;

  • 1,073 unitigs of a total length of 344.0 Mb produced by Canu;

  • 474 contigs of a total length of 314.2 Mb produced by Canu.

The total lengths for the Miniasm and Canu assemblies were closer to the A. coluzzii genome size of 301.3 Mb, estimated from Illumina reads. According to the NG50 metric (NG50 is the length for which the collection of all contigs of that length or longer covers at least half the A. gambiae AgamP4 reference genome [26, 27]), the Canu contig assembly showed better contiguity (Fig. 1A, Additional File 6). However, this assembly also featured the third-largest number of misassemblies. The Wtdbg2 assembly had the fewest misassemblies.

Figure 1:

Figure 1:

NGx curves for the Wtdbg2, Miniasm, Flye, and Canu contig and unitig assemblies. A, A. coluzzii. B. A. arabiensis.

We obtained the following 5 assemblies for A. arabiensis:

  • 1,920 contigs of a total length of 298.2 Mb produced by Wtdbg2;

  • 1,280 contigs of a total length of 289.7 Mb produced by FLYE;

  • 687 contigs of a total length of 338.8 Mb produced by Miniasm;

  • 521 unitigs of a total length of 277.2 Mb produced by Canu;

  • 211 contigs of a total length of 298.5 Mb produced by Canu.

The total length of all the assemblies except Miniasm is underestimated compared with the A. arabiensis genome size estimated from Illumina reads (315.6 Mb). The A. arabiensis NG50 values from any assembler were substantially larger than those of A. coluzzii (Fig. 1B, Additional File 6). The number of misassemblies was again larger for the Canu assemblies than for other assemblies. Thus, initial assemblies of the long-read data obtained by the Canu v1.8 software alone yielded NG50 contig lengths of 13.8 Mb for A. coluzzii and 23.7 Mb for A. arabiensis. The total assembly sizes were 314.2 and 277.2 Mb for A. coluzzii and A. arabiensis, respectively. For the Canu contig assemblies, the area under the NGx curve (auNG metric) was 13.57 and 21.99 Mb for A. coluzzii and A. arabiensis, respectively, which is higher than for the other assemblies (Fig. 1, Additional File 7).

We assessed the completeness of the A. coluzzii and A. arabiensis draft assemblies with BUSCO v3 [61, 62]. According to the BUSCO scores for both gene datasets, Canu assemblies were the best (Additional File 8), which likely indicates that the Canu assembler has more sophisticated error correction strategies than other assemblers have. The higher rate of core gene duplications for the Canu assemblies seems to indicate some haplotype separation, which was not observed in other assemblies. Low BUSCO scores for the Miniasm assemblies can be explained by the absence of error correction or polishing steps in Miniasm. For example, when we polished the A. coluzzii Miniasm assembly with 4-round Racon [63] using Nanopore reads, the resulting assembly achieves 87.9% complete Diptera genes (Additional File 9). Based on the comparison results across all the assemblies, we decided to proceed with the Canu contig and unitig assemblies for further steps in the assembly pipeline.

Polishing of the A. coluzzii and A. arabiensis assemblies

Initial assemblies of long reads from Oxford Nanopore Technologies are prone to frequent insertion and deletion errors, which usually are corrected by polishing. While there is no gold standard for polishing of Nanopore read assemblies, there are 2 commonly recommended polishing strategies. One strategy involves running several rounds of Racon [63] using raw Nanopore reads, and then running Medaka [64]. Another strategy is to run Nanopolish [65] using signal-level data additionally provided by a Nanopore sequencer. In both cases the quality of assemblies can be further improved by running Pilon [66] several times using short Illumina reads.

For polishing the A. coluzzii Canu contig assembly, we used Racon, Medaka, and Nanopolish, as well as Pilon for error correction and gap filling with Illumina reads. The FastQC quality control of the A. coluzzii Illumina reads reported 122.3M reads of a total length of 22.8 Gb and average length of 200 bp. FastQC showed that A. coluzzii Illumina reads have high per-base sequence quality (>32 on the Phred scale) and no adapter contamination. After each step in a polishing pipeline, we queried the resulting genome with conserved single-copy Diptera and Metazoa genes using the BUSCO test (Additional File 9). For the sake of brevity, we report only the BUSCO score for the Diptera single-copy gene set here. Single-copy genes usually cover only a small portion of a genome and it remains unclear how the polishing tools perform on the repeat-rich or non-coding regions. After Nanopolish corrected 283,935 substitutions, 1.6M insertions, and 51,104 deletions, the BUSCO score increased from 77.6% to 93.6% of complete genes. We then ran a 4-round Racon, which decreased the BUSCO score from 93.6% to 88.6% of complete genes. Also, after Nanopolish, we ran Pilon on the Canu contig assembly several times using Illumina reads. After the first round of Pilon, we obtained 97.9% of complete genes. After 3 rounds of Pilon, we reached 98.5% of complete genes. Because the change was insignificant for the third round, we decided to proceed with 3 rounds of Pilon. We also ran Nanopolish for the second time after the first round of Pilon, but this decreased the BUSCO score to 95.9%.

For polishing the A. arabiensis Canu contig and unitig assemblies, we used Nanopolish and Pilon. We ran Nanopolish and 3 rounds of Pilon on the A. arabiensis Canu contig assembly with Illumina reads. FastQC reported 260.6M reads for a total length 53.2 Gb with the average length being 90 bp; further filtering out of 14% of the reads left 224.8M of the A. arabiensis Illumina reads. After Nanopolish, which corrected 143,458 substitutions, 1.1M insertions, and 40,694 deletions, the BUSCO score improved from 83.0% to 94.4% of complete Diptera single-copy genes (Additional File 9). After the 3-round Pilon, we obtained the Canu contig assembly with 98.5% of complete genes. We also polished the A. arabiensis Canu unitig assembly with Nanopolish and 3-round Pilon and obtained the BUSCO score of 98.6% of complete Diptera genes.

We remark here that the BUSCO scores for the polished Canu contig assemblies of A. coluzzii and A. arabiensis were similar to the BUSCO scores for the A. gambiae PEST (AgamP4) genome (Table 1). The BUSCO scores of complete Diptera genes were 98.5% for polished contig assemblies of A. arabiensis and A. coluzzii.

Table 1:

The percentage of complete, single, duplicated, and fragmented genes computed by BUSCO from the conserved single-copy Diptera and Metazoa gene sets for the polished A. coluzzii and A. arabiensis Canu contig assemblies and the A. gambiae PEST genome

Assemblies Complete Diptera/Metazoa (%) Single Diptera/Metazoa (%) Duplicated Diptera/Metazoa (%) Fragmented Diptera/Metazoa (%)
A. coluzzii Canu 98.5/98.9 90.2/89.1 8.3/9.8 0.7/0.9
A. arabiensis Canu 98.5/98.9 93.5/91.8 5.0/7.1 0.7/0.2
A. gambiae PEST 98.3/98.7 97.7/96.8 0.6/1.9 0.8/0.3

Hi-C scaffolding of the A. coluzzii and A. arabiensis contigs

We assembled the A. coluzzii and A. arabiensis polished Canu contigs into chromosome-level scaffolds using Hi-C Illumina reads. For both mosquito genomes, FastQC showed high per-base sequence quality of Hi-C reads (>30 on the Phred scale) and detected contamination with Illumina TrueSeq adapters in 0.17% of the reads of A. coluzzii and in 1.5% of the reads for A. arabiensis. The contaminated reads were filtered out in both read sets, resulting in 231.9M and 141.9M reads for A. coluzzii and A. arabiensis, respectively. In the case of A. coluzzii, the Juicer tool [67] reported 3.7M unmapped Hi-C read pairs, 34.7M Hi-C read pairs mapped to inter contigs, and 97.5M Hi-C read pairs mapped to intra contigs. For A. arabiensis, we obtained 1.6M, 9.8M, and 66.7M Hi-C read pairs that were unmapped, mapped to inter contigs, and mapped to intra contigs, respectively (Additional File 10).

There are several Hi-C–based scaffolding tools available: DNA Triangulation [68], LACHESIS [69], GRAAL [70], HiRise [71], HiCAssembler [72], SALSA2 [73], and 3D-DNA [47]. Among these tools, only GRAAL, SALSA2, and 3D DNA code repositories are actively maintained. We chose 3D-DNA and SALSA2 for scaffolding of our draft assemblies because 3D-DNA allows manual correction while SALSA2 can use the assembly graphs produced by assemblers. Because SALSA2 was designed primarily for unitigs, we assessed the performance of both tools on the A. arabiensis unitig and contig assemblies produced by Canu. We ran SALSA2 with the corresponding assembly graph for the 2 A. arabiensis assemblies. Our experiments showed that 3D-DNA has a tendency to aggressively split contigs as the number of rounds increases. We therefore chose to run 3D-DNA with only 1 round instead of the default 3 rounds.

The QUAST-LG computation of metrics for the original and processed A. arabiensis assemblies demonstrated that SALSA2 performs better than 3D-DNA (Table 2). Strikingly, the assemblies processed by 3D-DNA had a higher number of scaffolds and lower NG50 than in the original assembly. At the same time, 3D-DNA corrected ∼700 misassemblies in the contig assembly. While SALSA2 did not show a substantial boost in contiguity for unitig assembly, it significantly improved contig assembly at the cost of introducing a small number of misassemblies (Additional File 11).

Table 2:

The QUAST-LG report for the NG50, number of misassemblies, and number of contigs for the A. arabiensis original Canu assemblies and those processed by 3D-DNA and SALSA2

QUAST-LG metrics Assemblies of the A. arabiensis genome
Canu Canu, 3D-DNA Canu, SALSA2
Contigs Unitigs Contigs, Unitigs Contigs Unitigs
NG50 (Mb) 23.7 23.7 5.1 5.9 71.4 23.9
No. misassemblies 12,308 12,860 11,650 15,415 13,155 13,271
No. scaffolds 211 521 246 870 161 366

We visually inspected the initial Hi-C contact heat maps produced by the scaffolding of the A. arabiensis assemblies (Additional File 12). The best heat map was generated by SALSA2 on the A. arabiensis contig assembly. For example, SALSA2 reconstructed the correct order of the scaffolds for the X chromosome except that 1 inversion was required to correct the orientation (upper left corner in Additional File 12a and b). On the other hand, the 3D-DNA heat maps were smoother (Additional File 12c and d) because the 3D-DNA software tends to remove repetitive sequences present in the original assemblies. We conclude that SALSA2 is better suited for scaffolding of contigs obtained from long reads. However, the results from both tools can be further improved by manual correction of scaffolds based on visual inspection of the Hi-C heat maps. Juicebox Assembly Tools (JBAT) v1.11 is the only tool currently available for manual correction of genome assemblies [74]. While 3D-DNA is designed to be loadable into JBAT for manual correction, we were unable to convert SALSA2 output to a data format that could be loaded and corrected in JBAT. Therefore, despite SALSA2 producing better scaffolding results, we decided to proceed with the 3D-DNA scaffolds obtained from the Canu contig genome assemblies for A. arabiensis (Additional File 12c) and A. coluzzii (Additional File 13). Both species assemblies required manual correction by reordering, changing orientation, splitting contig sequences, and allocating scaffold borders. The main goal of such manual correction was to obtain chromosome-level scaffolds without assembly errors, haplotype sequences, and assembly artifacts. We also tried to minimize the number of contig splits by means of manual correction.

To improve our manual correction process, we used the following additional information about contigs and scaffolds in the assemblies. All contigs were classified with PurgeHaplotigs software [75] into primary contigs, haplotigs, and assembly artifacts on the basis of the read-depth analysis as follows. Read-depth histograms were produced for the A. coluzzii and A. arabiensis draft assemblies (Additional File 14). In each read-depth histogram, we chose 3 cut-offs to capture 2 peaks of the bimodal distribution that correspond to haploid and diploid levels of coverage. The first read-depth peak resulted from the duplicated regions and corresponded to a “haploid” level of coverage. The second read-depth peak resulted from regions that are haplotype-fused and corresponded to the “diploid” level of coverage. We also aligned contigs from each draft assembly to the A. gambiae PEST (AgamP4) assembly to obtain information about distribution of the contigs across the chromosomes.

Because the Hi-C signal must be stronger for adjacent sequence regions, we manually reordered and changed the orientation of contigs in each assembly to keep the Hi-C signal strong along the diagonal. We used the PurgeHaplotigs classification and the fact that haplotig sequences lead to parallel diagonal signals for moving these contigs into debris. We also moved the contigs with low Hi-C signal and the contigs classified as assembly artifacts to debris. The remaining contigs were reordered according to the Hi-C signal. After manual correction we obtained the final Hi-C contact heat maps for the chromosome-level genome assemblies of A. coluzzii AcolMOP1 and A. arabiensis AaraD3 (Fig. 2).

Figure 2:

Figure 2:

The Hi-C contact heat maps obtained after manual correction of genome assemblies. A, A. coluzzii AcolMOP1. B, A. arabiensis AaraD3. From left to right in each heat map, chromosome X, chromosome 2 (2R+2L),  chromosome 3 (3R+3L). The heat maps were produced by JBAT.

Contigs in debris were further partitioned into several scaffolds. We performed chromosome quotient (CQ) analysis [76] to tentatively assign contigs in debris to the Y chromosome (CQ < 0.1), X chromosome (CQ = 2), and autosomes (CQ = 1). We grouped contigs with CQ values <0.1 into the “chrY” scaffolds. In addition, we removed sequences from the chrY scaffolds if they were classified as autosomal (Ag53C, Ag93) or X chromosomal (AgX367) in the previous studies [7779]. These results showed that the A. coluzzii and A. arabiensis assemblies contain 126 and 4 contigs from chromosome Y, respectively. We were unable to arrange contigs inside the chrY scaffolds into a correct order or orientation because the Hi-C signal was weak for these contigs.

For better understanding of the contig distribution among chromosomes, we also aligned known tandem repeats from the pericentromeric regions of chromosome X and autosomes to our assemblies. We retrieved the sequences from the debris contigs that belong to the pericentromeric region of chromosome X based on tandem repeats 18S ribosomal DNA (rDNA) and AgX367 and to autosomal pericentromeric regions based on satellites Ag53C and Ag93 [7779]. Because the Hi-C signal is low for these contigs due to low complexity of the corresponding genomic regions, we were unable to determine their position inside the scaffolds forming chromosomes. Therefore, we grouped these contigs into separate “X_pericentromeric” and “Autosomal_pericentromeric” scaffolds.

As a result of the manual correction, we obtained the final assemblies for A. coluzzii and A. arabiensis genomes. They include assembled chromosomes X, 2 (2R+2L),  3 (3R+3L),  and complete mitochondrion (MtDNA), as well as unordered contigs of the Y chromosome and pericentromeric sequences of autosomes and the X chromosome (including sequences from the rDNA cluster). Each species assembly consists of 7 scaffolds: chrX, chr2, chr3, chrY, X_pericentromeric, Autosomal_pericentromeric, and MtDNA. The total genome assemblies for A. coluzzii and A. arabiensis have a length of 273.4 and 256.8 Mb, respectively (Table 3). The assembly sizes are in good agreement with the experimentally determined genome size of 260 Mb for A. gambiae [80]. The resulting chromosome-level genome assemblies for A. coluzzii and A. arabiensis have a length of 242.3 and 234.6 Mb, respectively (Table 3). By comparison, the aforementioned chromosome-level genome assemblies AgamP4 for A. gambiae and AaraD2 for A. arabiensis are 230.5 and 216.6 Mb, respectively.

Table 3:

Genome sizes (bp) of the individual assembled chromosomes, unlocalized scaffolds, and mitochondrion for A. coluzzii AcolMOP1 and A. arabiensis AaraD3

Size (bp) A. coluzzii A. arabiensis
Chromosome X 27,656,523 26,913,133
Chromosome 2 (2R+2L) 114,808,232 (60,581,942 + 54,226,290) 111,988,354 (60,137,474 + 51,850,880)
Chromosome 3 (3R+3L) 99,875,506 (53,984,252 + 45,981,254) 95,710,210 (52,890,104 + 42,820,106)
Total chromosome-level assembly 242,340,261 234,611,697
Chromosome Y 21,271,033 619,478
X_pericentromeric 5,607,343 14,883,429
Autosomal_pericentromeric 4,189,909 6,693,365
Mitochondrion 15,366 15,347
Total genome assembly 273,423,912 256,823,316

Validation and quality evaluation of the genome assemblies

We validated the resulting assemblies by aligning the AgamP4.10 gene set from the AgamP4 assembly to AaraD3 and AcolMOP1 using NCBI BLAST v2.9.0 [81]. Overall, 13,036 (99.84%) and 13,031 (99.80%) genes from the total of 13,057 A. gambiae genes were mapped to the A. coluzzii and A. arabiensis assemblies, respectively (Additional File 15). Moreover, 9,442 (72.31%) and 8,971 (68.71%) genes were mapped with the alignment having a zero e-value and 90–110% of the gene length to the A. coluzzii and A. arabiensis assemblies, respectively (Table 4). The relatively smaller number of genes mapped to the X chromosome of A. arabiensis agrees with its higher divergence from the A. gambiae X chromosome [7]. The gene alignments provide supporting evidence that the obtained assemblies have the correct gene content.

Table 4:

Statistics of the genes from the A. gambiae PEST assembly aligned to the A. coluzzii and A. arabiensis assemblies

AgamP4.10 gene set
Aligned from X 2 3 Mt Y_unplaced UNKN
Aligned to total 1,063 6,603 4,897 13 2 479
chrX 584/397 564/378 5/3 4/3 0 0 11/13
chr2 5,055/4,930 9/11 4,862/4,734 5/4 0 0 180/180
chr3 3,785/3,610 6/2 10/8 3,566/3,419 0 0 203/181
MtDNA 13/13 0 0 0 13/13 0 0
chrY 4/10 0 1/3 1/0 0 2/1 0/6
Autosomal_pericentromeric 0/10 0 0/3 0/6 0 0 0/1
X_pericentromeric 1/1 0 0 1/0 0 0/1 0

In each entry (x/y), x stands for the number of genes aligned to AcolMOP1 and y stands for the number of genes aligned to AaraD3.

We validated the structural accuracy of the AaraD3 and AcolMOP1 assemblies by comparing them with the chromosome-scale reference genome of A. gambiae PEST. Using D-Genies v1.2.0 [82], we generated 3 whole-genome pairwise alignments for the following pairs of assemblies: AcolMOP1 and AgamP4, AaraD3 and AgamP4, and AcolMOP1 and AaraD3 (Fig. 3, Additional File 16). We observed that no alignment pair had inter-chromosomal rearrangements between the assemblies. Gaps in whole-genome pairwise alignments between AcolMOP1 and AgamP4 and between AaraD3 and AgamP4 indicate that all chromosomes in the new assemblies have more genomic information in the pericentromeric regions than the AgamP4 assembly. Overall, the pairwise alignments show high concordance of the AcolMOP1 and AaraD3 assemblies with the existing AgamP4 assembly.

Figure 3:

Figure 3:

Whole-genome pairwise alignment dot-plots between chromosome-level assemblies. A, AcolMOP1 and AgamP4. B, AaraD3 and AgamP4. C, AcolMOP1 and AaraD3.

To assess the completeness of the final assemblies, we searched for conserved, single-copy genes using BUSCO v4.1.4 with the dipteran complete gene set. The chromosomal scaffolds of the A. coluzzii and A. arabiensis final assemblies had BUSCO scores of 99.7% and 99.2%, respectively (Additional File 17). We used QUAST-LG to assess contiguity of the assemblies. The largest scaffolds are 114.8 and 112.0 Mb and scaffold N50s are 99.9 and 95.7 Mb for the A. coluzzii and A. arabiensis assemblies, respectively (Additional File 18).

Finally, we assessed the presence of known tandem repeats in pericentromeric regions of chromosomes 2, 3, X, and in chromosome Y of the AcolMOP1, AaraD3, AgamP4, and AcolN1 [36] assemblies. In particular, we used the presence of the putative pericentromeric tandem repeat Ag93 [78, 79] to assess the completeness of autosomal arms. We observed 27,364 and 33,460 Ag93 repeat copies in A. coluzzii and A. arabiensis assemblies, respectively, which are substantially greater than the 4,446 and 2,188 Ag93 repeat copies found in AgamP4 and AcolN1, respectively (Table 5). Moreover, while all hits in AgamP4 were located in the chromosome UNKN, 409/7,003 and 1,643/735 repeats were found in scaffolds corresponding to chromosomes 2/3 in the A. coluzzii and A. arabiensis assemblies, respectively (Additional File 19). This result indicates that the AaraD3 and AcolMOP1 chromosomes are more complete than the AgamP4 chromosomes. We also analyzed the presence of the Ag53C tandem repeat and its junction with the Tsessebe III transposable element, which are known to be located in the pericentromeric regions of the autosomes [78, 83]. The A. coluzzii assembly contains the highest number (646) of these repeats and the A. arabiensis assembly contains a comparable number of the repeats to AgamP4, while the AcolN1 assembly contains just 41 repeat copies (Table 5). It should be noted that these repeats are not located in scaffolds assembled to autosomal chromosomes and can only be found in the Autosomal_pericentromeric scaffolds of the A. coluzzii and A. arabiensis assemblies or in the chromosome UNKN of the AgamP4 assembly. Overall, the AaraD3 and AcolMOP1 assemblies contain more assembled sequences from the autosomal pericentromeric regions than the AgamP4 and AcolN1 assemblies.

Table 5:

The number of marker sequences present in the A. coluzzii MOPTI (AcolMOP1), A. arabiensis (AaraD3), A. gambiae (AgamP4), and A. coluzzii Ngousso (AcolN1) assemblies

Repeat sequence Chromosome AcolMOP1 AaraD3 AgamP4 AcolN1
Ag93 Autosomes 27,364 33,460 4,446 2,188
Ag53C Autosomes 646 93 103 41
AgX367 X chromosome 845 1,653 2 86
18S rDNA X chromosome 205 490 0 3
Ag113 X chromosome 3,146 0 765 97
AgY477 Y chromosome 7,685 0 4 0
AgY53B Y chromosome 10,569 0 387 0
zanzibar Y chromosome 100 0 0 0

For assessing completeness of the pericentromeric regions in the X chromosome and within the Y chromosome, we used AgX367 and AgY477 repeat sequences that are known to be located in the X and Y chromosomes, respectively [77, 78]. It is important to note that AgX367 and AgY477 repeat sequences share a region of high similarity. AcolMOP1 and AaraD3 contain a much higher number of AgX367 copies than AgamP4 or AcolN1 do (Table 5). All these copies are found in the X_pericentromeric scaffold only (Additional File 19). While AcolMOP1 contains 7,685 of AgY477 repeat copies in the scaffold that corresponds to the Y chromosome, they are absent in AaraD3. We also used AgY53B repeat and zanzibar retrotransposon to assess for the presence of Y chromosome contigs in the assemblies. Similar to AgY477, these sequences are found in AcolMOP1 but absent in AaraD3. AcolMOP1 contains the highest number of AgY53B and zanzibar copies among all the studied assemblies. For validating scaffolds corresponding to the X chromosome, we further used the Ag113 and 18S rDNA sequences. Remarkably, only AcolMOP1 and AgamP4 assemblies contain full-length 18S rDNA sequences. The Ag113 sequence is widely present in AcolMOP1 and AgamP4 but absent in AaraD3 (Table 5). Importanly, AcolMOP1 contains 1,941 Ag113 copies in the X chromosome while all appearances of Ag113 in AgamP4 are located in the chromosome UNKN (Additional File 19). All these results indicate that the X and Y chromosomes are better assembled in AcolMOP1 than in AgamP4 or AcolN1.

We conclude that we produced chromosome-level genome assemblies for A. coluzzii and A. arabiensis. The AaraD3 and AcolMOP1 assemblies have 98.1% of conserved single-copy Diptera genes from BUSCO and contain pericentromeric heterochromatin sequences, as well as sequences of the Y chromosomes and the rDNA cluster. Our assessments show that the AaraD3 and AcolMOP1 assemblies are of higher quality and more continuous than AgamP4 or AcolN1.

Genome rearrangements in A. coluzzii and A. arabiensis

The new assemblies of A. coluzzii MOPTI and A. arabiensis DONGOLA allowed us to identify genomic coordinates of breakpoint regions and breakpoint-flanking genes of chromosomal rearrangements. The pairwise alignments of the A. arabiensis, A. coluzzii, and A. gambiae genomes were performed and visualized using D-Genies v1.2.0 [82] (Fig. 3, Additional File 17), genoPlotR [84] (Additional File 20), and SyRi [85] (Additional File 21). The karyotypes of the incipient species A. coluzzii and A. gambiae do not differ by any known fixed rearrangements. The karyotype of A. arabiensis is known to differ from that of A. coluzzii and A. gambiae by 5 fixed overlapping X-chromosome inversions (a, b, c, d, and g) and inversion 2La. Inversions Xag are fixed in A. coluzzii and A. gambiae [86] and inversions Xbcd are fixed in A. arabiensis [86]. Inversion 2La is fixed in A. arabiensis but polymorphic in A. coluzzii and A. gambiae [86]. Our previous studies identified the X chromosome inversion breakpoints by aligning the A. gambiae PEST assembly with the Illumina-based A. arabiensis assemblies AaraD1 and AaraD2 [7, 35]. Here, we determined genomic coordinates and breakpoint-flanking genes of the Xag and Xbcd breakpoint regions in the chromosome-level assemblies of the 3 species (Additional File 22). We identified breakpoint regions and breakpoint-flanking genes of inversion 2La, which is fixed in both the A. arabiensis DONGOLA and A. coluzzii MOPTI strains (Additional Files 20–22). Inversion 2Rb is polymorphic in A. arabiensis, A. coluzzii, and A. gambiae [86], and its breakpoint regions are shared among the 3 species, indicating a single common origin of the inversion [87]. The 2Rb inversion is fixed in the DONGOLA strain and the alignments with the A. gambiae and A. coluzzii genomes identified its breakpoints in the AaraD3 assembly (Additional Files 20–22). Localization of the 2Rb breakpoints in the context of the new reference genome will advance the investigation of how this chromosomal inversion influences choice of host by A. arabiensis [3]. The genomic coordinates of breakpoint regions and breakpoint-flanking genes for inversions 2La and 2Rb found in our assemblies are in agreement with the previously described breakpoints for these inversions [87, 88].

Pair-wise alignments of the 3 chromosome-level assemblies identified new assembly-specific rearrangements, most of which are much smaller than 1 Mb (Fig. 4, Additional Files 21–23). We only considered rearrangements assembly-specific if they had the same breakpoints when aligned to both other assemblies. We found 2 small rearrangements in A. coluzzii: 3R translocation and 3L microinversion. The 3R translocation is located between genomic coordinates 37.6 and 37.7 Mb and the 3L microinversion is located between coordinates 68.4 and 68.6 Mb in AcolMOP1. We discovered a new 2R microinversion in A. arabiensis located between genomic coordinates 9.6 and 9.8 Mb in AaraD3 (Additional File 21). Finally, we found 5 AgamP4-specific inversions on 2R (59.1–59.6 Mb, 60.5–60.9 Mb), on 2L (4.0–5.0 Mb), and on 3L (0.2–0.4 Mb, 1.2–1.9 Mb). The identified structural variations between the assemblies may represent natural genome rearrangements or misassemblies. It is worth noting that these micro-rearrangements are located in euchromatin of AcolMOP1 and AaraD3, while they are located in heterochromatin of AgamP4 [89]. This observation suggests that the AgamP4-specific microinversions are likely misassemblies in the PEST heterochromatin where genome assembly is notoriously difficult [26, 27].

Figure 4:

Figure 4:

Whole-genome pairwise alignments between chromosome-level assemblies of AcolMOP1 and AgamP4. The alignment plots are produced by SyRi. The rearrangements are indicated with arrows.

To validate the new 2R microinversion in A. arabiensis, we aligned our AaraD3 assembly with AaraD2 [35], which is a super-scaffolded, Illumina-based AaraD1 assembly [33]. We found the new 2R microinversion in the AaraD2 assembly as well, confirming its presence in the DONGOLA strain (Additional File 24). We also aligned AcolMOP1 with the reference-guided scaffolded AcolN1 assembly [36] and the super-scaffolded AcolM2 [35] assemblies. These 3 assemblies were made for genomes of 3 different A. coluzzii strains: MOPTI, Ngousso, and MALI, respectively. The alignments demonstrated that the new 3R translocation and 3L microinversion are MOPTI-specific (Additional Files 25 and 26), indicating that these rearrangements are polymorphic within A. coluzzii. Incidentally, we identified an inversion and 2 large translocations in the 2R arm (possible misassemblies) of AcolM2 in its alignment with AcolMOP1 (Additional File 26).

In addition to the rearrangements identified using these 3 species assemblies, we found small rearrangements in the pericentromeric heterochromatin of the X chromosome by aligning the A. coluzzii and A. gambiae genomes (Fig. 4, Additional File 27). The A. arabiensis genome was too divergent in this region to detect rearrangements. Breakpoint regions of 2 translocations and 4 inversions were located within genomic coordinates 20.1–24.2 Mb in AgamP4 and 21.6–26.2 Mb in AcolMOP1. These rearrangements were not detected in the alignments of AcolMOP1 with either AcolN1 or AcolM2 (Additional File 27). A previous study identified 3 of these rearrangements in the 20–22 Mb region of the X chromosome (1 translocation and 2 inversions) by aligning the A. gambiae PEST and A. coluzzii Ngousso genomes [36]. Because our AcolMOP1 assembly extends farther into the heterochromatin than the AcolN1 assembly does, we were able to detect 2 times more rearrangements in the X chromosome. These misalignments could be due to order and/or orientation errors in the PEST genome assembly. However, they may also represent novel rearrangements segregating between A. gambiae and A. coluzzii. We recently described a new type of shared cytogenetic polymorphism in the incipient species, A. gambiae and A. coluzzii—an inversion of the satellite DNA location in relation to the proximal gene-free X chromosome band [83]. The findings suggest that structural variations can be common in the sex chromosome heterochromatin of mosquitoes.

Conclusion

By combining long-read sequences generated by the Oxford Nanopore technology and long-range information produced by the Hi-C approach, we obtained high-quality reference genome assemblies for A. arabiensis and A. coluzzii. We demonstrated tremendous improvement in the completeness and contiguity of these species’ genomes. Thus, these assemblies provide a valuable resource for comparative genomics, epigenetics, functional analyses, and population studies of malaria mosquitoes. To maximize the use of the data, tools, and workflows of this study, we present a pipeline for obtaining superior-quality genome assemblies for malaria mosquitoes based on Hi-C scaffolding of Oxford Nanopore sequencing contigs (Fig. 5). The pipeline illustrates successful approaches along with other approaches that we tried but discarded in the course of its development.

Figure 5:

Figure 5:

The pipeline for obtaining superior-quality genome assemblies for malaria mosquitoes based on Hi-C scaffolding of Oxford Nanopore sequencing contigs.

Sequencing of the first vector genome of A. gambiae revolutionized genetics and genomics research in medical entomology. A. arabiensis and A. coluzzii are major vectors of malaria, and functional characterization of their genomes will enable identification of the genomic determinants of epidemiologically important phenotypic and behavioral traits. Eventually, these efforts will lead to better malaria control.

Materials and Methods

Mosquito colony maintenance

The A. arabiensis DONGOLA (MRA-1235) and A. coluzzii MOPTI (MRA-763) strains were initially obtained from the Biodefense and Emerging Infections Research Resources Repository. Eggs were hatched in distilled water and incubated for 10–15 days, undergoing larval and pupal developmental stages at 27°C. After emerging from pupae, the adult males and females were maintained together in an incubator at 27°C, 75% humidity, with a 12-hour cycle of light and darkness. The 5–7-day-old adult females were blood-fed on defibrinated sheep blood using artificial blood feeders. Approximately 48–72 hours after blood feeding, egg dishes were placed for oviposition.

Mosquito sample collection for Oxford Nanopore sequencing

A single well–blood-fed female mosquito was separated from the original cage. After oviposition, F1 progeny from this single female were inbred with each other for 4 days in a 1.4-L paper popcorn cup. F1 females were given bloodmeal and, after 72 hours, eggs were collected from a single F1 female. The F2 progeny were reared under normal conditions and were inbred as before. After 6 rounds of inbreeding using this procedure, all F7 pupal progeny from a single F6 female were sorted by sex, collected, flash frozen in liquid nitrogen, and stored at –80°C.

Genomic DNA isolation

Genomic DNA was isolated from 20 inbred male pupae following a modified Qiagen Genomic Tip DNA Isolation kit (Cat. No. 10,243 and 19,060, Qiagen, Hilden, Germany) protocol. Briefly, pupae were homogenized using a Dremel motorized homogenizer for ∼30 seconds at the lowest speed. Next, >300 mAU (500 μL of >600 mAU/mL solution) Proteinase K (Cat. No. 19,131, Qiagen, Hilden, Germany) was added to the sample and incubated at 55°C for 3 hours. The homogenate was then transferred into a 15-mL conical tube and centrifuged at 5000g for 15 minutes at 4°C to remove debris. DNA was extracted following the standard Qiagen Genomic Tip protocols (Qiagen, Hilden, Germany). The purity, approximate size, and concentration of the DNA were tested using a nanodrop spectrophotometer, 0.5% agarose gel electrophoresis, and Qubit dsDNA assay, respectively.

Oxford Nanopore sequencing

Approximately 1 μg of DNA was used to generate a sequencing library according to the protocol provided for the SQK-LSK109 library preparation kit from Oxford Nanopore (Oxford Nanopore Technologies, Oxford Science Park, UK). After the DNA repair, end prep, and adapter ligation steps, SPRIselect bead suspension (Cat No. B23318, Beckman Coulter Life Sciences, Indianapolis, IN, USA) was used to remove short fragments and free adapters. Qubit dsDNA assay (ThermoFisher Scientific, Waltham, MA, USA) was used to quantify DNA and ∼300–400 ng of DNA library was loaded onto a MinION flow cell (MinION, RRID:SCR_017985) (BioProject: PRJNA634549, SRA: SRX8462258, SRX8462259).

Hi-C library preparation and sequencing

Hi-C libraries for A. arabiensis were prepared with an Arima-HiC kit (Arima Genomics, San Diego, CA, USA) using protocols provided by the company (Document part No. A160126 v00) with slight modifications. Two replicas of Hi-C libraries were prepared from 1–2-day-old virgin adults with equal proportions of each sex (1 library with 20 and 1 library with 60 adults). After the fixation step using the “Crosslinking—Small Animal" protocol, 10% of the original pulverized mosquito tissue was taken out to perform the “Determining input Amount-Small Animal" protocol. The remaining 90% of tissue comprising ≥750 ng of DNA was used to produce proximally ligated DNA fragments following the Arima-HiC protocol. The quality of proximally ligated DNA was tested by taking 75 ng of DNA through the Arima-QC1 Quality Control protocol. If the Arima-Q1 value passed, the rest of the proximally ligated DNA was used to prepare the libraries using NEBNext Ultra II DNA Library Prep Kit following the “Library Preparation" and the “Library Amplification" protocols. The libraries were sent to Novogene for sequencing. A total of 17.8 and 27.7 Gb of 2 × 150 bp reads were obtained for the 2 libraries (BioProject: PRJNA634549, SRA: SRX8462257). Three biological replicates of Hi-C data for A. coluzzii embryos (BioProject: PRJNA615337, SRA: SRS6448831) were obtained from the previous study [90].

Quality control of Nanopore reads

Analysis and visualization of the long Nanopore reads were performed with the Nanostat and Nanoplot tools of the Nanopack software (de2018nanopack) [91]. Alignment of Nanopore reads to the genome of A. gambiae (AgamP4) was done using minimap2 (Minimap2, RRID:SCR_018550) [53]. A contamination analysis was performed with Kraken2 [54] using a custom database with addition of the A. gambiae [26, 27] and A. coluzzii Ngousso [36] genomes.

Nanopore sequence assembly

Genome assemblies from the Nanopore sequencing data were obtained using wtdbg2 v1.1 (WTDBG, RRID:SCR_017225) [55], FLYE v2.4.1 (Flye, RRID:SCR_017016) [56], Miniasm v0.3-r179 [57], and Canu v1.8 (Canu, RRID:SCR_015880) [58]. In the case of the Canu v1.8 assembler, we obtained 2 assemblies: 1 consisting of unitigs (i.e., unambiguous reconstructions of the sequence) and the other consisting of contigs. For wtdbg2, the polishing step using minimap2 was performed per the developers' recommendation. The completeness and quality of the assemblies were assessed with BUSCO v2 (BUSCO, RRID:SCR_015008) [61, 62] and QUAST-LG [59]. For QUAST-LG, the A. gambiae (AgamP4) genome was used as a reference. A. gambiae is evolutionary more closely related to A. coluzzii (0.061M years divergence) than to A. arabiensis (0.509M years divergence) [8] and, thus, the reference-based metrics (such as NG50 and the number of misassemblies discussed below) were considered with caution. Using BUSCO, each assembly was queried for 2,799 conserved single-copy Diptera genes, as well as for 978 conserved single-copy Metazoa genes. A gene-recognizing model was trained for the Augustus tool [92] in the BUSCO pipeline by using the Aedes aegypti genome [48].

Genome size estimation and polishing the genome assemblies

For genome size estimation and polishing assemblies obtained from Nanopore reads, Illumina short paired-end data were used. The NCBI SRX accession numbers were SRX3832577 for A. coluzzii and SRX084275, SRX111457, SRX200218 for A. arabiensis. Quality control of the Illumina reads was performed with FastQC (FastQC, RRID:SCR_014583) [93]. Based on the FastQC analysis, reads were filtered by the quality and minimum read length, and TruSeq adapters were trimmed from reads using fastp v0.20.0 (fastp, RRID:SCR_016962). The genome sizes of A. coluzzii and A. arabiensis and sizes of single-copy genomic regions were estimated by k-mer analysis for k = 19 based on the Illumina short paired-end reads using methodology described in the genome size estimation tutorial [94]. A 19-mer distribution was separated into 3 consecutive ranges that correspond to error sequences, single-copy sequences (i.e., haploid peak), and repeat sequences. The average 19-mer single-copy coverage was estimated by finding the maximum in the haploid peak. After that, the area under the curve was calculated for the whole distribution range except the sequencing error range. To obtain genome length, the obtained area was divided by the coverage calculated in the previous step. The length of the single-copy sequences was calculated in the same manner but only for the single-copy sequence range. The formulas for calculating respective lengths are the following:

graphic file with name TM0001.gif

where Lg is a genome length, Lsc is a single-copy sequence length, Cov is an average 19-mer single-copy sequence coverage, freq is 19-mer frequency, K(freq) is the number of distinct 19-mers with frequency equal to freq (Y-axis), End is the maximal frequency value, and Lb, Rb are borders of the range for haploid peak. For example, the length of the single-copy region for the A. coluzzii genome is calculated as follows:

graphic file with name TM0002.gif

The frequency distribution of 19-mers in all high-quality short reads was computed by Jellyfish (Jellyfish, RRID:SCR_005491) [60]. Racon (Racon, RRID:SCR_017642) [63], Medaka [64], and Nanopolish (Nanopolish, RRID:SCR_016157) [65] were used to correct nucleotide substitutions, insertions, and deletions. Pilon (Pilon, RRID:SCR_014731) [66] was run several times using Illumina reads.

Scaffolding Nanopore contigs using Hi-C data

Hi-C Illumina short paired-end reads were used for genome scaffolding after their quality control was inspected with FastQC. BWA-MEM v0.7.17 [95] and Juicer v1.5.7 (Juicer, RRID:SCR_017226) [67] were run to assess the quality of Hi-C data with respect to the polished Canu contig assemblies of each genome. The 3D-DNA (3D-DNA, RRID:SCR_017227) [47] and SALSA2 [73] software were run to scaffold the Nanopore contigs. Metrics for the original and processed assemblies were computed by QUAST-LG [59]. Visual inspection of the Hi-C contact heat maps was performed. All contigs in the assemblies were classified with PurgeHaplotigs software [75] into primary contigs, haplotigs, and assembly artifacts on the basis of the read-depth analysis. Alignment of contigs to the A. gambiae PEST (AgamP4) assembly was used to obtain information about the distribution of the contigs across the chromosomes.

Chromosome quotient analysis

Because the AgamP4 assembly does not contain chromosome Y, CQ analysis was performed using Illumina reads from female and male mosquito genomes to detect the presence of contigs from the Y chromosome. According to the original definition [76], for a given sequence Si, CQ(Si) = F(Si)/M(Si), where F(Si) is the number of alignments from female sequence data to Si and M(Si) is the number of alignments from male sequence data to Si. Therefore, the CQ method allows for the differentiation of Y sequences from autosome and X sequences. CQ calculation was performed at a 1-kb window for contigs or scaffolds. If the number of male reads was <20, the CQ value of that particular 1-kb window was not used. Contigs or scaffolds with ≥15% of the 1-kb windows showing CQ values <0.1 were considered to be Y-derived and were grouped into a separate scaffold called “chrY.”

Gene mapping and pairwise alignments

The A. coluzzii MOPTI (AcolMOP1) and A. arabiensis DONGOLA (AaraD3) assemblies were validated by comparing them with the existing assembly A. gambiae PEST (AgamP4), representing the most complete chromosome-level anopheline genome assembly known to date. Using NCBI BLAST v2.9.0 (NCBI BLAST, RRID:SCR_004870) [81], a set of known genes (AgamP4.10 gene set) from the AgamP4 assembly was mapped to the new A. coluzzii and A. arabiensis assemblies. The assembly of the A. coluzzii Ngousso strain (AcolN1) from PacBio reads was also used, where appropriate, because AcolN1 consists of contigs rather than scaffolds [36]. Whole-genome pairwise alignment between these assemblies was generated using D-Genies v1.2.0 (D-GENIES, RRID:SCR_018967) [82], SyRi [85], and genoPlotR [84]. The D-Genies v1.2.0 dot-plots and SyRi plots were generated using sequence alignments. The genoPlotR visualization was done using alignments of orthologous genes identified in the A. coluzzii and A. arabiensis assemblies by BLAST of the AgamP4.10 genes.

Data Availability

The raw genomic sequence reads and both genome assemblies underlying this article are available in the NCBI under project accession PRJNA634549 and the GigaScience GigaDB database [96, 97].

Additional Files

Additional File 1. Analysis report for Oxford Nanopore reads produced by Nanostat.

Additional File 2. Histograms of read length after log normalization for Nanopore reads from (a)  A. coluzzii and (b)  A. arabiensis.

Additional File 3. Plot of the average read quality for Nanopore reads obtained from (a)  A. coluzzii and (b)  A. arabiensis.

Additional File 4. Alignment depth of Nanopore reads from A. coluzzii (left column) and A. arabiensis (right column) to the A. gambiae (AgamP4) genome.

Additional File 5. Distribution of 19-mers for A. coluzzii (left panel) and A. arabiensis (right panel) computed by Jellyfish.

Additional File 6. Analysis report generated by QUAST-LG for Oxford Nanopore draft assemblies.

Additional File 7. auNG metrics for A. coluzzii and A. arabiensis assemblies.

Additional File 8. Evaluation of draft assembly completeness with BUSCO.

Additional File 9. Evaluation of assembly completeness with BUSCO after polishing steps.

Additional File 10. Report produced by the Juicer tool when aligning the Hi-C data on the Canu contig assemblies.

Additional File 11. Analysis report generated by QUAST-LG for Hi-C scaffolding of Canu contigs.

Additional File 12. Initial heat maps of Hi-C contact information for the A. arabiensis genome assemblies obtained by (a) SALSA 2 from the Canu contig assembly, (b) SALSA 2 from the Canu unitig assembly, (c) 3D-DNA from the Canu contig assembly, and (d) 3D-DNA from the Canu unitig assembly. The heat maps were produced by JBAT.

Additional File 13. Hi-C contact heat map for the 3D-DNA scaffolds of the A. coluzzii assembly before manual correction. The heat map was produced by JBAT.

Additional File 14. Read depth histogram obtained by PurgeHaplotigs for the A. coluzzii  (a) and A. arabiensis  (b) assemblies. The cut-offs were manually selected (red arrows in the histograms): 30, 78, and 132 for A. сoluzzii and 25, 93, and 160 for A. arabiensis.

Additional File 15. Mapping the AgamP4.10 gene set to the final A. coluzzii and A. arabiensis assemblies.

Additional File 16. Whole-genome pairwise alignment dot-plots (produced by D-Genies v1.2.0) between the scaffolds corresponding to chromosomes X (a–c), 2 (d–f), and 3 (g–i). Alignments between the A. gambiae and A. coluzzii scaffolds, the A. gambiae and A. arabiensis scaffolds, and the A. coluzzii and A. arabiensis scaffolds are shown.

Additional File 17. BUSCO scores for the final assemblies.

Additional File 18. Analysis report generated by QUAST-LG for the final Hi-Cscaffolded Canu contigs.

Additional File 19. Map of the repetitive sequences for the Anopheles assemblies.

Additional File 20. Whole-genome pairwise alignments produced by genoPlotR between chromosomes of A. arabiensis, A. coluzzii,andA. gambiae. Left: AgamP4 and AcolMOP1. Middle: AgamP4 and AaraD3. Right: AcolMOP1 and AaraD3. The inversion breakpoints are shown with small letters.

Additional File 21. Whole-genome pairwise alignments produced by SyRi of A. arabiensis chromosomes (query) to A. gambiae (reference), and A. coluzzii (reference) chromosomes. Left: AgamP4 and AaraD3. Right: AcolMOP1 and AaraD3.

Additional File 22. Genomic coordinates of breakpoint regions and breakpoint-flanking genes of chromosomal rearrangements identified by aligning the AgamP4, AcolMOP1, and AaraD3 assemblies using D-Genies v1.2.0.

Additional File 23. Genomic coordinates of breakpoint regions of chromosomal rearrangements identified by aligning the AgamP4, AcolMOP1, and AaraD3 assemblies using SyRi.

Additional File 24. Pairwise dot-plot alignment between the AaraD3 and AaraD2 (AaraD1 superscaffolded) assemblies produced by D-Genies v1.2.0. Top: Whole-genome pairwise alignment. Bottom: Collinearity in the alignment of the region with the new 2R microinversion.

Additional File 25. Pairwise dot-plot alignment between the AcolMOP1 and reference-guided scaffolded AcolN1 assemblies produced by D-Genies v1.2.0. Left: Whole-genome pairwise alignment. Middle: Alignment of the region with the new 3R translocation. Right: Alignment of the region with the new 3L microinversion.

Additional File 26. Pairwise dot-plot alignment between the AcolMOP1 and AcolM2 assemblies produced by D-Genies v1.2.0. Left: Whole-genome pairwise alignment. Middle: Alignment of the region with the new 3R translocation. Right: Alignment of the region with the new 3L microinversion.

Additional File 27. Pairwise dot-plot alignment between the X chromosomes produced by D-Genies v1.2.0. Left: AcolMOP1 and AgamP4. Middle: AcolMOP1 and AcolN1. Right: AcolMOP1 and AcolM2.

Abbreviations

BLAST: Basic Local Alignment Search Tool; bp: base pairs; BUSCO: Benchmarking Universal Single-Copy Orthologs; BWA: Burrows-Wheeler Aligner; CQ: chromosome quotient; Gb: gigabase pairs; JBAT: Juicebox Assembly Tools; kb: kilobase pairs; M: million; Mb: megabase pairs; NCBI: National Center for Biotechnology Information; NAIAD: National Institute of Allergy and Infectious Diseases; PacBio: Pacific Biosciences; rDNA: ribosomal DNA; SRA: Sequence Read Archive.

Competing Interests

The authors declare that they have no competing interests.

Funding

Data production and the work of M.A.A., P.A., and I.V.S. were supported by US National Institutes of Health (NIH) grant R21AI135298. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. The work of N.A. was supported by the Government of the Russian Federation through the ITMO Fellowship and Professorship Program. The work of A.Z. was supported by JetBrains Research. I.V.S. was also supported by the Fralin Life Sciences Institute and the USDA National Institute of Food and Agriculture Hatch project 223822.

Authors’ Contributions

I.V.S. and M.A.A. conceived and coordinated the project and acquired funding for this study. N.A. secured funding for A.Z. J.L. maintained mosquito colonies and performed Hi-C protocol optimization, Hi-C experiments, and Illumina library preparation for A. arabiensis. A.S., C.C., and Z.T. did genomic DNA isolation and Oxford Nanopore sequencing. Z.T. performed CQ analysis. Under the supervision of M.A.A. and I.V.S., A.Z. and P.A. performed the genome assembly, scaffolding, and validation with the contribution from V.L. N.A. and P.A. supervised A.Z. The manuscript was prepared and approved by all the authors.

Supplementary Material

giab017_GIGA-D-20-00290_Original_Submission
giab017_GIGA-D-20-00290_Revision_1
giab017_Response_to_Reviewer_Comments_Original_Submission
giab017_Reviewer_1_Report_Original_Submission

Jacob Tennessen -- 11/12/2020 Reviewed

giab017_Reviewer_2_Report_Original_Submission

Shanlin Liu -- 11/28/2020 Reviewed

giab017_Supplemental_Files

ACKNOWLEDGEMENTS

The following reagents were obtained through the Biodefense and Emerging Infections Research Resources Repository, NIAID, NIH: A. arabiensis, Strain DONGOLA 2Ra, 2Rb/b, and 3R, MRA-1235, contributed by Ellen M. Dotson; A. coluzzii, Strain MOPTI, Eggs, MRA-763, contributed by Gregory C. Lanzaro. The authors thank Adam Kai Leung Wong for his invaluable help in installing and troubleshooting various software packages on the Colonial One high-performance computing cluster at the George Washington University. We thank Janet Webster and Kristi DeCourcy for editing the manuscript.

Contributor Information

Anton Zamyatin, Computer Technologies Laboratory, ITMO University, Kronverkskiy Prospekt 49-A, Saint Petersburg 197101, Russia.

Pavel Avdeyev, Department of Mathematics, The George Washington University, 801 22nd Street NW, Washington, DC 20052, USA; Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, 800 22nd Street NW, Washington, DC 20052, USA.

Jiangtao Liang, Department of Entomology, Virginia Polytechnic Institute and State University, 170 Drillfield Drive, Blacksburg, VA 24061, USA; Fralin Life Sciences Institute, Virginia Polytechnic Institute and State University, 360 West Campus Drive, Blacksburg, VA 24061, USA.

Atashi Sharma, Fralin Life Sciences Institute, Virginia Polytechnic Institute and State University, 360 West Campus Drive, Blacksburg, VA 24061, USA; Department of Biochemistry, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA.

Chujia Chen, Fralin Life Sciences Institute, Virginia Polytechnic Institute and State University, 360 West Campus Drive, Blacksburg, VA 24061, USA; Department of Biochemistry, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA.

Varvara Lukyanchikova, Department of Entomology, Virginia Polytechnic Institute and State University, 170 Drillfield Drive, Blacksburg, VA 24061, USA; Fralin Life Sciences Institute, Virginia Polytechnic Institute and State University, 360 West Campus Drive, Blacksburg, VA 24061, USA; Institute of Cytology and Genetics the Siberian Division of the Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia.

Nikita Alexeev, Computer Technologies Laboratory, ITMO University, Kronverkskiy Prospekt 49-A, Saint Petersburg 197101, Russia.

Zhijian Tu, Fralin Life Sciences Institute, Virginia Polytechnic Institute and State University, 360 West Campus Drive, Blacksburg, VA 24061, USA; Department of Biochemistry, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA.

Max A Alekseyev, Department of Mathematics, The George Washington University, 801 22nd Street NW, Washington, DC 20052, USA; Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, 800 22nd Street NW, Washington, DC 20052, USA.

Igor V Sharakhov, Department of Entomology, Virginia Polytechnic Institute and State University, 170 Drillfield Drive, Blacksburg, VA 24061, USA; Fralin Life Sciences Institute, Virginia Polytechnic Institute and State University, 360 West Campus Drive, Blacksburg, VA 24061, USA.

References

  • 1. Russell  TL, Govella  NJ, Azizi  S, et al.  Increased proportions of outdoor feeding among residual malaria vector populations following increased use of insecticide-treated nets in rural Tanzania. Malaria J. 2011;10, doi: 10.1186/1475-2875-10-80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Petrarca  V, Beier  JC. Intraspecific chromosomal polymorphism in the Anopheles gambiae complex as a factor affecting malaria transmission in the Kisumu area of Kenya. Am J Trop Med Hyg. 1992;46(2):229–37. [DOI] [PubMed] [Google Scholar]
  • 3. Main  BJ, Lee  Y, Ferguson  HM, et al.  The genetic basis of host preference and resting behavior in the major African malaria vector, Anopheles arabiensis. PLoS Genet. 2016;12(9):e1006303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Davidson  G, Paterson  HE, Coluzzi  M, et al.  The Anopheles gambiae complex. In: Wright  JW, Pal  R, eds. Genetics of Insect Vectors of Disease. Amsterdam-London-New York: Elsevier; 1967:211–49. [Google Scholar]
  • 5. Slotman  M, Della Torre  A, Powell  JR. The genetics of inviability and male sterility in hybrids between Anopheles gambiae and An. arabiensi s. Genetics. 2004;167(1):275–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Bernardini  F, Galizi  R, Wunderlich  M, et al.  Cross-species Y chromosome function between malaria vectors of the Anopheles gambiae species complex. Genetics. 2017;207(2):729–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Fontaine  MC, Pease  JB, Steele  A, et al.  Extensive introgression in a malaria vector species complex revealed by phylogenomics. Science. 2015;347(6217):1258524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Thawornwattana  Y, Dalquen  D, Yang  Z. Coalescent analysis of phylogenomic data confidently resolves the species relationships in the Anopheles gambiae species complex. Mol Biol Evol. 2018;35(10):2512–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Aboagye-Antwi  F, Alhafez  N, Weedall  GD, et al.  Experimental swap of Anopheles gambia e's assortative mating preferences demonstrates key role of X-chromosome divergence island in incipient sympatric speciation. PLoS Genet. 2015;11(4):e1005141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Diabate  A, Dabire  RK, Millogo  N, et al.  Evaluating the effect of postmating isolation between molecular forms of Anopheles gambiae (Diptera: Culicidae). J Med Entomol. 2007;44(1):60–4. [DOI] [PubMed] [Google Scholar]
  • 11. Tene Fossog  B, Ayala  D, Acevedo  P, et al.  Habitat segregation and ecological character displacement in cryptic African malaria mosquitoes. Evol Appl. 2015;8(4):326–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Diabate  A, Dabire  RK, Kim  EH, et al.  Larval development of the molecular forms of Anopheles gambiae(Diptera: Culicidae) in different habitats: a transplantation experiment. J Med Entomol. 2005;42(4):548–53. [DOI] [PubMed] [Google Scholar]
  • 13. Gimonneau  G, Bouyer  J, Morand  S, et al.  A behavioral mechanism underlying ecological divergence in the malaria mosquito Anopheles gambiae. Behav Ecol. 2010;21(5):1087–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Dao  A, Yaro  AS, Diallo  M, et al.  Signatures of aestivation and migration in Sahelian malaria mosquito populations. Nature. 2014;516(7531):387–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Tennessen  J, Ingham  VA, Toe  KH, et al.  A population genomic unveiling of a new cryptic mosquito taxon within the malaria-transmitting Anopheles gambiae complex. Mol Ecol. 2021;30(3):775–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Lewin  HA, Larkin  DM, Pontius  J, et al.  Every genome sequence needs a good map. Genome Res. 2009;19(11):1925–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Thomma  BP, Seidl  MF, Shi-Kunne  X, et al.  Mind the gap; seven reasons to close fragmented genome assemblies. Fungal Genet Biol. 2016;90:24–30. [DOI] [PubMed] [Google Scholar]
  • 18. Greppi  C, Laursen  WJ, Budelli  G, et al.  Mosquito heat seeking is driven by an ancestral cooling receptor. Science. 2020;367(6478):681–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Saveer  AM, Pitts  RJ, Ferguson  ST, et al.  Characterization of chemosensory responses on the labellum of the malaria vector mosquito. Sci Rep. 2018;8(1):5656. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Hill  CA, Fox  AN, Pitts  RJ, et al.  G protein-coupled receptors in Anopheles gambiae. Science. 2002;298(5591):176–8. [DOI] [PubMed] [Google Scholar]
  • 21. Mancini  E, Tammaro  F, Baldini  F, et al.  Molecular evolution of a gene cluster of serine proteases expressed in the Anopheles gambiae female reproductive tract. BMC Evol Biol. 2011;11:72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Mancini  E, Baldini  F, Tammaro  F, et al.  Molecular characterization and evolution of a gene family encoding male-specific reproductive proteins in the African malaria vectorAnopheles gambiae. BMC Evol Biol. 2011;11:292. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Christophides  GK, Zdobnov  E, Barillas-Mury  C, et al.  Immunity-related genes and gene families in Anopheles gambiae. Science. 2002;298(5591):159–65. [DOI] [PubMed] [Google Scholar]
  • 24. Adolfi  A, Poulton  B, Anthousi  A, et al.  Functional genetic validation of key genes conferring insecticide resistance in the major African malaria vector, Anopheles gambiae. Proc Natl Acad Sci U S A. 2019;116(51):25764–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Liu  K, Tsujimoto  H, Cha  SJ, et al.  Aquaporin water channel AgAQP1 in the malaria vector mosquito Anopheles gambiae during blood feeding and humidity adaptation. Proc Natl Acad Sci U S A. 2011;108(15):6062–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Holt  RA, Subramanian  GM, Halpern  A, et al.  The genome sequence of the malaria mosquito Anopheles gambiae. Science. 2002;298(5591):129–49. [DOI] [PubMed] [Google Scholar]
  • 27. Sharakhova  MV, Hammond  MP, Lobo  NF, et al.  Update of the Anopheles gambiae PEST genome assembly. Genome Biol. 2007;8(1):R5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Cheng  C, White  BJ, Kamdem  C, et al.  Ecological genomics of Anopheles gambiae along a latitudinal cline: a population-resequencing approach. Genetics. 2012;190(4):1417–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Caputo  B, Pichler  V, Mancini  E, et al.  The last bastion? X chromosome genotyping of Anopheles gambiae species pair males from a hybrid zone reveals complex recombination within the major candidate ‘genomic island of speciation’. Mol Ecol. 2016;25(22):5719–31. [DOI] [PubMed] [Google Scholar]
  • 30. Miles  A, Harding  NJ, Botta  G, et al.  Genetic diversity of the African malaria vector Anopheles gambiae. Nature. 2017;552(7683):96–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Schmidt  H, Lee  Y, Collier  TC, et al.  Transcontinental dispersal of Anopheles gambiae occurred from West African origin via serial founder events. Commun Biol. 2019;2:473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Deitz  KC, Athrey  GA, Jawara  M, et al.  Genome-wide divergence in the West-African malaria vector Anopheles melas. G3 (Bethesda). 2016;6(9):2867–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Neafsey  DE, Waterhouse  RM, Abai  MR, et al.  Highly evolvable malaria vectors: The genomes of 16 Anopheles mosquitoes. Science. 2015;347(6217):1258522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Lawniczak  MK, Emrich  SJ, Holloway  AK, et al.  Widespread divergence between incipient Anopheles gambiae species revealed by whole genome sequences. Science. 2010;330(6003):512–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Waterhouse  RM, Aganezov  S, Anselmetti  Y, et al.  Evolutionary superscaffolding and chromosome anchoring to improve Anopheles genome assemblies. BMC Biol. 2020;18(1), doi: 10.1186/s12915-019-0728-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Kingan  SB, Heaton  H, Cudini  J, et al.  A high-quality de novo genome assembly from a single mosquito using PacBio sequencing. Genes (Basel). 2019;10(1), doi: 10.3390/genes10010062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Richards  S, Murali  SC. Best practices in insect genome sequencing: what works and what doesn't. Curr Opin Insect Sci. 2015;7, doi: 10.1016/j.cois.2015.02.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Worley  KC, Richards  S, Rogers  J. The value of new genome references. Exp Cell Res. 2017;358(2):433–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Bickhart  DM, Rosen  BD, Koren  S, et al.  Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat Genet. 2017;49(4):643–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Goodwin  S, Gurtowski  J, Ethe-Sayers  S, et al.  Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res. 2015;25(11):1750–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Russell  JA, Campos  B, Stone  J, et al.  Unbiased strain-typing of arbovirus directly from mosquitoes using Nanopore sequencing: A field-forward biosurveillance protocol. Sci Rep. 2018;8(1):5417. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Peritz  A, Paoli  GC, Chen  CY, et al.  Serogroup-level resolution of the “Super-7” Shiga toxin-producing Escherichia coliusing nanopore single-molecule DNA sequencing. Anal Bioanal Chem. 2018;410(22):5439–44. [DOI] [PubMed] [Google Scholar]
  • 43. Michael  TP, Jupe  F, Bemm  F, et al.  High contiguity Arabidopsis thaliana genome assembly with a single nanopore flow cell. Nat Commun. 2018;9(1):541. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Miller  DE, Staber  C, Zeitlinger  J, et al.  Highly contiguous genome assemblies of 15 Drosophilaspecies generated using nanopore sequencing. G3 (Bethesda). 2018;8(10):3131–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Ebbert  MTW, Farrugia  SL, Sens  JP, et al.  Long-read sequencing across the C9orf72 ‘GGGGCC’ repeat expansion: implications for clinical use and genetic discovery efforts in human disease. Mol Neurodegener. 2018;13(1):46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Peichel  CL, Sullivan  ST, Liachko  I, et al.  Improvement of the threespine stickleback genome using a Hi-C-based proximity-guided assembly. J Hered. 2017;108(6):693–700. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Dudchenko  O, Batra  SS, Omer  AD, et al.  De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science. 2017;356(6333):92–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Matthews  BJ, Dudchenko  O, Kingan  SB, et al.  Improved reference genome of Aedes aegypti informs arbovirus vector control. Nature. 2018;563(7732):501–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Ghurye  J, Koren  S, Small  ST, et al.  A chromosome-scale assembly of the major African malaria vector Anopheles funestus. Gigascience. 2019;8(6), doi: 10.1093/gigascience/giz063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Compton  A, Liang  J, Chen  C, et al.  The beginning of the end: a chromosomal assembly of the New World malaria mosquito ends with a novel telomere. G3 (Bethesda). 2020;10(10):3811–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Bracewell  R, Tran  A, Chatla  K, et al.  Chromosome-level assembly of Drosophila bifasciata reveals important karyotypic transition of the X chromosome. G3 (Bethesda). 2020;10(3):891–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. CE  E, Cao  W. Nanopore sequencing and Hi-C scaffolding provide insight into the evolutionary dynamics of transposable elements and piRNA production in wild strains of Drosophila melanogaster. Nucleic Acids Res. 2020;48(1):290–303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Li  H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Davis  MP, van Dongen  S, Abreu-Goodger  C, et al.  Kraken: a set of tools for quality control and analysis of high-throughput sequence data. Methods. 2013;63(1):41–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Ruan  J, Li  H. Fast and accurate long-read assembly with wtdbg2. Nat Methods. 2020;17(2):155–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Kolmogorov  M, Yuan  J, Lin  Y, et al.  Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 2019;37(5):540. [DOI] [PubMed] [Google Scholar]
  • 57. Li  H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics. 2016;32(14):2103–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Koren  S, Walenz  BP, Berlin  K, et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017;27(5):722–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Gurevich  A, Saveliev  V, Vyahhi  N, et al.  QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29(8):1072–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Marcais  G, Kingsford  C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27(6):764–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Simao  FA, Waterhouse  RM, Ioannidis  P, et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31(19):3210–2. [DOI] [PubMed] [Google Scholar]
  • 62. Seppey  M, Manni  M, Zdobnov  EM. BUSCO: assessing genome assembly and annotation completeness. Methods Mol Biol. 2019;1962:227–45. [DOI] [PubMed] [Google Scholar]
  • 63. Vaser  R, Sovic  I, Nagarajan  N, et al.  Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 2017;27(5):737–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. medaka: Sequence correction provided by ONT Research. https://github.com/nanoporetech/medaka. Accessed 15 September 2020. [Google Scholar]
  • 65. Jain  M, Koren  S, Miga  KH, et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018;36(4):338–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Walker  BJ, Abeel  T, Shea  T, et al.  Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One. 2014;9(11):e112963. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Durand  NC, Shamim  MS, Machol  I, et al.  Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 2016;3(1):95–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Kaplan  N, Dekker  J. High-throughput genome scaffolding from in vivo DNA interaction frequency. Nat Biotechnol. 2013;31(12):1143–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Burton  JN, Adey  A, Patwardhan  RP, et al.  Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol. 2013;31(12):1119–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Marie-Nelly  H, Marbouty  M, Cournac  A, et al.  High-quality genome (re) assembly using chromosomal contact data. Nat Commun. 2014;5, doi: 10.1038/ncomms6695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Putnam  NH, O'Connell  BL, Stites  JC, et al.  Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Res. 2016;26(3):342–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Renschler  G, Richard  G, Valsecchi  CIK, et al.  Hi-C guided assemblies reveal conserved regulatory topologies on X and autosomes despite extensive genome shuffling. Genes Dev. 2019;33(21–22):1591–612. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Ghurye  J, Rhie  A, Walenz  BP, et al.  Integrating Hi-C links with assembly graphs for chromosome-scale assembly. PLoS Comput Biol. 2019;15(8):e1007273. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Dudchenko  O, Shamim  MS, Batra  S, et al.  The Juicebox Assembly Tools module facilitates de novo assembly of mammalian genomes with chromosome-length scaffolds for under $1000. bioRxiv. 2018, 10.1101/254797. [DOI] [Google Scholar]
  • 75. Roach  MJ, Schmidt  SA, Borneman  AR. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinformatics. 2018;19(1):460. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76. Hall  AB, Qi  Y, Timoshevskiy  V, et al.  Six novel Y chromosome genes in Anophelesmosquitoes discovered by independently sequencing males and females. BMC Genomics. 2013;14:273. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77. Hall  AB, Papathanos  PA, Sharma  A, et al.  Radical remodeling of the Y chromosome in a recent radiation of malaria mosquitoes. Proc Natl Acad Sci U S A. 2016;113(15):E2114–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78. Krzywinski  J, Sangare  D, Besansky  NJ. Satellite DNA from the Y chromosome of the malaria vector Anopheles gambiae. Genetics. 2005;169(1):185–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79. Melters  DP, Bradnam  KR, Young  HA, et al.  Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. Genome Biol. 2013;14(1):R10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Besansky  NJ, Powell  JR. Reassociation kinetics of Anopheles gambiae (Diptera: Culicidae) DNA. J Med Entomol. 1992;29(1):125–8. [DOI] [PubMed] [Google Scholar]
  • 81. Altschul  SF, Gish  W, Miller  W, et al.  Basic Local Alignment Search Tool. J Mol Biol. 1990;215(3):403–10. [DOI] [PubMed] [Google Scholar]
  • 82. Cabanettes  F, Klopp  C. D-GENIES: dot plot large genomes in an interactive, efficient and simple way. PeerJ. 2018;6:e4958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83. Sharma  A, Kinney  NA, Timoshevskiy  VA, et al.  Structural variation of the X chromosome heterochromatin in the Anopheles gambiaecomplex. Genes (Basel). 2020;11(3), doi: 10.3390/genes11030327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84. Guy  L, Kultima  JR, Andersson  SG. genoPlotR: comparative gene and genome visualization in R. Bioinformatics. 2010;26(18):2334–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Goel  M, Sun  H, Jiao  WB, et al.  SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol. 2019;20(1):277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86. Coluzzi  M, Sabatini  A, della Torre  A, et al.  A polytene chromosome analysis of the Anopheles gambiaespecies complex. Science. 2002;298(5597):1415–8. [DOI] [PubMed] [Google Scholar]
  • 87. Lobo  NF, Sangare  DM, Regier  AA, et al.  Breakpoint structure of the Anopheles gambiae 2Rb chromosomal inversion. Malar J. 2010;9:293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88. Sharakhov  IV, White  BJ, Sharakhova  MV, et al.  Breakpoint structure reveals the unique origin of an interspecific chromosomal inversion (2La) in the Anopheles gambiae complex. Proc Natl Acad Sci U S A. 2006;103(16):6258–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89. Sharakhova  MV, George  P, Brusentsova  IV, et al.  Genome mapping and characterization of the Anopheles gambiaeheterochromatin. BMC Genomics. 2010;11:459. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90. Lukyanchikova  V, Nuriddinov  M, Belokopytova  P, et al.  Anophelesmosquitoes revealed new principles of 3D genome organization in insects. bioRxiv. 2020; 10.1101/2020.05.26.114017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91. De Coster  W, D'Hert  S, Schultz  DT, et al.  NanoPack: visualizing and processing long-read sequencing data. Bioinformatics. 2018;34(15):2666–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92. Keller  O, Kollmar  M, Stanke  M, et al.  A novel hybrid gene prediction method employing protein multiple sequence alignments. Bioinformatics. 2011;27(6):757–63. [DOI] [PubMed] [Google Scholar]
  • 93. Chen  S, Zhou  Y, Chen  Y, et al.  fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34(17):i884–i90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94. Genome Size Estimation Tutorial.  https://bioinformatics.uconn.edu/genome-size-estimation-tutorial. Accessed 28 December 2020. [Google Scholar]
  • 95. Li  H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv  2013:1303.3997. [Google Scholar]
  • 96. Zamyatin  A, Avdeyev  P, Liang  J, et al.  Supporting data for the chromosome-level genome assembly of the malaria vector Anopheles arabiensis. GigaScience Database. 2021. 10.5524/100860. [DOI] [PMC free article] [PubMed]
  • 97. Zamyatin  A, Avdeyev  P, Liang  J, et al.  Supporting data for the chromosome-level genome assembly of the malaria vectors Anopheles coluzzii. GigaScience Database. 2021. 10.5524/100861. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Zamyatin  A, Avdeyev  P, Liang  J, et al.  Supporting data for the chromosome-level genome assembly of the malaria vector Anopheles arabiensis. GigaScience Database. 2021. 10.5524/100860. [DOI] [PMC free article] [PubMed]
  2. Zamyatin  A, Avdeyev  P, Liang  J, et al.  Supporting data for the chromosome-level genome assembly of the malaria vectors Anopheles coluzzii. GigaScience Database. 2021. 10.5524/100861. [DOI] [PMC free article] [PubMed]

Supplementary Materials

giab017_GIGA-D-20-00290_Original_Submission
giab017_GIGA-D-20-00290_Revision_1
giab017_Response_to_Reviewer_Comments_Original_Submission
giab017_Reviewer_1_Report_Original_Submission

Jacob Tennessen -- 11/12/2020 Reviewed

giab017_Reviewer_2_Report_Original_Submission

Shanlin Liu -- 11/28/2020 Reviewed

giab017_Supplemental_Files

Data Availability Statement

The raw genomic sequence reads and both genome assemblies underlying this article are available in the NCBI under project accession PRJNA634549 and the GigaScience GigaDB database [96, 97].


Articles from GigaScience are provided here courtesy of Oxford University Press

RESOURCES