Abstract
Introduced into Hawaii in the early 1900s, the Japanese white-eye or warbling white-eye (Zosterops japonicus) is now the most abundant land bird in the archipelago. Here, we present the first Z. japonicus genome, sequenced from an individual in its invasive range. This genome provides an important resource for future studies in invasion genomics. We annotated the genome using two workflows—standalone AUGUSTUS and BRAKER2. We found that AUGUSTUS was more conservative with gene predictions when compared with BRAKER2. The final number of annotated gene models was similar between the two workflows, but standalone AUGUSTUS had over 70% of gene predictions with Blast2GO annotations versus under 30% using BRAKER2. Additionally, we tested whether using RNA-seq data from 47 samples had a significant impact on annotation quality when compared with data from a single sample, as generating RNA-seq data for genome annotation can be expensive and requires well preserved tissue. We found that more data did not significantly change the number of annotated genes using AUGUSTUS but using BRAKER2 the number increased substantially. The results presented here will aid researchers in annotating draft genomes of nonmodel species as well as those studying invasion success.
Keywords: genome annotation, bird genomics, invasion biology
Significance
Biological invasions pose a great threat to global biodiversity, especially as habitat continues to degrade and resources become limited. Studying species in their invasive range can help us to understand the genetic basis of colonization success. Here, we provide a high-quality genomic resource for an extremely successful avian invader, the Japanese white-eye (Zosterops japonicus) in Hawaii. The genus Zosterops comprises nearly 100 species and is found across a vast geographical area. This genome is the first Zosterops genome sequenced from a species found in Asia and from an individual in its invasive range, making it an important addition to the currently available genomic resources for the genus. Additionally, we assessed how different pipelines and RNA-seq data sets affected overall genome annotation. This is an important resource for researchers who do not have funding to generate large RNA-seq data sets for genome annotation.
Introduction
The family Zosteropidae, specifically the genus Zosterops, represents one of the most rapid diversification events in birds, and vertebrates as a whole (Moyle et al. 2009). The genus comprises approximately 100 species and has an estimated age of approximately 2 Myr (Moyle et al. 2009), earning the title of “the great speciator”(Diamond et al. 1976). Furthermore, Zosteropidae is the only rapidly speciating avian clade that is found across a vast geographical range, spanning Africa, Australasia, and the South Pacific (Van Riper and van Balen 2020), suggesting a strong colonization ability (Cowles and Uy 2019). Here, we focus on the Japanese white-eye (Zosterops japonicus), also known as the warbling white-eye, a native East Asian bird that was introduced to Hawaii in the early 1900s. Since its introduction, Z. japonicus has become the most abundant land bird across the Hawaiian archipelago (Berger 1972; Scott et al. 1989). Zosteropidae provides a unique opportunity to study the genomic architecture of colonization success and broaden our understanding of invasion genomics. However, in order to do so we must develop high quality genomic resources.
In this study, we present a draft genome assembly of the Japanese white-eye. Additionally, we provide a comparative framework for evaluating the efficacy of multiple genome annotation pipelines based on the number of genes predicted, the number of genes annotated, and shared orthologs. Generating RNA-seq data for genome annotation can be difficult because of cost and limited access to well preserved tissue, to address this we evaluated whether generating more data resulted in improvement in annotation quality. We tested usage of single individual RNA-seq data versus data from multiple individuals in genome annotation and provide recommendations for future genome sequencing and annotation projects.
Materials and Methods
DNA Sequencing
We collected blood from a male Z. japonicus individual from Hilo, Hawaii (National Zoological Park IACUC #16-28). Blood was stored in RNAlater (Ambion, Austin, TX) at –80 °C after sampling. The blood sample was sent to HudsonAlpha Institute for Biotechnology for sample preparation and sequencing. DNA was extracted and libraries were prepared using the 10× Genomics Chromium Library Prep protocol (10× Genomics, Pleasanton, CA). Libraries were paired-end 2 × 150 bp sequenced on two Illumina Hiseq X lanes (Illumina, San Diego, CA).
RNA Sequencing
Tissue was collected from 47 Z. japonicus individuals sampled from the Big Island, Hawaii, including the individual that was used for genome sequencing (National Zoological Park IACUC #16-28). Adult individuals were sampled in Hilo (near sea level) and at approximately 3,000 m on Mauna Kea, Hawaii (21 males, 26 females). Pectoral muscle and heart tissue were immediately stored in liquid nitrogen after sampling. We extracted total RNA using Trizol, prepared cDNA libraries with Truseq Stranded mRNA library preparation kits, and sequenced libraries on a Novaseq S4 lane (Illumina, San Diego, CA).
Genome Assembly
We filtered out low quality sequence reads (phred quality score < 20) and adaptor sequences using TrimGalore v0.6.4 (Krueger 2015). The filtered reads were used to generate a k-mer histogram with Jellyfish (Marçais and Kingsford 2011). Using the histogram, we estimated genome size and heterozygosity in Genomescope (Vurture et al. 2017). We mapped filtered reads to the Z. japonicus mitochondrial genome (NCBI Reference Sequence: KT601061.1) using bwa v0.7.17 (Li and Durbin 2009) and generated a consensus sequence and annotated the mitogenome using Geneious Prime (Biomatters 2021.0.1).
The nuclear genome was assembled using raw sequencing reads and Supernova 2.0 (Weisenfeld et al. 2017) and a single record was generated per scaffold (–style=pseudohap). Summary statistics were generated using assembly_stats v0.1.4 (Trizna 2020; supplementary table S1, Supplementary Material online). We used Kraken v2 (Wood and Salzberg 2014) to identify contaminants in our assembly and scaffolds that were classified as bacteria were removed. Genome completeness was assessed with BUSCO v3 (Simão et al. 2015; Waterhouse et al. 2018) using the avian data set (aves_odb9, 4,915 BUSCOs).
Genome Annotation
We annotated the genome using a multifaceted approach of repeat identification, ab initio gene prediction, and functional annotation (supplementary fig. S1, Supplementary Material online). Repetitive and low-complexity DNA sequences were identified and masked using RepeatMasker v4.0.6 (Smit et al. 2013–2015) with the chicken (Gallus gallus) database of repeats (RepeatMasker Library db20140131). Gene prediction was done using two approaches—a modified version of the AUGUSTUS (Stanke et al. 2006) workflow outlined in Tsuchiya et al. (2020) (workflow 1) and BRAKER2 (workflow 2; Brůna et al. 2021). BRAKER2 uses genomic and RNA-seq data and fully automated training of GeneMark-EP+ (Brůna et al. 2020) and AUGUSTUS for gene prediction. As part of both workflows, we used RNA-seq data generated from the pectoral muscle and heart tissue of the individual from which the genome was sequenced (data set 1) and also from concatenated data from 47 Z. japonicus individuals (data set 2), including the individual used in data set 1. Briefly, we ran rcorrector v 1.0.4 (Song and Florea 2015) on raw reads, removed reads deemed unfixable, and quality filtered remaining reads using TrimGalore v0.6.4 (--length 36 -q 5 --stringency 1 -e 0.1). Reads were mapped to the softmasked assembly using the two-pass method in STAR v2.7 (Dobin et al. 2013).
In workflow 1, we used AUGUSTUS to make evidence-based prediction using gene models trained on Z. japonicus during BUSCO analysis and hints generated from RepeatMasker output and RNA-seq data (using AUGUSTUS bam2wig and bam2hints scripts). The assembly was partitioned using Evidence Modeler (Haas et al. 2008), AUGUSTUS was run on each scaffold individually, and results were concatenated. In workflow 2, we inputted our softmasked assembly and bam output from STAR into BRAKER2, a fully automated gene-prediction pipeline. Each workflow was run on data set 1 and data set 2 individually.
To functionally annotate the genome, amino acid sequences of predicted gene models were blasted against the nonredundant protein database using blastp (Altschul et al. 1990; nr, max-target_seqs 10, e-value 1e-4). We inputted the blastp output into OmicsBox (BioBam Bioinformatics 2019) and used Interproscan, gene ontology mapping, and Blast2GO (Götz et al. 2008) to functionally annotate gene models. We identified orthologous groups using Orthofinder v2.4.0 (Emms and Kelly 2019) and compared results between each workflow on each data set.
Results and Discussion
Genome Assembly
We sequenced 122.6 Gb (85× coverage) from two Hiseq X lanes. Genome size was estimated to be 1.1 Gb, which is similar to other species in the genus Zosterops, like Z. lateralis (PRJNA277310) and Z. borbonicus (PRJNA530916). The final assembly had a GC content of 41.49% and comprises 73,846 contigs with a contig N50 of 38,076 bp and 29,443 scaffolds with a scaffold N50 of 9.69 Mb (fig. 1A; supplementary table S1, Supplementary Material online). Our genome assembly had the highest N50 and lowest L50 and the third lowest number of scaffolds (after Z. lateralis and Z. borbonicus) when compared with other Zosterops genomes available on GenBank (supplementary table S2, Supplementary Material online). Out of the 4,915 avian orthologs searched, we recovered 4,366 complete (88.9%) and 293 (6%) fragmented BUSCOs. A total of 256 BUSCOs were missing (5.1%). We recovered the complete mitochondrial genome (17,819 bp).
Fig. 1.
(A) Visualization of assembly stats (https://github.com/rjchallis/assembly-stats) displaying the scaffold N50 in dark orange, N90 in light orange, GC content in dark blue, and BUSCO results in green. (B) Illustration of Zosterops japonicus. (C) Venn diagram of orthologs identified by Orthofinder comparing AUGUSTUS (workflow 1) and BRAKER (workflow 2) results from both data sets (D1 and D2). http://bioinformatics.psb.ugent.be/webtools/Venn/.
Genome Structural Contents
Genome-wide heterozygosity was high, estimated at 1.01% by Genomescope. Repeatmasker identified 2.76% of the genome as repetitive elements (supplementary table S3, Supplementary Material online), primarily LINEs (1.24%) and simple repeats (1.08%). On average, avian genomes contain fewer repetitive elements (<10%) when compared with other tetrapod vertebrates, like mammals (Böhne et al. 2008; Zhang et al. 2014).
Genome Annotation
Workflow 1 identified 24,213 gene models from the single RNA-seq sample data set (Augustus_D1) and 21,571 gene models from the multi-sample data set (Augustus_D2; supplementary table S4, Supplementary Material online). BLAST and Blast2GO results were similar for both data sets, with blastp finding matches for 91% (22,129) of the predicted gene models in data set 1 and 92% for data set 2 (19,882). Seventy-two percent (17,489) of the gene models in data set 1 were functionally annotated by Blast2GO, and 73% (15,673) in data set 2. Workflow 2 identified 69,835 gene models using data set 1 (Braker_D1) and 74,311 gene models using the data set 2 (Braker_D2; supplementary table S4, Supplementary Material online). Similar to workflow 1, BLAST and Blast2GO results were comparable for both data sets using workflow 2, with blastp finding matches for 41% (28,885) of the predicted gene models in data set 1 and 42% for data set 2 (30,934). Blast2GO functionally annotated 22% (15,029) of the gene models in data set 1 and 26% (19,644) in data set 2. Overall, we found that workflow 1, using standalone AUGUSTUS, was more conservative in gene prediction, predicting around 20,000 genes for both data sets, when compared with workflow 2 (BRAKER2) that predicted approximately 70,000 genes in both cases. Although >70% of the predicted genes were Blast2GO annotated for both data sets using workflow 1, under 30% were annotated using workflow 2. For comparison, the zebra finch genome has 18,447 gene models, of which 17,475 protein coding genes, predicted using RNA-seq data from seven tissues (Warren et al. 2010).
Orthofinder found a total of 58,326 orthologs across all analyses, out of which 12,936 orthologs were shared between all four annotation pipelines (fig. 1C). Overall, the results generated using the same workflow shared more orthologs than ones that used the same data set. Data sets annotated using workflow 1 shared 4,127 unique and 18,388 total orthologs and 38,029 unique and 51,771 total orthologs using workflow 2. Whereas, for data set 1 both workflows shared 31 unique and 13,821 total orthologs and for data set 2 they shared 16 unique and 14,229 total orthologs. This suggests that genome annotation results are greatly affected by the bioinformatics pipeline used. Further analysis needs to be done to identify whether certain orthogroups are over or underrepresented in results from each workflow.
This high-quality genome provides a useful resource for research on Zosterops and related taxa as well as colonization and invasion success, that can be used for comparative analyses with native white-eyes in the future. Additionally, we showed that different bioinformatic workflows produce varying results in genome annotation and that standalone AUGUSTUS is more conservative in gene prediction than BRAKER2, with a significantly higher percentage of genes predicted by AUGUSTUS resulting in Blast2GO annotations. Furthermore, we found that using AUGUSTUS, RNA-seq data from multiple individuals resulted in 1,816 fewer annotated genes when compared with data from a single individual, whereas, with BRAKER2 more data resulted in substantially more (4,615) annotated genes. Therefore, we recommend using AUGUSTUS (workflow 1) for genome annotation when limited RNA-seq data are available for annotation. It is, however, important to note that BRAKER2 provides a straightforward pipeline for gene prediction that is easily accessible for researchers new to bioinformatics. The results presented in this study provide an important resource for researchers annotating draft genomes of nonmodel species as well as those studying invasion success.
Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.
Supplementary Material
Acknowledgments
All computations were performed on the Smithsonian Institution High-Performance Cluster (https://doi.org/10.25572/SIHPC). We thank Kristina Paxton, Eben Paxton, Nancy McInerney, Rebecca Dikow, Vanessa Gonzalez, Lilly Parker, Alyssa Kaganer, and Michael Campana for support and advice. The project was funded by a Smithsonian Scholarly Studies Award (to R.C.F., M.V., and Z. Cheviron), National Geographic Committee on Research and Exploration Grant WW-R011-17 (to R.C.F., M.V., and Z.C.), NSF Graduate Research Fellowship (to M.V.), Smithsonian Institution predoctoral fellowship (to M.V.), and postdoctoral fellowship funded by the Smithsonian Women's Committee (to M.T.N.T.).
Data Availability
Genome assembly and raw sequencing data are available on GenBank under BioProject PRJNA625760. Mitogenome assembly is available on GenBank under accession number MW574481. Genome annotation GFFs for all workflows and data sets are available on Figshare (https://smithsonian.figshare.com/projects/Comparative_analysis_of_annotation_pipelines_using_the_first_Japanese_white-eye_Zosterops_japonicus_genome/96590).
Literature Cited
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ.. 1990. Basic local alignment search tool. J Mol Biol. 215(3):403–410. [DOI] [PubMed] [Google Scholar]
- Berger AJ. 1972. Hawaiian birdlife. Honolulu (Hawaii): University of Hawaii Press. [Google Scholar]
- BioBam Bioinformatics 2019. OmicsBox—Bioinformatics Made Easy. Available from: https://www.biobam.com/omicsbox.
- Böhne A, Brunet F, Galiana-Arnoux D, Schultheis C, Volff JN.. 2008. Transposable elements as drivers of genomic and biological diversity in vertebrates. Chromosome Res. 16(1):203–215. [DOI] [PubMed] [Google Scholar]
- Brůna T, Hoff KJ, Lomsadze A, Stanke M, Borodovsky M.. 2021. BRAKER2: automatic Eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform. 3(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brůna T, Lomsadze A, Borodovsky M.. 2020. GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins. NAR Genomics Bioinform. 2:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cowles SA, Uy JAC.. 2019. Rapid, complete reproductive isolation in two closely related Zosterops White-eye bird species despite broadly overlapping ranges. Evolution 73(8):1647–1662. [DOI] [PubMed] [Google Scholar]
- Diamond JM, Gilpin ME, Mayr E.. 1976. Species distance relation for birds of the Solomon Archipelago, and the paradox of the great speciators. Proc Natl Acad Sci USA. 73(6):2160–2164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dobin A, et al. 2013. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Emms DM, Kelly S.. 2019. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20(1):238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Götz S, et al. 2008. High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic Acids Res. 36(10):3420–3435. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haas BJ, et al. 2008. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9(1):R7–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krueger F. 2015. A wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files. Available from: https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/.
- Li H, Durbin R.. 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25(14):1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marçais G, Kingsford C.. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6):764–770. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moyle RG, Filardi CE, Smith CE, Diamond J.. 2009. Explosive Pleistocene diversification and hemispheric expansion of a ‘great speciator’. Proc Natl Acad Sci USA. 106(6):1863–1868. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scott JM, Ramsey FL, Kepler CB.. 1989. Forest bird communities of the Hawaiian Islands: their dynamics, ecology, and conservation. Stud Avian Biol. 9:431. [Google Scholar]
- Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM.. 2015. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31(19):3210–3212. [DOI] [PubMed] [Google Scholar]
- Smit A, Hubley R, Green P.. 2013. –2015. RepeatMasker Open-4.0. Available from: http://www.repeatmasker.org.
- Song L, Florea L.. 2015. Rcorrector: efficient and accurate error correction for illumina RNA-seq reads. GigaScience 4(1):1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stanke M, Schöffmann O, Morgenstern B, Waack S.. 2006. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7:62–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trizna M. 2020. assembly_stats 0.1.4. Zenodo. doi:10.5281/zenodo.3968775.
- Tsuchiya MTN, et al. 2020. Whole genome sequencing of procyonids reveals distinct demographic histories in kinkajou (Potos flavus) and northern raccoon (Procyon lotor). Genome Biol Evol. 13(1):evaa255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Riper SG, van Balen B. 2020. Warbling White-eye (Zosterops japonicus), version 1.0. In: Billerman SM, editor. Birds of the World. Ithaca (NY): Cornell Lab of Ornithology. Available from: 10.2173/bow.warwhe1.01. [DOI] [Google Scholar]
- Vurture GW, et al. 2017. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 33(14):2202–2204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Warren WC, et al. 2010. The genome of a songbird. Nature 464(7289):757–762. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Waterhouse RM, et al. 2018. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol Biol Evol. 35(3):543–548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weisenfeld NI, Kumar V, Shah P, Church DM, Jaffe DB.. 2017. Direct determination of diploid genome sequences. Genome Res. 27(5):757–767. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wood DE, Salzberg SL.. 2014. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3):r46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang G, et al. 2014. Comparative genomics brings insides into avian genome evolution and adaptation. Science 346(6215):1311–1321. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Genome assembly and raw sequencing data are available on GenBank under BioProject PRJNA625760. Mitogenome assembly is available on GenBank under accession number MW574481. Genome annotation GFFs for all workflows and data sets are available on Figshare (https://smithsonian.figshare.com/projects/Comparative_analysis_of_annotation_pipelines_using_the_first_Japanese_white-eye_Zosterops_japonicus_genome/96590).

