A high quality assembly of the Nile Tilapia (Oreochromis niloticus) genome reveals the structure of two sex determination regions

Matthew A Conte; William J Gammerdinger; Kerry L Bartie; David J Penman; Thomas D Kocher

doi:10.1186/s12864-017-3723-5

. 2017 May 2;18:341. doi: 10.1186/s12864-017-3723-5

A high quality assembly of the Nile Tilapia (Oreochromis niloticus) genome reveals the structure of two sex determination regions

Matthew A Conte ¹, William J Gammerdinger ¹, Kerry L Bartie ², David J Penman ², Thomas D Kocher ^1,^✉

PMCID: PMC5414186 PMID: 28464822

Abstract

Background

Tilapias are the second most farmed fishes in the world and a sustainable source of food. Like many other fish, tilapias are sexually dimorphic and sex is a commercially important trait in these fish. In this study, we developed a significantly improved assembly of the tilapia genome using the latest genome sequencing methods and show how it improves the characterization of two sex determination regions in two tilapia species.

Results

A homozygous clonal XX female Nile tilapia (Oreochromis niloticus) was sequenced to 44X coverage using Pacific Biosciences (PacBio) SMRT sequencing. Dozens of candidate de novo assemblies were generated and an optimal assembly (contig NG50 of 3.3Mbp) was selected using principal component analysis of likelihood scores calculated from several paired-end sequencing libraries. Comparison of the new assembly to the previous O. niloticus genome assembly reveals that recently duplicated portions of the genome are now well represented. The overall number of genes in the new assembly increased by 27.3%, including a 67% increase in pseudogenes. The new tilapia genome assembly correctly represents two recent vasa gene duplication events that have been verified with BAC sequencing. At total of 146Mbp of additional transposable element sequence are now assembled, a large proportion of which are recent insertions. Large centromeric satellite repeats are assembled and annotated in cichlid fish for the first time. Finally, the new assembly identifies the long-range structure of both a ~9Mbp XY sex determination region on LG1 in O. niloticus, and a ~50Mbp WZ sex determination region on LG3 in the related species O. aureus.

Conclusions

This study highlights the use of long read sequencing to correctly assemble recent duplications and to characterize repeat-filled regions of the genome. The study serves as an example of the need for high quality genome assemblies and provides a framework for identifying sex determining genes in tilapia and related fish species.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-017-3723-5) contains supplementary material, which is available to authorized users.

Keywords: Genome assembly, Pacific Biosciences SMRT sequencing, Transposable elements, Gene duplication, Sex chromosome, Sex determination, Aquaculture, Tilapia

Background

Aquaculture plays an increasingly important role in providing sustainable seafood products and has significantly outpaced capture fisheries in the past several decades [1]. Tilapias are among the most important farmed fishes, and tilapia production continues to expand exponentially across the globe [2]. An important aspect of commercial production is the control of sexual differentiation. Males grow to market-size earlier than females. Females also start to reproduce at a smaller size, filling production ponds with small fish [3]. It is therefore advantageous to grow-out only male fish. At one time, all-male populations were produced through interspecific crosses [4], but the strains supporting this technology have been lost or contaminated. Currently, the standard way of achieving all male or nearly all male tilapia populations is via hormonal masculinization [3, 5]. A reliable way of producing genetically all-male tilapia would allow the replacement of hormonal masculinization, which is banned in several major producing countries (although not enforced in most cases). It is therefore important to understand the genetic basis of sex determination in current aquaculture stocks.

Sex determination in tilapias is largely genetic, although environmental factors also play a role [6–8]. In Nile tilapia (Oreochromis niloticus), distinct XY sex determining loci have been identified on both linkage group (LG) 1 and LG23 [9, 10]. The closely related blue tilapia (O. aureus) segregates both an XY locus on LG1, and an epistatically dominant ZW locus on LG3 [11]. Additional sex determining loci have been identified on LGs 5, 7, 13, 18 and 20 in closely related species of East African cichlid [12–14]. As a group, tilapias and related species of other cichlid fishes are a promising model system for understanding the gene network controlling sex determination in vertebrates.

Work to identify the genes underlying each of these sex determiners has been hampered by the incomplete nature of previous draft genome assemblies, and by the discovery that many of these sex determiners are located in large blocks of highly differentiated, and sometimes repetitive sequence. To date, the molecular genetic basis for sex determination in cichlids has been determined for only the LG23 XY locus in O. niloticus [15].

Although several draft genome sequences are available for cichlids, these are mostly based on short Illumina sequencing reads [16]. The previous O. niloticus assembly was produced using ~277X coverage of Illumina reads from several libraries including a 40kbp scaffolding library. Recently duplicated and highly repetitive sequences are typically collapsed in these assemblies [17]. Indeed some of the most interesting and perhaps evolutionarily important regions of the genome may be the most difficult to assemble accurately. Recently duplicated regions are notoriously difficult to assemble due to their repeat length and high sequence identity [18]. The repetitive “dark-matter” part of the genome is vastly underrepresented in the majority of current genome assemblies [19]. Attempts to assemble these regions using only short read sequencing are futile [20]. Only long sequencing reads will produce more contiguous and complete assemblies of complex vertebrate genomes [21–25]. The importance of such high quality assemblies for downstream applications cannot be overemphasized.

Here we report a new assembly of the tilapia genome from long PacBio sequence reads. This assembly contains much of the missing sequence from previous assemblies, and is among the most contiguous vertebrate genome assemblies to date. We use this new assembly to further characterize the tilapia sex determining loci previously identified on LGs 1 and 3.

Results

Assembly overview

A homozygous clonal XX female tilapia individual [26] was chosen for genome sequencing. The individual was sequenced to 44X coverage using PacBio sequencing of 63 SMRT cells using the P6-C4 chemistry. This yielded 5,085,371 reads with a mean subread length of 8,747 bp and N50 read length of 11,366 bp.

An overview of the assembly process is outlined in Fig. 1. To summarize, 37 candidate de novo assemblies were generated using both the FALCON [22] and Canu [27] genome assembly packages. Multiple parameters were adjusted for both algorithms to tune the assemblies. The error correction steps of both algorithms include parameters that control alignment seed length, read length, overlap length and error rates (see Methods).

Evaluating assemblies

The 37 candidate assemblies were evaluated using a number of different metrics, techniques and complementary datasets. First, each of the candidate assemblies was evaluated using ALE assembly likelihood estimates [28] (which integrated read quality, mate-pair orientation, insert size, coverage and k-mer frequencies) based on alignment of the reads from four separate Illumina libraries and of the 44X PacBio dataset (see Methods). Candidate assemblies were also evaluated for completeness using CEGMA [29] and BUSCO [30] core gene sets, as well as by aligning existing O. niloticus RefSeq [31] transcripts. A set of 193,027 BAC-end sequences [32] representing ~29X physical clone coverage were used to assess the longer range accuracy of candidate assemblies. Finally, both a physical radiation-hybrid (RH) map consisting of 1,256 markers [33] and a RAD-seq genetic map consisting of 3,802 markers [34] were used to estimate the number of misassemblies present in each of the candidate assemblies. The results of these analyses are provided in Additional file 1.

Ranking assemblies

No single candidate assembly ranked the highest for all of the evaluation metrics that were computed. Principal component analysis (PCA) was used to reduce the various assembly evaluation metrics and compare the candidate assemblies. Additional file 2 shows that the Canu assemblies tend to cluster separately from the FALCON assemblies in the PCA space. The total assembly size and number of RefSeq exons mapped explained the largest amount of variance and were correlated. These two metrics did not seem like the most important metrics to base the evaluation upon since assembly parameters could be tuned to change the total size and the estimated genome size was 1.082Gbp [35].

The ALE likelihood scores explain the next largest proportion of the variance. The 37 candidate assemblies were ranked by overall ALE scores for each of the five sequencing libraries. An average of the ALE ranks was then calculated. The Canu assembly (#14) that was chosen as the best among the 37 candidate assemblies showed the best average ALE ranks. In addition, Canu assembly #14 had one of the best rates of properly mapped BAC-end sequences, and possessed among the fewest misassemblies as determined by conflicts with the RH and RAD map data (Additional file 1). These results suggest that Canu assembly #14 has the best long-range accuracy while maintaining comparable short-range accuracy.

Polishing

A relatively small number of sequence errors remained in the intermediate unpolished Canu #14 assembly. To correct these errors, first the raw 44X PacBio reads were aligned to the Canu assembly and Quiver was used to polish the assembly at 1,870,943 sites (see Methods). Quiver corrected 1,739,112 (92.95%) insertions, 88,037 (4.71%) substitutions and 43,794 (2.34%) deletions. Next, four Illumina libraries, totaling 277X coverage, were aligned to the intermediate Quiver-polished assembly. Based upon these alignments, Pilon polished an additional 1,101,609 sites. Pilon corrected 1,087,107 (98.68%) insertions, 12,402 (1.13%) substitutions, and 2,100 (0.19%) deletions.

Detection of misassemblies

The polished intermediate assembly showed high accuracy at the level of individual bases and with respect to the placement of paired-end sequences from ~150kbp BACs (Additional file 1). However, 32 putative inter-chromosomal misassemblies were identified by alignment to the RH and RAD maps. The RH and RAD maps both identified 21 of these inter-chromosomal misassemblies. The RAD map identified an additional 8 putative misassemblies that were not identified using the RH map (the RH map had no markers aligning to these regions), while the RH map identified an additional 3 misassemblies that were not identified using the RAD map (likewise, the RAD map had no markers aligning to these regions). The regions around each putative misassembly were inspected using the genomic resources already mentioned. Each had a characteristic signature consisting of a high density of variants in the 44X PacBio read alignments, as well as low or zero physical coverage of the 40kbp insert Illumina mate-pair library. An example of these misassembly signatures is shown in Additional file 3.

Genome wide analysis of the intermediate assembly for each of these characteristic signatures detected 110 regions of high-density PacBio variants and 376 regions of low physical coverage in the 40kbp mate-pair library. 41 regions had both a high-density of PacBio variants and low physical 40kbp mate-pair coverage. Nine of these regions showed correct alignment to both maps and therefore were not included in the set of putative misassemblies. However, two of these regions were identified by the PacBio variants and low 40kbp physical coverage in regions where there were no markers in either the RH or RAD map and added to the 32 map-based misassemblies giving a total of 34 sites of likely misassembly. Table 1 provides a summary of the putative misassemblies that were identified by the maps and sequence alignment methods.

Table 1.

Number of putative misassemblies identified by various methods

Evidence	Number of misassemblies detected
Both maps	21
RAD map only	8
RH map only	3
Both PacBio variants and 40kbp library	16
PacBio variants only	2
40kbp library only	16
PacBio variants and 40kbp library, but no maps	2

Open in a new tab

Analysis of the repetitive elements within these regions revealed that misassembly locations were enriched for highly repetitive interspersed and nested repeats. We examined the region ~75kbp on both sides of the likely misassembly breakpoints and found that 94.51% of these regions were classified as repetitive (see Methods). These regions were enriched for several TE families. Table 2 shows the enrichment of the most common repeats and TEs within the misassembly regions. In each of these cases, the mean length of these repeats was longer within the misassembly regions. Some of the same TE families that are abundant across the whole genome (e.g., DNA-TcMar-Tc1, LINE-L2, LINE-Rex-Babar) are also present in high frequency in the misassembly regions. However, some TE families that occurred in relatively low frequency across the whole genome (e.g., DNA-Sola, LTR-ERV1, RC-Helitron, and satellite repeats) are highly enriched in the misassembly regions.

Table 2.

Repeats in putative misassembly regions compared to the whole genome

Repetitive element		O_niloticus_UMD1 genome wide				Within 34 misassembly regions				Enrichment
Order	Family	#	Total bp	%	Mean length	#	Total bp	%	Mean length	Ratio of genome/missassem regions	Δ mean length
DNA	Sola	7,007	1,536,337	0.15%	219.3	142	53,045	1.37%	373.6	913.3%	154.3
	TcMar-Tc1	156,588	46,394,192	4.60%	296.3	846	354,606	9.18%	419.2	199.6%	122.9
	hAT	36,441	10,103,158	1.00%	277.2	289	99,016	2.56%	342.6	256.0%	65.4
	hAT-Ac	49,528	17,626,929	1.75%	355.9	445	191,054	4.95%	429.3	282.9%	73.4
	hAT-Charlie	37,049	13,709,558	1.36%	370.0	236	99,913	2.59%	423.4	190.4%	53.3
LINE	L1	10,712	8,879,041	0.88%	828.9	63	79,410	2.06%	1,261	234.1%	431.6
	L2	76,937	29,334,193	2.91%	381.3	539	269,645	6.98%	500.3	239.9%	119.0
	Penelope	28,509	7,214,522	0.71%	253.1	258	70,627	1.83%	273.7	257.7%	20.7
	Rex-Babar	38,996	19,208,630	1.90%	492.6	386	199,730	5.17%	517.4	272.1%	24.9
LTR	ERV1	10,756	6,450,995	0.64%	599.8	112	97,055	2.51%	866.6	392.2%	266.8
LTR	Gypsy	29,201	13,615,743	1.35%	466.3	274	172,231	4.46%	628.6	330.4%	162.3
RC	Helitron	3,882	2,685,111	0.27%	691.7	216	187,805	4.86%	869.5	1800.0%	177.8
Unk.	Unknown	350,007	97,456,302	9.66%	278.4	2,188	841,264	21.78%	384.5	225.5%	106.0
Satellite	Satellite	10,061	7,662,597	0.76%	761.6	93	115,309	2.99%	1,240	393.4%	478.3
Satellite	Simple	322,732	15,543,664	1.54%	48.2	1,733	173,824	4.50%	100.3	292.2%	52.1

Open in a new tab

Anchoring

Table 3 provides the anchored size of each LG, including gaps. The new O_niloticus_UMD1 assembly anchored 868.6Mbp of the total genome (86.9%), which is 211Mbp (32%) more than was anchored in the previous “Orenil1.1” assembly (657Mbp) [16]. When gaps are not counted, the amount of anchored, non-gap, sequence is 864Mbp (86.4%) compared to 606Mbp (60.6%) in the previous Orenil1.1 assembly. LG3 is the largest anchored LG (68.6Mbp), which agrees with cytogenetic studies that show LG3 as the largest and most repetitive chromosome in the O. niloticus genome [33, 36, 37]. Cytogenetic studies also indicate that LG7 is the second largest chromosome in the O. niloticus genome, and LG7 is the second largest anchored LG in the new “O_niloticus_UMD1” assembly.

Table 3.

Size of each anchored linkage group for both the previous assembly, Orenil1.1 [16] and the new assembly (O_niloticus_UMD1)

Linkage Group	Orenil1.1	O_niloticus_UMD1	Difference (%)
LG1	31,194,787	38,372,991	7,178,204 (23.0%)
LG2	25,048,291	35,256,741	10,208,450 (40.8%)
LG3	19,325,363	68,550,753	49,225,390 (254.7%)
LG4	28,679,955	38,038,224	9,358,269 (32.6%)
LG5	37,389,089	34,628,617	−2,760,472 (−7.4%)
LG6	36,725,243	44,571,662	7,846,419 (21.4%)
LG7	51,042,256	62,059,223	11,016,967 (21.6%)
LG8	29,447,820	30,802,437	1,354,617 (4.6%)
LG9	20,956,653	27,519,051	6,562,398 (31.3%)
LG10	17,092,887	32,426,571	15,333,684 (89.7%)
LG11	33,447,472	36,466,354	3,018,882 (9.0%)
LG12	34,679,706	41,232,431	6,552,725 (18.9%)
LG13	32,787,261	32,337,344	−449,917 (−1.4%)
LG14	34,191,023	39,264,731	5,073,708 (14.8%)
LG15	26,684,556	36,154,882	9,470,326 (35.5%)
LG16	34,890,008	43,860,769	8,970,761 (25.7%)
LG17	31,749,960	40,919,683	9,169,723 (28.9%)
LG18	26,198,306	37,007,722	10,809,416 (41.3%)
LG19	27,159,252	31,245,232	4,085,980 (15.0%)
LG20	31,470,686	36,767,035	5,296,349 (16.8%)
LG22	26,410,405	37,011,614	10,601,209 (40.1%)
LG23	20,779,993	44,097,196	23,317,203 (112.2%)
Total	657,350,972	868,591,263	211,240,291 (32.1%)
Total (minus gaps)	606,480,097	864,361,263	257,881,166 (42.5%)

Open in a new tab

Assembly completeness

To determine the completeness of the new O_niloticus_UMD1 assembly, the assembly was compared against two established sets of core vertebrate gene sets. Table 4 shows the number of the 248 CEGMA and the 3,023 BUSCO conserved vertebrate genes that were identified in the new assembly. The number of conserved genes identified increased for both the CEGMA and BUSCO gene sets. The number of complete single-copy BUSCOs increased by 223 (10%), while the number of complete duplicated BUSCOs increased by 26 (59%). The number of missing BUSCOs decreased by 288 (67%) in the O_niloticus_UMD1 assembly compared to the Orenil1.1 assembly.

Table 4.

Genome completeness as measured by CEGMA and BUSCO

	Orenil1.1 (LGs)	O_niloticus_UMD1 (LGs)
Complete CEGs	244 (98.39%)	245 (98.79%)
Complete + partial CEGs	247 (99.61%)	248 (100%)
Total complete CEGs including putative orthologs	333	342
Complete BUSCOs	2185 (72.28%)	2408 (79.66%)
Complete and single-copy BUSCOs	2141 (70.82%)	2338 (77.34%)
Complete and duplicated BUSCOs	44	70
Fragmented BUSCOs	411 (13.60%)	476 (15.75%)
Missing BUSCOs	427 (14.13%)	139 (4.60%)

Open in a new tab

Annotation

The O_niloticus_UMD1 assembly was annotated using the NCBI RefSeq automated eukaryotic genome annotation pipeline. This same pipeline was previously used to annotate the Orenil1.1 assembly. Several additional, new transcriptome datasets (particularly gill tissues, see Methods) were available to annotate the O_niloticus_UMD1 assembly that were not available during the Orenil1.1 annotation process. A comparison of both genome assembly annotations is provided in Table 5. The O_niloticus_UMD1 assembly contains 8,238 more gene and pseudogene annotations than the Orenil1.1 assembly (27.3% increase). Similarly, the number of mRNA annotations increased markedly by 10,374 (21.7% increase). The number of partial mRNA annotations decreased from 3,050 to 393 (87.1% decrease). CDS annotations also increased overall (21.9%). The RefSeq annotation pipeline makes corrections to CDS annotations that contain premature stop-codons, frameshifts and internal gaps that would disrupt protein sequence coding. These corrections are based on transcriptome data and corrected 743 CDSs in O_niloticus_UMD1 compared to 817 previously for Orenil1.1 (9.1% decrease). The number of non-coding RNAs more than doubled in the O_niloticus_UMD1 assembly (115.5% increase).

Table 5.

RefSeq annotation summary

Feature	Orenil1.1	O_niloticus_UMD1	Difference (%)
Genes and pseudogenes	30,174	38,412	8,238 (27.3%)
protein-coding	26,329	29,249	2,920 (11.1%)
non-coding	3,508	8,599	5,091 (145.1%)
pseudogenes	337	564	227 (67.4%)
mRNAs	47,700	58,074	10,374 (21.7%)
fully-supported	45,245	55,760	10,515 (23.2%)
partial	3,050	393	−2,657 (−87.1%)
with filled gap(s)	2,480	67	−2,413 (−97.3%)
known RefSeq (NM_)	145	178	33 (22.8%)
model RefSeq (XM_)	47,555	57,896	10,341 (21.7%)
Other RNAs	5,694	12,899	7,205 (126.5%)
fully-supported	5,071	10,881	5,810 (114.6%)
model RefSeq (XR_)	5,071	10,929	5,858 (115.5%)
CDSs	47,892	58,398	10,506 (21.9%)
fully-supported	45,245	55,760	10,515 (23.2%)
partial	2,467	401	−2,066 (−83.7%)
with major correction(s)	817	743	−74 (−9.1%)
known RefSeq (NP_)	145	178	33 (22.8%)
model RefSeq (XP_)	47,555	57,896	10,341 (21.7%)

Open in a new tab

O_niloticus_UMD1 assembly summary

Table 6 provides summary statistics for the previous O. niloticus assembly (Orenil1.1), each intermediate of the new assembly, and our new final assembly (O_niloticus_UMD1). The O_niloticus_UMD1 assembly is more contiguous, with 45% fewer contigs than the number of scaffolds in Orenil1.1. The overall size of the O_niloticus_UMD1 assembly is 1.01Gbp compared to 927Mbp of Orenil1.1. The O_niloticus_UMD1 contains only 424 gaps that were introduced in the anchoring step. These anchoring gaps amount to 4.2Mbp (0.42%) due entirely to the arbitrary 10kbp gaps placed between anchored contigs. This compares to 111.5Mbp (12.04%) of gaps in Orenil1.1. Overall, 189.5Mbp of new sequence has been assembled in O_niloticus_UMD1 that was either previously in gaps or not assembled at all in Orenil1.1.

Table 6.

Summary statistics for the various assemblies

Assembly	O_niloticus_wgs_v1 (scaffolds)	Orenil1.1 (anchored to LGs)	O. niloticus Canu (contigs)	O. niloticus Canu broken (contigs)	O_niloticus_UMD1 (anchored to LGs)
Number of contigs/scaffolds	5,900	5,677	2,960	2,989	2,566
Total size	927,725,912	927,679,487	1,003,343,259	1,005,609,889	1,009,839,889
Longest contig/scaffold/LG	13,623,339	51,042,256	20,432,727	13,936,383	62,059,223
Mean contig/scaffold/LG size	157,242	29,879,589.6	338,967	336,437	37,672,228.8
NG50 contig/scaffold/LG	2,629,658	26,684,556	3,325,464	3,110,904	37,007,722
% N	12.04	12.03	0	0	0.42
Number of gaps	71,854	72,077	0	0	424
Non gap bp	816,139,901	816,140,124	1,003,343,259	1,005,609,889	1,005,610,312
Total gap bp	111,586,011	111,539,363	0	0	4,240,000

Open in a new tab

Repeat content

The TE and repeat portion of the genome is vastly under underrepresented in most genome assemblies [19]. The use of long PacBio reads allowed for the assembly of more of the repetitive regions of the O. niloticus genome. 379Mbp (37.6%) of the total O_niloticus_UMD1 assembly was annotated as repetitive. Table 7 provides a summary of the repeat and TE families that were most abundant in the assembly. The new assembly includes an additional 146Mbp (14.6%) of repetitive sequence that was either hidden in gaps or not present at all in the previous assembly. The entire repeat catalog is provided in Additional file 4.

Table 7.

Summary of repeat families

Repetitive element		Orenil1.1			O_niloticus_UMD1			Δ from Orenil1.1
Order	Family	Total bp	%	Mean length (bp)	Total bp	%	Mean length (bp)	Total bp	Mean length (bp)
DNA	TcMar-Tc1	39,070,443	4.21%	252.0	46,394,192	4.60%	296.3	7,323,749	44.3
	hAT	3,502,443	0.38%	210.2	10,103,158	1.00%	277.2	6,600,715	67.1
	hAT-Ac	11,264,479	1.21%	259.5	17,626,929	1.75%	355.9	6,362,450	96.4
	hAT-Charlie	8,266,601	0.89%	218.1	13,709,558	1.36%	370.0	5,442,957	152.0
LINE	L1	4,469,636	0.48%	389.9	8,879,041	0.88%	828.9	4,409,405	439.0
	L1-1_AFC	1,277,360	0.14%	671.6	3,197,003	0.32%	1,686.2	1,919,643	1,014.6
	L2	20,015,588	2.16%	248.8	29,334,193	2.91%	381.3	9,318,605	132.4
	Rex-Babar	9,422,494	1.02%	276.1	19,208,630	1.90%	492.6	9,786,136	216.5
LTR	ERV1	1,872,564	0.20%	302.2	6,450,995	0.64%	599.8	4,578,431	297.6
	Gypsy	6,734,826	0.73%	415.0	13,615,743	1.35%	466.3	6,880,917	51.3
	Pao	1,745,361	0.19%	686.9	4,892,623	0.48%	833.9	3,147,262	147.0
Unknown	Unknown	58,952,108	6.36%	215.4	97,456,302	9.66%	278.4	38,504,194	63.0
Low complexity	Satellite	1,110,151	0.12%	268.4	7,662,597	0.76%	761.6	6,552,446	493.2
Low complexity	Simple	11,784,382	1.27%	42.0	15,543,664	1.54%	48.2	3,759,282	6.2
TOTAL	–	232,691,524	25.10%	183.5	379,017,551	37.56%	254.9	146,326,027	71.4

Open in a new tab

Figure 2 provides a comparison of the repeat landscape of the Orenil1.1 and O_niloticus_UMD1 assemblies. Most notably, recently inserted (~ < 5% Kimura divergence) TEs have been assembled in far greater number in the new O_niloticus_UMD1 assembly. The overall number of repetitive elements increased at all divergence levels (218,992 more elements, Additional file 4), with most at lower divergences (165,607 additional elements at < 5% Kimura divergence). Figure 2 suggests that TE insertions less than 1% diverged are still underrepresented in the assembly.

Fig. 2 — Repeat Landscape comparison. The percentage of both the Orenil1.1 and O_niloticus_UMD1 and assemblies that each TE family is represented at in particular substitution levels analogous to the age of TEs (Kimura substitution level – CpG adjusted)

Satellite regions represent one of the most highly repetitive regions of the genome and are often associated with centromeric and heterochromatic regions. Two tilapia-specific satellite repeats have been previously described. ONSATA is a 209 bp repeat unit and shows variability between related tilapiine species [38]. Only 29 copies of ONSATA (comprising a total of 2,917 bp) were assembled and annotated in the original Orenil1.1 assembly. In the new O_niloticus_UMD1 assembly, 226 regions of ONSATA comprising a total of 1,386,985 bp were assembled and annotated. Many of the ONSATA regions, the longest of which was 43,805 bp on the unanchored contig908, were composed of multiple, nested ONSATA copies. ONSATB is a 1,904 bp repeat unit that is organized in tandem arrays and appears to be more conserved and perhaps under selective constraint [39]. 48 copies of ONSATB (comprising a total of 11,036 bp) were assembled and annotated in the original Orenil1.1 assembly. In the new O_niloticus_UMD1 assembly, 1,481 copies of ONSATB (comprising a total of 2,889,496 bp) were assembled and annotated. Again, many of the ONSATB regions were composed of multiple ONSATB copies, the longest of which was 11,210 bp located near the beginning of LG12 (607,345-618,555).

TEs specific to African cichlid species have been previously sequenced and used as molecular markers to study evolutionary history and phylogenetics of African cichlids [40, 41]. Some of these African cichlid specific or “AFC” LINEs and SINEs had been previously assembled and annotated in the Orenil1.1 assembly. An additional 2.3Mbp of AFC-specific TE sequence was annotated in the new O_niloticus_UMD1 assembly. This 2.3Mbp increase was assembled across 55 fewer AFC TE copies, which resulted in longer mean length AFC TE copies. This suggests that the previous assembly contained many fragmented AFC specific TE copies.

Recently duplicated regions

Recently duplicated genes are notoriously difficult to assemble due to their high sequence identity [18]. Using short Illumina reads to assemble these regions is a difficult task even with mate-pair sequence data across multiple spatial scales. In a previous study of the tilapia vasa gene, we identified three partial gene sequences in the Orenil1.1 assembly [42]. We then screened a tilapia BAC library for vasa gene sequences and identified three BAC clones containing vasa sequences. The three clones came from separate restriction fingerprint contigs [43], and represent duplications of the ancestral vasa gene. Sanger sequencing identified a full-length vasa gene in each of these BAC clones. Figure 3a shows how the previous Orenil1.1 assembly failed to correctly assemble any of the three vasa gene copies. Figure 3b indicates how these genes were assembled from each of the BAC clones. Figure 3c details how the new O_niloticus_UMD1 assembly correctly assembles the three copies of the vasa gene corresponding to the three BAC clones.

Fig. 3 — *Vasa* gene duplication. a The top row shows the *vasa* transcript (NCBI accession number AB032467.1) aligned to Orenil1.1 assembly scaffolds with gaps shown in solid red. b The middle row shows this same *vasa* transcript aligned to the separate BAC assemblies (NCBI accession numbers AB649031-AB649033). c The bottom row shows the *vasa* transcript aligned to O_niloticus_UMD1 LGs. For each row there are three alignments corresponding to the three copies of each *vasa* transcript

Sex determination regions

The new O_niloticus_UMD1 assembly was used to study sequence differentiation across two sex-determining regions in tilapias. The first region is an XX/XY sex-determination region on LG1 found in many strains of tilapia [9, 34, 44–47]. We previously characterized this region by whole genome Illumina re-sequencing of pooled DNA from males and females [48]. We realigned these sequences to the new O_niloticus_UMD1 assembly and searched for variants that were fixed in the XX female pool and polymorphic in the XY male pool. Figure 4 shows the F_ST and the sex-patterned variant allele frequencies for the XX/XY O. niloticus comparison across the complete Orenil1.1 and O_niloticus_UMD1 assemblies, while Fig. 5 focuses on the highly differentiated ~9Mbp region on LG1 with a substantial number of sex-patterned variants, indicative of a reduction in recombination in a sex determination region that has existed for some time [48].

Fig. 4 — Whole genome *O. niloticus* sex comparison. a F_ST comparison of XX female pool versus XY male pool on Orenil1.1. b Sex-patterned variants across Orenil1.1. c F_ST comparison of XX female pool versus XY male pool on O_niloticus_UMD1. d Sex-patterned variants across O_niloticus_UMD1

Fig. 5 — LG1 *O. niloticus* sex comparison. a F_ST comparison of XX female pool versus XY male pool on LG1 of Orenil1.1. b Sex-patterned variants on LG1 of Orenil1.1. c F_ST comparison of XX female pool versus XY male pool on LG1 of O_niloticus_UMD1. Anchored contig boundaries are depicted with grey bars. d Sex-patterned variants on LG1 of O_niloticus_UMD1

The second sex comparison is for an ZZ/WZ sex-determination region on LG3 in a strain of O. aureus [11, 49]. This region has not previously been characterized using whole genome sequencing. For this comparison we identified variant alleles fixed in the ZZ male pool and polymorphic in the WZ female pool. Figure 6 shows the F_ST and the sex-patterned variant allele frequencies for this comparison across the whole O_niloticus_UMD1 assembly, while Fig. 7 focuses on the differentiated region on LG3. O. aureus LG3 contains a large ~50Mbp region of differentiated sex-patterned variants, also indicative of a reduction in recombination in the sex determination region. Figure 6 also shows this differentiation pattern on several other LGs (LG7, LG9, LG14, LG16, LG18, LG22 and LG23). It is possible that these smaller regions of sex-patterned differentiation are actually translocations in O. aureus relative to the O. niloticus genome assembly.

Fig. 6 — Whole genome *O. aureus* sex comparison. a F_ST comparison of ZW female pool versus ZZ male pool on Orenil1.1. b Sex-patterned variants across Orenil1.1. c F_ST comparison of ZW female pool versus ZZ male pool on O_niloticus_UMD1. d Sex-patterned variants across O_niloticus_UMD1

Fig. 7 — LG3 *O. aureus* sex comparison. a F_ST comparison of ZW female pool versus ZZ male pool on LG3 of Orenil1.1. b Sex-patterned variants on LG3 of Orenil1.1. c F_ST comparison of ZW female pool versus ZZ male pool on LG3 of O_niloticus_UMD1. Anchored contig boundaries are depicted with grey bars. d Sex-patterned variants on LG3 of O_niloticus_UMD1

The overall number of sex-patterned variants was markedly increased for both sex comparisons using the new assembly. Table 8 indicates this and provides the number of sex-patterned variants in each comparison across the whole genome as well as on the respective sex-determination LG. LG3 saw the largest gain of sex-patterned variants (1,445 to 24,983 variants) due to the fact that the LG3 assembly now includes 49.3Mbp of new sequence (Table 3).

Table 8.

LG1 and LG3 sex-patterned variants using both assemblies

	Orenil1.1	O_niloticus_UMD1
O. niloticus
LG1 sex-patterned variants	11,894	12,225
Non-LG1 sex-patterned variants	17,579	26,493
Total sex-patterned variants	29,473	38,718
O. aureus
LG3 sex-patterned variants	1,445	24,983
Non-LG3 sex-patterned variants	79,936	78,423
Total sex-patterned variants	81,381	103,406

Open in a new tab

Discussion

Genome assembly

We explored the parameter space of both the FALCON and Canu genome assembly packages and produced 37 candidate assemblies (Additional file 1). Since the true sequence is not known, we had to deduce which of the candidate assemblies best represented the true sequence of the homozygous clone. We elected to assess the assemblies with a variety of metrics, and to select the assembly that scored well across all of the most important metrics.

The first metric is the overall size of the assembly, which should closely match the estimated size of the genome. The size of the O. niloticus genome has been measured by both Feulgen densitometry and bulk fluorometric assay. Five separate measurements range between 0.95–1.20 picograms or ~0.929–1.174Gbp [35]. The average genome size of these five estimates is 1.082Gbp. The various assemblies ranged in size from 975.1Mbp to 1.07Gbp. The assembly that was chosen (#14) has a length of 1.01Gbp, which corresponds to 93.3% of the estimated size of the genome.

The second set of metrics we considered were the standard measures of assembly contiguity such as NG50, number of contigs, longest contig and mean contig size. The third set of metrics consisted of assembly likelihood (ALE) scores, which were calculated by aligning four Illumina libraries (fragment, 3kbp, 6–7kbp, and 40kbp – Table 9, Methods) as well as the 44X PacBio library against each candidate assembly. The fourth metric measured the accuracy of the assemblies at larger scales by aligning the contigs to a ~29X clone coverage library of ~150kbp BAC-end sequences [32] and to existing genetic and physical maps of O. niloticus [33, 34]. Alignment of the RH and RAD maps to the candidate assemblies indicated that every assembly had a relatively low and consistent number of misassemblies (Additional file 1). The fifth metric assessed the completeness of each candidate genome assembly by looking for two core eukaryotic gene sets, CEGMA [29] and BUSCO [30].

Table 9.

O. niloticus Illumina libraries used for ALE calculations and Pilon polishing

Insert size	NCBI SRA accession IDs	Combined coverage	Platform
Fragment	SRR071589, SRR071593, SRR071594, SRR071601, SRR071604, SRR071605, SRR071610, SRR071619	51.6×	Illumina Genome Analyzer II
3kbp	SRR071588, SRR071591, SRR071597, SRR071599, SRR071603, SRR071612, SRR071614	196.3×	Illumina HiSeq 2000
6–7kbp	SRR071602, SRR071607, SRR071615, SRR071616, SRR071617, SRR071620, SRR071622,	24.6×	Illumina Genome Analyzer II
40kbp	SRR071595, SRR071598, SRR071611	4.8×	Illumina Genome Analyzer II

Open in a new tab

No candidate assembly ranked the best for all of these different metrics. In order to choose a preferred assembly, we used principal component analysis to organize the several scores for each assembly. The PCA analysis showed a noticeable difference between the Canu assemblies and the FALCON assemblies (Additional file 2). All of the Canu assemblies clustered together in PCA space. The FALCON assemblies fell into two separate clusters because five of the FALCON assemblies (#17, 32, 34, 35, and 36, Additional file 1) had low ALE scores and NG50s. The other FALCON assemblies tended to show overall better ALE scores for the 44X PacBio library than did the Canu assemblies. This is due to differences in the consensus accuracy between Canu and FALCON assemblies. The 44X PacBio ALE placement and insert scores were virtually the same across all candidate assemblies, but the 44X PacBio ALE k-mer scores were lower for the Canu assemblies. This suggests a slight difference in consensus between Canu and FALCON, although it is probably not noticeable after the polishing steps.

Leaving aside the five low quality FALCON assemblies, a major tradeoff in the PCA is between the size of the assembly and the PacBio ALE score. The FALCON assemblies are all smaller than the Canu assemblies, and for the reasons discussed above, have higher ALE scores for the PacBio library. We elected to focus on the Canu assemblies, where the major tradeoff is between the quality of the assembly (ALE scores, NG50, completeness) and size of the assembly (Total size, exon bp mapped). Ultimately, we chose the assembly (Canu #14) with the best overall ALE average rank. This assembly was 28.8Mbp shorter than the longest Canu assembly (#15).

Alignment of the RH and RAD maps to the candidate assemblies indicated that every assembly had a relatively similar and low number of misassemblies (Additional file 1). To correct these misassemblies in the polished version of assembly #14, the locations of misassemblies were first narrowed using the RH and RAD map data together. This typically narrowed the location of a misassembly to a region of less than 1Mbp. From there, the region around each misassembly breakpoint was inspected using alignments of the PacBio data, Illumina data, RefSeq gene set, BAC-end sequences as well as the RepeatMasker annotations. A characteristic signal of high variation in the PacBio alignments, low physical coverage in the Illumina libraries (best characterized with the largest 40kbp Illumina library), and a high density of large and nested repeats was seen in each region of misassembly. Regions of high variation in the PacBio alignments and low 40kbp physical coverage were then calculated genome-wide to investigate whether additional misassemblies might be hidden in the assembly. When considering the PacBio highly variant regions and the low physical 40kbp coverage regions individually, both sets over-estimated the number of misassembly regions. These false-positive potential misassemblies occurred in regions where there was support for correct and continuous assembly based on both RH and RAD map alignments, which together lend stronger support. Only in two cases were there regions that had high PacBio variation, low physical 40kbp coverage and no alignment of RH or RAD map data. We decided to break the assembly at these two locations as well.