Abstract
The African wild rice species Oryza longistaminata has several beneficial traits compared to cultivated rice species, such as resistance to biotic stresses, clonal propagation via rhizomes, and increased biomass production. To facilitate breeding efforts and functional genomics studies, we de-novo assembled a high-quality, haploid-phased genome. Here, we present our assembly, with a total length of 351 Mb, of which 92.2% was anchored onto 12 chromosomes. We detected 34,389 genes and 38.1% of the genome consisted of repetitive content. We validated our assembly by a comparative linkage analysis and by examining well-characterized gene families. This genome assembly will be a useful resource to exploit beneficial alleles found in O. longistaminata. Our results also show that it is possible to generate a high-quality, functionally complete rice genome assembly from moderate SMRT read coverage by exploiting synteny in a closely related Oryza species.
Stefan Reuscher et al. assembled the genome of an African wild rice species to facilitate breeding efforts and functional genomic studies. They used SMRT sequencing, chromosomal synteny between rice species, and a linkage map to assemble the 351 Mb genome into 12 chromosomes.
Introduction
The Oryza genus in the grass family (Poaceae) contains the cultivated rice species Oryza sativa from Asia and Oryza glabberima from Africa. In addition, the Oryza genus consists of at least 20 wild species with a global distribution which contain an enormous reservoir of untapped variation1,2. The genus contains mostly diploid (2n = 24) species with occasional tetraploids (4n = 48) and can be divided into genome types based on their ability for interbreeding. Genome types range from diploid AA to tetraploid KKLL, with both commercially used species (O. sativa and O. glabberima) belonging to the AA type. Recent advances in DNA sequencing technology have enabled the (re-)sequencing of several commercial and wild AA genome-rice varieties, although some reported assemblies do not attempt to reconstruct full chromosomes3–10.
The wild rice Oryza longistaminata is of the AA genome type. It is found in tropical regions of western Africa near fresh water and in swampy areas1. It is rarely used for human consumption, but holds a number of beneficial traits, such as resistance to bacterial blight linked to the Xa21 locus11, perennial growth, and a high biomass production. The latter two are likely associated with the ability of O. longistaminata to propagate clonally via rhizomes. Efforts have been made to transfer beneficial alleles from O. longistaminata into commercial varieties by evaluating the agricultural potential of introgressed chromosomal segments from O. longistaminata into a commercial background12,13. In addition to breeding efforts, O. longistaminata is also used to study the genetic basis and the development of rhizomes5,14–16.
The assembly of a complete plant genome provides a strong basis for functional genomics studies or for efforts to identify candidate genes through traditional mapping approaches. However, truly chromosome-complete plant genomes are still not a trivial achievement17. Among the cereals, the rice genome is more amendable to assembly due to the fact that it is less repetitive and its size is generally less than 500 Mb. For large cereal genomes such as barley (ca. 5 Gb), a full chromosome assembly was achieved by a combination of ultra-high coverage of small reads, BAC libraries, manual curation, and a number of technologies that produce long-range positional information, such as optical mapping or chromosome conformation capture18. For rice genome assembly, the same technologies can be used, however due to the smaller genome size, sequencing efforts should require less extensive resources.
The possibility to produce long reads (10–40 kb) from genomic DNA by single molecule real-time (SMRT) sequencing technology has enabled complete genome assemblies for diverse organisms, including notoriously repetitive plant genomes19,20. Despite the advantages of long-read sequencing, recent rice genomes assemblies still rely on supplemental technologies to provide large-scale genomic context of contig sequences3,4. In addition, short reads are still necessary to correct single nucleotide and small indel errors in the SMRT reads.
In this work, we assembled the genome of O. longistaminata, including 12 chromosome-scale sequences with alternative parental haplotypes. We used a comparatively moderate coverage (66×) of SMRT reads and exploited gene-synteny in the Oryza genus and a previously generated genetic map for our assembly. In total, we assembled 351 Mb of which 92.2% could be placed on 12 chromosomes. We furthermore validate our genome assembly for its usefulness in possible functional genomics studies and breeding efforts.
Results
Genome assembly and annotation
We sequenced DNA extracted from young leaves of one individual plant of O. longistaminata accession IRGC110404. In total, we used 16 PacBio SMRT V3 cells generating 22.6 Gb on 2.4 million reads (average read length: 9.3 kb) (Supplementary Figure 1). The total nuclear genome size of O. longistaminata was estimated to be around 340 Mb5, setting our average coverage to approximately 66-fold.
Assembly using FALCON-UNZIP21 resulted in 1632 primary contigs with a total length of 350.56 Mb, an N50 of 554 kb and a maximum contig size of 7.29 Mb (Table 1). In addition, FALCON-UNZIP assembled 4229 contigs representing the alternative haplotype on the sister chromatid (haplotigs). Those alternative contigs had a total combined length of 258.67 Mb and an N50 of 148 kb (73.79% of the primary assembly). We also tried the Canu assembler22 with our raw data but found that FALCON-UNZIP performed better, possibly because of the heterozygosity found in the O. longistaminata genome (Supplementary Note 1 and Supplementary Table 1). For error-correction of the primary contigs, we first re-aligned the SMRT reads to the assembled contigs with blasr and then used quiver to correct 1.3 million insertions, 0.2 million deletions, and 0.61 million substitutions. In the next step, 18.35 million pairs of short reads (150 bp read length, 432 bp median insert size) were aligned to the contigs and an additional 0.24 million insertions and 38.7 thousand deletions were corrected. The polished contigs were arranged and oriented using a genetic map23 and exploiting gene synteny with O. sativa ssp. japonica (Supplementary Figure 2), resulting in the assembly of 12 pseudo-chromosomes with a total length of 323.95 Mb (92.2% placement rate) (Fig. 1). The completeness of the genome assembly was assessed by detecting a set of unique single copy genes in the genome assembly24. Out of 1440 unique single copy genes, 1360 (94.5%) were detected in our assembly. Gene models in the newly assembled genome were determined using a combination of computational gene prediction and expressed transcripts based on RNAseq data from eight diverse tissues. The final gene model set consisted of 34,389 genes with a median gene length of 2,700 bp. Using the Mercator annotation pipeline, MAPMAN bins (other than unassigned) could be detected for 20,121 genes25. Putative centromeric regions were identified on all chromosomes. Although we failed to identify telomeric repeat regions ([TTTAGGG]n) in our final assembly, such repeat structures were represented in the pool of error-corrected reads. The contig coverage of the final pseudomolecules tended to be less fragmented in the comparably gene-rich chromosome arms, while the centromeric regions were mostly reconstructed from shorter contigs. This might lead to inaccuracies in the assembly of such regions.
Table 1.
Genome statistic | |
---|---|
Total size of assembled contigs (bp) | 350,562,179 |
Total size of contigs anchored on chromosomes (bp) | 324,081,576 |
Contig placement rate (%) | 92.45 |
Number of contigs | 1632 |
Longest contig (bp) | 7,290,908 |
Contig N50 (bp) | 553,927 |
Number of gene models | 34,389 |
Median gene length (bp) | 2700 |
Genic content (%) | 40.19 |
Repeat content (%) | 38.10 |
GC content (%) | 42.71 |
Whole genome alignment
To analyze the large-scale structure of our O. longistaminata genome assembly, we performed a comparative analysis of gene synteny using the O. sativa ssp. japonica genome as a reference. A genome-wide alignment of all coding sequences (CDS) in each of the two genomes was performed, followed by the identification of syntenic pairs of orthologues (Fig. 2). In total, we identified 9976 pairs of syntenic CDS, which were used to construct the genome-wide alignment. As expected, the genomes of O. sativa and O. longistaminata appeared highly syntenic, indicated by the central diagonal in Fig. 2. A close-up inspection revealed several minor differences in the CDS order between both genomes, including small-scale (<0.1 Mb) inversions and duplications. In addition to the direct orthologues, also several groups of inter-chromosomal syntenic gene pairs were detected. The median rate of synonymous mutations (Ks) in those paralogues was 1.37 compared to 0.04 in direct orthologues. This indicated that those inter-chromosomal orthologues are the result of an ancient whole genome duplication event that is conserved in the whole Oryza genus26. In the O. sativa genome, 24 pairs of duplicated segments were found27. In our cross-genome alignment, this known pattern of duplications was very well replicated and all major duplication blocks could be detected (Supplementary Figure 3). In summary, genome-wide alignments confirmed that our assembly shows the expected syntenic gene order in the Oryza genus, including difficult-to-assemble regions that originated from an ancient genome duplication event.
Haplotype variations
Since O. longistaminata is an outcrossing species, it is reasonable to expect higher haplotype diversity in the O. longistaminata genome as compared to the highly inbred cultivated rice species O. sativa ssp. japonica and O. sativa ssp. indica. However, this is difficult to exactly quantify, as comparable haplotype-based assemblies are missing. We first focused on SNP-based analyses and aligned the haplotigs back to the final 12 chromosomal sequences using NUCMER28 and called SNPs using NUCMER’s show-snp tool. We found that a total of 973,487 bp were different based on the alignments generated by NUCMER. We then analyzed the frequency of small variations (≤10 bp) in different genomic features for bins of 1 Mb (Supplementary Figure 4a). The lowest frequency for such variations was found in the CDS regions with a median of 1.1 variations per kb. The highest frequency was found in the regions 1 kb upstream of each locus with a median of 3.1 variations per kb. This indicates that haplotype diversity is most pronounced in the promoter regions of protein-coding genes, which might have an effect on haplotype-specific promoter activities. To analyze larger variations (>10 bp) we used the output of NUCMER together with Assemblytics29. We detected a total of 18,361 larger variations with this approach. Among those were 5,743 deletions, 6,563 insertions, 3,148 repeat contractions, 2,828 repeat expansions, 13 tandem contractions, and 66 tandem expansions. The median occurrence of those variations in bins of 1 Mb was highest in the repetitive content (0.24 variants per kb) but considerably lower in the protein-coding regions with 0.018 variant per kb in the protein-coding loci (Supplementary Figure 4b). In summary, we detected a considerable feature-specific amount of variation between the two parental genomes of O. longistaminata. This highlights the complexity of a genome from an outcrossing plant species which is still largely unexplored.
Repeat content
Since mobile genomic elements (transposons) are known drivers of genome evolution, we analyzed the transposon content of the O. longistaminata genome. Using RepeatMasker30 and a database of rice repetitive elements, we found that 38.10% of the total genome assembly consisted of interspersed repeats (Table 2). Further classification of repeat elements revealed that 13.46% of the assembly was classified as long-terminal repeats (LTR) elements and 16.83% was classified as DNA transposons. To put results for O. longistaminata into context within the Oryza genus, we also analyzed five other rice genomes using the same procedure we used for O. longistaminata. The overall amount of repeats in the O. longistaminata genome was comparable to the other analyzed rice genome and most similar to the wild African rice species Oryza barthii. To further analyze the differences in the observed repeat sizes, especially between the two more repetitive O. sativa cultivars and O. longistaminata, we plotted the total size of the top 20 most prevalent (by size) repeat elements in O. longistaminata for all analyzed species (Fig. 3). LTR retrotransposons from the Gypsy family showed the largest absolute changes in size among the analyzed species and are the major contributor to rice genome size differences. In the O. longistaminata genome, the total size of Gypsy family transposons was most comparable to O. barthii and O. glabberima. The distribution of repeat elements along the chromosomes followed a repeat-family specific pattern. Repeat elements belonging to the Copia, EnSpm, and Gypsy-family showed the highest density in centromeric regions, while repeat elements of the Explorer, Gaijin, Harbinger, and SINE-type were typically found at lower density in those regions. (Supplementary Figure 5).
Table 2.
Species | Genome size (Mb) | Total repeat size (Mb) [% of total sequence] | Retroelements (Mb) [% of total sequence] | DNA transposons (Mb) [% of total sequence] |
---|---|---|---|---|
Oryza longistaminata | 350.6 | 133.9 [38.10%] | 53.1 [15.11%] | 59.1 [16.83%] |
Oryza barthii | 308.3 | 110.3 [36.08%] | 49.4 [16.15%] | 48 [15.68%] |
Oryza brachyantha | 260.8 | 53.3 [21.92%] | 18 [7.38%] | 31.8 [13.08%] |
Oryza glabberima | 316.4 | 117.2 [38.65%] | 58 [19.13%] | 47.7 [15.71%] |
Oryza sativa ssp. indica | 427.0 | 184.5 [44.91%] | 106.9 [26.03%] | 61.9 [15.08%] |
Oryza sativa ssp. japonica | 373.2 | 171.1 [45.87%] | 93 [24.93%] | 63 [16.88%] |
The O. longistaminata genome and five other selected rice species were analyzed by RepeatMasker and values shown were taken directly from the .tbl file. The O. longistaminata genome used here included the unmapped contigs
Using the O. longistaminata genome in quantitative trait locus (QTL) studies
In our recent work, we applied genotyping-by-sequencing to a population of F2 plants from a cross of O. longistaminata and O. sativa ssp. japonica cv. Nipponbare23. In that work, we used the O. sativa genome (IRGSP V1.0) as the reference in the initial read mapping step, as no high-quality O. longistaminata genome was available at the time. To test the usefulness of our O. longistaminata assembly for breeding and mapping applications, we repeated genotyping-by-sequencing and quantitative trait locus (QTL) mapping using the newly assembled O. longistaminata genome as the reference.
We used a population of 1081 F2 individuals and set the threshold for missing data per SNP marker to ≤5%. This resulted in 2357 available SNP markers when using the O. longistaminata genome as a reference compared to 2435 SNP markers for the NB genome. We proceeded to detect QTL affecting the number of tillers per plant separately for each of the two reference genomes. Using either reference genome, we detected four QTL on chromosomes 1, 3, 4 and 8 (Fig. 4). In addition, the LOD profiles were found to be very similar when the two reference genomes were compared. Based on those results, we concluded that our genome assembly is suitable to be used as a resource in breeding programs involving O. longistaminata as a parent.
Completeness of gene families and metabolic pathways
One major motivation to create a high-quality reference genome is to accelerate functional genomics studies. This requires complete representation of evolutionary (gene families) or functionally (metabolic pathways) defined groups of genes. To this end, we assessed the completeness and the quality of annotations of the set of enzymes that synthesize the phytohormone gibberellic acid (GA) and the SWEET (SUGARS WILL EVENTUALLY BE EXPORTED) family of sugar transport proteins. Using BLASTP and BLASTN searches with O. sativa sequences as the query, we detected almost all analyzed genes in the expected chromosomal regions in the O. longistaminata genome with a clear 1-to-1 relationship between O. sativa and O. longistaminata (Supplementary Dataset 1). In general, O. longistaminata proteins were very similar (>95% AA identity) compared to O. sativa proteins. In five cases (CPS, SWEET1a, 2a, 4 and 11), two highly similar loci are present in the O. longistaminata genome, as opposed to a single locus in O. sativa. This pairs of loci were always found in close proximity to each other. It is conceivable that, e.g., highly heterozygous parts of the genome could lead to breaks in the assembly and the observed duplication in reality represents two regional haplotypes that were not correctly picked up during haplotype phasing.
Identification of functionally enriched genomic regions
Recently, it was reported that barley chromosomes feature genomic compartments, which are characterized, among other features, by an enrichment in specific gene functions18. We could not detect such clear compartments in our O. longistaminata genome assembly, most likely because rice genomes are approximately one order of magnitude smaller compared to barley (5 Gb vs. <0.5 Gb) and are thus less compartmentalized. However, by analyzing the distribution of MAPMAN functional gene categories along the chromosomes, we discovered 153 genomic regions (0.5 Mb) in which at least one functional category of genes was significantly (P ≤ 0.05) enriched (Fig. 5). The functional categories which were enriched most often were 30:signaling (15 regions), 26:misc (14 regions), 20:stress (13 regions), and 16:secondary metabolism (12 regions). Multi-locus arrangements of very similar genes are often collapsed in assemblies based on short reads, which is highly undesirable as several agriculturally important alleles conferring resistance to biotic stresses are part of large, multi-gene clusters31–33. The 13 genomic regions, in which stress-related genes were significantly enriched contained between 7 and 20 stress-related genes each. Evaluating the potential of those regions for biotic stress resistance, e.g., by a targeted breeding approach, might be one way to utilize this reference genome assembly.
Gene expression profiles
RNAseq data from eight diverse tissues (leaf, root, shoot apical meristem, tiller bud, rhizome tip, rhizome node, rhizome bud stage 1 and 2) was used to detect genes in the O. longistaminata genome in combination with computational gene prediction (Supplementary Figure 6 and Supplementary Table 2). In total, 34,389 loci were detected and their expression was quantified (Supplementary Dataset 2). A principal component analysis of gene expression data indicated similar expression profiles in related rhizomatous and non-rhizomatous tissues (Fig. 6a). Gene expression in rhizome tips was found to be most similar to the shoot apical meristem of above ground shoots, while samples from rhizome buds and tiller buds were most similar with respect to each other. To detect patterns of gene expression and tissue-specific genes, k-means based clustering was performed with k = 14 (Fig. 6b). We found 841 genes in cluster 8 expressed primarily in the leaf. Genes involved in photosynthesis (light reaction, photorespiration, Calvin cycle) and secondary metabolism (flavonoids, phenylpropanoids, isoprenoids) were significantly enriched (P < 0.05) in that cluster based on MAPMAN functional annotations (Supplementary Dataset 3). Similarly, 758 genes in cluster 4 were primarily expressed in the roots and genes putatively encoding peroxidases and glutathione-S-transferases were enriched among those genes. The clusters 1 (22 genes), 2 (96 genes), and 6 (127 genes) contained genes which were specifically expressed in a combination of rhizome and meristematic (shoot apical meristem, tiller bud) tissues.
Conclusion
In this work, we presented a high quality, haplotype-aware reference genome for the wild rice species O. longistaminata. We challenged our assemblies’ usefulness for QTL mapping and functional genomics, and concluded that this work provides a solid basis for future efforts to understand and transfer useful traits from O. longistaminata into cultivated rice species. In comparison to other chromosome-level assemblies3,4, our approach is based on assembling whole-genome shotgun sequencing data directly followed by exploiting gene synteny and a genetic map to form pseudo-chromosomal sequences. This greatly simplified the sequencing and post-assembly workflow. Our work provides a glimpse into the possibility for semi-automated genome assemblies for genera of crop species, in which at least one high-quality reference genome is available. This would allow pan-genomic approaches for selected genera34.
Methods
Plant material
O. longistaminata (accession IRGC110404) was used for whole genome and transcriptome sequencing. Plants were cultivated in a controlled environment chamber at Nagoya University in Nagoya, Aichi Prefecture, Japan. The F2 populations used for QTL analysis were described previously23.
DNA and RNA sequencing
For long read-sequencing, DNA was extracted by the ISOPLANT method35. In total, 20 µg of DNA (106 ng µL−1) was used for sequencing on a Pacific Biosystems RS instrument using P6v2 chemistry and 16 SMRT V3 cells (Supplementary Figure 1). Sequencing was carried out by Macrogen (Seoul, South Korea). For short read-sequencing, DNA from the same individual was extracted using the same method. Libraries were prepared using the TruSeq DNA v2 kit and sequenced on an Illumina Genome Analyzer IIx platform (Illumina, San Diego, CA, USA). For transcriptome sequencing, RNA was extracted from leaves, roots, tiller buds, shoot apical meristems, and rhizome tissues including early (stage 1) and late (stage 2) buds, tips, and node regions of mature rhizomes using the QIAGEN plant RNA kit (Hilden, Germany) (Supplementary Figure 5). Paired-end reads were generated on a HiSeq2000 platform.
Genome sequence assembly
Raw SMRT sub-reads were first assembled using the FALCON assembler (https://github.com/PacificBiosciences/FALCON-integrate, release v2.1.2) to generate primary contigs. FALCON was configured as follows: genome_size = 350000000, length_cutoff = 5000, length_cutoff_pr = 5000, pa_DBsplit_option = -x500 -s200, pa_HPCdaligner_option = -v -B4 -e.70 -l1000 -s1000, falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 400, ovlp_DBsplit_option = -x500 -s200, ovlp_HPCdaligner_option = -v -B4 -h60 -e.96 -l500 -s1000, overlap_filtering_setting = --max_diff 50 --max_cov 50 --min_cov 3 --bestn 10. The primary assembly was then haplotype-phased using FALCON-UNZIP (obtained at: https://downloads.pacbcloud.com/public/falcon/falcon-2018.03.12-04.00-py2.7-ucs2.tar.gz) and default settings.
After assembly and haplotype phasing, error correction was performed in two steps. First, all SMRT reads were realigned to the assembly using quiver integrated in SMRT analysis (v.2.3.0). Quiver polishing was performed by filtering of subreads (minLength = 50, minSubReadLength = 50, readScore = 0.75), mapping of subreads to the assembly using blasr (maxHits = 10, maxDivergence = 30, minAnchorSize = 12, seed = 1, minAccuracy = 0.75, minLength = 50, algorithmOptions = -useQuality) followed by error correction. The resulting sequences were again polished by first aligning paired-end short reads using bwa-mem36 followed by error-correction using Pilon37 with the --diploid, --nostrays, and --fix indels options. The same procedure was performed for primary contigs and associated haplotype sequences (haplotigs).
The polished contigs were then arranged in a linear fashion using ALLMAPS38 with two different genome maps. The first map consisted of 301 high-confidence SNP markers23. To establish their respective positions in the polished contigs, a 200 bp genomic DNA sequence from O. sativa ssp. japonica cv. Nipponbare surrounding the SNP was mapped to the polished contigs using bwa-mem. The alignment was converted to bed format using bamtobed from bedtools v2.25.039 and converted to an ALLMAPS map by a custom R script. The second map exploited gene synteny between O. longistaminata and O. sativa. Sequences representing all gene models from IRGSP1.0 reference genome build were obtained from http://rapdb.dna.affrc.go.jp/ and aligned to the polished contigs using BLASTN. The best blast hit from the first isoform for each gene model was isolated and used to create a gene synteny-based map. Spurious blast hits (blast hits from a specific O. sativa chromosome with a run-length = 1) from contigs with three or more hits were removed. Only the start coordinates for each gene model were used, except for contigs with only 1 blast hit. In that case also the stop coordinate of the gene model was used to allow orientation. In total, this map contained 34,097 anchor points. The combined genetic maps were first used to detect putative chimeric contigs. Using the jcvi.assembly.allmaps split command with the –chunk = 4 option 37 breakpoints were identified and the contigs were split accordingly. Both maps were remade to represent the split contigs and were finally used as input for the jcvi.assembly.allmaps path command. The resulting 12 pseudo-chromosomes together with all unmapped contigs (7.8% of the total genome, merged into one DNA sequence with 1 kb separating the contigs) is referred to as the O. longistaminata reference genome V2.0 (respecting the V1.0 assembly described earlier5). Completeness of the assembly was assessed using BUSCO 2.0.1 with the embryophyta_odb9 dataset24.
Genotyping and QTL detection
Genotyping based on SNPs and subsequent QTL detection were performed as described in detail in our previous work23. In short, 1081 F2 plants of a cross of O. longistaminata and O. sativa ssp. japonica cv. Nipponbare were used for genotyping-by-sequencing using the TASSEL4 pipeline40. SNPs were filtered based on minor allele frequency, parental alleles, read depth, and missing data. As the last step, putative errors were corrected and missing data was imputed based on flanking alleles. Phenotyping was performed by digging up plants from the paddy field and counting all shoots for each individual plant. In addition, six replicate plants of each of the two parents were used for genotyping and phenotyping. QTL detection was performed using the R/qtl package41. A linear regression model with multiple imputations implemented in the scanone function was used. The threshold for significance was calculated from 100 permutation tests.
Genomic feature detection
The final O. longistaminata reference genome V2.0 was annotated using a combination of in silico gene prediction and transcriptome data. Gene prediction was carried out using the MEGANTE gene prediction pipeline42 using the profile for O. sativa and standard settings. In addition, RNAseq data from eight different tissues described above was used to identify expressed genes. For this, raw RNAseq reads were first cleaned using Trimmomatic-0.3643 with the options LEADING:3, TRAILING:3, and MINLEN:30 followed by re-pairing reads using pairfq (https://github.com/sestaton/Pairfq). Cleaned reads were then aligned separately for each tissue to the O. longistaminata reference genome V2.0 using hisat244 with the –dta option. Transcripts were assembled from read alignments using stringtie45 with the –m 50 option and all eight resulting GTF-files were merged using stringtie with the –merge argument. TransDecoder (https://github.com/TransDecoder/TransDecoder) was used to derive transcript sequences and detect all ORFs longer than 50 AA (TransDecoder.LongOrfs -m 50). The likely coding region for each transcript was detected using TransDecoder.Predict supplemented by BLASTP data obtained by querying the SWISS-PROT database with the longest ORF of each transcript. Predicted and experimentally verified gene models were combined using a custom R/Bioconductor script and in case of overlaps precedence was given for the experimentally verified gene models. Repeat elements in the final O. longistaminata assembly were identified using RepeatMasker with the options: -pa 4 -x -excln -html -gff -no_is -species rice. Putative centromeric regions were identified using a 154 bp monomer isolated from the O. sativa CentO region (AY101510.1) as a query for BLASTN searches against the O. longistaminata genome.
Functional gene annotations
Functional annotations for MEGANTE-predicted genes including BLASTP hits, cDNA accessions numbers from similar cDNAs, Interpro domains, and GO terms were used as reported by MEGANTE. Expressed genes were annotated using the best BLASTP hit using a protein database containing all AA sequences from UniProt with taxon ID 4527 (Oryza genus). Based on the UniProt accession number, Interpro domain IDs and GO terms were added. A short human-readable description was added by using the name of the most specific ortholog group derived from eggnog 4.5.146. In addition, all final gene models (using the protein encoded by the first reported splice variant) were assigned to a MAPMAN functional annotation using Mercator25.
Genome-wide alignments
Analyses of genome synteny were performed using the CoGe platform (https://genomevolution.org/coge/)47. Whole genome sequences and all CDS as GFF annotation from the O. longistaminata reference genome and from the O. sativa genome were uploaded to CoGe. CDS from both genomes were aligned using LastZ (--hspthresh 3000). DAGChainer (using relative gene order, -D 20 and -A 5) was used to identify chains of syntenic genes and the Quota Align option was used to merge neighboring syntenic regions. The results file including synonymous mutation rates was downloaded and parsed using a custom R script before plotting.
Haplotype analysis
For analysis of the error-corrected haploid-phased alternative contigs (haplotigs), each sequence was aligned to the final 12 chromosomes of the O. longistaminata reference genome using NUCMER version 3.128 with the options: –maxmatch –l 100 –c 500. For small variations, the show-snps program from NUCMER was used with the options –Clr –x 1 –T and the resulting table was converted to a vcf file using the MUMmerSNPs2VCF.py program found here: (https://github.com/liangjiaoxue/PythonNGSTools/blob/master/MUMmerSNPs2VCF.py). The resulting vcf file was parsed into R and only variations with a length of ≤10 bp in both the REF and ALT fields were used for further analysis. The output of NUCMER was also analyzed using Assemblytics29 and the resulting BED-file was parsed into R and only variations with a length of >10 bp in either the ref_gap_size or the query_gap_size field were used for further analysis.
General data processing
Principal data analysis and visualization was performed using R V3.3.2 or later. Quantification of gene expression was performed using Rsubread48. K-means based clustering was performed using MBCluster.Seq49. Principal component analysis was performed using the prcomp function. Enrichment of functional categories in genomic regions or expression-based clusters was tested for by Fisher’s exact test followed by Bonferroni–Holm correction using the fisher.test and p.adjust functions, respectively. Manipulation of biological sequences and genome-based features was performed using Bioconductor V3.3 and the packages Biostrings, rtracklayer, and GenomicRanges. General data visualization was performed using ggplot2. Circular plots of genome features were created using Circos50.
Electronic supplementary material
Acknowledgements
We thank the National Institute of Genetics, Mishima, Japan and the Nagoya University Information and Communications Department, Nagoya, Japan for the use of their cluster computing facilities. We like to thank Ken Naito from the National Agriculture and Food Research Organization for helpful discussions about genome assembly.
Author contributions
S.R. performed genome assembly and all bioinformatics analyses. T.F. assisted with genome assembly, provided plant materials and DNA for genome sequencing, and performed QTL analyses. K.B.U. provided materials for the tissue transcriptome. M.C. provided resources for visualization and analysis of genome features. K.K.J. provided plant material. A.T., A.F. and N.K. provided sequence data. M.A. conceived and supervised the research.
Data availability
All reads used for genome (Bioproject PRJDB6339) and transcriptome (Bioproject PRJDB6351) assembly have been uploaded to the DNA Databank of Japan (DDBJ). A genome browser and other O. longistaminata genome-related data including all sequences and annotations can be found at http://olinfres.nig.ac.jp/
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Stefan Reuscher, Email: reuscher@agr.nagoya-u.ac.jp.
Motoyuki Ashikari, Email: ashi@agr.nagoya-u.ac.jp.
Electronic supplementary material
Supplementary Information accompanies this paper at 10.1038/s42003-018-0171-y.
References
- 1.Vaughan DA. The Wild Relatives of Rice. A Genetic Resources Handbook. Manila: International Rice Research Institute; 1994. [Google Scholar]
- 2.Marathi B, Ramos J, Hechanova SL, Oane RH, Jena KK. SNP genotyping and characterization of pistil traits revealing a distinct phylogenetic relationship among the species of Oryza. Euphytica. 2015;201:131–148. doi: 10.1007/s10681-014-1213-2. [DOI] [Google Scholar]
- 3.Du H, et al. Sequencing and de novo assembly of a near complete indica rice genome. Nat. Commun. 2017;8:15324. doi: 10.1038/ncomms15324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Zhang J, et al. Extensive sequence divergence between the reference genomes of two elite indica rice varieties Zhenshan 97 and Minghui 63. Proc. Natl. Acad. SCi. U.S.A. 2016;113:E5163–E5171. doi: 10.1073/pnas.1611012113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Zhang Y, et al. Genome and comparative transcriptomics of African wild rice Oryza longistaminata provide insights into molecular mechanism of rhizomatousness and self-incompatibility. Mol. Plant. 2015;8:1683–1686. doi: 10.1016/j.molp.2015.08.006. [DOI] [PubMed] [Google Scholar]
- 6.Schatz MC, et al. Whole genome de novo assemblies of three divergent strains of rice, Oryza sativa, document novel gene space of aus and indica. Genome Biol. 2014;15:506. doi: 10.1186/s13059-014-0506-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Wang M, et al. The genome sequence of African rice (Oryza glaberrima) and evidence for independent domestication. Nat. Genet. 2014;46:982–988. doi: 10.1038/ng.3044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Zhang QJ, et al. Rapid diversification of five Oryza AA genomes associated with rice adaptation. Proc. Natl. Acad. Sci. U.S.A. 2014;111:E4954–E4962. doi: 10.1073/pnas.1418307111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Chen J, et al. Whole-genome sequencing of Oryza brachyantha reveals mechanisms underlying Oryza genome evolution. Nat. Commun. 2013;4:1595. doi: 10.1038/ncomms2596. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Stein JC, et al. Genomes of 13 domesticated and wild rice relatives highlight genetic conservation, turnover and innovation across the genus Oryza. Nat. Genet. 2018;50:285–296. doi: 10.1038/s41588-018-0040-0. [DOI] [PubMed] [Google Scholar]
- 11.Song WY, et al. A receptor kinase-like protein encoded by the rice disease resistance gene, Xa21. Science. 1995;270:1804–1806. doi: 10.1126/science.270.5243.1804. [DOI] [PubMed] [Google Scholar]
- 12.Ramos Joie M., Furuta Tomoyuki, Uehara Kanako, Chihiro Niwa, Angeles-Shim Rosalyn B., Shim Junghyun, Brar Darshan S., Ashikari Motoyuki, Jena Kshirod K. Development of chromosome segment substitution lines (CSSLs) of Oryza longistaminata A. Chev. & Röhr in the background of the elite japonica rice cultivar, Taichung 65 and their evaluation for yield traits. Euphytica. 2016;210(2):151–163. doi: 10.1007/s10681-016-1685-3. [DOI] [Google Scholar]
- 13.Gichuhi, E., Himi, E., Takahashi, H. & Maekawa, M. Oryza longistaminata’s chromosome segments are responsible for agronomically important traits for environmentally smart rice. Scientific Conference Proceedings (2013).
- 14.He R, et al. A systems-wide comparison of red rice (Oryza longistaminata) tissues identifies rhizome specific genes and proteins that are targets for cultivated rice improvement. BMC Plant Biol. 2014;14:46. doi: 10.1186/1471-2229-14-46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Hu F, et al. Identification of rhizome-specific genes by genome-wide differential expression analysis in Oryza longistaminata. BMC Plant Biol. 2011;11:18. doi: 10.1186/1471-2229-11-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hu FY, et al. Convergent evolution of perenniality in rice and sorghum. Proc. Natl. Acad. Sci. U.S.A. 2003;100:4050–4054. doi: 10.1073/pnas.0630531100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Michael TP, VanBuren R. Progress, challenges and the future of crop genomes. Curr. Opin. Plant Biol. 2015;24:71–81. doi: 10.1016/j.pbi.2015.02.002. [DOI] [PubMed] [Google Scholar]
- 18.Mascher M, et al. A chromosome conformation capture ordered sequence of the barley genome. Nature. 2017;544:427–433. doi: 10.1038/nature22043. [DOI] [PubMed] [Google Scholar]
- 19.Sakai H, et al. The power of single molecule real-time sequencing technology in the de novo assembly of a eukaryotic genome. Sci. Rep. 2015;5:16780. doi: 10.1038/srep16780. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.VanBuren R, et al. Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum. Nature. 2015;527:508–511. doi: 10.1038/nature15714. [DOI] [PubMed] [Google Scholar]
- 21.Chin CS, et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods. 2016;13:1050–1054. doi: 10.1038/nmeth.4035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Koren S, et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017;27:722–736. doi: 10.1101/gr.215087.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Furuta T, Ashikari M, Jena KK, Doi K, Reuscher S. Adapting genotyping-by-sequencing for rice F2 populations. G3. 2017;7:881–893. doi: 10.1534/g3.116.038190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–3212. doi: 10.1093/bioinformatics/btv351. [DOI] [PubMed] [Google Scholar]
- 25.Lohse M, et al. Mercator: a fast and simple web server for genome scale functional annotation of plant sequence data. Plant Cell Environ. 2014;37:1250–1258. doi: 10.1111/pce.12231. [DOI] [PubMed] [Google Scholar]
- 26.Paterson AH, Bowers JE, Chapman BA. Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proc. Natl. Acad. Sci. U.S.A. 2004;101:9903–9908. doi: 10.1073/pnas.0307901101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Thiel T, et al. Evidence and evolutionary analysis of ancient whole-genome duplication in barley predating the divergence from rice. BMC Evol. Biol. 2009;9:209. doi: 10.1186/1471-2148-9-209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Kurtz S, et al. Versatile and open software for comparing large genomes. Genome Biol. 2004;5:R12. doi: 10.1186/gb-2004-5-2-r12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Nattestad M, Schatz MC. Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics. 2016;32:3021–3023. doi: 10.1093/bioinformatics/btw369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Smit, A. F. A., Hubbley, R. & Green, P. RepeatMasker Open-4.0. Available at http://www.repeatmasker.org (2013–2015).
- 31.Zhao Y, et al. Allelic diversity in an NLR gene BPH9 enables rice to combat planthopper variation. Proc. Natl. Acad. Sci. U.S.A. 2016;113:12850–12855. doi: 10.1073/pnas.1614862113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Liu Y, et al. A gene cluster encoding lectin receptor kinases confers broad-spectrum and durable insect resistance in rice. Nat. Biotechnol. 2015;33:301–305. doi: 10.1038/nbt.3069. [DOI] [PubMed] [Google Scholar]
- 33.Song WY, et al. Evolution of the rice Xa21 disease resistance gene family. Plant Cell. 1997;9:1279–1287. doi: 10.1105/tpc.9.8.1279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Jacquemin J, Bhatia D, Singh K, Wing RA. The International Oryza Map Alignment Project: development of a genus-wide comparative genomics platform to help solve the 9 billion-people question. Curr. Opin. Plant Biol. 2013;16:147–156. doi: 10.1016/j.pbi.2013.02.014. [DOI] [PubMed] [Google Scholar]
- 35.Zhu H, Qu F, Zhu LH. Isolation of genomic DNAs from plants, fungi and bacteria using benzyl chloride. Nucleic Acids Res. 1993;21:5279–5280. doi: 10.1093/nar/21.22.5279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Walker BJ, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One. 2014;9:e112963. doi: 10.1371/journal.pone.0112963. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Tang H, et al. ALLMAPS: robust scaffold ordering based on multiple maps. Genome Biol. 2015;16:3. doi: 10.1186/s13059-014-0573-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Glaubitz JC, et al. TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline. PLoS One. 2014;9:e90346. doi: 10.1371/journal.pone.0090346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Broman, K. W. & Sen, Ś. A Guide to QTL Mapping with R/qtl (Springer, New York, 2009).
- 42.Numa H, Itoh T. MEGANTE: a web-based system for integrated plant genome annotation. Plant Cell Physiol. 2014;55:e2. doi: 10.1093/pcp/pct157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods. 2015;12:357–360. doi: 10.1038/nmeth.3317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Pertea M, et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 2015;33:290–295. doi: 10.1038/nbt.3122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Huerta-Cepas J, et al. Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Mol. Biol. Evol. 2017;34:2115–2122. doi: 10.1093/molbev/msx148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Lyons E, Freeling M. How to usefully compare homologous plant genes and chromosomes as DNA sequences. Plant J. 2008;53:661–673. doi: 10.1111/j.1365-313X.2007.03326.x. [DOI] [PubMed] [Google Scholar]
- 48.Liao Y, Smyth GK, Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res. 2013;41:e108. doi: 10.1093/nar/gkt214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Si Y, Liu P, Li P, Brutnell TP. Model-based clustering for RNA-seq data. Bioinformatics. 2014;30:197–205. doi: 10.1093/bioinformatics/btt632. [DOI] [PubMed] [Google Scholar]
- 50.Krzywinski M, et al. Circos: an information aesthetic for comparative genomics. Genome Res. 2009;19:1639–1645. doi: 10.1101/gr.092759.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All reads used for genome (Bioproject PRJDB6339) and transcriptome (Bioproject PRJDB6351) assembly have been uploaded to the DNA Databank of Japan (DDBJ). A genome browser and other O. longistaminata genome-related data including all sequences and annotations can be found at http://olinfres.nig.ac.jp/