SUMMARY
Many economically important crops have large and complex genomes, which hampers sequencing of their genome by standard methods such as WGS. Large tracts of methylated repeats occur at plant genomes interspersed by hypomethylated gene-rich regions. Gene enrichment strategies based on methylation profile offer an alternative to sequencing repetitive genomes. Here, we have applied methyl filtration (MF) with McrBC digestion to enrich for euchromatic regions of sugarcane genome. To verify the efficiency of MF and the assembly quality of sequences submitted to gene-enrichment strategy, we have compared assemblies using MF and unfiltered (UF) libraries. The MF allowed the achievement of a better assembly by filtering out 35% of the sugarcane genome and by producing 1.5 times more scaffolds and 1.7 times more assembled Mb compared to unfiltered scaffolds. The coverage of sorghum CDS by MF scaffolds was at least 36% higher than by UF scaffolds. Using MF technology, we increased by 134X the coverage of genic regions of the monoploid sugarcane genome. The MF reads assembled into scaffolds covering all genes at sugarcane BACs, 97.2% of sugarcane ESTs, 92.7% of sugarcane RNA-seq reads and 98.4% of sorghum protein sequences. Analysis of MF scaffolds encoding enzymes of the sucrose/starch pathway discovered 291 SNPs in the wild sugarcane species, S. spontaneum and S. officinarum. A large number of microRNA genes were also identified in the MF scaffolds. The information achieved by the MF dataset provides a valuable tool for genomic research in the genus Saccharum and improvement of sugarcane as a biofuel crop.
Keywords: methylation filtration, gene-rich regions, de novo assembly, sugarcane, Saccharum spp
INTRODUCTION
The remarkable improvement of sequencing methods from Sanger to NGS in the past few years has enabled large-scale genome sequencing and evolutionary comparisons (Zhou et al., 2010; Liu et al., 2012; Renny-byfield et al., 2011; Zhang et al., 2012). In spite of the rapid improvements in sequencing technology, and although several sequencing initiatives were launched in the last decade to integrate genome sequencing strategies, such as BAC-by-BAC and whole genome shotgun (WGS) of important crops (Feuillet et al., 2011), the number of plant genomes sequenced is growing relatively slowly, because of their large size and complexity (Morrell et al., 2011; Hamilton and Buell, 2012). The interest in the sugarcane genome has grown in the last few years, due to its economic impact on sustainable energy production (Cheavegatti-Gianotto et al., 2011). The interespecific crosses between Saccharum officinarum and Saccharum spontaneum allowed the generation of commercial sugarcane hybrid cultivars. The arisen hybrids have very complex polyploidy, aneuploidy genomes that can show great divergence in their repetitive regions (Butterfield et al., 2001). Most of the knowledge of the sugarcane genome has been obtained by cytological studies, DNA molecular markers and breeding (Butterfield et al., 2001). The sequencing of sorghum genome also allowed the increase of genomic studies of its relatives, including sugarcane (Wei et al., 2009). At the gene level, sorghum and sugarcane sequences are well conserved, with more than 85% of similarity among orthologs (Wang et al., 2010). For sugarcane, the only valuable source of information is the EST dataset (Vettore et al., 2003), but this dataset lacks a large number of genes due to either low expression level or tissue specificity (Nelson et al., 2008). Due to the complex genome and the high proportion of repetitive sequences on the sugarcane genome, sequencing initiative consortiums are only at the early stages of achieving enough coverage to solve some of the difficult assembly problems.
DNA cytosine methylation in plant genomes is found extensively in heterochromatin, inactivating transposons and other repetitive sequences, and much less intensely at euchromatic regions, including transcribed regions (Rabinowicz et al., 2005; Grativol et al., 2011). The first technology to deal with the separation of gene regions from highly repetitive ones was high-C0t, in which a specific sequence reassociates at a rate proportional to the number of times it occurs in the genome (Peterson et al., 2002). A second gene enrichment technology is methylation filtration (MF), where regions of the genome are selected based on their characteristic hypomethylation (Rabinowicz et al., 2005). Basically, the repetitive regions which have high occurrence of mC sites are recognized by McrBC restriction enzyme and excluded from genomic shotgun libraries. The unmethylated regions that represent mostly genes are preserved and can be sequenced as MF libraries (Rabinowicz et al., 2005). MF selection was successfully used to sequence genic regions of maize (Palmer et al., 2003) and sorghum (Bedell et al., 2005), demonstrating to be an efficient strategy to enrich genes in many plants with such complex genomes as monocots (Rabinowicz et al., 2005). Here, we compared the assemblies of sequences derived from a pilot experiment of methyl filtration sequencing with sequences from WGS. The results showed that MF technology combined with Illumina short reads sequencing are efficient to generate a set of sequences representative of the hypomethylated portion of sugarcane genome. MF reads increased the coverage of genic regions of the monoploid sugarcane genome to approximately 134 times and allowed a de novo assembly of genomic regions within and around genes, like promoters, microRNAs and introns. The gene similarity between sugarcane and sorghum was highlighted in the MF assembled-sequences, covering 98.4% of sorghum CDS sequences. The information achieved by the MF dataset provides a valuable tool for genomic research in the genus Saccharum.
RESULTS
Gene enrichment in the MF library
To assess the power of gene enrichment in sugarcane using the methyl filtration technique, we have assumed that gene enrichment in MF libraries is proportional to the reduction of repetitive sequences (Bedell et al., 2005). For this evaluation, we constructed and sequenced methyl filtered and unfiltered libraries. The sugarcane cultivar SP70-1143 was grown in soil and mature young leaves were collected for genomic DNA extraction. Two libraries were constructed using gDNA digested with McrBC endonuclease (Methyl filtered - MF) and undigested (Unfiltered - UF). MF and UF libraries containing inserts of 200 bp were sequenced on Illumina with paired-end 50 cycle protocol. After quality filtering of fastq files, both libraries had approximately the same depth of reads (24,239,076-MF and 25,385,446-UF). Next, we checked whether some classes of repeats were reduced in MF reads, based on the percentage of hits of MF and UF reads with the main families of repeats. Among the recognizable repeats, the most repetitive sequences, ribosomal genes (2.56), centromeric repeats (1.4) and retrotransposon elements (1.18) were less abundant in MF reads (Table S1). Then, the gene enrichment factor, or FP value, was calculated by comparing the rate of gene discovery between MF and UF sequences. The proportion of matches of MF and UF sequences in a curated sorghum CDS database were compared over a range of E-values from 10−5 to 10−20, such that all matches better than the given E-value were tabulated (Table S2). The FP values ranged from 1.4 to 1.7 with a medium of 1.54. The size of the monoploid sugarcane genome is estimated to be approximately 930 Mb (Wang et al., 2010). The proportion of genome reduction obtained by MF could be calculated by dividing the range of FP values by the 930 Mb of sugarcane genome. The median sampled space by MF was 603 Mb; therefore 327 Mb or 35% of the genome was filtered out (Figure 1).
Figure 1. Gene enrichment by Methyl Filtration in sugarcane.
Methyl filtration reduces approximately 35% of the highly methylated portion of sugarcane genome (orange) filtering out 327 Mb of the monoploid genome size. In black the percentage and size of the genome space sampled by MF technique are shown.
Assembly of gene-rich regions
To verify the possibility of assembly sequences previous submitted to gene-enrichment strategies as methyl filtration, we first compare a pilot assembly of MF and UF datasets. The quality filtered reads from digested and undigested libraries were shuffled on each paired library and a similar depth in both datasets was observed (Figure 2a). The de novo assemblies of MF and UF reads were performed with the same parameters by using the SOAPdenovo software (Luo et al., 2012). The resulting scaffolds were filtered by their length (≥ 200 bp) and the assemblies' statistics were calculated. The main challenge of assembly plant genomes is to handle the great amounts of repetitive sequences that could complicate the assembly. Sequences which are the result of the methyl filtration technique may have a better representation of genes, leading to a better assembly when compared to UF. Accordingly, the assembly of MF reads produced 1.5 times more scaffolds, 1.7 times more assembled Mb and 2.5 times more scaffolds larger than 1,000 bp in length compared to UF reads (Figure 2a). A second evaluation of the assemblies was performed by comparing the coverage of sorghum CDS. After BLASTN search with curated sorghum CDS against MF and UF scaffolds, MF sequences were capable to hit 1.7 times more CDS than UF (13,233 and 7,782 CDS, respectively). In total, 7,358 CDS had a significant alignment with MF and UF scaffolds. Given the large difference in total number of scaffolds and of CDS tagged by the two assemblies, we evaluated base coverage only in CDS aligned with both MF and UF scaffolds; the percentage of CDS covered in the range of 50 to 100% was counted for each dataset and compared. MF scaffolds covered on average 68% of aligned CDS when compared to only 32 % by UF scaffolds (Figure 2b).
Figure 2. Comparison of assemblies of MF and UF reads.
The digested (MF) and undigested (UF) reads were assembled separately and the obtained scaffolds were evaluated by their rate of sorghum CDS coverage. a) Illumina paired-end sequences were quality analyzed and the remaining reads were shuffled on each paired library. The de novo assembly performed by SOAPdenovo resulted in MF and UF consensus sequences with great difference in their assembly statistics. b) The sorghum CDS coverage was divided in 5 categories between 50–100% and the number of CDS that fell on each specific category was retrieved for each assembly. The percentages of CDS were calculated for each category with the number of CDS covered by MF (red bars) and UF scaffolds (gray bars).
In order to improve the assembly of consensus sequences, two mate-pair libraries of methyl filtered sequences, sheared to 2,000 and 5,000 bp, were prepared and sequenced. The coverage of sorghum genes by quality-filtered MF libraries (200, 2,000 and 5,000 bp) was additionally verified. MF reads covered 95% of the exons over 10X, and 98% of sorghum genes (see Experimental Procedures). Next, repetitive sequences were removed from the MF libraries and the remaining reads were shuffled on each paired library. According to (Butterfield et al., 2001) genes cover around 20% of the sugarcane monoploid genome size (186 Mb), which is close to maize and rice gene contents, 24 and 17%, respectively. Over 260 million of reads (25 billion MF bases) were sequenced by Illumina technique (Table S3). Thus, the coverage of the sugarcane genic region achieved with MF reads is estimated to be 134X. Following the assembly pipeline, a range of kmer sizes were used and the best kmer size (k=49) was chosen based on assembled lengths with and without Ns, mean contigs size and number of contigs over 1,000 bp. For comparison, the ABySS software (Simpson et al., 2009) was also used as a second assembler of contigs with the best kmer. We scaffolded pre-assembled contigs by using the SSPACE software (Boetzer et al., 2011). The SOAPdenovo scaffolds were also used to find the optimized assembly. The workflow, with main steps of MF reads assembly, is shown in Figure S1. The MF scaffolds outputted from the three assemblers were compared (Table 1). The resulting assemblies comprised over 900,000 supercontigs containing at least 600 Mb, in each group of supercontigs assembled. The number of supercontigs over 1,000bp obtained with each assembler was 420,765 for SOAPdenovo, 83,946 for ABySS+SSPACE and 146,178 for SOAPdenovo+SSPACE. The MF scaffold coverage of the sugarcane monoploid gene space varied from 3.3 to 16.5 fold. The combination of ABySS+SSPACE resulted in the best assembled supercontigs considering the total number of bases without “N” and total number of supercontigs. The MF scaffolds outputted from SOAPdenovo achieved the largest supercontig (bp) and N50 length (bp). Comparing the MF scaffolds obtained by the same scaffolder (SSPACE), SOAPdenovo+SSPACE achieved better results than ABySS+SSPACE in all analyzed categories except at the largest supercontig. We used MF scaffolds obtained for all three assemblers in subsequent validations, quality assessment and comparative analysis.
Table 1.
Assembly statistics
Assembler | Total (Mb) | Total without N's (Mb) | No. of supercontigs (>= 200bp)a | Largest supercontig (Pb) | N50 (bp) |
---|---|---|---|---|---|
SOAPdenovo | 3,082 | 1,225 | 915,918 | 120,968 | 9,486 |
AbySS + SSPACE | 622 | 458 | 1,117,449 | 45,926 | 735 |
SOAPdenovo + SSPACE | 674 | 472 | 1,109,444 | 35,917 | 1,154 |
Only scaffolds above 200bp were used to compute the assembly statistics.
Gene tagging and coverage
The principle of the use methyl filtration technique in highly repetitive genomes is to identify genes faster than WGS in a robust and efficient manner. We evaluated the efficiency of gene tagging of sugarcane ESTs dataset and RNA-seq reads by the three different sets of MF scaffolds. We used the sugarcane EST database, comprising 121,342 sequences (88 Mb) and 17 RNA-seq libraries (55,635,715 reads), comprising over 3,415 Mb (Table S4). After the alignment, the coverage (in base pairs) of reads mapped to MF scaffolds from each assembler was tabulated (Table S5). The higher percentage of covered ESTs was obtained by SOAPdenovo+SSPACE scaffolds (97.18%). In terms of covering bases, only 1,732 Kb of ESTs were not covered by SOAPdenovo+SSPACE scaffolds. MF scaffolds from the three assemblers covered over 3,203 Mb of RNA-seq reads. The SOAPdenovo+SSPACE had the highest percentage and coverage of RNA-seq reads. To estimate the number of MF scaffolds that were tagged by ESTs sequences and RNA-seq reads, the unique IDs of the scaffolds were extracted and the intersection between MF scaffolds was found for each assembly. The intersection between ESTs and RNA-seq reads range from 194,993 for the SOAPdenovo to 208,134 for the SOAPdenovo+SSPACE scaffolds (Figure S2). The largest number of MF scaffolds double aligned with ESTs and RNA-seq reads and also uniquely matched to ESTs was found in the SOAPdenovo+SSPACE assembly.
In addition to be tagged by large numbers of ESTs and RNA-seq reads, it is expected that if MF scaffolds represent genic regions of sugarcane, the rate of matches between MF scaffolds and proteins sequences should be also high. To test this hypothesis, we have used sorghum and Arabidopsis protein sequences, classified as “hypothetical,” and “known” (Methods S1). The better distribution of matches in the 3 categories (total, known and hypothetical) from sorghum and Arabidopsis was found in the SOAPdenovo+SSPACE scaffolds (Figure 3). These MF scaffolds covered more than 97% of total and known sorghum proteins. In addition, MF scaffolds hit 98% of sorghum hypothetical proteins. The best percentage of MF scaffolds hits with Arabidopsis dataset was found in known proteins (84%). The hits distribution of SOAPdenovo and ABySS+SSPACE scaffolds at the proteins datasets is available in Figure S3.
Figure 3. Comparisons of tagged Arabidopsis and sorghum proteins by MF scaffolds.
The sorghum and Arabidopsis protein sequences further classified as “hypothetical,” and “known” were compared to SOAPdenovo+SSPACE scaffolds through BLASTX. A protein was considered tagged if it had a best match ≥ 10−8(red bars). Proteins not supported by alignment parameters were considered as no hit (gray bars).
To estimate the ability of MF scaffolds to capture sugarcane genes, we analyzed their coverage of a dataset of bacterial artificial chromosome (BAC) sequences from a sugarcane cultivar. We selected 20 of total 52 finished BACs at GNPannot database (Methods S1). We extracted gene and exon sequences separately for each BAC and used a BLAST search to calculate the coverage of MF scaffolds on these specific genic regions. MF scaffolds from SOAPdenovo+SSPACE were chosen for this analysis due the best performance in covering CDSs and ESTs. MF scaffolds tagged all the gene sets in the 20 sequenced BACs (Table 2). The average nucleotide coverage of the gene set from the sugarcane BACs, comprising 5' and 3' UTR, exonic and intronic regions, was 90.8%. At exon level, 93% of predicted exons were tagged by MF scaffolds with 91.9% of nucleotide coverage.
Table 2.
Estimation of coverage of genes and exons from sugarcane BAC clones tagged by MF scaffolds.
BAC ID | Length (bp) | Gene Count | Exon regions | % Gene | Genes |
Exons |
||
---|---|---|---|---|---|---|---|---|
No. of genes | % of overlapping bases | No. of exons | % of overlapping bases | |||||
Sh007C22 | 125,231 | 39 | 156 | 45 | 39 | 92.6 | 141 | 93.2 |
Sh011C13 | 137,505 | 51 | 135 | 14 | 51 | 89.1 | 126 | 89.8 |
Sh011F05 | 65,821 | 22 | 92 | 75 | 22 | 90.5 | 90 | 90.4 |
Sh013O22 | 56,351 | 18 | 65 | 80 | 18 | 87.1 | 61 | 86.3 |
Sh013O24 | 82,736 | 36 | 74 | 15 | 36 | 87.7 | 69 | 88.7 |
Sh015P19 | 101,636 | 31 | 104 | 34 | 31 | 87.2 | 97 | 89.5 |
Sh026K20 | 137,078 | 42 | 142 | 76 | 42 | 92.0 | 131 | 93.7 |
Sh029N14 | 95,398 | 23 | 59 | 66 | 23 | 91.9 | 53 | 90.2 |
Sh030H10 | 94,493 | 33 | 135 | 48 | 33 | 92.3 | 125 | 91.8 |
Sh038J02 | 96,992 | 35 | 92 | 24 | 35 | 89.8 | 91 | 92.0 |
Sh043C15 | 101,971 | 35 | 127 | 31 | 35 | 91.0 | 119 | 93.4 |
Sh045D09 | 71,852 | 18 | 66 | 82 | 18 | 90.7 | 62 | 90.4 |
Sh056J11 | 85,843 | 40 | 90 | 14 | 40 | 93.8 | 88 | 94.3 |
Sh070I10 | 142,194 | 60 | 135 | 13 | 60 | 91.6 | 124 | 93.5 |
Sh077E22 | 71,049 | 21 | 81 | 21 | 21 | 93.5 | 78 | 95.7 |
Sh077M22 | 71,034 | 19 | 76 | 72 | 19 | 93.5 | 72 | 95.5 |
Sh095E16 | 157,59 | 48 | 148 | 23 | 48 | 91.6 | 134 | 92.6 |
Sh102H05 | 112,142 | 31 | 130 | 38 | 31 | 92.4 | 116 | 92.5 |
Sh102M23 | 129,132 | 32 | 127 | 79 | 32 | 91.3 | 121 | 96.5 |
Sh109A01 | 70,182 | 21 | 66 | 74 | 21 | 86.1 | 57 | 88.1 |
Annotation of complete pathways
In order to assess the usefulness of MF scaffolds in answering evolutionary genomics questions, we searched for genes of the sucrose/starch metabolic pathway. A screen on the MF scaffolds using the sucrose/starch protein sequences from Arabidopsis and maize revealed 355 sugarcane scaffolds involved in this pathway (Table S6). The average length of the sucrose/starch related MF scaffolds was 1,546 bp. As expected, the number of sugarcane MF scaffolds related to sucrose/starch pathway was more similar with maize than with Arabidopsis. We also examined whether we could identify in the MF scaffolds well-supported SNPs. We constructed and sequencing genomic DNA libraries from two wild sugarcane species S. spontaneum and S. officinarum. To call SNPs in both wild species we performed an alignment between gDNA reads and the reference sucrose/starch metabolism annotated scaffolds. Following, 291 SNPs were found in the two main sugarcane progenitors S. spontaneum (43) and S. officinarum (248). From these, 113 SNPs were located in CDS regions (38%). Some important SNPs are highlighted on sucrose/starch pathway (Figure 4). The percentage of SNPs found in S. officinarum and S. spontaneum was 86% and 14% respectively, which is in agreement with results that indicate that the sugarcane hybrids genomes are composed by ~80% of S. officinarum and 10 to 20% of S. spontaneum genomes (Grivet and Arruda, 2002). While S. officinarum is capable of storing sucrose to about 17% of its fresh weight, their wild relative (S. spontaneum) stores only ~4% (Bull and Glasziout, 1963). Interestingly, in the sucrose phosphate synthase (SPS) gene, a key enzyme in sucrose storage process, two non-synonymous modifications were found in both in S. officinarum and S. spontaneum, but another two SNPs from exonic regions appeared to be unique to S. officinarum (Figure S4).
Figure 4. SNP distribution between main sugarcane progenitors on the sucrose/starch pathway.
The annotated sucrose/starch related genes in MF scaffolds were used as reference sequences in the search for SNPs in S. spontaneum and S. officinarum. The numbers of SNPs found in each parent are marked with orange in the main steps of the sucrose/starch pathway. S. spontaneum SNPs (left) / S. officinarum SNPs (right).
Identification of miRNA precursors
MiRNAs constitute an important class of small RNAs that are involved in the negative regulation of protein-coding genes at the posttranscriptional level (Vaucheret, 2006; Bartel, 2004). Plant miRNA genes are produced from their own transcriptional units with the involvement of some transcription factors (TFs) that bind to their promoter regions, similar to those of protein-coding genes (Bologna et al., 2013; Meng et al., 2011). The product of miRNA gene transcription is a partial self-complementary hairpin structure (pri-miRNA) that is capped and polyadenylated (Souret et al., 2005; Mishra and Mukherjee, 2007). In plants, the hairpin has various remarkable characteristics which allow their identification by computational approaches (Bartel et al., 2004). Because miRNA genes have non-protein-coding transcription units, we examined whether the MF scaffolds contained hairpin structures of known and conserved plant miRNAs. MF scaffolds contain a large number of miRNA precursors (14,623). On Figure 5a the number of precursors identified from some conserved miRNAs is shown. We also counted the number of miRNA sequences and families of Arabidopsis, rice, sorghum and maize that were represented in the MF scaffolds. For those species more closely related to sugarcane, over 93% of the miRNA sequences had a hairpin structure on MF scaffolds (Figure 5b). For about 50% of Arabidopsis and rice miRNA sequences, the presence of hairpins was found at the MF scaffolds. Almost all miRNA families from sorghum and maize were represented in MF scaffolds (Figure 5c). We also checked if in the MF scaffolds we could identified the hairpins for new miRNAs recently described for sugarcane (Thiebaut et al., 2012). From a total of 37 sugarcane miRNAs deposited at the miRBase, we found 31 miRNA sequences with hairpins at MF scaffolds. The secondary structures of Sof-miR160 and Sof-miR398 deposited were compared with the hairpin structure obtained with MF scaffolds (Figure S5). Interestingly, the other 6 miRNA hairpins missing in the MF scaffolds were not located at genic regions on sorghum genome, but at intergenic regions.
Figure 5. MiRNA hairpin content in MF scaffolds.
The MIRcheck pipeline was used to find and evaluate the presence of hairpins of known plant miRNAs in MF scaffolds. a) The number of precursors found in MF scaffolds using conserved miRNAs in the search; b) Comparison between the number of miRNA sequences of each plant species with supported hairpin at MF scaffolds with the number of total miRNA deposited at miRBase; c) Comparison between the number of miRNA families of each plant species with supported hairpin at MF scaffolds with the number of total miRNA families deposited at miRBase.
DISCUSSION
In order to obtain a good representation of the gene-enriched fraction of the sugarcane genome, we used methylation filtration (MF) followed by Illumina sequencing. This current work can be the beginning of the application of Next Generation Sequencing coupled with methyl filtration technique to achieve sequences with high quality coverage of genic region from plants with very complex genomes.
Whole genome duplication is a common event in dicot/monocots plants, and it has shaped the repetitive, genic and much of their epigenetic status of the genome by creating a large level of redundancy (Wang et al., 2012). After polyploidization, drastic reorganization of the genomic structure may occur, including amplification of repetitive sequences and reduction of low copy number DNA (Doyle et al., 2008). Even though rounds of polyploidization can create larger genome sizes, frequently, several plant gene sets do not accompany this increase, and occasionally some of the genes are lost (Roulin et al., 2013). Although there is often a good correlation between methylation levels and increase in genome sizes of monocot plants, in some wheat species the same correlation was not observed (Rabinowicz et al., 2005). Bread wheat has an enormous gene space, 5X larger than other grasses such as maize and sorghum; still, low gene enrichment was achieved using methyl filtration (Rabinowicz et al., 2005). Similarly, unlike what was achieved for sorghum and maize (Palmer et al., 2003; Bedell et al., 2005), a modest increase in gene enrichment was obtained for sugarcane. Although the highly repetitive sequences such as ribosomal genes, retrotransposon and centromeric repeats appear to be highly methylated and were less recovered in sugarcane MF reads, the enrichment of genic regions could be lower in the MF libraries due to the presence of unmethylated or not heavily methylated repeats. However, while the presence of these repetitive sequences in the set of MF reads could create problems for the assembly of genic regions (Scheibye-Alsing et al., 2009), the comparison of MF and UF assemblies showed that the methyl filtration not only increased the quality of the assembly, but also gene coverage. Besides, the increment of the basic genic set coverage in 134X could help to solve some of the assembly difficulties with highly complex genomes and generate a quality group of sequences that well represents the sugarcane gene content. In contrast, recent reports of the draft genome sequences of Nicotiana benthamiana (Bombarely et al., 2012) and Gossypium raimondii (K., Wang et al., 2012) by WGS have not achieved such a high coverage of genic regions.
While the complete sequencing of the sugarcane genome remains a valuable goal, the sequencing of smaller datasets of complex genomes such as MF scaffolds, ESTs and RNA-seq appears to be an important step to obtain gene-related information (Edwards and Batley, 2010). For reduced-representation libraries, it is expected that gene discovery may occur faster than with WGS, proportionally to the reduction of repeats (Young et al., 2010). Indeed, the efficiency of the assembled MF scaffolds to identify genes was actually high, providing an excellent representation of ESTs and RNA-seq reads sugarcane datasets, proteins from different species and genes and exons from sugarcane BACs. Recently, the ratio of ESTs coverage on plant genomes was reported to be a good parameter to distinct good and poor assemblies (Shangguan et al., 2013). By generating MF scaffolds covering at least 3.6X the sugarcane gene space, it was possible to achieve the same amount or even more information than that which was covered in the sorghum MF dataset with only 1X coverage (Bedell et al., 2005). Furthermore, the MF scaffolds offered a high quality resource for identification of genes, promoters and polymorphisms with biotechnological potential.
The sugarcane improvement process started late in the 19th century when an innovative cross was made between the wild species S. officinarum and S. spontaneum in order to generate cultivars with large amounts of sugar content and resistant to diseases (Moore, 2005). Over 100 years of sugarcane breeding, the increase of sucrose production per hectare is still a major goal (Smith, 2008). However, only a few genes of the sucrose metabolic pathway have been identified in sugarcane (Zhang et al., 2013; Papini-Terzi et al., 2009). In this context, the MF scaffolds offer the possibility to identify complete pathways such as the genes of the metabolism of sucrose and starch. Besides the identification of genes, the lower sequencing error rates of MF compared to high Cot sequences, has allowed the use of this technique in the search for SNPs (Fu et al., 2004). Recently, an array-based capture with sorghum genes, followed by NGS, was used for identification of sugarcane SNPs (Bundock et al., 2012). Although sorghum and sugarcane coding sequences are highly similar, the MF scaffolds include also regulatory non-coding sequences. Furthermore, the MF scaffolds are capable of identifying SNPs among a sugarcane hybrid, S. spontaneum and S. officinarum, which have far-reaching phenotypic differences. Moreover, a large number of microRNA genes were also identified in the MF scaffolds. Previously, sugarcane pri-miRNA have been identified by searching the EST database (Zanca et al., 2010), which limits the discovering process because their rapid processing (Voinnet, 2009). In addition, the lack of genomic sequences hinders the use of bioinformatics tools to discover new microRNAs. So far, most of the new sugarcane miRNA precursors have been described using the sorghum genome for database search, but much information could have been lost because of the evolution rate of miRNA is different between species and could generate many non-conserved and species-specific miRNA (Cuperus et al., 2011).
Because of the difficulties to solve assembly problems of highly repetitive genomes, we performed the sequencing and assembling of sugarcane genic regions through the use of methylation filtration. The high coverage of the sugarcane gene set by MF reads enabled the generation of a robust assembly that was capable of serving as a reference sequence, covering efficiently not only ESTs and proteins from published databases, but also the vast majority of a 17 library RNA-seq dataset. The discovery of genes, regulatory sequences and non-coding RNA is now a valuable resource for the improvement of sugarcane as a biofuel crop.
EXPERIMENTAL PROCEDURES
Plant material and nuclei DNA extraction
Sugarcane cultivar SP70-1143 was grown in soil and young leaves were collected. Genomic DNA was purified from isolated nuclei of 2 leaves as described (Nuclei isolation modified from Hamilton, Kunsch, and Temperli (1972) Anal. Biochem.49:48–57; further modified by Tom Guilfoyle, then N. Olszewski and Eric Richards; available at (http://www.protocol-online.org/cgi-bin/prot/view_cache.cgi?ID=3931) with modifications. The quality of DNA was estimated using Thermo Scientific NanoDrop™ 2000c Spectrophotometer and then quality was verified by electrophoresis on a 1% agarose gel.
MF library construction and sequencing
For each library, 3–5 μg of genomic DNA was sheared using Covaris S220 Adaptive Focused Acoustics ultra sonicator. Libraries were constructed following standard protocol using the NEBNext DNA Sample Prep Master Mix Set 1 (NEB E6040) and Illumina-compatible paired-end adaptors. Half of the library was digested with McrBC endonuclease (NEB M0272) in a 100 μl reaction containing 30 units McrBC, 1xNEBuffer 2, 200 μg/ml BSA and 1 mM GTP. After 8 hr incubation in 37 °C, additional 30 units of McrBC and 1 μl of GTP were added and incubation was continued for another 8 hrs. After the incubation, digested (methyl filtered –MF) and control (undigested - UF) libraries were purified using the QIAquick PCR Purification kit (Qiagen 28104), run on 2% MetaPhor® agarose (Lonza 50108) gel and fragments of desired sizes were excised from gel, purified with the QIAquick PCR Purification kit and amplified with 15 cycles of PCR using Phusion polymerase (part of the NEBNext kit) and cycling parameters described in the NEBNext kit manual. DNA concentrations were quantified on a Bioanalyzer (Agilent), diluted to 10 nM and loaded on flow cells to generate clusters. The MF (Methyl filtered) and UF (Unfiltered) libraries containing inserts of 200 bp size were constructed and sequenced at Cold Spring Harbor Laboratory on Illumina GAII machine using the paired-end 50 cycle protocol. Two mate-pair libraries were constructed with selection of genomic DNA sheared at 2,000 and 5,000 bp, followed by McrBC treatment as described above. The libraries containing inserts of 2 and 5 kb sizes were sequenced at Fasteris Life Sciences SA (Plan-les-Ouates, Switzerland) on HiSeq2000 machine using the paired-end 100 cycle protocol. The sequence quality was evaluated by measuring the quality of the reads according to the 90% of bases having a base quality greater or equal than 20 (Q20) using FASTX Toolkit version 0.0.6 (Hanon lab). The sequence data from this study have been submitted to the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi) under accession no. SRP023506.
FP calculation
As reported by Bedell and co-workers (Bedell et al., 2005), we have done a curation of the sorghum dataset to eliminate repetitive sequences that could inflate the true gene content of the given database (Methods S1). Because the large similarity between sorghum and sugarcane genes we selected the sorghum CDS sequences from JGI (ftp://ftp.jgipsf.org/pub/compgen/phytozome/v9.0/Sbicolor/annotation/Sbicolor_79_cds.fa.gz). The dataset was reduced in 4,006 sequences from 29,448 total sequences. The gene enrichment factor or FP value was calculated by using a NCBI-BLASTN search (E-value cutoff 0.01) with MF and UF filtered reads against curated sorghum CDS database. All matches of MF and UF sequences between E-values from 10−5 to 10−20 were tabulated. To evaluate the FP at the repeats level, we performed a NCBI-BLASTN search (E-value cutoff 0.01) with MF and UF filtered reads against the repetitive sequences from TIGR_Gramineae_Repeats.v3.3_0_0 (Ouyang and Buell, 2004). All reads matched to different repeats classes were counted and the repeat filtration calculated.
Assembly
The quality-filtered reads from MF and UF datasets were shuffled on each paired library by using an in house Perl script. The interleaved reads were assembled into scaffolds with SOAPdenovo (Luo et al., 2012) using a previous tested kmer size of 25. The MF and UF scaffolds were filtered by their length to be ≥ 200 bp and the statistics of each assembly was calculated. An additional evaluation of the assemblies was performed by aligning sorghum CDS filtered as described in Methods S1against the MF and UF scaffolds. We used BLASTN search with best hit and E-value cutoff 10−5. Because the great difference in the total number of scaffolds between MF and UF, we only evaluated the coverage of CDS that aligned with both MF and UF sequences. We count the number of CDS covered between 50 to 100% by MF and UF scaffolds.
Next, the 3 McrBC digested libraries (200, 2,000 and 5,000 bp) were evaluated for their gene enrichment. The SAMMate was run with the MF libraries against sorghum genome by using bowtie with 5 mismatches in the whole read and the best alignment option on. The sorghum genes were used as probe and the coverage was quantified. After that, the reads hitting repetitive sequences reads were removed using Bowtie2 2.1.0 with the settings “-L 8 -N 0 --local -k 1”.Reads matched to these criteria were excluded. The alignment was performed against the repetitive sequences from TIGR_Gramineae_Repeats.v3.3_0_0 (Ouyang and Buell, 2004). The remaining reads shuffling on each paired library was carried out with an in house Perl script. The interleaved reads were first assembled into contigs with SOAPdenovo (Luo et al., 2012) using a range of kmer sizes (25, 31, 37, 41, 43, 49). The best kmer size (k=49) was chose based on size including N, size without N, mean size and number of contigs over 500 bp. The ABySS software (Simpson et al., 2009) was used as second assembler of contigs with kmer=49. We scaffolded pre-assembled contigs by using SSPACE_basic v.2.0 (Boetzeret al., 2011) with the settings “-x 1 –m 30”. The SOAPdenovo scaffolds were also used to find the optimized assembly. The workflow with the main steps of the assemblies is available at Figure S1. All assemblies were performed on 8 processors Xeon 6-core E7540 2.00GHz HT server with 1TB RAM and then the assembly statistics were computed. The best assembly (SOAPdenovo+SSPACE) was called MF scaffolds and they are available for downloading and BLAST-based similarity search in the link http://lbmp.bioqmed.ufrj.br/genome. Additionally, we predicted the sugarcane CDS and proteins from SOAPdenovo+SSPACE scaffolds using AUGUSTUS version 2.5.5 (Stanke et al., 2008). The existing gene model of the closest organism was used (i.e. Rhizopus oryzae) to predict genes on sugarcane scaffolds. Sugarcane EST data from the DFCI Sugarcane Gene Index Release 3.0 was used to guide gene prediction on AUGUSTUS. To filter for false positive genes, predicted genes with no EST match were separated. The EST annotation were transferred using best hit from BLAT version 35×1 (Kent, 2002) alignment between ESTs and predicted protein sequences from sugarcane scaffolds. The predicted sugarcane CDS and proteins with annotation are also available for downloading and BLAST-based similarity search in the link http://lbmp.bioqmed.ufrj.br/genome.
ESTs and RNA-seq reads comparisons
Sugarcane ESTs were downloaded from the DFCI Sugarcane Gene Index Release 3.0 (http://compbio.dfci.harvard.edu/cgi-bin/tgi/download.pl?ftp_dir=data&file_dir=Saccharum_officinarum), which contains 121,342 sequences (88,397,709 total bases). To compare the assemblies with ESTs dataset a NCBIBLASTN with E-value cutoff 10−5 was performed. The alignment length of each EST sequence was extracted and the coverage in base pairs computed.
A total of 8 RNA-seq libraries from in vitro grown sugarcane plantlets of SP70-1143 cultivar and 9 RNA-seq libraries from young leaves of 3 month-old Saccharum officinarum stalks are available at NCBI Sequence Read Archive under project SRP32773 and were used to evaluate the assemblies (Table S4). Raw reads (55,635,715) were checked by measuring the quality with the same parameters as described for McrBC-digested libraries using FASTX Toolkit version 0.0.6. Filtered RNA-seq reads (37,990,190) were mapped using Bowtie2 2.1.0 onto the existing assemblies (SOAPdenovo, ABySS+SSPACE and SOAPdenovo+SSPACE) with the settings “-L 16 -N 1 --local -k 1”. The unique IDs of scaffolds were extracted and the intersection between the number of scaffolds that matched to Sugarcane ESTs and RNA-seq reads was found for each assembly.
Protein comparisons
The Arabidopsis and sorghum protein sequence sets were downloaded from JGI. Both files were last modified in 12/13/12 and contained 29,448 and 35,386 protein sequences, respectively. The sequences protein datasets where further classified as “repeats,” “hypothetical,” and “known” by using 3 criteria (Methods S1).The classified proteins sets were considered tagged if it had a best match ≤ 10−8 with the assemblies in a NCBI-BLASTX search. The unique IDs from sorghum and Arabidopsis proteins that matched to scaffolds from the 3 assemblies were extracted and the hit numbers computed.
BAC analysis
At the time of the analysis, 52 finished BAC clones from R570 sugarcane cultivar were available at GNPannot portal (http://gnpannot.cirad.fr/cgi-bin/gbrowse/sugarcane/). We selected sequences of 20 BAC clones from GNPannot portal (Methods S1). To address the gene and exon coverage, we performed a NCBI-BLASTN with E-value cutoff 10−5 between MF scaffolds (SOAPdenovo+SSPACE) and genes and exons regions extracted from BACs. The alignment coordinates were merged for each gene and exon and the base coverage calculated.
Identification of sucrose/starch metabolism orthologous genes
The genes related to sucrose/starch metabolism were identified at MapMan (Thimm et al., 2004) mapping files from Arabidopsis (Ath_AFFY_ATH1_TAIR10_Aug2012) and maize (Zm_B73_5b_FGS_cds_2012) available at (http://mapman.gabipd.org/web/guest/mapmanstore). Both files have hierarchical functional categories assign with BIN codes and the genes related to each category. We selected all genes related to major carbohydrate metabolism and transport sugars sucrose (BIN codes 2 and 34.2.1, respectively) to further annotate. The protein sequences from Arabidopsis and maize sucrose/starch metabolism-related genes were used as protein database in a NCBI-BLASTX search of MF scaffolds (SOAPdenovo+SSPACE) using E-value cutoff 10−8. A sequence was annotated if the match was over 70% of identity and minimum overlapping to the protein region of 50 amino acids or more.
SNPs detection
We constructed and sequencing genomic DNA libraries of young leaves from S. spontaneum and S. officinarum (Methods S1). To call SNPs in both sugarcane genotypes we used MAQ version 0.7.1 to align sequenced reads on the annotated scaffolds into sucrose/starch metabolism genes. The easyrun method at MAQ command line was used to generate the cns.final.snp file that contains the quality filtered SNPs. The following additional filters were used to post process the called SNPs: (1) Phred-like consensus quality higher than 20; (2) the read depth of called SNP should be at least 10 reads; (3) the average number of hits of reads covering the SNP position equal to 1.00; (4) the mapping quality of the reads covering the SNP position higher than 60; (5) the minimum consensus quality in the 3bp flanking regions higher than 60. After that, the MF scaffolds were submitted to FGENESH (monocot plants, available at http://linux1.softberry.com/berry.phtml?topic=fgenesh&group=programs&subgroup=gfind) for gene structure prediction. Another two scaffolds encoding sucrose phosphate synthase were selected and the SNP position found in S.spontaneum and S.officinarum were marked.
miRNA analysis
The mature miRNA sequences were downloaded from miRBase database Release 19 (ftp://mirbase.org/pub/mirbase/CURRENT/) and the plant miRNA were selected. In total 5,837 miRNAs sequences were used in the search of hairpin structure characteristics of miRNAs. For this purpose, we used MF scaffolds (SOAPdenovo+SSPACE) as target genomic sequence to match with known miRNAs. The MIRcheck software package (Jones-Rhoades and Bartel, 2004) was used to find the presence of hairpins as described in Methods S1. Only hairpins containing miRNAs described in Arabidopsis, rice, sorghum and maize were taking into account for the number of mature sequences and miRNA families. The MIRcheck pipeline was also applied to search for hairpin structure of miRNAs recently described for sugarcane (Thiebaut et al., 2012) (Methods S1).
Data Access
The unfiltered and methyl-filtered sequence data from this study have been submitted to the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi) under accession no. SRP023506. The sequence data from sugarcane wild species S. spontaneum and S. officinarum have been submitted to the NCBI Sequence Read Archive under accession no. SRP026249. The RNA-seq data from SP70-1143 and S. officinarum have been submitted to the NCBI Sequence Read Archive under accession no. SRP032773.
Supplementary Material
ACKNOWLEDGMENTS
We are grateful to Laboratório Multiusuário de Bioinformática da Embrapa and NACAD from COPPE/UFRJ for providing additional computational infrastructure. We thank Andréia Cordeiro for technical assistance. Research in our group is supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Instituto Nacional de Ciência e Tecnologia em Fixação Biológica de Nitrogênio (INCT), Financiadora de Estudos e Projetos (FINEP), Fundação de Amparo à Pesquisa do Rio de Janeiro (FAPERJ) and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES).
Footnotes
Data Access: NCBI Sequence Read Archive accession no. SRP023506, SRP026249 and SRP032773.
SUPPORTING INFORMATION Additional Supporting Information can be found in the online version of this article.
REFERENCES
- Bartel DP. MicroRNAs : Genomics, Biogenesis, Mechanism, and Function Genomics. Cell. 2004;116:281–297. doi: 10.1016/s0092-8674(04)00045-5. [DOI] [PubMed] [Google Scholar]
- Bartel DP, Lee R, Feinbaum R. MicroRNAs : Genomics, Biogenesis, Mechanism, and Function Genomics : The miRNA Genes. 2004;116:281–297. doi: 10.1016/s0092-8674(04)00045-5. [DOI] [PubMed] [Google Scholar]
- Bedell J. a, Budiman M. a, Nunberg A, et al. [Accessed August 27, 2010];Sorghum genome sequencing by methylation filtratio. PLoS Biol. 2005 3:e13. doi: 10.1371/journal.pbio.0030013. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=539327&tool=pmcentrez&rendertype=abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. [Accessed March 1, 2013];Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011 27:578–9. doi: 10.1093/bioinformatics/btq683. Available at: http://www.ncbi.nlm.nih.gov/pubmed/21149342. [DOI] [PubMed] [Google Scholar]
- Bologna NG, Schapire AL, Palatnik JF. [Accessed March 11, 2013];Processing of plant microRNA precursors. Brief. Funct. Genomics. 2013 12:37–45. doi: 10.1093/bfgp/els050. Available at: http://www.ncbi.nlm.nih.gov/pubmed/23148323. [DOI] [PubMed] [Google Scholar]
- Bombarely A, Rosli HG, Vrebalov J, Moffett P, Mueller LA, Martin GB. A Draft Genome Sequence of Nicotiana benthamiana to Enhance Molecular Plant-Microbe Biology Research. 2012;25:1523–1530. doi: 10.1094/MPMI-06-12-0148-TA. [DOI] [PubMed] [Google Scholar]
- Bull TA, Glasziout KT. The evolutionary significance of sugar accumulation in Saccharum. Aust. J. Biol. Sci. 1963;16:737–742. [Google Scholar]
- Bundock PC, Casu RE, Henry RJ. [Accessed March 26, 2013];Enrichment of genomic DNA for polymorphism detection in a non-model highly polyploid crop plant. Plant Biotechnol. J. 2012 10:657–67. doi: 10.1111/j.1467-7652.2012.00707.x. Available at: http://www.ncbi.nlm.nih.gov/pubmed/22624722. [DOI] [PubMed] [Google Scholar]
- Butterfield M, D'Hont A, Berding N. THE SUGARCANE GENOME : A SYNTHESIS OF CURRENT UNDERSTANDING, AND LESSONS FOR BREEDING AND BIOTECHNOLOGY. Proc S Afr Sug Technol Ass. 2001;75:1–5. [Google Scholar]
- Cheavegatti-Gianotto A, Abreu H.M.C. de, Arruda P, et al. [Accessed August 20, 2011];Sugarcane (Saccharum X officinarum) : A Reference Study for the Regulation of Genetically Modified Cultivars in Brazil. Trop. Plant Biol. 2011 4:62–89. doi: 10.1007/s12042-011-9068-3. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3075403&tool=pmcentrez&rendertype=abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cuperus JT, Fahlgren N, Carrington JC. Evolution and Functional Diversification of MIRNA Genes. Society. 2011;23:431–442. doi: 10.1105/tpc.110.082784. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Doyle JJ, Flagel LE, Paterson AH, Rapp R. a, Soltis DE, Soltis PS, Wendel JF. [Accessed July 22, 2011];Evolutionary genetics of genome merger and doubling in plants. Annu. Rev. Genet. 2008 42:443–61. doi: 10.1146/annurev.genet.42.110807.091524. Available at: http://www.ncbi.nlm.nih.gov/pubmed/18983261. [DOI] [PubMed] [Google Scholar]
- Edwards D, Batley J. [Accessed March 18, 2013];Plant genome sequencing: applications for crop improvement. Plant Biotechnol. J. 2010 8:2–9. doi: 10.1111/j.1467-7652.2009.00459.x. Available at: http://www.ncbi.nlm.nih.gov/pubmed/19906089. [DOI] [PubMed] [Google Scholar]
- Feuillet C, Leach JE, Rogers J, Schnable PS, Eversole K. [Accessed March 6, 2013];Crop genome sequencing : lessons and rationales. Trends Plant Sci. 2011 16:77–88. doi: 10.1016/j.tplants.2010.10.005. Available at: http://www.ncbi.nlm.nih.gov/pubmed/21081278. [DOI] [PubMed] [Google Scholar]
- Fu Y, Hsia A, Guo L, Schnable PS. Types and Frequencies of Sequencing Errors in Methyl-Filtered and High C 0 t Maize Genome Survey Sequences 1. Genome Anal. 2004;135:2040–2045. doi: 10.1104/pp.104.041640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grativol C, Hemerly AS, Ferreira PCG. [Accessed September 26, 2011];Genetic and epigenetic regulation of stress responses in natural plant populations. Biochim. Biophys. Acta. 2011 :13–15. doi: 10.1016/j.bbagrm.2011.08.010. Available at: http://www.ncbi.nlm.nih.gov/pubmed/21914492. [DOI] [PubMed]
- Grivet L, Arruda P. Sugarcane genomics: depicting the complex genome of an important tropical crop. Curr. Opin. Plant Biol. 2002;5:122–7. doi: 10.1016/s1369-5266(02)00234-0. Available at: http://www.ncbi.nlm.nih.gov/pubmed/11856607. [DOI] [PubMed] [Google Scholar]
- Hamilton JP, Buell CR. [Accessed March 5, 2013];Advances in plant genome sequencing. Plant J. 2012 70:177–90. doi: 10.1111/j.1365-313X.2012.04894.x. Available at: http://www.ncbi.nlm.nih.gov/pubmed/22449051. [DOI] [PubMed] [Google Scholar]
- Jones-Rhoades MW, Bartel DP. Computational identification of plant microRNAs and their targets, including a stress-induced miRNA. Mol. Cell. 2004;14:787–99. doi: 10.1016/j.molcel.2004.05.027. Available at: http://www.ncbi.nlm.nih.gov/pubmed/15200956. [DOI] [PubMed] [Google Scholar]
- Kent WJ. [Accessed March 8, 2012];BLAT---The BLAST-Like Alignment Tool. Genome Res. 2002 12:656–664. doi: 10.1101/gr.229202. Available at: http://www.genome.org/cgi/doi/10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu L, Li Y, Li S, Hu N, He Y, Pong R, Lin D, Lu L, Law M. [Accessed February 28, 2013];Comparison of next-generation sequencing systems. J. Biomed. Biotechnol. 2012 2012:251364. doi: 10.1155/2012/251364. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3398667&tool=pmcentrez&rendertype=abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luo R, Liu B, Xie Y, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012;1:18. doi: 10.1186/2047-217X-1-18. Available at: http://www.gigasciencejournal.com/content/1/1/18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meng Y, Shao C, Wang H, Chen M. [Accessed February 27, 2013];The regulatory activities of plant microRNAs : a more dynamic perspective. Plant Physiol. 2011 157:1583–95. doi: 10.1104/pp.111.187088. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3327222&tool=pmcentrez&rendertype=abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mishra NS, Mukherjee SK. A Peep into the Plant miRNA World. Open Plant Sci. J. 2007;1:1–9. Available at: http://www.benthamopen.org/pages/content.php?TOPSJ/2007/00000001/00000001/1TOPSJ.SGM. [Google Scholar]
- Moore PH. [Accessed May 3, 2012];Integration of sucrose accumulation processes across hierarchical scales: towards developing an understanding of the gene-to-crop continuum. F. Crop. Res. 2005 92:119–135. Available at: http://linkinghub.elsevier.com/retrieve/pii/S0378429005000250. [Google Scholar]
- Morrell PL, Buckler ES, Ross-ibarra J. Crop genomics : advances and applications. Nat. Publ. Gr. 2011;13:85–96. doi: 10.1038/nrg3097. Available at: http://dx.doi.org/10.1038/nrg3097. [DOI] [PubMed] [Google Scholar]
- Nelson W, Luo M, Ma J, et al. [Accessed April 3, 2013];Methylation-sensitive linking libraries enhance gene-enriched sequencing of complex genomes and map DNA methylation domains. BMC Genomics. 2008 9:621. doi: 10.1186/1471-2164-9-621. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2628917&tool=pmcentrez&rendertype=abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ouyang S, Buell CR. [Accessed March 5, 2012];The TIGR Plant Repeat Databases: a collective resource for the identification of repetitive sequences in plants. Nucleic Acids Res. 2004 32:D360–3. doi: 10.1093/nar/gkh099. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=308833&tool=pmcentrez&rendertype=abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Palmer LE, Rabinowicz PD, O'Shaughnessy AL, Balija VS, Nascimento LU, Dike S, la Bastide M. de, Martienssen R. a, McCombie WR. [Accessed September 3, 2010];Maize genome sequencing by methylation filtration. Science. 2003 302:2115–7. doi: 10.1126/science.1091265. Available at: http://www.ncbi.nlm.nih.gov/pubmed/14684820. [DOI] [PubMed] [Google Scholar]
- Papini-Terzi FS, Rocha FR, Vêncio RZN, et al. [Accessed March 3, 2013];Sugarcane genes associated with sucrose content. BMC Genomics. 2009 10:120. doi: 10.1186/1471-2164-10-120. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2666766&tool=pmcentrez&rendertype=abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peterson DG, Wessler SR, Paterson AH. Efficient capture of unique sequences from eukaryotic genomes. Trends Genet. 2002;18:547–50. doi: 10.1016/s0168-9525(02)02764-6. Available at: http://www.ncbi.nlm.nih.gov/pubmed/12414178. [DOI] [PubMed] [Google Scholar]
- Rabinowicz PD, Citek R, Budiman M. a, et al. [Accessed March 2, 2013];Differential methylation of genes and repeats in land plants. Genome Res. 2005 15:1431–40. doi: 10.1101/gr.4100405. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1240086&tool=pmcentrez&rendertype=abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Renny-byfield S, Chester M, Kovar A, et al. Next Generation Sequencing Reveals Genome Downsizing in Allotetraploid Nicotiana tabacum, Predominantly through the Elimination of Paternally Derived Repetitive DNAs. 2011;28:2843–2854. doi: 10.1093/molbev/msr112. [DOI] [PubMed] [Google Scholar]
- Roulin A, Auer PL, Libault M, Schlueter J, Farmer A, May G, Stacey G, Doerge RW, Jackson SA. The fate of duplicated genes in a polyploid plant genome. 2013. pp. 143–153. [DOI] [PubMed] [Google Scholar]
- Scheibye-Alsing K, Hoffmann S, Frankel a, et al. [Accessed June 11, 2011];Sequence assembly. Comput. Biol. Chem. 2009 33:121–36. doi: 10.1016/j.compbiolchem.2008.11.003. Available at: http://www.ncbi.nlm.nih.gov/pubmed/19152793. [DOI] [PubMed] [Google Scholar]
- Shangguan L, Han J, Kayesh E, Sun X, Zhang C, Pervaiz T, Wen X, Fang J. [Accessed August 8, 2013];Evaluation of genome sequencing quality in selected plant species using expressed sequence tags. PLoS One. 2013 8:e69890. doi: 10.1371/journal.pone.0069890. Available at: http://www.ncbi.nlm.nih.gov/pubmed/23922843. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I. [Accessed February 28, 2013];ABySS : a parallel assembler for short read sequence data. Genome Res. 2009 19:1117–23. doi: 10.1101/gr.089532.108. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2694472&tool=pmcentrez&rendertype=abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith AM. [Accessed March 3, 2013];Prospects for increasing starch and sucrose yields for bioethanol production. Plant J. 2008 54:546–58. doi: 10.1111/j.1365-313X.2008.03468.x. Available at: http://www.ncbi.nlm.nih.gov/pubmed/18476862. [DOI] [PubMed] [Google Scholar]
- Souret F, Lu C, Green PJ, Meyers BC. Curr. Opin. Biotechnol. 2005. Sweating the small stuff : microRNA discovery in plants; pp. 1–8. [DOI] [PubMed] [Google Scholar]
- Stanke M, Diekhans M, Baertsch R, Haussler D. [Accessed December 14, 2013];Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008 24:637–44. doi: 10.1093/bioinformatics/btn013. Available at: http://www.ncbi.nlm.nih.gov/pubmed/18218656. [DOI] [PubMed] [Google Scholar]
- Thiebaut F, Grativol C, Carnavale-Bottino M, Rojas C. a, Tanurdzic M, Farinelli L, Martienssen R. a, Hemerly AS, Ferreira PC. [Accessed July 16, 2012];Computational identification and analysis of novel sugarcane microRNAs. BMC Genomics. 2012 13:290. doi: 10.1186/1471-2164-13-290. Available at: http://www.ncbi.nlm.nih.gov/pubmed/22747909. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thimm O, Bläsing O, Gibon Y, et al. [Accessed February 27, 2013];Mapman: a User-Driven Tool To Display Genomics Data Sets Onto Diagrams of Metabolic Pathways and Other Biological Processes. Plant J. 2004 37:914–939. doi: 10.1111/j.1365-313x.2004.02016.x. Available at: http://doi.wiley.com/10.1111/j.1365-313X.2004.02016.x. [DOI] [PubMed] [Google Scholar]
- Vaucheret H. Post-transcriptional small RNA pathways in plants: mechanisms and regulations. Genes Dev. 2006;20:759–71. doi: 10.1101/gad.1410506. Available at: http://www.ncbi.nlm.nih.gov/pubmed/16600909. [DOI] [PubMed] [Google Scholar]
- Vettore L, Silva FR, Kemper EL, et al. Genome Res. 2003. Analysis and Functional Annotation of an Expressed Sequence Tag Collection for Tropical Crop Sugarcane; pp. 2725–2735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Voinnet O. [Accessed March 1, 2013];Origin, biogenesis, and activity of plant microRNAs. Cell. 2009 136:669–87. doi: 10.1016/j.cell.2009.01.046. Available at: http://www.ncbi.nlm.nih.gov/pubmed/19239888. [DOI] [PubMed] [Google Scholar]
- Wang J, Roe B, Macmil S, et al. Microcollinearity between autopolyploid sugarcane and diploid sorghum genomes. BMC Genomics. 2010;11:261. doi: 10.1186/1471-2164-11-261. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2882929&tool=pmcentrez&rendertype=abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang K, Wang Z, Li F, et al. [Accessed March 10, 2013];The draft genome of a diploid cotton Gossypium raimondii. Nat. Genet. 2012 44:1098–103. doi: 10.1038/ng.2371. Available at: http://www.ncbi.nlm.nih.gov/pubmed/22922876. [DOI] [PubMed] [Google Scholar]
- Wang Y, Wang X, Paterson AH. Genome and gene duplications and gene expression divergence: a view from plants. 2012;1256:1–14. doi: 10.1111/j.1749-6632.2011.06384.x. [DOI] [PubMed] [Google Scholar]
- Wei F, Stein JC, Liang C, et al. Detailed analysis of a contiguous 22-Mb region of the maize genome. PLoS Genet. 2009;5:e1000728. doi: 10.1371/journal.pgen.1000728. Available at: http://www.ncbi.nlm.nih.gov/pubmed/19936048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Young AL, Abaan HO, Zerbino D, Mullikin JC, Birney E, Margulies EH. [Accessed July 18, 2011];A new strategy for genome assembly using short sequence reads and reduced representation libraries. Genome Res. 2010 20:249–56. doi: 10.1101/gr.097956.109. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2813480&tool=pmcentrez&rendertype=abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zanca AS, Vicentini R, Ortiz-Morea F. a, Bem LE, V Del Silva, da MJ, Vincentz M, Nogueira FTS. [Accessed March 15, 2012];Identification and expression analysis of microRNAs and targets in the biofuel crop sugarcane. BMC Plant Biol. 2010 10:260. doi: 10.1186/1471-2229-10-260. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3017846&tool=pmcentrez&rendertype=abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang G, Liu X, Quan Z, et al. [Accessed May 13, 2012];Genome sequence of foxtail millet (Setaria italica) provides insights into grass evolution and biofuel potential. Nat. Biotechnol. 2012 doi: 10.1038/nbt.2195. Available at: http://www.nature.com/doifinder/10.1038/nbt.2195. [DOI] [PubMed]
- Zhang J, Arro J, Chen Y, Ming R. [Accessed May 25, 2013];Haplotype analysis of sucrose synthase gene family in three Saccharum species. BMC Genomics. 2013 14:314. doi: 10.1186/1471-2164-14-314. Available at: http://www.ncbi.nlm.nih.gov/pubmed/23663250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou X, Ren L, Meng Q, Li Y, Yu Y, Yu J. [Accessed February 28, 2013];The next-generation sequencing technology and application. Protein Cell. 2010 1:520–36. doi: 10.1007/s13238-010-0065-3. Available at: http://www.ncbi.nlm.nih.gov/pubmed/21204006. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.