Abstract
Genome duplication with hybridization, or allopolyploidization, occurs in animals, fungi and plants, and is especially common in crop plants. There is an increasing interest in the study of allopolyploids because of advances in polyploid genome assembly; however, the high level of sequence similarity in duplicated gene copies (homeologs) poses many challenges. Here we compared standard RNA-seq expression quantification approaches used currently for diploid species against subgenome-classification approaches which maps reads to each subgenome separately. We examined mapping error using our previous and new RNA-seq data in which a subgenome is experimentally added (synthetic allotetraploid Arabidopsis kamchatica) or reduced (allohexaploid wheat Triticum aestivum versus extracted allotetraploid) as ground truth. The error rates in the two species were very similar. The standard approaches showed higher error rates (>10% using pseudo-alignment with Kallisto) while subgenome-classification approaches showed much lower error rates (<1% using EAGLE-RC, <2% using HomeoRoq). Although downstream analysis may partly mitigate mapping errors, the difference in methods was substantial in hexaploid wheat, where Kallisto appeared to have systematic differences relative to other methods. Only approximately half of the differentially expressed homeologs detected using Kallisto overlapped with those by any other method in wheat. In general, disagreement in low-expression genes was responsible for most of the discordance between methods, which is consistent with known biases in Kallisto. We also observed that there exist uncertainties in genome sequences and annotation which can affect each method differently. Overall, subgenome-classification approaches tend to perform better than standard approaches with EAGLE-RC having the highest precision.
Keywords: allopolyploid, homeolog, expression quantification, RNA-seq, subgenome classification
Introduction
Genome duplication, termed polyploidization, is widespread in plants with up to 35% of land plants being recent polyploids [1]. Many crop species in particular are allopolyploids [2], which involves the hybridization of two different species with genome duplication. Thus, there is much interest in the study of genome duplication and the advantages or disadvantages this phenomenon may convey. To explore the underlying mechanisms that may provide the raw genetic material for adaptation, many gene expression studies have been conducted on both natural and synthetic allopolyploid species [3–8]. Allopolyploid species have traditionally been difficult to analyze at the whole genome scale because of the large size of their genomes and the high levels of sequence similarity between duplicated chromosomes. These duplicated gene copies, called homeologs, are in general highly similar and pose challenges to gene expression analyses. However, these homeologs and bias in their expression are of great interest because they potentially contribute to adaptation in polyploid species [9–11].
Recent improvements in long sequencing read technologies and assembly strategies [12–14] have allowed for breakthroughs in polyploid genome assembly. In plant biology especially, recent allopolyploid species such as bread wheat [15] as well as many other agriculturally important plant species have benefited [16, 17]. Now that de novo genome assembly for allopolyploids is no longer as formidable as it once was, a large number of polyploid reference genomes are expected to become available in the near future to facilitate genome-wide studies. Accordingly, there is a need to evaluate expression quantification methods given the presence of homeologs in allopolyploids.
High levels of sequence similarity pose many challenges to read mapping and consequently, expression quantification as well as other types of sequence analysis. Mapping bias was shown to be an issue in numerous studies on diploid allele specific expression [18–21], which is analogous to the analysis of homozygous tetraploids. In polyploid studies, sequence ambiguity between homeologs has been shown to affect read mapping [22, 23], especially if the assembly or annotation quality is asymmetric [24].
In this study we pose the question ‘How do subgenome-classification approaches perform in comparison to traditional RNA-seq methods in allopolyploids?’ and evaluate different approaches and methods for homeolog expression quantification in allopolyploids. To evaluate methods in polyploids, analysis on genetic materials with and without subgenomes is highly valuable as a form of ground truth. Here we used synthetic allotetraploid Arabidopsis kamchatica (Fisch. ex DC.) K. Shimizu & Kudoh and performed tests with its two direct parental accessions of Arabidopsis halleri and Arabidopsis lyrata. For hexaploid wheat Triticum aestivum Chinese Spring [25] with subgenomes ABD, we performed tests with tetra-Chinese Spring (subgenomes AB) where subgenome D was experimentally removed [26] as well as Aegilops tauschii the diploid progenitor of wheat subgenome D.
Although genome-alignment tools designed for diploid species such as STAR [27], TopHat [28], LAST [29, 30] and GSNAP [31] or pseudo-alignment-based method Kallisto [32] have often been used for RNA-seq analysis in allopolyploid species [25, 33–37], bioinformatic algorithms specific for allopolyploid species have been developed. Among them, HomeoRoq [7] and PolyDog [24] represent similar subgenome-classification approaches, in which alignment quality to each subgenome is considered. PolyCat [38] and EAGLE-RC [39] identify and utilize explicit genotype differences between subgenomes. Among them, PolyCat requires picking one subgenome as the reference and constructing a homeolog SNP-index as opposed to read alignment to all subgenomes, while EAGLE-RC identifies and utilizes explicit genotype differences between subgenomes to calculate the read sequence likelihood for each subgenome alignment. HyLiTE [40] requires sequencing data from all parental species as well as the child polyploid species; thus, it was not suitable for the analysis of hexaploid wheat as the subgenome B progenitor is still unknown. A systematic comparison between subgenome-classification methods would be necessary to evaluate them, though there are commonalities in their approach, and is outside the scope of this review.
We test, to the best of our knowledge, all general approaches to quantify expression in polyploids:
(i) A standard genome-alignment-based RNA-seq analysis on the full allopolyploid reference genome with two different alignment tools STAR and LAST, with read counts obtained using featureCounts [41].
(ii) A pseudo-alignment-based method with Kallisto on the full allopolyploid transcriptome.
(iii) A subgenome-classification approach with HomeoRoq, which maps read sequences to each subgenome separately, with read counts obtained using featureCounts.
(iv) A subgenome-classification approach with EAGLE-RC, which maps read sequences to each subgenome separately and also explicitly uses genotype variations that discriminate between homeologs as constraints in analysis, with read counts obtained using featureCounts.
Our results show that EAGLE-RC had the lowest error rate (A. kamchatica, 0.40%; hexaploid wheat, 0.49%) for alignments to the correct subgenome, while Kallisto had the highest error rate (A. kamchatica, 12.43%; hexaploid wheat, 13.44%). LAST and STAR had similar error rates in A. kamchatica, but LAST was more precise in hexaploid wheat. In general, performance between methods was comparable for A. kamchatica but not for hexaploid wheat. We also observed systematic differences in low-expression genes that impacted the homeolog expression bias results. Other concerns include uncertainty in the completeness and accuracy of the genome sequence and annotation, which affected each method differently due to differences in constraints. This may be especially relevant as polyploid species have only recently begun to be sequenced and assembled in large numbers and the gene annotations are in their 1st iterations. In the face of this uncertainty, EAGLE-RC is the most precise in our evaluations.
Methods
We evaluated methods on two allopolyploid species, tetraploid A. kamchatica and hexaploid wheat T. aestivum. All codes used in this study are described in Supplemental Methods, given as a series of shell script commands.
RNA sequence data and reference genomes
The natural species A. kamchatica [42, 43] was derived from two diploid species A. halleri and A. lyrata recently [8, 44]. It is a model polyploid with a broad distribution range, self-compatibility and transformation technique [44, 45]. To construct synthetic polyploids, we used two highly homozygous parental accessions used for genome assembly: A. halleri Tada mine W302 (ver 2.2, scaffolds N50 712 kb) [46] and A. lyrata lyrpet4 (ver 2.2, scaffolds N50 1.2 Mb) [8]. The two genotypes were crossed then the genome doubling was induced by colchicine treatment. Although synthetic polyploidization may occasionally activate transposable elements or induce chromosomal rearrangements [47], the subgenomes of the synthetic polyploid were derived from the merging of the two parental genomes and are highly or completely identical, providing a unique opportunity to evaluate RNA-seq methods in allopolyploids.
To assess classification accuracy in synthetic A. kamchatica, we used data from the parental species so that the ground truth of a read’s subgenome origin is known. A. halleri subsp. gemmifera and A. lyrata subsp. patraea RNA-seq data [48] were obtained under DDBJ accession DRP003263 submission DRA004364. Briefly, this data set consists of four samples each for A. halleri and A. lyrata with 2 x 100 bp paired-end reads for a total of ~20 and ~21 Gb, respectively. For differential homeolog expression analysis, synthetic allotetraploid A. kamchatica RNA-seq data [7] was obtained under DDBJ accession DRP01140. Briefly, this data set consists of three biological replicates of A. kamchatica before and after cold stress with 2 x 100 bp paired-end reads for a total of ~12 and ~10 Gb, respectively.
The whole genome assembly and annotation for hexaploid wheat T. aestivum with subgenomes ABD were obtained from the International Wheat Genome Sequencing Consortium [25] (assembly ver 1.0 and annotation ver 1.0, N50 22.8 Mb). The assembly quality is at the chromosome level and the reference genome was split into subgenomes A, B and D allowing for separate read mapping in subgenome-classification methods.
To assess classification accuracy in wheat, we generated new tetra-Chinese Spring and A. tauschii data to use in our evaluations. We utilized tetra-Chinese Spring in which the subgenomes AB were extracted by removing subgenome D by repeated backcrossing [26]. Thus, the genome sequence of tetra-Chinese Spring must be very close to the subgenomes AB of hexaploid Chinese Spring. Because there is no genetic material of extracted D genomes from hexaploid wheat, we used A. tauschii KU-2076 (resource in Kyoto University, collected in Iran). Subgenome D of hexaploid wheat is known to be derived from this species, thus it should be highly similar though there is divergence due to within-species variations [49]. We constructed RNA-seq data sets of tetra-Chinese Spring and A. tauschii in triplicate for a total of 2.1 and 2.8 Gb, respectively, under DDBJ run accessions DRR155268–DRR155273. For differential homeolog expression analysis, T. aestivum RNA-seq data were obtained from National Center for Biotechnology Information (BioProject PRJEB12358) with SRA accessions ERR1201752-ERR1201754 and ERR1201770-ERR1201772 describing samples, in triplicate, 24 h after inoculation of fungal pathogen Fusarium graminearum and mock inoculation for a total of 17.3 Gb and 16.1 GB, respectively.
Plant growth, RNA isolation and sequencing
Tetra-Chinese Spring and A. tauschii were grown at 16°C in 8 h light/16 h dark cycle with 60% relative humidity for 2 weeks, and leaf tissues were harvested. RNA was extracted from each tissue using RNeasy Plant Mini Kits (QIAGEN, Hilden, Germany) in combination with DNase I treatment (QIAGEN). Illumina sequencing libraries were made by TruSeq Stranded mRNA Library Prep Kit. RNA-seq was conducted using Illumina HiSeq 4000 at the Functional Genomics Centre, Zurich.
Homeolog identification
To annotate homeologs in A. kamchatica, we constructed RNA transcripts from the gene models in the A. halleri and A. lyrata gene annotations using gffread (v0.9.12). Homeologs were then identified based on reciprocal best hit for each subgenome’s transcripts using LAST (v809). We required hits to have an E-value less than 10−10 with at least 200 aligned bases in both transcripts, resulting in 24329 homeolog pairs identified. This method of identifying homeologs is simple but efficient and also allows for the enumeration of genotype differences between subgenomes, a necessary resource for EAGLE-RC.
To annotate homeologs in T. aestivum, we constructed RNA transcripts from high-confidence gene models belonging to subgenomes A, B and D, including the UTR regions. We then identified pair-wise homeologs through reciprocal best hit for combinations AB, AD and BD (26556, 27360 and 27311, respectively) and then triple copy homeologs by checking for genes in AB that share the same hit for D in their respective AD and BD hits, resulting in 21196 triple copy homeologs identified. Though these gene models were not used in this study, for reference, we discovered 3977, 4781 and 4814 two-copy homeologs in AB, AD and BD, respectively. For singletons, there were 13 459, 14 093 and 11 551 out of 44 796, 45 463 and 43 754 total gene models in subgenomes A, B and D, respectively.
Standard RNA-seq analysis
We tested a standard genome-alignment-based RNA-seq expression quantification approach (Figure 1A) aimed at differential expression analysis [50] that is often used for diploids:
(i) Map reads to the allopolyploid reference genome—STAR (v2.5.2b) [27] and LAST (v809) [29, 30].
(ii) Count reads using featureCounts (v1.5.1) [41] at the transcript level.
(iii) Extract homeolog specific read counts.
Other miscellaneous sequence-processing tasks such as indexing and sorting were done using samtools (v1.5.2).
To construct the reference genome of A. kamchatica, we concatenated the reference genomes of its two parental species, A. halleri and A. lyrata, to obtain an allopolyploid reference. For LAST, we also filtered out read alignments with MAPQ scores less than 20, while STAR MAPQ scores are not as useful for thresholding. In T. aestivum, we excluded all reads that mapped to chrUn.
We tested a pseudo-alignment RNA expression quantification workflow (Figure 1B) using Kallisto (v0.43.1) [32], which is also often used for diploids. The built-in expression quantification in Kallisto was used to count reads (est_counts) at the transcript level. When evaluating classification accuracy, we used the pseudobam option in Kallisto to output read assignments and evaluate the proportion which were misassigned.
Subgenome-classification analysis
We tested a subgenome-classification approach where, in contrast to the standard approach, the sequencing reads are mapped separately to each subgenome of an allopolyploid’s reference genome (Figures 1C and D) using STAR. Then we utilized a read classifier to assign reads to their subgenome origin, if possible. We tested read classifiers methods HomeoRoq (last updated 11 August 2014) and EAGLE-RC (v1.0.0).
HomeoRoq [7] (Figure 1C) classifies reads based on the number of mismatches, up to a maximum of 10, between the read and the genome sequence of each subgenome, where reads must be mappable to both subgenomes to be considered. In contrast with EAGLE-RC, which requires computing the subgenome-discriminating variants, HomeoRoq does not require comparative analysis between different subgenomes in advance of classification.
The basic EAGLE model [39] is a generative model for read sequences used to calculate the likelihood of a read given a reference genotype hypothesis and an alternative genotype hypothesis. In EAGLE-RC (Figure 1D), the basic model was extended to perform read classification using the variants that discriminate between homeologs. During homeolog identification, subgenome-specific genotype differences (i.e. variants) that discriminate between homeologs are determined. For A. kamchatica, reads are mapped to the reference genomes of A. halleri (H), and A. lyrata (L), separately. EAGLE-RC then calculates the probability for read r given each subgenome as the reference hypothesis and the other subgenome as the alternative hypothesis :
where the classification is determined by the reference with the highest probability, requiring the winning hypothesis to be at least probability 0.95 with marginal probability at least 0.51; otherwise it is ‘unknown’, where the marginal probability is the proportion of the winning hypothesis over the sum of the probabilities for all subgenome reference hypotheses.
For hexaploid T. aestivum subgenome classification, we performed a bottom-up workflow using a series of pair-wise classifications (Figure 2). For HomeoRoq, we determined, if possible, the consensus classification from pair-wise classifications. For example, a read is classified as A if there is a consensus A classification in both AB and AD pair-wise classification comparisons.
For hexaploid subgenome classification with EAGLE-RC, the pair-wise likelihoods per read were calculated, where for each reference hypothesis , there are two alternative genome hypotheses, and . The probability of a read belonging to a given reference is then as follows:
where the classification is determined by the reference with the highest probability, requiring the winning hypothesis to be at least probability 0.95 with marginal probability at least 0.51, otherwise it is ‘unknown’.
Differentially expressed homeologs
To identify differentially expressed homeologs (DEHs) in A. kamchatica, we used DESeq2 (v1.20.0) [51] in the R environment (v3.5.1) on the read count data for the A. halleri-derived subgenome and A. lyrata-derived subgenome separately. We required a homeolog to be differentially expressed with 0.05 or better false discovery rate (Benjamini–Hochberg FDR) in at least one of the subgenomes. We then performed Fisher’s exact test in R on the combined read counts to determine significant differential homeolog expression ratio (P < 0.05 and fold change ≥ 2). This definition of differential expression requires significant change in gene expression in at least one subgenome according to DESeq2 along with significant change in the homeolog expression ratio between subgenomes according to Fisher’s exact test, under different conditions. There may be a multiple-testing issue with the Fisher’s exact tests for selected homeologs but we did not perform further FDR corrections.
To find DEH in hexaploid T. aestivum, we used DESeq2 on the read count data for each of subgenomes A, B and D separately, requiring a gene to be differentially expressed with 0.05 or better FDR in at least one of the subgenomes across three pair-wise tests. We then performed a series of Fisher’s exact tests on the combined read counts (A versus BD, B versus AD and D versus AB) to test for differential homeolog expression ratio. A homeolog is differentially expressed if at least one test has P < 0.05/3 (Bonferroni correction) and fold change ≥ 2. Similar to the tetraploid analysis, we did not perform further FDR corrections.
Results
A. kamchatica read classification
Generally, many types of sequence analysis rely on how accurately read sequences can be mapped to a reference. In the case of allopolyploids, read alignment must deal with a higher degree of repetitiveness than in diploids due to homeologs.
We tested the standard genome alignment approach using the widely used RNA-seq read mapping tool, STAR. Though this tool is often used in diploid read mapping, it is unclear whether it will be suitable for allopolyploids. Unfortunately, the mapping quality score from STAR is not suitable for thresholding ``uniqueness” due to how it assigns scores in an almost binary manner. Thus, we also tested LAST, a general sequence alignment tool that can estimate the probability that an alignment represents the genomic source of the read. For example, a given read aligns to a single location with no mismatches, but aligns to five other locations with one mismatch each. This read may be deemed a unique best hit, but there may be a reasonable probability, depending on read length, that it came from any of the other alignments with a single base-calling error. LAST allows us to set a degree of uniqueness (i.e. 0.05 mismap probability) as a cut-off threshold that is convenient for handling this type of uncertainty.
We examined tetraploid A. kamchatica first because it may be less complicated than hexaploid wheat, which we describe later in this study. Here, we evaluated the accuracy of each tool and each approach by how well they assigned reads to the correct subgenome. For A. kamchatica, the ground truth is known by testing with pure A. halleri and A. lyrata RNA-seq data. Table 1 shows the classification performance of each approach for A. kamchatica. One point to note is that the mapping rate is affected by alignments to non-homeologs, thus we also showed the number of homeologs with detected expression in our evaluations.
Table 1.
Classification error | A. halleri | A. lyrata | |||
---|---|---|---|---|---|
Mapped, % | H to L, % | L to H, % | Homeologs, % | Homeologs, % | |
STAR | 93.40 | 11.09 | 11.40 | 90.34 | 90.55 |
LAST | 71.59 | 2.87 | 2.50 | 90.10 | 90.39 |
Kallisto | 92.22 | 12.42 | 12.44 | 95.34 | 95.50 |
HomeoRoq | 93.67 | 1.64 | 1.23 | 89.28 | 80.78 |
EAGLE-RC | 93.67 | 0.48 | 0.33 | 87.90 | 88.60 |
We calculated the precision by using classification error rate as the criterion. It is clear that subgenome-classification approaches (HomeoRoq and EAGLE-RC) performed better than the standard alignment-based approaches (STAR, LAST and Kallisto) using a concatenated genome or pseudo-alignment on a concatenated transcriptome. In the standard alignment-based approach using a concatenated reference genome, LAST’s mismap probability model was seen to be beneficial to read classification showing a much lower classification error rate than STAR, with only a slightly lower number of expressed homeologs detected.
Kallisto showed the lowest precision among all methods, though it showed the highest number of expressed homeologs detected. That Kallisto’s performance was the most affected by the presence of homeologs is perhaps due to a reduction in the number of unique kmers, relative to diploid analysis, which is essential for the method to find unique read-to-gene associations. To quantify the reduction in the number of unique kmers in a tetraploid reference relative to diploid reference, we performed a simulation with 1000 trials of 100 randomly selected A. halleri genes with randomly assigned SNPs at varying degrees of sequence divergence (Supplementary data, Table S1). This analysis shows that there is a large reduction in the number of unique kmers given one extra gene copy depending on the pair-wise sequence divergence. For a point of reference, A. kamchatica is estimated to have ~2–3% divergence between homeologs [7].
EAGLE-RC showed the highest precision in read classification though it had a fewer expressed homeologs detected. HomeoRoq was less precise than EAGLE-RC while having a similar number of expressed homeologs detected. The main difference in methodology between EAGLE-RC and HomeoRoq is that EAGLE-RC utilizes genotype information explicitly while HomeoRoq relies on comparing the number of mismatches, implicitly comparing genotype differences. However, simply comparing the number of mismatches is susceptible to spurious mappings, because a read alignment with 9 versus 10 mismatches favors the one with 9 mismatches to the same degree as it would favor an alignment with 0 versus 1 mismatch.
Another point to consider is that the quality of the reference may differ between subgenomes and there may be uncertainty from missing regions or erroneous annotations. HomeoRoq requires reads to be mappable to both references, which constrains read counting to genome regions that exist in both subgenomes. However, this comes with the disadvantage that more divergent regions may be excluded due to reads being mappable to only one homeolog. EAGLE-RC requires reads to be mapped to regions that are different between homeologs and thus can be classified based on those differences. This constrains analysis to regions in the homeologs’ gene models that can be pair-wise aligned to compute the genotype differences between homeologs. We performed a simulation analysis where reads were simulated using ART [52] from annotated gene models with no divergence from the reference and calculated the classification performance (Supplementary data, Table S2). The results show that even in this ideal scenario, there were reads which could not be classified with certainty by LAST, HomeoRoq and EAGLE-RC due to reads mapping equally well to both subgenomes. It also shows that in ideal conditions, STAR and Kallisto, despite higher error rates, were excellent in their true positive rates. However, the ideal condition of data with no divergence from the reference, no sequences outside of annotated genes, and perfectly reflective of gene models is not realistic in practice.
A. kamchatica homeolog expression quantification
Read mapping is the 1st step in many types of sequence analysis and is the foundation for many methods that rely on processing read alignments. However, read-counting methods and differential expression analysis may potentially be able to correct for artifacts or ambiguity in read alignments. We revisited classification error using the quantified read counts for homeologs in A. kamchatica (Table 2), although we suggest some caution in the interpretation of the number of reads quantified and thus the classification error rate in this result as HomeoRoq and EAGLE-RC have additional expression quantification processes to count subgenome-common reads. For Kallisto, we used the estimated read count from its output while for all other methods, we obtained read counts using featureCounts.
Table 2.
Classification error | |||
---|---|---|---|
H to L, % | L to H, % | Reads quantified | |
STAR | 1.16 | 1.36 | 74149312 |
LAST | 0.90 | 1.26 | 67015253 |
Kallisto | 0.90 | 1.21 | 87667382 |
HomeoRoq | 1.32 | 1.64 | 60946109 |
EAGLE-RC | 0.54 | 0.80 | 57254486 |
Here, error is represented as a proportion of quantified reads rather than the proportion of mappable reads in the earlier analysis. Our results show that error decreased in STAR, LAST and Kallisto as the quantification process accounted for ambiguously mapped reads in some fashion (dropped by featureCounts by default and distributed by Kallisto). HomeoRoq’s and EAGLE-RC’s error rates increased slightly due to these methods’ discarding reads that are deemed unclassifiable. However, a large number of unclassifiable reads are due to equivalent alignments to both subgenomes, deemed subgenome common, which can be used to estimate expression levels such as RPKM by distributing proportionally to each subgenome. This has been demonstrated previously for HomeoRoq [48] and is also applicable to EAGLE-RC. Thus, the number of reads quantified for subgenome-classification approaches in Table 2 may not be directly comparable to standard genome alignment approaches. It turns out that for tetraploid A. kamchatica, all methods had comparable error rates, where EAGLE-RC was the most precise.
Next, we examined the ratio of homeologous pairs. We are interested in any shifts in expression between homeologs across conditions, so we analyzed RNA-seq data from A. kamchatica and quantified the A. halleri read count proportion as follows:
While we do not know the ground truth expression levels, we can evaluate the concordance between the different methods to describe how results might differ depending on which approach was used. To compare the results between different methods, we calculated the pair-wise root–mean–squared distance (RMSD) and coefficient of determination (r2) between the results for all methods. The RMSD is a measure of the average distance between two sets X and Y:
The r2 describes how well one variable X can be used to predict another variable Y by calculating the proportion of variability that can be explained:
Though both measures describe the similarity between the results of different methods, the RMSD quantifies the magnitude of difference with units while the r2 quantifies the proportion of similar elements.
We examined the ratio of homeologous pairs using RNA-seq data from A. kamchatica samples before and after cold stress. The r2 (Table 3) and RMSD (Supplementary data, Table S3) for show that in general, the results from the standard genome-alignment-based approach were concordant (average r2 = 0.9505, RMSD of 6.96%). HomeoRoq was also concordant (average r2 = 0.9084) with STAR and LAST. This is consistent with the error rates in the classification results.
Table 3.
STAR | LAST | Kallisto | HomeoRoq | EAGLE-RC | |
---|---|---|---|---|---|
STAR | — | ||||
LAST | 0.9505 | — | |||
Kallisto | 0.6886 | 0.6852 | — | ||
HomoeRoq | 0.9214 | 0.8953 | 0.6472 | — | |
EAGLE-RC | 0.8201 | 0.8003 | 0.5791 | 0.8325 | — |
The r2 values between methods showed that though read mapping precision can vary greatly, downstream read counting methods were somewhat able to correct for artifacts or ambiguity in read alignment to arrive at a similar quantified expression. Kallisto was discordant from all other methods with an average RMSD of 18.88%. This may be due to its built-in read-counting method using expectation–maximization (EM) whereas all other methods used featureCounts, which does not consider multi-mapped reads. There were also issues with low-expression genes where 420 and 447 A. halleri and A. lyrata homeologs, respectively, were reported to have near-zero expression (<10 reads) by all other methods in all samples but Kallisto reported more than 100 reads. The homeolog expression ratio scatter plots reflect this potential artifact (Supplementary data, Figure S1) where a slight clustering at the edges of the plots can be seen for Kallisto. Though the cause is unclear, Kallisto tends to overestimate some low-expression genes compared to other methods.
EAGLE-RC also showed less concordance in r2 and RMSD than HomeoRoq to standard genome-alignment approaches, where it reported some high-expression genes as low expression compared to other methods. More detailed examination revealed that because EAGLE-RC required reads to cross genotype differences between homeologs, it excluded reads in exons which were not pair-wise aligned using LAST during the homeolog identification process. There may exist uncertainty in the annotation, as there are 552 homeologs that had at least a 40% difference in the proportion of the gene model aligned between A. halleri and A. lyrata. The majority of these cases were A. halleri homeologs having a smaller proportion of alignment than A. lyrata homeologs due to being much longer. The exclusion of regions which were not aligned between homeologs accounted for a large portion of the difference in EAGLE-RC. For example, the homeolog annotated as AT4G25110 in A. kamchatica with 12 exons in A. halleri and 5 exons in A. lyrata and an aligned region between homeologs of 46 and 61% in A. halleri and A. lyrata, respectively, had almost all (99.89%) of the reads assigned to A. halleri mapped to regions that were not aligned between homeologs (Supplementary data, Figure S2), leading to a large difference between EAGLE-RC and other methods (Supplementary data, Table S4).
The subgenome-classification approach with HomeoRoq or EAGLE-RC tries to account for potentially missing reference genome regions in one subgenome, through consideration of read mappability or explicitly utilizing genotype variation, respectively. For example, the homeolog annotated as AT5G45850 in A. kamchatica had reads that mapped to the A. lyrata reference that did not map at all to the A. halleri reference and was thus discarded by HomeoRoq. Similarly, missing reference genome regions in homeologs cannot be pair-wise aligned, thus EAGLE-RC discarded reads that mapped to these regions. In other methods, a naive counting of reads could be reference biased (Supplementary data, Table S5). In this case, the average for a sample with the subgenome-classification approach was 0.90 compared to 0.44 with STAR, LAST and Kallisto.
Next, we examined DEH using RNA-seq data from A. kamchatica samples before and after cold stress. The results of different methods (Figure 3) showed that, in general, there was a high overlap between the different methods tested where Kallisto was the least concordant among the methods tested. This is consistent with the results when we examined the ratio of homeologous pairs.
Intuitively, genes with lower expression are expected to be more affected by uncertainties due to read count differences having more of an effect on the proportion (Table 4). Indeed, our results show that the concordance between methods was lower for genes with lower expression with a significant increase (paired t-test P-value, 6.767 × 10−06) in RMSD. This may be due to a higher proportion of reads being ambiguous and each method’s difference in handling these ambiguously mapped reads. This may have large implications if the goal is to determine homeolog expression bias.
Table 4.
STAR | LAST | Kallisto | HomeoRoq | EAGLE-RC | |
---|---|---|---|---|---|
STAR | — | 0.0213 | 0.1008 | 0.0324 | 0.0457 |
LAST | 0.1092 | — | 0.1008 | 0.0380 | 0.0499 |
Kallisto | 0.2301 | 0.2301 | — | 0.1026 | 0.1076 |
HomoeRoq | 0.1329 | 0.1529 | 0.2460 | — | 0.0380 |
EAGLE-RC | 0.1897 | 0.2010 | 0.2725 | 0.1867 | — |
Hexaploid wheat read classification
For hexaploid wheat T. aestivum, the three homeologous copies further complicate read mapping and also requires a more complex workflow for subgenome classification. To test classification performance, we used RNA-seq data of a tetraploid wheat line (tetra-Chinese Spring) with subgenomes AB and a diploid A. tauschii line (accession KU-2076) with subgenome D as the ground truth. Table 5 shows the classification performance of each approach.
Table 5.
Classification error | chrA | chrB | chrD | |||
---|---|---|---|---|---|---|
Mapped, % | AB to D, % | D to AB, % | Homeologs, % | Homeologs, % | Homeologs, % | |
STAR | 75.83 | 9.12 | 14.85 | 53.62 | 53.47 | 59.11 |
LAST | 77.21 | 1.62 | 2.78 | 54.28 | 53.96 | 60.61 |
Kallisto | 78.17 | 10.17 | 16.71 | 69.80 | 70.05 | 75.64 |
HomeoRoq | 79.73 | 0.97% | 1.54 | 48.24 | 48.20 | 53.43 |
EAGLE-RC | 79.73 | 0.33 | 0.66 | 48.83 | 48.52 | 55.51 |
Our results show that the misclassification rates of all methods were strikingly similar for allotetraploid A. kamchatica and allohexaploid wheat (Figure 4). This suggests that it was mainly the difference in methods rather than taxon-specific features that affected the error rate where subgenome-classification methods showed a higher precision than standard methods using the concatenated genome. The highest precision was obtained by EAGLE-RC followed by HomeoRoq, LAST, STAR and Kallisto in that order. Kallisto detected the most expressed homeologs by count, which may be spurious because Kallisto tended to overestimate low-expression genesrelative to other methods as discussed above. Among the other four methods, the subgenome-classification methods showed a slightly lower number of expressed homeologs detected than the standard genome alignment approach, though we do not know which one is closer to the ground truth. EAGLE-RC still maintained an error rate under 1%, the best among all methods, though it deemed an average of 20% of mapped reads as unclassified due to low confidence resulting from ambiguity between homeolog pairs. There was also a higher cost in computation time for the higher precision (Supplementary data, Table S6).
Another point of interest is that the more divergent D diploid line had much higher error rates than the AB line directly extracted from Chinese Spring, which is the reference genome line (Table 5). In this case, STAR and Kallisto performed quite poorly compared to other methods. Divergence from the reference genome appears to be a multiplier for error rate, thus less precise methods will be more affected.
Hexaploid wheat homeolog expression quantification
We evaluated the classification error using the quantified read counts for homeologs in T. aestivum (Table 6). As described in the previous section, note that the number of reads quantified is not directly comparable among different methods. Again, for Kallisto we used the estimated read count from its output while for all other methods we obtained read counts using featureCounts.
Table 6.
Classification error | |||
---|---|---|---|
AB to D, % | D to AB, % | Reads quantified | |
STAR | 2.83 | 5.13 | 2632463 |
LAST | 1.46 | 1.80 | 3324633 |
Kallisto | 1.45 | 3.07 | 4922150 |
HomeoRoq | 1.50 | 2.26 | 1929797 |
EAGLE-RC | 0.67 | 1.32 | 1892448 |
As expected, the error rate was higher in hexaploid wheat than in A. kamchatica though with similar trends. In AB classification error, STAR had over 200% higher rate of error, LAST and Kallisto had 60% higher error and HomeoRoq and EAGLE-RC were the most robust with 20% higher error. Similar to the alignment error analysis, the more distant A. tauschii (D) reads had much higher error rates, propagated from errors in the alignment. Again, STAR and Kallisto were the most affected by the increased distance in these samples.
Next, we examined homeolog expression shifts across conditions and quantified the homeolog expression ratio in T. aestivum by calculating the proportion of subgenome A reads as follows:
We examined the homeolog expression ratio using RNA-seq data from T. aestivum samples 24 h after fungal inoculation and after mock inoculation. The r2 (Table 7) and RMSD (Table S7) show that, relative to other methods, Kallisto exhibited a large drop in concordance in hexaploid wheat compared to A. kamchatica. The homeolog expression ratios (Figure 5a–d) show that the discordance was largely due to systematic differences in low-expression homeologs, which was also a trend for other methods (Figure 5e and f and Supplementary data, Figures S3–S5) though not to the degree of Kallisto. Also similar to A. kamchatica, there was some discordance between EAGLE-RC and other methods. In hexaploid wheat, there are 1689 AB, 1549 BD and 1661 AD pair-wise homeologs that have at least a 40% difference in the proportion of the gene model aligned, which may account for much of the discordance between EAGLE-RC and other methods.
Table 7.
STAR | LAST | Kallisto | HomeoRoq | EAGLE-RC | |
---|---|---|---|---|---|
STAR | — | ||||
LAST | 0.9174 | — | |||
Kallisto | 0.3257 | 0.3342 | — | ||
HomoeRoq | 0.8612 | 0.8107 | 0.2787 | — | |
EAGLE-RC | 0.8310 | 0.7931 | 0.2814 | 0.7958 | — |
Then, we examined differentially expressed homeologs using RNA-seq data from T. aestivum samples 24 h after fungal inoculation and after mock inoculation. The DEH results for hexaploid wheat between different methods (Figure 6) show that there was less concordance between methods compared to tetraploid A. kamchatica. Notably, only 51% of 304 DEHs detected by Kallisto were supported by any of the other four methods (47% by STAR, 45% by LAST, 38% by HomeoRoq and 41% by EAGLE-RC). The results here reiterate that discordance was systematic, where Kallisto’s tendency to overestimate low-expression genes led to significant differences in the homeolog expression ratio.
Discussion
Recent improvements in sequencing technology have reduced the difficulty in constructing allopolyploid reference genomes [12–14] and there should be a corresponding increase in genome-wide studies for allopolyploids [3–8]. As such, there is a need to evaluate expression quantification methods given the presence of homeologs in allopolyploids. Unlike expression quantification in diploids, homeolog expression quantification evaluates multiple highly similar gene copies concurrently and it would be ideal if all copies are at a similar level of completeness. This includes the genome annotation, which may be a non-trivial source of uncertainty [53, 54].
It is well known that the presence of repetitive sequences in diploids (paralogs) present technical challenges for read alignment [55, 56] and can bias RNA-seq expression quantification [57]. For polyploids, with the presence of homeologs, there are even more repetitive sequences due to an increase in the number of gene copies. In this study, we saw that applying a standard diploid RNA-seq workflow to allopolyploids may have issues in terms of assigning reads to the wrong subgenome, particularly in hexaploid wheat. In general, low-expression genes accounted for most of the discordance between methods. Kallisto, especially, often overestimated the number of reads in low-expression genes which has been observed in previous studies [58]. If the goal is to determine homeolog expression bias, then accurate quantification of low-expression genes is important because small errors can result in large shifts in the expression ratio between homeologs. In addition, the accuracy of expression levels in such genes is especially important if we wish to identify ‘on’ and ‘off’ states of gene expression in response to stimuli in time course analyses.
There has been a long discussion and dispute about the bias of expression between homeologs and subgenomes because different species show different patterns [59]. In this study we found that different methods showed different biases, depending on the types of uncertainty they consider. In particular, gene annotation can affect the detection of homeolog expression bias because the exon regions of homeologs are typically annotated using RNA-seq reads on each copy separately and thus may be annotated differently, especially when the expression level of one of the copies is low. To obtain general conclusions on the biased expression in polyploids, we would suggest that analysis be performed with comparable and accurate methods in corresponding exonic regions. EAGLE-RC attempts to do this by considering only corresponding exonic regions between homeologs.
In our evaluations, we presented results both at the read mapping stage and after expression quantification. It is difficult to be completely fair when evaluating the accuracy of read mapping due to differences in each of the four approaches. Kallisto does not attempt to identify primary versus secondary alignments, thus a primary alignment only evaluation is not suitable. The standard genome alignment approach may have secondary alignments due to the presence of homeologs, which can be indistinguishable from the primary alignment in terms of alignment score, in which case a primary alignment is picked randomly. The subgenome-classification approach of mapping reads to each subgenome separately has inherently fewer secondary alignments, which is an advantage of this approach. Thus, we evaluated the classification accuracy with all alignments, as secondary alignment errors are a factor in all approaches though to different degrees. The results after read counting then present the performance of each approach after potentially accounting for ambiguous alignments. In this way we show the performance at both major steps in the RNA-seq expression quantification process, though there are many other read counting methods that we were not able test. Read counting methods may also benefit from the higher precision of the subgenome-classification approach by utilizing classified reads in lieu of uniquely mapped reads and distributing ambiguous reads through EM.
Conclusion
In this study, we evaluated methods for homeolog expression quantification in tetraploid A. kamchatica and hexaploid wheat T. aestivum using RNA-seq. We examined the standard genome alignment based approach with STAR and LAST, the subgenome-classification approach with HomeoRoq and EAGLE-RC and a pseudo-alignment approach with Kallisto.
The presence of homeologs had the largest effect on STAR and Kallisto, resulting in higher read classification error. We observed that discordance was systematic, occurring mostly in low-expression genes and can result in large shifts in the homeolog expression ratio. The explicit use of genotype differences between homeologs in EAGLE-RC seems to be a factor in reducing uncertainties in the reference genome and annotation. Our results show that EAGLE-RC was the most precise method in both tetraploid A. kamchatica and hexaploid wheat.
Supplementary Material
Acknowledgments
The Aegilops tauschii seeds were kindly supplied by the National BioResource Project-Wheat (Japan, http://www.nbrp.jp). The tetra-Chinese Spring seeds were kindly supplied by Prof. Hisashi Tsujimoto (Tottori Univ.). Dr Yoshihiro Matsuoka (Fukui Prefectural University, Japan), Dr Hiroyuki Tsuji (Yokohama City University) and Ms Akina Mitsuhashi (Yokohama City University) provided generous support for Aegilops tauschii cultivation and propagation. We thank Rie Shimizu-Inatsugi for plant experiments, Catharine Aquino and the Functional Genomics Center Zurich for sequencing and Thomas Wicker, Beat Keller and Cristobal Uauy for valuable discussions.
Tony Kuo is a researcher in bioinformatics and genomic sequence analysis.
Masaomi Hatakeyama is a researcher in bioinformatics and genomic sequence analysis.
Toshiaki Tameshige is a researcher in evolutionary genomics, ecological genomics and genome duplication (polyploidization).
Kentaro K Shimizu is a researcher in evolutionary genomics, ecological genomics and genome duplication (polyploidization).
Jun Sese is a researcher in bioinformatics and genomic sequence analysis.
Author Contributions
T.K. executed the study and drafted the manuscript. M.H. contributed to data analysis and simulation studies. T.T. and K.K.S. contributed to genetic material and data collection. K.K.S. and J.S. initiated the study, initiated the analysis framework and contributed to drafting the manuscript.
Key Points
We evaluated standard RNA-seq expression quantification methods and compared against subgenome-classification approaches which map reads to each subgenome separately.
We used previous experimentally added (synthetic allotetraploid A. kamchatica) and newly generated reduced RNA-seq data (allohexaploid wheat T. aestivum versus extracted allotetraploid) as ground truth.
The standard approaches showed higher error rates (>10% using pseudo-alignment with Kallisto) while subgenome-classification approaches showed much lower error rates (<1% using EAGLE-RC and <2% using HomeoRoq).
Our results show that, in general, disagreement in low-expression genes was responsible for most of the discordance between methods.
Overall, subgenome-classification approaches perform better than standard approaches with EAGLE-RC having the highest precision.
Funding
JST CREST (JPMJCR16O3); KAKENHI (15H01717, 16H06279, 16H06469, 16H06464 and 16K21727); Japan Human Frontier Science Program; Indo-Swiss Collaboration in Biotechnology; Swiss National Science Foundation.
References
- 1. Wood TE, Takebayashi N, Barker MS, et al. . The frequency of polyploid speciation in vascular plants. Proc Natl Acad Sci USA. 2009;106(33):13875–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Renny-Byfield S, Wendel JF. Doubling down on genomes: polyploidy and crop plants. Am J Bot 2014;101(10):1711–25. [DOI] [PubMed] [Google Scholar]
- 3. Comai L, Tyagi AP, Winter K, et al. . Phenotypic instability and rapid gene silencing in newly formed Arabidopsis allotetraploids. Plant Cell 2000;12(9):1551–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Grover CE, Gallagher JP, Szadkowski EP, et al. . Homoeolog expression bias and expression level dominance in allopolyploids. New Phytol 2012;196(4):966–71. [DOI] [PubMed] [Google Scholar]
- 5. Buggs RJA. Unravelling gene expression of complex crop genomes. Heredity 2013;110:97–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Yoo MJ, Szadkowski E, Wendel JF. Homoeolog expression bias and expression level dominance in allopolyploid cotton. Heredity 2013;110(2):171–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Akama S, Shimizu-Inatsugi R, Shimizu KK, et al. . Genome-wide quantification of homeolog expression ratio revealed nonstochastic gene regulation in synthetic allopolyploid Arabidopsis. Nucleic Acids Res 2014;42(6):e46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Paape T, Briskine R, Halstead-Nussloch G, et al. . Patterns of polymorphism and selection in the subgenomes of the allopolyploid Arabidopsis kamchatica. Nat Commun 2018;9:3909. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Comai L. The advantages and disadvantages of being polyploid. Nat Rev Genet 2005;6(11):836–46. [DOI] [PubMed] [Google Scholar]
- 10. Wang J, Tian L, Lee HS, et al. . Genomewide nonadditive gene regulation in Arabidopsis allotetraploids. Genetics 2006;172(1):507–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Chang PL, Dilkes BP, McMahon M, et al. . Homoeolog-specific retention and use in allotetraploid Arabidopsis suecica depends on parent of origin and network partners. Genome Biol 2010;11(12):R125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Lu F, Romay MC, Glaubitz JC, et al. . High-resolution genetic mapping of maize pan-genome sequence anchors. NatCommun 2015;6: doi: 10.1038/ncomms7914. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Ming R, Wai CM. Assembling allopolyploid genomes: no longer formidable. Genome Biol 2015;16:27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Jiao WB, Schneeberger K. The impact of third generation genomic technologies on plant genome assembly. Curr Opin Plant Biol 2017;36:64–70. [DOI] [PubMed] [Google Scholar]
- 15. Krasileva KV, Vasquez-Gross HA, Howell T, et al. . Uncovering hidden variation in polyploid wheat. Proc Natl Acad Sci USA 2017;114(6):E913–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Yuan Y, Bayer PE, Batley J, Edwards D. Improvements in genomic technologies: application to crop genomics. Trends Biotechnol. 2017;35(6):547–58. doi: 10.1016/j.tibtech.2017.02.009. [DOI] [PubMed] [Google Scholar]
- 17. Hatakeyama M, Aluri S, Balachadran MT, et al. . Multiple hybrid de novo genome assembly of finger millet, an orphan allotetraploid crop. DNA Res 2018;25(1):39–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Degner JF, Marioni JC, Pai AA, et al. . Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 2009;25(24):3207–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Heap GA, Yang JH, Downes K, et al. . Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing. Hum Mol Genet 2010;19(1):122–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Nothnagel M, Wolf A, Herrmann A, et al. . Statistical inference of allelic imbalance from transcriptome data. Hum Mutat 2011;32(1):98–106. [DOI] [PubMed] [Google Scholar]
- 21. León-Novelo LG, McIntyre LM, Fear JM, et al. . A flexible Bayesian method for detecting allelic imbalance in RNA-seq data. BMC Genomics 2014;15:920. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Boatwright JL, McIntyre LM, Morse AM, et al. . A robust methodology for assessing differential homeolog contributions to the transcriptomes of allopolyploids. Genetics. 2018;210(3):883–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Gerard D, Ferrão LFV, Garcia AAF, Stephens M. Genotyping polyploids from messy sequencing data. Genetics. 2018;210(3):789–807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Page JT, Udall JA. Methods for mapping and categorization of DNA sequence reads from allopolyploid organism. BMC Genet 2015;16(suppl 2):S4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. International Wheat Genome Sequencing Consortium, IWGSC RefSeq principal investigators, Appels R, et al. . Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science 2018;361(6403). doi: 10.1126/science.aar7191. [DOI] [PubMed] [Google Scholar]
- 26. Yang YF, Furuta Y, ichi Nagata S, et al. . Tetra Chinese Spring with AABB genomes extracted from the hexaploid common wheat, Chinese Spring. Genes Genet Syst. 1999;74:67–70. [Google Scholar]
- 27. Dobin A, Davis CA, Schlesinger F, et al. . STAR: ultrafast universal RNA-seq aligner. Bioinformatics 2013;29(1):15–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 2009;25(9):1105–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Shrestha AM, Frith MC. An approximate Bayesian approach for mapping paired-end DNA reads to a reference genome. Bioinformatics 2013;29(8):965–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Frith MC, Kawaguchi R. Split-alignment of genomes finds orthologies more accurately. Genome Biol 2015;16:106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 2010;26(7):873–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Bray NL, Pimentel H, Melsted P, et al. . Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 2016;34:525–7. [DOI] [PubMed] [Google Scholar]
- 33. Chalhoub B, Denoeud F, Liu S, et al. . Early allopolyploid evolution in the post-Neolithic Brassica napus oilseed genome. Science 2014;345(6199):950–3. [DOI] [PubMed] [Google Scholar]
- 34. Sierro N, Battey JND, Ouadi S, et al. . The tobacco genome sequence and its comparison with those of tomato and potato. Nat Commun 2014;5:3833. doi: 10.1038/ncomms4833. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Zhang T, Hu Y, Jiang W, et al. . Sequencing of allotetraploid cotton (Gossypium hirsutum L. acc. TM-1) provides a resource for fiber improvement. Nat Biotechnol 2015;33:531–7. [DOI] [PubMed] [Google Scholar]
- 36. Yang J, Liu D, Wang X, et al. . The genome sequence of allopolyploid Brassica juncea and analysis of differential homoeolog gene expression influencing selection. Nat Genet 2016;48:1225–32. [DOI] [PubMed] [Google Scholar]
- 37. Ramírez-González RH, Borrill P, Lang D, et al. . The transcriptional landscape of polyploid wheat. Science 2018;361. doi: 10.1126/science.aar6089. [DOI] [PubMed] [Google Scholar]
- 38. Page JT, Gingle AR, Udall JA. PolyCat: a resource for genome categorization of sequencing reads from allopolyploid organisms. G3 (Bethesda) 2013;3(3):517–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Kuo T, Frith MC, Sese J, et al. . explicit alternative genome likelihood evaluator. BMC Med Genomics 2018;11(suppl 2):28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Duchemin W, Dupont PY, Campbell MA, et al. . HyLiTE: accurate and flexible analysis of gene expression in hybrid and allopolyploid species. BMC Bioinformatics. 2015;16:8. doi: 10.1186/s12859-014-0433-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 2014;30(7):923–30. [DOI] [PubMed] [Google Scholar]
- 42. Shimizu KK, Fujii S, Marhold K, et al. . Arabidopsis kamchatica (Fisch. ex DC.) K. Shimizu & Kudoh and A. kamchatica subsp. kawasakiana (Makino) K. Shimizu & Kudoh, new combinations. Acta Phytotaxonomica et Geobotanica 2005;56(2):163–72. [Google Scholar]
- 43. Shimizu-Inatsugi R, Lihová J, Iwanaga H, et al. . The allopolyploid Arabidopsis kamchatica originated from multiple individuals of Arabidopsis lyrata and Arabidopsis halleri. Mol Ecol 2009;18(19):4024–48. [DOI] [PubMed] [Google Scholar]
- 44. Tsuchimatsu T, Kaiser P, Yew CL, et al. . Recent loss of self-incompatibility by degradation of the male component in allotetraploid Arabidopsis kamchatica. PLoS Genet 2012;8(7): e1002838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Yew CL, Kakui H, Shimizu KK. Agrobacterium-mediated floral dip transformation of the model polyploid species Arabidopsis kamchatica. J Plant Res 2018;131(2):349–58. [DOI] [PubMed] [Google Scholar]
- 46. Briskine RV, Paape T, Shimizu-Inatsugi R, et al. . Genome assembly and annotation of Arabidopsis halleri, a model for heavy metal hyperaccumulation and evolutionary ecology. Mol Ecol Resour 2017;17(5):1025–36. [DOI] [PubMed] [Google Scholar]
- 47. Soltis PS, Marchant DB, Peer YV, et al. . Polyploidy and genome evolution in plants. Curr Opin Genet Dev 2015;35:119–25. [DOI] [PubMed] [Google Scholar]
- 48. Paape T, Hatakeyama M, Shimizu-Inatsugi R, et al. . Conserved but attenuated parental gene expression in allopolyploids: constitutive zinc hyperaccumulation in the allotetraploid Arabidopsis kamchatica. Mol Biol Evol 2016;33(11):2781–800. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Mizuno N, Yamasaki M, Matsuoka Y, et al. . Population structure of wild wheat D-genome progenitor Aegilops tauschii Coss.: implications for intraspecific lineage diversification and evolution of common wheat. Mol Ecol 2010;19(5):999–1013. [DOI] [PubMed] [Google Scholar]
- 50. Conesa A, Madrigal P, Tarazona S, et al. . A survey of best practices for RNA-seq data analysis. Genome Biol 2016;17:13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014;15(12):550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Huang W, Li L, Myers JR, et al. . ART: a next-generation sequencing read simulator. Bioinformatics 2012;28(4):593–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Triant DA, Pearson WR. Most partial domains in proteins are alignment and annotation artifacts. Genome Biol. 2015;16:99. doi: 10.1186/s13059-015-0656-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Promponas VJ, Iliopoulos I, Ouzounis CA. Annotation inconsistencies beyond sequence similarity-based function prediction—phylogeny and genome structure. Stand Genomic Sci 2015;10:108. doi: 10.1186/s40793-015-0101-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 2011;13(1):36–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Finotello F, Lavezzo E, Fontana P, et al. . Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data. Brief Bioinform 2012;13(3):269–80. [DOI] [PubMed] [Google Scholar]
- 57. Finotello F, Camillo BD. Measuring differential gene expression with RNA-seq: challenges and strategies for data analysis. Brief Funct Genomics 2015;14(2):130–42. [DOI] [PubMed] [Google Scholar]
- 58. Everaert C, Luypaert M, Maag J, et al. . Benchmarking of RNA-sequencing analysis workflows using whole-transcriptome RT-qPCR expression data. Sci Rep 2017;7(1):1559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Yoo MJ, Liu X, Pires JC, et al. . Nonadditive gene expression in polyploids. Annu Rev Genet 2014;48:485–517. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.