Integrated modeling of protein-coding genes in the Manduca sexta genome using RNA-Seq data from the biochemical model insect

Xiaolong Cao; Haobo Jiang

doi:10.1016/j.ibmb.2015.01.007

. Author manuscript; available in PMC: 2016 Jul 1.

Published in final edited form as: Insect Biochem Mol Biol. 2015 Jan 20;62:2–10. doi: 10.1016/j.ibmb.2015.01.007

Integrated modeling of protein-coding genes in the Manduca sexta genome using RNA-Seq data from the biochemical model insect

Xiaolong Cao ^a, Haobo Jiang ^a

PMCID: PMC4476934 NIHMSID: NIHMS657019 PMID: 25612938

Abstract

The genome sequence of Manduca sexta was recently determined using 454 technology. Cufflinks and MAKER2 were used to establish gene models in the genome assembly based on the RNA-Seq data and other species' sequences. Aided by the extensive RNA-Seq data from 50 tissue samples at various life stages, annotators over the world (including the present authors) have manually confirmed and improved a small percentage of the models after spending months of effort. While such collaborative efforts are highly commendable, many of the predicted genes still have problems which may hamper future research on this insect species. As a biochemical model representing lepidopteran pests, M. sexta has been used extensively to study insect physiological processes for over five decades. In this work, we assembled Manduca datasets Cufflinks 3.0, Trinity 4.0, and Oases 4.0 to assist the manual annotation efforts and development of Official Gene Set (OGS) 2.0. To further improve annotation quality, we developed methods to evaluate gene models in the MAKER2, Cufflinks, Oases and Trinity assemblies and selected the best ones to constitute MCOT 1.0 after thorough crosschecking. MCOT 1.0 has 18,089 genes encoding 31,666 proteins: 32.8% match OGS 2.0 models perfectly or near perfectly, 11,747 differ considerably, and 29.5% are absent in OGS 2.0. Future automation of this process is anticipated to greatly reduce human efforts in generating comprehensive, reliable models of structural genes in other genome projects where extensive RNA-Seq data are available.

Keywords: gene annotation, de novo assembly, tobacco hornworm, automated gene modeling, arthropod genomics

1. Introduction

With five larval instars, a large body size and hemolymph volume, and a simple larval body structure, the tobacco hornworm Manduca sexta has been widely employed as a model organism to study basic physiological processes in insects, such as cuticle formation, neural transmission, hormonal regulation, nutrient transport, intermediary metabolism, and immune responses (Hopkins et al., 2000; Shield and Hildebrand, 2001; Riddiford et al., 2003; Kanost et al., 1990; Arrese and Soulages, 2010; Jiang et al., 2010). Acquired knowledge of the molecular mechanisms underlying these processes would lead to new means of pest control, because M. sexta may be a good representative of some serious agricultural pests in the order of Lepidoptera. Several transcriptome analyses have yielded sequences and expression patterns of genes related to immunity, digestion, and olfaction (Zou et al., 2008; Pauchet et al., 2010; Zhang et al., 2011; Grosse-Wilde et al., 2011; Gunaratna and Jiang, 2013), but the potential of this model species is far from fulfillment partly due to the lack of its genome sequence. The shortage of complete protein sequences based on correctly modeled genes substantially hampers proteomic studies, for instance, of the immune complex formed around entomopathogens.

Recently, the genomic DNA isolated from a single male pupa of M. sexta was pyrosequenced at >20-fold coverage and assembled into Manduca Genome Assembly 1.0 (Msex 1.0) using Newbler with Atlas-GapFill (X et al., 2014). Sixty cDNA libraries, representing mRNA samples of whole larvae, organs and tissues at various developmental stages, were sequenced using Illumina technology, yielding >350 gigabyte data. Some of these RNA-Seq datasets and other known M. sexta cDNA sequences were aligned to the reference genome to generate Manduca Cufflinks Assembly 1.0 and 1.0b using Bowtie, TopHat, and Cufflinks. Aided by the available sequence data from M. sexta and other arthropod species, approximately 18,000 genes in Msex 1.0 were predicted by MAKER2 generating the Manduca Official Gene Set 1.0 (OGS 1.0). Some of the OGS 1.0 models were examined by annotators to detect errors using Manduca Cufflinks 1.0/1.0b, Trinity 3.0, and Oases 3.0 sequences. The latter two sets of gene transcripts, assembled solely based on the RNA-Seq datasets, were extensively used along with Cufflinks 1.0/1.0b to improve annotation quality. Over a period of more than one year, 2,498 structural genes were successfully curated by approximately 70 researchers (X et al., 2014). PASA2 (http://pasa.sourceforge.net/) was then used to select the best models from the MAKER2, Cufflinks, Trinity, Oases, and manual assemblies to generate Manduca OGS 2.0 (X et al., 2014).

During the course of gene cross-examination, we came to realize that some of the lessons learned can be valuable to future genome projects. For example, as extensive RNA-Seq data are becoming a norm, genome-dependent and independent assemblies are critically important in the validation and perfection of MAKER2 gene models. Due to limitations of the programs used to produce OGS 2.0 (Table 1), an integration of their outputs using computer programs may greatly reduce human efforts in sequence cross-examination and considerably increase the percentage of crosschecked gene models. To achieve these goals, we have developed methods to evaluate models in the MAKER, Cufflinks, Oases and Trinity assemblies. As proof of principle, a reliable, nearly complete set of protein sequences (MCOT 1.0) is generated to facilitate proteomic research in this model insect. In the following, we report the generation of Cufflinks 3.0, Oases 4.0 and Trinity 4.0 gene models, discuss their advantages, shortcomings and integration, and describe how MCOT 1.0 was developed and compared with OGS 2.0.

Table 1. Comparison of the four gene prediction programs.

Program	Algorithm	Advantages	Disadvantages
Cufflinks	map reads to the reference genome with TopHat and Bowtie to identify splice sites, and then use outputs of TopHat to create gene models	most sensitive; accurate splicing sites; GTF file for gene annotation; fast, less computation; more tolerant to low quality reads	carry errors in the genome scaffolds (gaps, NNNs, misassembling, etc.); many isoforms from closely located and related genes do not exist
Maker2	align EST and protein sequences to genome to produce ab initio gene predictions and can use RNA-Seq data to improve the prediction.	less redundant; model genes poorly represented in the RNA-Seq datasets; GTF file for gene annotation	low quality of predictions, such as extra or skipped exons, inaccurate splicing junctions, and merging of adjacent genes; biased on proteins
Trinity	De novo assemble transcripts using RNA-Seq data	not influenced by problems in the genome assembly	single hash level (k: 25); less sensitive than Cufflinks; redundant transcripts; no GTF file; SNPs etc.
Oases	De novo assemble transcripts using RNA-Seq data, and use Velvet for contig assembling	accurate, not influenced by problems in the genome assembly, multiple hash levels to improve quality of transcript assembly	less sensitive than Cufflinks, redundant transcripts; intense computation for large datasets; no GTF file; SNPs and other variations

Open in a new tab

2. Materials and Methods

2.1. Data and program acquisition

Manduca Genome Assembly 1.0 (Msex 1.0) and gene models in Manduca Official Gene Sets 1.0 (OGS 1.0, Table S1) and 2.0 (OGS 2.0) and Cufflinks Assembly 1.0 (Cufflinks 1.0) (X et al., 2014) were downloaded from Manduca Base (ftp://ftp.bioinformatics.ksu.edu/pub/Manduca/). Universal protein sequences in UniProtKB Arthropoda (Table S1) were downloaded from ftp://ftp.ebi.ac.uk/pub/databases/fastafiles/uniprot/. The RNA-Seq datasets (X et al., 2014) were acquired from Dr. Gary Blissard at Cornell University. SAMtools (0.1.19) (Li et al., 2009), Bowtie2 (2.2.1) (Langmead and Salzberg, 2012), TopHat (2.0.11) (Trapnell et al., 2009), Cufflinks (2.1.1) (Trapnell et al., 2012; Roberts et al., 2011), Trinity (20131110) (Grabherr et al., 2011), Oases (0.2.08) (Schulz et al., 2012), and BLAST+ (2.2.29) (Camacho et al., 2009) were downloaded from http://samtools.sourceforge.net/, http://bowtie-bio.sourceforge.net/bowtie2/index.shtml, http://ccb.jhu.edu/software/tophat/index.shtml, http://cufflinks.cbcb.umd.edu/, http://trinityrnaseq.sourceforge.net/, https://www.ebi.ac.uk/∼zerbino/oases/, ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ and installed on a local supercomputer according to their manuals.

2.2. Generation of Cufflinks 3.0

The 60 RNA-Seq datasets were aligned to Msex 1.0 using TopHat at settings for three different read types: single end, paired end, and strand specific. “--read-realign-edit-dist 0” was selected to increase accuracy of read alignments. Cufflinks was used to translate the accepted hits generated by TopHat to separate GTF files, with the “-u” command enabled to allow more accurate handling of multiple reads mapped to the same region. Cuffmerge was employed to combine GTF files of all the libraries to make the final GFF file (see scripts in the Supplemental Materials), from which transcript sequences were extracted using gffread to form Cufflinks 3.0 dataset (Table S1).

2.3. Reads treatment, normalization, and de novo assembling

Paired end reads were trimmed to 80 bp using FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html), with the forward reads combined in one file and the reverse ones in another. To handle the RNA-Seq data with 256 GB RAM of the supercomputer, the number of the reads was reduced according to Haas et al (2013). The Perl scripts provided in Trinity were used to perform in silico read normalization with maximum coverage set to 500. The single end and strand-specific reads were combined in one file for normalization at the same maximum coverage. After all normalized reads were pooled, Trinity was used to assemble the reads as paired end reads, generating Trinity 4.0 (Table S1). For Oases, four hash lengths (k: 25, 27, 29, 31) were chosen to assemble the reads as single end reads in four separate runs. Scaffolding was not allowed, preventing the stretches of Ns in assembled transcripts. The transcript files were then merged according to the Oases manual, generating Oases 4.0 (Table S1). In addition, reads that cannot be aligned to Msex 1.0 by TopHat were combined, trimmed to 80 bp, and assembled as paired end reads using Trinity. This new assembly (Trinity 4.0b, Table S1) was later used to identify unmapped genes, some of which may reside on the unsequenced W chromosome.

2.4. Gene translation and sequence comparison

Gene transcripts in Trinity 4.0 and Oases 4.0 were translated to polypeptide sequences using TransDecoder in Trinity (http://transdecoder.sourceforge.net/) (Haas et al., 2013), with minimum protein length set at 60. For removing redundant sequences in the de novo assemblies, identical proteins were identified in one batch using Python scripts and only one of each group was kept in the final set of unique protein sequences. To identify the best sequences in the comparisons between assemblies, the BLOSUM62 scoring matrix in the BLAST source code was changed to -100 for all 190 non-identical residue pairs. As such, only identical or near identical sequences would be detected by BLASTP with a positive score of alignment. The gap opening penalty was set to the maximum (32,767) to avoid gapped matches. A batchwise BLASTP comparison of the two sets of translated sequences was performed, with the tabular outputs (e.g. match length, query length, subject length) exported to Excel for further analysis. Cufflinks 3.0 translations were used as queries to search Trinity 4.0 or Oases 4.0 translations.

2.5. Cross-examination and selection of protein sequences from different assemblies

As illustrated in the flowchart (Fig. 1), the BLASTP results from comparisons of the unique protein sequences in Cufflinks 3.0, Trinity 4.0, and Oases 4.0 were examined by two methods to establish Selections 1 and 2. The results from one method were then cross-examined by the other to yield a dataset COT, later becoming a major part of MCOT 1.0.

In the length-based method, the Cufflink-Trinity comparison resulted in pairs with match lengths (TMLs, T for Trinity) and Cufflinks lengths (CLs), and their ratios were used to determine whether or not the Trinity hits would be kept (Fig. 1A). If TML/CL was ≤ 0.7, the Trinity hits were ignored and corresponding Cufflinks sequences were further processed: for ones without ambiguous residues (Xs), their lengths (CLs) were directly used as CL*s; for the others with Xs, 70% of the CL values were used as CL*s. On the other hand, if TML/CL was > 0.7, the Trinity sequences were considered in the next step. The same procedure was carried out to compare Cufflinks and Oases translations and select the Oases ones (OML/CL > 0.7) for further consideration, together with the selected Cufflinks and Trinity sequences. The ones with the largest values (CL*, TL, or OL) were kept in Selection 1. If the values were equal, retention priority was given to the concerning sequences in the following order: Cufflinks, Trinity, and then Oases.

In the ratio-based method, Cufflinks 3.0 translations were used as queries to search arthropod universal/UniProt (U) sequences using BLASTP with the original BLOSUM62 matrix (Fig. 1B). Results were kept if identity >35% and ML/QL >0.7 or ML >200. When several regions were matched, ML equals the sum of match lengths between the same query and subject sequences. Up to five top hits were used to calculate UL (for UniProt length: mean ± SD) and ID of the best match was kept. Lengths (CL, TL, OL and UL) of the Cufflinks 3.0, Trinity, Oases, and UniProt proteins, correlated by the BLASTP searches, were used to calculate similarity ratios CUS, TUS, and OUS. For example, TUS (i.e. Similarity ratio of lengths in a T-U comparison) was defined as TL/UL or UL/TL, whichever is between 0 and 1, so that a TUS close to 1 indicates high similarity between this Trinity-UniProt pair. Depending on the absence or presence of Xs in the Cufflinks translations, CUS was directly used or adjusted to 70% as CUS*. The proteins with the highest ratios (CUS*, TUS or OUS) will be kept in Selection 2 and, if the values were equal, the priority order of C > T > O was used to determine which ones to retain.

To cross-examine the two selections, the length (L) and match length (ML) of sequence Y (C or T or O) in Selection 2 (S2), UL of its correlated UniProt sequence, L and ML of its correlated sequence in Selection 1 (S1) were used to calculate YUS_S2 – YUS_S1 (Fig. 1C). YUS = L/UL or UL/L, whichever is 0 to 1. Sequences in S1 were kept if their YUS_S2 – YUS_S1 < 0.3, ML_S1/CL > 0.95, or ML_S1/CL > 0.8 when Cufflinks sequence contains Xs (route 1). Sequences in S2 were retained, if their YUS_S2 – YUS_S1 > 0.5 and L_S2/CL > 0.7 (route 2). The remaining sequence pairs in the two selections were manually scrutinized to determine which ones to keep (route 3). In most cases, S1 and S2 were identical (YUS_S2 = YUS_S1).

2.6. Classification of sequence comparison results

If the lengths of a query sequence (QL), subject sequence (SL), and match length (ML) were identical (QL = SL = ML), the match was considered as “P” (for perfect). If (ML/QL)×(ML/SL) > 0.95 (e.g. when ML = QL, ML/SL > 0.95), the match was “N” (for near perfect). The 3^rd and 4^th categories “O” (for okay) and “B” (for bad) were separated based on match length index (MLI), defined as (ML/QL)/0.7 + ML/200. If MLI was ≥ 1, the match was “O”. In other words, even if QL is much greater than ML, >200 residues match is significant. Or, when ML/200 is small, >70% of QL falls into the matched region is considerable. If MLI was < 1, the match was “B”. In the last category of “W” (for worst), the query sequences had no match. When OGS 1.0 and Cufflinks 3.0 datasets were compared, OGS 1.0 IDs with “B” and “W” matches were recorded.

2.7. Identification of proteins present only in OGS 1.0

Although accuracy of the gene models in OGS 1.0 is relatively low, some are unique (Table 1). Since Cufflinks is more sensitive than Trinity and Oases (Yandell and Ence, 2012), MAKER2 proteins were used as queries to search the Cufflinks 3.0 translations using BLASTP with the modified scoring matrix, according to Section 2.4. Based on the results, those sequences in the categories of “B” or “W” were stored as “M” (for MAKER2 unique proteins), later incorporated into MCOT 1.0.

2.8. Identification of unmapped genes in Trinity 4.0b

Since a male pupa was used for genome sequencing, genes located on the W chromosome are not present in Msex 1.0. In addition, the genome assembly probably lacks genes or gene pieces on other chromosomes, as gaps between scaffolds or NNN regions. Trinity 4.0b was used to uncover transcripts of such unmapped genes. Based on results of the MCOT-Trinity 4.0b comparison, Trinity 4.0b protein sequences in the categories of “B” or “W” were kept for BLASTP search of arthropod UniProt sequences using the original BLOSUM62 scoring matrix. Hits with ML > 100, identity > 35%, and minimum/maximum of ML, QL and SL > 0.7 were combined with the proteins in “M” (Section 2.7) and “COT” (Section 2.5) to generate MCOT 1.0 (Table S1) after redundant sequences were removed. The redundant ones were identical sequences or shorter sequences (with zero or three residues trimmed off from both ends) identical to a part of longer ones.

3. Results and discussion

3.1. Manduca Genome Assembly 1.0

Shotgun sequencing of M. sexta genomic DNA fragments by the 454 technology resulted in a dataset at >20-fold of the genome size (422 ± 12 Mb), which was then assembled into Msex 1.0 (X et al., 2014). The genome assembly consists of 20,891 scaffolds (Table 2) with N₅₀ at 664 kb, much longer than the size of a typical lepidopteran insect gene. While this sequence set is good enough for gene modeling, other features may complicate the process: 1) 50.5% and 41.0% of the scaffolds are <1 kb and 1 kb to 10 kb, accounting for 1.70% and 4.05% of the 419 Mb assembly size, respectively (Fig. 2A); 2) over 17,000 undetermined nucleotide (NNN) regions (average: 1,118 bp; range: 1-124,308 bp) (Fig. 2B) may contain genes or gene elements, even though they only account for 4.71% of the entire assembly; 3) conserved and novel repetitive elements, accounting for 25% of Msex 1.0 (X et al., 2014), and other highly similar sequences may cause errors in this assembly (Cao et al., 2014). Consequently, gene modeling can be a challenge in some cases.

Table 2. Summary statistics of M. sexta scaffolds in Msex 1.0 (data from X et al., 2014).

size range	number	% of total number	length	% of total length	NNN number	NNN length	% of NNN length
<1 ×10³	10,543	50.5	7,516,906	1.8	13	13	0.00
10³-10⁴	8,572	41.0	16,986,901	4.1	340	551,049	3.24
10⁴-10⁵	1,083	5.2	40,475,711	9.6	3,568	4,970,857	12.28
10⁵-10⁶	604	2.9	209,932,343	50.0	9,576	10,185,979	4.85
>10⁶	89	0.4	144,530,018	34.5	4,188	4,061,178	2.80
total	20,891	100	419,441,879	100	17,685	19,769,076	4.71

Open in a new tab

Fig. 2 — Length distributions of Scaffolds and NNN regions. A) Percentage of scaffold numbers and sizes; B) lengths of NNN regions and corresponding scaffolds.

3.2. Manduca Cufflinks Assembly 3.0

Cufflinks uses RNA-Seq data to model genes in a genome assembly (Table 1) (Trapnell et al., 2012; Roberts et al., 2011). We took advantage of Msex 1.0 and all 60 RNA-Seq datasets (X et al., 2014) to generate a new assembly, namely Cufflinks 3.0. As an update of Manduca Cufflinks 1.0, assembled using 33 of the 60 libraries, Cufflinks 3.0 contains 36,027 genes and 62,497 transcripts (Table 3). Cufflinks 1.0 has 37,281 genes and 64,301 transcripts. Perhaps, lacking RNA-Seq data support from scarcely expressed genes has split some genes and their transcripts into two or more pieces in Cufflinks 1.0. Analysis of Cufflinks 3.0 dataset indicates that 75% of the genes have one transcript form and 16% have 2 or 3 splicing alternates (Fig. 3). Thus, alternative splicing appears to be a minor concern for the genes predicted autonomously. In comparison, 96% of the MAKER2 gene models in OGS 1.0 have no splice variant, indicating this program is not good at predicting such variations.

Table 3. Numbers of genes, transcripts, and proteins predicted by different programs.

program	assembly	genes	transcripts	proteins	unique proteins
MAKER2	OGS 1.0	18,750	20,317	22,310	22,310
Cufflinks	Cufflinks 3.0	36,027	62,497	53,102	37,316
Trinity	Trinity 4.0	193,161	317,062	155,825	57,593
Oases	Oases 4.0	88,397	552,733	304,367	130,474

Open in a new tab

3.3. Trinity and Oases assemblies

Based on the same reference genome, Cufflinks and MAKER2 may incorrectly predict genes if there are flaws in their corresponding genomic regions (Table 1, Section 3.1). To discover and repair this problem, we de novo assembled transcripts using the 60 RNA-Seq datasets. Totally, 317,062 transcripts corresponding to 193,161 genes were established using Trinity and 552,733 from 88,397 genes by Oases (Table 3). Due to characteristics of the Trinity and Oases programs (Table 1), the transcript numbers were 5.1 to 27.2-fold higher than those in Cufflinks 3.0 and OGS 1.0. The percentages of short transcripts (< 512 bp) were 48% in Trinity and 30% in Oases, much higher than 14% in Cufflinks 3.0 or OGS 1.0 (Fig. 4A, Table S2). Many of the short contigs in the genome-independent assemblies were probably caused by how these different programs handle problems such as single nucleotide polymorphisms, low quality reads, and posttranscriptional modifications. While Oases allows multiple hash levels, merging them does not necessarily produce a better assembly than Trinity did. The gene number was 88,397 or 45.8% of the Trinity models, but the protein number (total: 304,367, unique: 130,474) was 1.95- and 2.27-fold of the Trinity proteins (total: 155,825, unique: 57,593) (Table 3). Nonetheless, the numbers of transcripts and unique proteins in different size ranges (Fig. 4, A and B) did indicate that the RNA-Seq datasets were large and diverse enough for modeling a majority of the active genes and, in some cases, their splicing variants, all based on experimental evidence.

Fig. 4 — Size distributions of transcripts (A) and unique proteins (B) predicted by the four programs.

3.4. Translation of the gene model sets

We focus our efforts on structural genes to make M. sexta amenable to proteomic studies in the future. By translating their transcripts and setting the size limit to > 60 residues, we expect to detect antimicrobial peptides (e.g. cecropins) but not some neuropeptides that are too small to tell apart from the noise of short open reading frames (ORFs). Some of the transcripts contain two or more ORFs, in most cases due to the merging of adjacent genes. As an extreme example, MAKER2 merged eleven adjacent genes into one coding for a gigantic “polyprotease”. While the transcript numbers in Trinity and Oases are 5.1 and 8.8 times of that in Cufflinks, the numbers of unique proteins are just 1.5 to 3.5 times respectively (Table 3, Fig. 4B), suggesting that differences in the non-coding regions may also be responsible for the high transcript counts. Based on the protein size distribution (Table S3), Cufflinks outperforms the other three programs in modeling proteins longer than 2,049 residues, owing to its high sensitivity and reliance on Msex 1.0 (Table 1). The unique proteins shorter than 2,048 residues in Oases 4.0 are significantly higher in number than those in Trinity 4.0, then Cufflinks 3.0, and OGS 1.0 at last (Fig. 4B). Although part of this could be an artifact caused by Oases and Trinity to a lesser extent, the de novo assemblies well complement the other two assemblies by closing the gaps in Msex 1.0 (Table 1). MAKER2, primarily designed to model structural genes, has generated OGS 1.0. Albeit the smallest, this assembly contains unique genes. These genes are either scarcely expressed in the 52 tissue samples or expressed in unsampled tissues or stages so that they are not detected even by Cufflinks. In summary, an integration of the assemblies is necessary to generate a reliable, concise, and complete set of structural genes.

3.5. Comparison of proteins in OGS 1.0 and Cufflinks 3.0

To facilitate comparison among the four M. sexta assemblies, we modified the scoring matrix of BLASTP so that all non-identical residue pairs (e.g. Leu and Ile) score -100 (Section 2.4). Consequently, unless there is a long stretch of identical or near identical amino acid sequence in a query and a subject, the comparison always yields a negative score, allowing us to ignore the less-than-perfect matches that cause complications. After the proteins in OGS 1.0 and Cufflinks 3.0 were compared, 17,907 of the pairs were 100% identical, 226 were 98.0 to 99.9% identical, and these two groups together accounted for 99.95% of the total matches (Table 4). In this way, match length (ML) in the query (Q) and subject (S) were directly used to calculate (ML/QL)×(ML/SL) and (ML/QL)/0.7 + ML/200 (i.e. MLI or match length index), without any concern about the exact percentage identity. The ML, QL, SL, (ML/QL)×(ML/SL) and MLI values were then used to categorize the matches into “P”, “N”, “O”, “B”, and “W” (Section 2.6). Among the 22,310 unique proteins from the MAKER2 models, 6,481 perfectly and 2,245 near perfectly matched those from Cufflinks 3.0 (Table 5). Together, they account for 39.1% of the total. Another 39.1% fall into the “O” category. Proteins in the categories “B” (678) and “W” (4,177) are considered to be unique, as they are not modeled by Cufflinks, Trinity, or Oases. The latter two are less sensitive than Cufflinks (Table 1).

Table 4. Distribution of numbers of matched proteins over sequence identity in the BLASTP comparison of the protein sequences in OGS 1.0 and Cufflinks 3.0.

Identity (%)	Count	% of total counts
96-97	1	0.01
97-98	8	0.04
98-99	58	0.32
99-100	168	0.94
100	17,672	98.69

Open in a new tab

Table 5. BLASTP comparison of OGS 1.0 and Cufflinks 3.0 models.

category	count	% of total counts
P (perfect)	6,481	29.05
N (near perfect)	2,245	10.06
O (okay)	8,729	39.13
B (bad)	678	3.04
W (worst)	4,177	18.72

total	22,310	100

Open in a new tab

3.6. Comparison of proteins in Trinity 4.0, Oases 4.0, and Cufflinks 3.0

Using the same method, we separately compared proteins in Cufflinks 3.0 with Trinity 4.0 and Oases 4.0 translations. Because translations of the MAKER2 models (Section 3.5), de novo assemblies, and arthropod UniProt sequences (Section 3.7) were all compared with translations of Cufflinks 3.0, identifications of the Cufflinks hits from these BLASTP searches serve as a liaison for all these datasets. The correlated protein sequences can then be evaluated to find the best model (Fig. 1).

In the comparison of Cufflinks 3.0 with Oases 4.0 translations, for example, 67.8% of the total matched sequences had ML/QL > 0.95 (Fig. 5A). The rest of hits fell into the realms of 0.95-0.7 (19.7%) and 0.7-0 (12.5%). We arbitrarily set the ML/QL threshold at 0.7 to identify Q and S sequences representing the same gene and kept the longer ones in Selection 1 (Fig. 1A). Likewise we found that 39.9% of the total O-C matches had (ML/QL)×(ML/SL) > 0.95 (Fig. 5B); 1.7% of the total had (ML/QL)/0.7 + ML/200 (i.e. match length indices or MLIs) less than one (Fig. 5C). Using cutoff values of 1.0 for ML/QL, 0.95 for (ML/QL)×(ML/SL), and 1 for MLI, we categorized the matches into “P”, “N”, “O”, “B” or “W”. By correlating the results from T-C (Trinity 4.0 vs. Cufflinks 3.0) and O-C comparisons (Fig. 1A), we found 5,516 and 968 of the proteins in Cufflinks 3.0 perfectly and near perfectly matched both Trinity and Oases models (Table 6), respectively. Among the 37,316 total hits, 26,702 (71.6%) fell into the same categories (P, N, O, B or W) from the comparisons, indicating that Trinity and Oases models are consistent in the protein-coding region at least. While 7,094 or 19.4% of the proteins were highly reliable (PP, NP, PN, and NN), 1,944 or 5.2% (BB, BW, WB, and WW) were probably modeled by Cufflinks only due to its high sensitivity (Table 1). The P/N/O proteins distributed normally over a broad size range; 68.7% of the B/W were short (<128 residues) (Table S4 and Fig. 6). Possibly the short proteins came from untranslated regions of some genes, noncoding RNAs, or small protein genes expressed but undetected. In contrast to these extreme categories, 18,509 or 49.6% of the 37,316 total hits belong to the OO comparison and further efforts were made to select useful information from these sequences.

Table 6. BLASTP comparison of Cufflinks 3.0, Trinity 4.0, and Oases 4.0 models.

Cufflinks		Oases

		P	N	O	B	W
Trinity	P	5,516	407	3,490	178	228
	N	203	968	1,511	39	22
	O	1,592	824	18,509	796	361
	B	22	6	213	325	89
	W	151	21	315	146	1,384

Open in a new tab

Fig. 6 — Size distributions of unique Cufflinks proteins in the P/N/O (red) and B/W (gray) categories after comparison with the *de novo* assemblies.

3.7. Comparison of proteins in UniProtKB Arthropoda and Cufflinks 3.0

Reliable proteins from other arthropods are useful for validating gene models. Therefore, we used BLASTP algorithm and the original BLOSUM62 matrix to compare query (Q) proteins in Cufflinks 3.0 translations with UniProtKB Arthropoda (i.e. UniProt or U) as described in Section 2.5. Of the 37,316 unique proteins in the Cufflinks 3.0, 30,313 or 81.2% had one to five matches; 7,003 had no match and may be unique in M. sexta. Their length distributions were normal distributions for the ones with 1 to 5 matches, but not so for those with 0 match (Table S5, Fig. 7) – 3,149 or 45.0% of them were shorter than 128 residues. Some of the small proteins may not exist and it is also possible that BLASTP at the default settings has bias against short proteins. Nonetheless, assuming the sequence lengths of orthologous proteins are similar, we can exploit the links among UniProt, Cufflinks, Trinity, and Oases datasets to choose models by the ratio-based method to generate Selection 2 (Fig. 1B).

Fig. 7 — Size distributions of unique Cufflinks proteins with 0, 1, 2, 3, 4, and ≥ 5 UniProt hits.

3.8. Model selection among Cufflinks 3.0, Trinity 4.0, and Oases 4.0

For all hits with ML/CL > 0.7, we chose the longest models for Selection 1 (S1, Fig. 1A, Section 2.5). When Xs (caused by NNNs) were present in the Cufflinks translations, the use of CL* (i.e. 0.7CL), instead of CL, allowed the de novo proteins to survive and replace the ambiguous Cufflinks models. To complement S1, lengths of the Trinity, Oases, Cufflinks, and UniProt (U) proteins, correlated through Cufflinks IDs from the T-C, T-O, and T-U comparisons, were used to calculate the similarity ratios TUS, CUS* and OUS (Section 2.5, Fig. 1B). The models with ratios closest to 1.0 were kept in Selection 2 (S2). Cross-examination of the correlated proteins in S1 and S2 by ratio comparison (YUS_S2-YUS_S1) resulted in the retention of 36,205 proteins without Xs (Fig. 1C, route 1). Crosschecking S2 contributed 35 proteins (route 2); manual checking improved the other 77 in S1 or S2 (route 3). Of the 999 sequences with Xs, 996 were selected via route 1 and three via route 2. Of 36,317 proteins without Xs, 29,612 have the same S1 and S2 result, and the rest, 6,593 keep S1 (route 1), 35 keep S2 (route 2) and only 77 needed manual checking (route 3).

3.9. Generation of MCOT 1.0

During the comparison of OGS 1.0 and Cufflinks 3.0 translations (Section 3.5), we found that 4,855 B/W proteins in OGS 1.0 were not properly modeled by Cufflinks, possibly due to the limitation of detection sensitivity or scope. However, after these sequences were used as queries to search the de novo datasets with the length-based method, only 2,230 had B/W matches in both Trinity 4.0 and Oases 4.0 translations; the other ones were P/N/O. Because some of the P/N/O proteins were detected in the Cufflinks transcripts by TBLASTN, we realized that, due to its settings, TransDecoder filtered out 2,625 proteins, accounting for 4.94% of the 53,102 Cufflinks 3.0 proteins. These 4,855 B/W proteins in OGS 1.0 were compared with translations of Trinity 4.0 and Oases 4.0 and model selection was performed as per the Cufflinks 3.0 translations.

After comparing Trinity 4.0 and Oases 4.0 translations with Cufflinks 3.0 translations (Section 3.8), we selected the best model for each of the 37,316 Cufflinks proteins (COT). Pooling the 4,855 MAKER2 models (M) with B/W matches to Cufflinks resulted in 42,171 IDs, some of which were selected more than once. After removing them, we found 35,567 IDs, removed 2,036 redundant sequences, eliminated 2,763 and 764 (100% identical to a part of another after removal of 0 and 3 residues from each end, respectively), and obtained 30,004 protein sequences.

The intermediate BAM files generated by TopHat indicated that 20 to 30% of the RNA-Seq reads were not mapped to Msex 1.0 and may represent: 1) exons in the gaps, NNNs and W chromosome, 2) mitochondrial RNAs, or 3) others (e.g. polyA, mRNA of symbionts). To identify unmapped nuclear genes of M. sexta, we generated Trinity 4.0b using the unmapped reads (Section 2.3) and adopted relatively strict standards to scrutinize the Trinity sequences (Section 2.8). Of the 39,809 unique proteins (> 60 residues) translated from Trinity 4.0b, 10,534 had no match (W) with the 30,004 proteins; 212 had bad matches. In these 10,746 B/W proteins, only 1,183 (1,162 unique) had good UniProt matches. Some of the other 9,563 came from bacteria. The 1,162 proteins were combined with the 30,004 to generate MCOT 1.0. Of the 31,166 protein sequences in MCOT 1.0, 1,162 are from Trinity 4.0b, 7,118 are from Trinity 4.0, 2,559 from Oases 4.0, 3,715 from OGS 1.0 and 16,612 from Cufflinks 3.0. 31% of those proteins from Trinity or Oases were updated from their original versions in Cufflinks 3.0 translations. 3.7% were newly added genes in unsequenced genome regions including the W chromosome.

3.10. Comparison of MCOT 1.0 with OGS 2.0

There are 31,166 protein sequences in MCOT 1.0. Since they are originally from MAKER2, Cufflinks or Trinity 4.0b models, we traced back to their gene IDs, and found 18,089 protein-coding genes gave rise to 28,449 transcripts after model selection and 31,166 proteins (Table 7). In comparison, there are 14,165 genes, 18,979 transcripts, and 20,888 proteins in OGS 2.0 (after being filtered by the same method for MCOT 1.0). There are 21.7% fewer genes, 29.7% fewer transcripts and 33.0% fewer proteins in OGS 2.0 compared to MCOT 1.0 after counting genes, transcripts and proteins with exactly the same standard as MCOT 1.0. We then used the protein sequences in MCOT 1.0 as queries to search OGS 2.0 using BLASTP and the modified scoring matrix. The results showed 8,034 P, 2,178 N, 11,747 O, 996 B, and 8,211 W, indicating that 32.8% were P/N, 37.7% were O, and 29.5% were B/W (Table 8). The differences are substantial between the two assemblies. MCOT 1.0 is more inclusive than OGS 2.0 in terms of covering proteins. To facilitate the usage of MCOT 1.0 for proteomic studies, we have developed a naming system, which provides information of their sources, identification, matching qualities, and reference to OGS 2.0 (Fig. 8).

Table 7. Summary statistics of MCOT 1.0 and OGS 2.0.

	MCOT 1.0	OGS 2.0
gene #	18,089	14,165
transcript #	28,449	18,979
final protein #	31,166	20,888

Open in a new tab

Table 8. Comparison of MCOT 1.0 and OGS 2.0.

Query to Subject	P	N	O	B	W
MCOT1.0 to OGS2.0	8,034	2,178	11,747	996	8,211

Open in a new tab

Fig. 8 — Naming of MCOT 1.0 sequences. In gene “MCOT.X#”, X stands for “M” (MAKER2), r20“C” (Cufflinks 3.0) or “W” (Trinity 4.0b) to indicate its original source (before BLAST search and model selection), and # is the 5-digit ID (*e.g*. 02367). Transcripts are named “MCOT.X#.#”, where the second # (1, 2 …) stands for the 1^st/2^nd/… transcript from the same gene. Likewise proteins are named “MCOT.X#.#.#.XYZ#V”, where the third # represents the 1^st/2^nd/… protein from the same transcript. If one gene generates one transcript and then one protein, the second and third #'s are marked as “0”. Multicistronic genes are rare, but do exist in insects. The 2^nd X indicates the final sequence source of “M”, “C”, “T (Trinity 4.0)”, “O” (Oases 4.0), or “W” (unmapped, including those on the W chromosome). Y and Z are the quality of matching with Trinity 4.0 and Oases 4.0, respectively: “P” (perfect), “N” (near perfect), “O” (okay), “B” (bad), “W” (worst), or “X” (data unavailable). The fourth # is the number of kept UniProt hits (0 to 5 or X for data unavailable). V marks the quality of matching with OGS 2.0: if “P”, “N” or “O”, the corresponding OGS 2.0 ID is added next to “|”; otherwise “X” is added to indicate no good match.

3.11. Additional information from Cufflinks 3.0

A major part of MCOT 1.0 is refined from Cufflinks 3.0 models which includes 36,027 genes and 62,497 transcripts (Table 3). Using Transdecoder, we found 20,289 of the Cufflinks genes were not translated to proteins (based on the definition in Section 2.4), suggesting that most of them are noncoding. While 22,615 of the Cufflinks genes are absent in MCOT 1.0, the difference of 2,326 indicated that some of them may have been correctly merged during MCOT 1.0 generation. Of the 20,289 noncoding genes, the most complex gene (4,000 bp in length, 71.5% of A/T) have 33 alternative splicing forms and could be a long, noncoding RNA. Length distributions of the coding and noncoding transcripts in Cufflinks 3.0 (Fig. 9) were strikingly different: the coding ones are a lot longer. Surprisingly, 4,144 noncoding genes are 2,049 to 8,192 bp and 183 are > 8,193 bp. While MCOT 1.0 focuses on structural genes, the non-coding genes are another world to explore in the future.

Fig. 9 — Size distributions of the coding and noncoding transcripts in Cufflinks 3.0.

3.12. Summary

We developed an integrated approach to select the best models based on BLASTP comparison of the Cufflinks dataset with sequences in OGS 1.0 and the de novo assemblies. The modified scoring matrix greatly simplified the sequence comparison by keeping pairs with >98% identity. Correlated by Cufflinks IDs, the models in different assembles (Trinity 4.0, Oases 4.0, OGS 1.0, and UniProt) were compared and chosen based on length-derived parameters. By incorporating unique sequences in OGS 1.0 and unmapped genes in the Trinity 4.0b, we generated MCOT 1.0, which has 60% more proteins than OGS 2.0. As extensive RNA-Seq data are available for most genome projects nowadays, automation of our procedures will produce comprehensive models of protein-coding genes in the future.

Supplementary Material

supplement

Table S1. Datasets generated or used in this study and their descriptions

Table S2. Transcript length distribution of different modeling programs

Table S3. Unique protein length distribution of different modeling programs

Table S4. Length distribution of the Cufflinks 3.0 proteins with P/N/O or B/W matches

Table S5. Length distribution of Cufflinks 3.0 proteins with different numbers of hits in UniProt

NIHMS657019-supplement.docx^{(87.1KB, docx)}

Highlights.

Generated genome-dependent and independent assemblies to support manual gene annotation
Developed methods to compare and select structural gene models for MCOT 1.0
Validated 5,933 OGS 2.0 models, found differences in 6,820 models, and discovered 5,336 new ones

Acknowledgments

This study was supported by NIH grant GM58634. We thank Drs. Ulrich Melcher and Jamie Walters for their critical comments of the manuscript. The Manduca Genome Project, which provided Msex 1.0, OGS 1.0, OGS 2.0, Cufflinks 1.0, and RNA-Seq datasets, was funded by DARPA (Gary Blissard, Boyce Thompson Institute) and NIH grant GM41247 (Michael Kanost, Kansas State University). This work was approved for publication by the Director of Oklahoma Agricultural Experimental Station, and supported in part under project OKLO2450 (to H. Jiang). Computation for this project was performed at OSU High Performance Computing Center supported in part through NSF grant OCI-1126330.

Abbreviations

OGS: official gene set
ORF: open reading frame
L: length
ML: match length
QL: query length
SL: subject length
M: MAKER
C: Cufflinks
T: Trinity
O: Oases
U: UniProt Arthropoda
Y: C/T/O
S: similarity ratio of lengths
MLI: match length index
S1/S2: Selection 1 or 2
“P”: perfect
“N”: near perfect
“O”: okay
“B”: bad
“W”: worst

Footnotes

The sequence files of MCOT 1.0 transcripts and proteins are available to download at ftp://ftp.bioinformatics.ksu.edu/pub/Manduca/OGS2/OSU_files/. BLAST search of the two datasets can be performed at http://agripestbase.org/manduca/?q=blast.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

Arrese EL, Soulages JL. Insect fat body: energy, metabolism, and regulation. Ann Rev Entomol. 2010;55:207–225. doi: 10.1146/annurev-ento-112408-085356. [DOI] [PMC free article] [PubMed] [Google Scholar]
Camacho C, Coulouris G, Avagyan V, Ma N. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cao X, He Y, Hu Y, Zhang X, Wang Y, Zou Z, Chen Y, Blissard GW, Kanost MR, Jiang H. Sequence conservation, phylogenetic relationships, and expression profiles of nondigestive serine proteases and serine protease homologs in Manduca sexta. 2014 doi: 10.1016/j.ibmb.2014.10.006. in review. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644–652. doi: 10.1038/nbt.1883. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grosse-Wilde E, Kuebler LS, Bucks S, Vogel H, Wicher D, Hansson BS. Antennal transcriptome of Manduca sexta. Proc Natl Acad Sci USA. 2011;108:7449–7454. doi: 10.1073/pnas.1017963108. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gunaratna R, Jiang H. A comprehensive analysis of Manduca sexta immunotranscriptome. Dev Com Immunol. 2013;39:388–398. doi: 10.1016/j.dci.2012.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M, Macmanes MD, Ott M, Orvis J, Pochet N, Strozzi F, Weeks N, Westerman R, William T, Dewey CN, Henschel R, Leduc RD, Friedman N, Regev A. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc. 2013;8:1494–1512. doi: 10.1038/nprot.2013.084. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hopkins T, Krchma L, Ahmad S, Kramer K. Pupal cuticle proteins of Manduca sexta: characterization and profiles during sclerotization. Insect Biochem Mol Biol. 2000;30:19–27. doi: 10.1016/s0965-1748(99)00091-0. [DOI] [PubMed] [Google Scholar]
Jiang H, Vilcinskas A, Kanost MR. Immunity in lepidopteran insects. In “Invertebrate Immunity“. In: Söderhäll K, editor. Adv Exp Med Biol. Vol. 708. 2010. pp. 181–204. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kanost MR, Kawooya JK, Law JH, Ryan RO, Van Heusden MC, Ziegler R. Insect hemolymph proteins. Adv Insect Physiol. 1990;22:299–396. [Google Scholar]
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The Sequence Alignment/Map format and SAM tools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pauchet Y, Wilkinson P, Vogel H, Nelson DR, Reynolds SE, Heckel DG, ffrench-Constant RH. Pyrosequencing the Manduca sexta larval midgut transcriptome: messages for digestion, detoxification and defence. Insect Mol Biol. 2010;19:61–75. doi: 10.1111/j.1365-2583.2009.00936.x. [DOI] [PubMed] [Google Scholar]
Riddiford L, Hiruma K, Zhou X, Nelson CA. Insights into the molecular basis of the hormonal control of molting and metamorphosis from Manduca sexta and Drosophila melanogaster. Insect Biochem Mol Biol. 2003;33:1327–1338. doi: 10.1016/j.ibmb.2003.06.001. [DOI] [PubMed] [Google Scholar]
Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011;12:R22. doi: 10.1186/gb-2011-12-3-r22. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics. 2012;28:1086–1092. doi: 10.1093/bioinformatics/bts094. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shields V, Hildebrand JG. Recent advances in insect olfaction, specifically regarding the morphology and sensory physiology of antennal sensilla of the female sphinx moth Manduca sexta. Microsc Res Tech. 2001;55:307–329. doi: 10.1002/jemt.1180. [DOI] [PMC free article] [PubMed] [Google Scholar]
Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105–1111. doi: 10.1093/bioinformatics/btp120. [DOI] [PMC free article] [PubMed] [Google Scholar]
Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7:562–578. doi: 10.1038/nprot.2012.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
X et al., 2014.
Yandell M, Ence D. A beginner's guide to eukaryotic genome annotation. Nat Rev Genet. 2012;13:329–342. doi: 10.1038/nrg3174. [DOI] [PubMed] [Google Scholar]
Zhang S, Gunaratna RT, Zhang X, Najar F, Wang Y, Roe B, Jiang H. Pyrosequencing-based expression profiling and identification of differentially regulated genes from Manduca sexta, a lepidopteran model insect. Insect Biochem Mol Biol. 2011;41:733–746. doi: 10.1016/j.ibmb.2011.05.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou Z, Najar F, Wang Y, Roe B, Jiang H. Pyrosequence analysis of expressed sequence tags for Manduca sexta hemolymph proteins involved in immune responses. Insect Biochem Mol Biol. 2008;38:677–682. doi: 10.1016/j.ibmb.2008.03.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

Table S1. Datasets generated or used in this study and their descriptions

Table S2. Transcript length distribution of different modeling programs

Table S3. Unique protein length distribution of different modeling programs

Table S4. Length distribution of the Cufflinks 3.0 proteins with P/N/O or B/W matches

Table S5. Length distribution of Cufflinks 3.0 proteins with different numbers of hits in UniProt

NIHMS657019-supplement.docx^{(87.1KB, docx)}

[R1] Arrese EL, Soulages JL. Insect fat body: energy, metabolism, and regulation. Ann Rev Entomol. 2010;55:207–225. doi: 10.1146/annurev-ento-112408-085356. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Camacho C, Coulouris G, Avagyan V, Ma N. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Cao X, He Y, Hu Y, Zhang X, Wang Y, Zou Z, Chen Y, Blissard GW, Kanost MR, Jiang H. Sequence conservation, phylogenetic relationships, and expression profiles of nondigestive serine proteases and serine protease homologs in Manduca sexta. 2014 doi: 10.1016/j.ibmb.2014.10.006. in review. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644–652. doi: 10.1038/nbt.1883. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Grosse-Wilde E, Kuebler LS, Bucks S, Vogel H, Wicher D, Hansson BS. Antennal transcriptome of Manduca sexta. Proc Natl Acad Sci USA. 2011;108:7449–7454. doi: 10.1073/pnas.1017963108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Gunaratna R, Jiang H. A comprehensive analysis of Manduca sexta immunotranscriptome. Dev Com Immunol. 2013;39:388–398. doi: 10.1016/j.dci.2012.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M, Macmanes MD, Ott M, Orvis J, Pochet N, Strozzi F, Weeks N, Westerman R, William T, Dewey CN, Henschel R, Leduc RD, Friedman N, Regev A. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc. 2013;8:1494–1512. doi: 10.1038/nprot.2013.084. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Hopkins T, Krchma L, Ahmad S, Kramer K. Pupal cuticle proteins of Manduca sexta: characterization and profiles during sclerotization. Insect Biochem Mol Biol. 2000;30:19–27. doi: 10.1016/s0965-1748(99)00091-0. [DOI] [PubMed] [Google Scholar]

[R9] Jiang H, Vilcinskas A, Kanost MR. Immunity in lepidopteran insects. In “Invertebrate Immunity“. In: Söderhäll K, editor. Adv Exp Med Biol. Vol. 708. 2010. pp. 181–204. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Kanost MR, Kawooya JK, Law JH, Ryan RO, Van Heusden MC, Ziegler R. Insect hemolymph proteins. Adv Insect Physiol. 1990;22:299–396. [Google Scholar]

[R11] Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The Sequence Alignment/Map format and SAM tools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Pauchet Y, Wilkinson P, Vogel H, Nelson DR, Reynolds SE, Heckel DG, ffrench-Constant RH. Pyrosequencing the Manduca sexta larval midgut transcriptome: messages for digestion, detoxification and defence. Insect Mol Biol. 2010;19:61–75. doi: 10.1111/j.1365-2583.2009.00936.x. [DOI] [PubMed] [Google Scholar]

[R14] Riddiford L, Hiruma K, Zhou X, Nelson CA. Insights into the molecular basis of the hormonal control of molting and metamorphosis from Manduca sexta and Drosophila melanogaster. Insect Biochem Mol Biol. 2003;33:1327–1338. doi: 10.1016/j.ibmb.2003.06.001. [DOI] [PubMed] [Google Scholar]

[R15] Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011;12:R22. doi: 10.1186/gb-2011-12-3-r22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics. 2012;28:1086–1092. doi: 10.1093/bioinformatics/bts094. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Shields V, Hildebrand JG. Recent advances in insect olfaction, specifically regarding the morphology and sensory physiology of antennal sensilla of the female sphinx moth Manduca sexta. Microsc Res Tech. 2001;55:307–329. doi: 10.1002/jemt.1180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105–1111. doi: 10.1093/bioinformatics/btp120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7:562–578. doi: 10.1038/nprot.2012.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] X et al., 2014.

[R21] Yandell M, Ence D. A beginner's guide to eukaryotic genome annotation. Nat Rev Genet. 2012;13:329–342. doi: 10.1038/nrg3174. [DOI] [PubMed] [Google Scholar]

[R22] Zhang S, Gunaratna RT, Zhang X, Najar F, Wang Y, Roe B, Jiang H. Pyrosequencing-based expression profiling and identification of differentially regulated genes from Manduca sexta, a lepidopteran model insect. Insect Biochem Mol Biol. 2011;41:733–746. doi: 10.1016/j.ibmb.2011.05.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Zou Z, Najar F, Wang Y, Roe B, Jiang H. Pyrosequence analysis of expressed sequence tags for Manduca sexta hemolymph proteins involved in immune responses. Insect Biochem Mol Biol. 2008;38:677–682. doi: 10.1016/j.ibmb.2008.03.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Integrated modeling of protein-coding genes in the Manduca sexta genome using RNA-Seq data from the biochemical model insect

Xiaolong Cao

Haobo Jiang

Abstract

1. Introduction

Table 1. Comparison of the four gene prediction programs.

2. Materials and Methods

2.1. Data and program acquisition

2.2. Generation of Cufflinks 3.0

2.3. Reads treatment, normalization, and de novo assembling

2.4. Gene translation and sequence comparison

2.5. Cross-examination and selection of protein sequences from different assemblies

Fig. 1.

2.6. Classification of sequence comparison results

2.7. Identification of proteins present only in OGS 1.0

2.8. Identification of unmapped genes in Trinity 4.0b

3. Results and discussion

3.1. Manduca Genome Assembly 1.0

Table 2. Summary statistics of M. sexta scaffolds in Msex 1.0 (data from X et al., 2014).

Fig. 2.

3.2. Manduca Cufflinks Assembly 3.0

Table 3. Numbers of genes, transcripts, and proteins predicted by different programs.

Fig. 3. Percentages of genes with 1, 2, 3, 4, 5, or ≥ 6 splicing forms based on Cufflinks 3.0 (left) and MAKER2-generated OGS 1.0 (right).

3.3. Trinity and Oases assemblies

Fig. 4.

3.4. Translation of the gene model sets

3.5. Comparison of proteins in OGS 1.0 and Cufflinks 3.0

Table 4. Distribution of numbers of matched proteins over sequence identity in the BLASTP comparison of the protein sequences in OGS 1.0 and Cufflinks 3.0.

Table 5. BLASTP comparison of OGS 1.0 and Cufflinks 3.0 models.

3.6. Comparison of proteins in Trinity 4.0, Oases 4.0, and Cufflinks 3.0

Fig. 5.

Table 6. BLASTP comparison of Cufflinks 3.0, Trinity 4.0, and Oases 4.0 models.

Fig. 6.

3.7. Comparison of proteins in UniProtKB Arthropoda and Cufflinks 3.0

Fig. 7.

3.8. Model selection among Cufflinks 3.0, Trinity 4.0, and Oases 4.0

3.9. Generation of MCOT 1.0

3.10. Comparison of MCOT 1.0 with OGS 2.0

Table 7. Summary statistics of MCOT 1.0 and OGS 2.0.

Table 8. Comparison of MCOT 1.0 and OGS 2.0.

Fig. 8.

3.11. Additional information from Cufflinks 3.0

Fig. 9.

3.12. Summary

Supplementary Material

Highlights.

Acknowledgments

Abbreviations

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases