Abstract
The SilkDB is an open-access database for genome biology of the silkworm (Bombyx mori). Since the draft sequence was completed and the SilkDB was first released 5 years ago, we have collaborated with other groups to make much remarkable progress on silkworm genome research, such as the completion of a new high-quality assembly of the silkworm genome sequence as well as the construction of a genome-wide microarray to survey gene expression profiles. To accommodate these new genomic data and house more comprehensive genomic information, we have reconstructed SilkDB database with new web interfaces. In the new version (v2.0) of SilkDB, we updated the genomic data, including genome assembly, gene annotation, chromosomal mapping, orthologous relationship and experiment data, such as microarray expression data, Expressed Sequence Tags (ESTs) and corresponding references. Several new tools, including SilkMap, Silkworm Chromosome Browser (SCB) and BmArray, are developed to access silkworm genomic data conveniently. SilkDB is publicly available at the new URL of http://www.silkdb.org.
INTRODUCTION
The silkworm, Bombyx mori, is one of the most economically important insects, which was domesticated for producing silk ∼5000 years ago. Now, it still plays an important role in increasing income of farmers in many countries, such as China, India and other developing countries (1,2). Silkworm is also used as a best-characterized model for biochemical, molecular genetic and genomic studies of Lepidopteran insects (3,4).
In 2004, a 6× (5) and 3× (6) draft genome sequences for silkworm were completed by Chinese and Japanese teams, respectively. Subsequently, the database of SilkDB (V1.0) was constructed to access the 6× genome data (7). Since the first version of SilkDB was released, the amount of database access has exceeded 230 000 times. This open database resource has greatly facilitated the functional genomics research of silkworm and some other insects.
Recently, the Chinese and Japanese groups have cooperated to integrate the silkworm genomic data, including two sets of draft sequences and the paired ends from fosmids and BACs, and completed a fine silkworm genome assembly with 8.5× coverage. The quality of the new assembly has been significantly improved. The N50 scaffold size reaches ∼3.7 Mb for the 432 Mb genome (8). Another improvement is that over 377.5 Mb (87.4%) genome sequence could be assigned to all of the 28 chromosomes by integrating a high-density Single Nucleotide Polymorphism (SNP) linkage map (1755 markers) (9). In addition, our group has designed and constructed a genome-wide microarray with 22 987 probes covering the genes predicted from 6× draft genome sequences, and successfully surveyed the gene expression profiles in multiple silkworm tissues (10).
In order to accommodate the new genomic data and house more comprehensive genomic information, we have reconstructed the SilkDB database with new user-friendly interfaces. Several tools were also developed to access the data conveniently and were linked in the new version. Herein, we will report the progress in the new version (v2.0) of SilkDB, and describe the updated datasets as well as great performance improvement.
DATA UPDATED AND INTEGRATION
Genome data and repeat sequences
The new silkworm genome assembly consists of 43 622 scaffolds spanning ∼432 Mb. The genome sequence is significantly more intact than the previous version, and 109 scaffolds whose length exceed 1 Mb accounts for ∼390.3 Mb. In addition, a total of 1668 repeat sequences have been identified by a de novo repeat annotation strategy of ReAS (11), which account for ∼43.6% of the silkworm genome, together with 17 known silkworm transposable elements in GenBank. This indicates that the silkworm genome comprises more significant repeat sequences than other insects, such as 16% in Anopheles gambiae (12), 1% in Apis mellifera (13) and 2.7–25% in Drosophila melanogaster (14). We integrated all of the new assembly of silkworm genome sequence into the SilkDB (V2.0).
Gene dataset and gene functional annotation
In order to obtain a precise gene dataset, a variety of strategies were used (8). A consensus nonredundant dataset with 14 623 protein-coding genes was built by merging different gene datasets using GLEAN (http://sourceforge.net/projects/glean-gene). This GLEAN gene dataset was used as reference dataset and has been integrated into the updated SilkDB (V2.0). Additionally, the predicted noncoding genes, including 206 miRNAs, 147 rRNAs and 498 tRNAs, were also integrated into the database.
The functions of all the protein-coding genes have been annotated with different methods. First, genes with similar sequences may have similar functions, so all the genes were used to BLAST against nonredundant databases downloaded from the NCBI to find homologs. About 12 246 (83.7%) genes could be found to have corresponding homologs when using the E-value threshold of 1E-5. Secondly, the information of protein domains in genes will provide clues for gene functions. All the silkworm genes were used to query against the InterPro database (15). As a result, 8522 genes (58.2%) have 2509 kinds of known domains. Based on the domain assignments, 5971 genes can be classified by Gene Ontology (GO) terms, which is a controlled vocabulary for the description of molecular function, biological process and cellular component of gene products (16). Thirdly, gene families were identified among B. mori, D. melanogaster, Aedes aegypti, A. gambiae, A. mellifera, Homo sapiens, Gallus gallus, Fugu rubripes and Caenorhabditis elegans by using the strategy of TreeFam (17). A total of 6669 silkworm genes are distributed in 1779 gene families. Four hundred families seem to be insect specific, of which 245 families are silkworm specific. These genes may be selected to accommodate insect-specific or silkworm-specific functions of biological processes for silkworm during evolution. All of the above gene function annotations have been integrated into the updated version of SilkDB.
Experimental information
We also focus on integrating the experimental data into the SilkDB. Currently, we have collected 184 509 Expressed Sequence Tags and full-length cDNAs, which contain useful information for gene expression and function. About 9056 genes have ESTs under the threshold of ‘alignment length >100 and identities >80%’. Moreover, the microarray data of silkworm were also included in the SilkDB. As shown in a previous report, we have designed and constructed a genome-wide microarray with 22 987 silkworm gene probes covering the genes predicted from 6× draft genome sequences (10). The microarray has been used to monitor gene expression profiles in 10 representative samples on Day 3 of the fifth instar larvae. As a result, a total of 10 393 active transcripts were detected. The results provide a rich data resource for expression profiles and functions of silkworm genes, especially for the 1642 tissue-specific genes that exhibited a strong relevance to the physiological functions of the corresponding tissues (10). In addition, reference information related to gene functions was also collected and integrated into the SilkDB.
DATABASE ACCESS
Along with the update of silkworm genomic data, the manner of managing and accessing these data has been re-designed. The entire silkworm genome, gene dataset, gene annotation, experimental data and reference information are stored in the MySQL (http://www.mysql.org/) database management system. All of the above information is navigated by GBrowse (18) instead of the previous tool of MapView (7). It is well known that the GBrowse is one of the most popular genome viewers for manipulating and displaying annotations on genomes, and has been extensively applied in the construction of the database for a variety of model organisms, such as Flybase (19), WormBase (20) and SGD (21). By using GBrowse, users could easily browse any interested region in the silkworm genome. According to the position on a scaffold, a variety of track features could be accessed, including protein-coding genes, noncoding genes, GC content, frame usage, restriction sites and repetitive sequences (Figure 1D).
By clicking protein-coding gene track on GBrowse, the page will link to the Gene Page, which is the heart of the updated SilkDB (Figure 1E). Gene Page is available for each gene, containing all the related information, including gene symbol, position, definition, EST evidence, corresponding microarray probes, domain assignment, GO annotation, gene family, refseq ID, reference information (title, author and PubMed ID), BLAST homolog, genome sequence, CDS sequence as well as deduced protein sequence (Figure 1E). It also provides hyperlink if cross-referenced links are available for related database entries, for example, clicking on microarray probes will link to our new developed web-base viewer of BmArray to visually display microarray data, GO terms are linked to EMBL database (22), refseq IDs are linked to GenBank (23) (Supplementary Figure S1).
IMPROVED DATABASE USABILITY
In order to facilitate data analysis, the updated SilkDB provides a variety of user-friendly interfaces for common tools generated by Pise (24). One of the most useful tools is the BLAST tool. User could use the BLAST to search against scaffolds, genes, ESTs, other insect genomes and genes. On the result page of a BLAST search, each hit is linked to the GBrowse view of the sequence. Another two tools, Silkworm Chromosome Browser (SCB) and SilkMap, were developed to facilitate users to use the chromosomal information which is newly available for current genome assembly. The SCB tool provides the position of scaffolds on 28 chromosomes (Figure 1C), which enables the user to access any chromosomal region of interest. SilkMap can be used to anchor nucleotide or protein sequence on silkworm chromosomes and will provide a visualization picture of sequence locations (Figure 1A and B). Through SilkMap, user could know not only location of the query sequence on a chromosome, but also the copies of the query sequence in the silkworm genome. The subject position is linked to the detailed view of GBrowse. In addition, a silkworm Gene Ontology Browse was also developed to provide users with accessing the silkworm genes by particular terminology.
FUTURE DIRECTIONS
We will continuously improve the quality of the assembly and annotations of silkworm genome sequence. The updated data will be timely included in the SilkDB when it is available. At the same time, we will manually curate the information in the database. Users are also encouraged to submit corrected or additional information on the predicted gene or the genome sequence to SilkDB via E-mail. At present, some research projects for silkworm are ongoing, such as using microarray to survey gene expression profiles at different developmental stages, microRNA expression experiment and silkworm SNP project. We are planning to annotate these data to find the biology meaning, and integrate these experiment data and analysis results into the database in the future.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
National Basic Research Program of China (No. 2005CB121000); Program for Changjiang Scholars and innovative Research Team in University of China (No. IRT0750); Programme of Introducing Talents of Discipline to Universities (No. B07045); National Natural Science Foundation of China (No. 30800804); National Hi-Tech Research and Development Program of China (No. 2006AA10A118). Funding for open access charge: National Basic Research Program of China (No. 2005CB121000).
Conflict of interest statement. None declared.
ACKNOWLEDGEMENTS
We wish to thank members of the International Silkworm Genome Consortium for their efforts to improve the quality of the silkworm genome sequence and annotations.
REFERENCES
- 1.Goldsmith MR, Shimada T, Abe H. The genetics and genomics of the silkworm, Bombyx mori. Annu. Rev. Entomol. 2005;50:71–100. doi: 10.1146/annurev.ento.50.071803.130456. [DOI] [PubMed] [Google Scholar]
- 2.Prasad MD, Muthulakshmi M, Arunkumar KP, Madhu M, Sreenu VB, Pavithra V, Bose B, Nagarajaram HA, Mita K, Shimada T, et al. SilkSatDb: a microsatellite database of the silkworm, Bombyx mori. Nucleic Acids Res. 2005;33:D403–D406. doi: 10.1093/nar/gki099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Papanicolaou A, Gebauer-Jung S, Blaxter ML, Owen McMillan W, Jiggins CD. ButterflyBase: a platform for lepidopteran genomics. Nucleic Acids Res. 2008;36:D582–D587. doi: 10.1093/nar/gkm853. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Arunkumar KP, Tomar A, Daimon T, Shimada T, Nagaraju J. WildSilkbase: an EST database of wild silkmoths. BMC Genomics. 2008;9:338. doi: 10.1186/1471-2164-9-338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Xia Q, Zhou Z, Lu C, Cheng D, Dai F, Li B, Zhao P, Zha X, Cheng T, Chai C, et al. A draft sequence for the genome of the domesticated silkworm (Bombyx mori) Science. 2004;306:1937–1940. doi: 10.1126/science.1102210. [DOI] [PubMed] [Google Scholar]
- 6.Mita K, Kasahara M, Sasaki S, Nagayasu Y, Yamada T, Kanamori H, Namiki N, Kitagawa M, Yamashita H, Yasukochi Y, et al. The genome sequence of silkworm, Bombyx mori. DNA Res. 2004;11:27–35. doi: 10.1093/dnares/11.1.27. [DOI] [PubMed] [Google Scholar]
- 7.Wang J, Xia Q, He X, Dai M, Ruan J, Chen J, Yu G, Yuan H, Hu Y, Li R, et al. SilkDB: a knowledgebase for silkworm biology and genomics. Nucleic Acids Res. 2005;33:D399–D402. doi: 10.1093/nar/gki116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.International Silkworm Genome Consortium The genome of a lepidopteran model insect, the silkworm Bombyx mori. Insect Biochem. Mol. Biol. 2008;38:1036–1045. doi: 10.1016/j.ibmb.2008.11.004. [DOI] [PubMed] [Google Scholar]
- 9.Yamamoto K, Nohata J, Kadono-Okuda K, Narukawa J, Sasanuma M, Sasanuma SI, Minami H, Shimomura M, Suetsugu Y, Banno Y, et al. A BAC-based integrated linkage map of the silkworm Bombyx mori. Genome Biol. 2008;9:R21. doi: 10.1186/gb-2008-9-1-r21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Xia Q, Cheng D, Duan J, Wang G, Cheng T, Zha X, Liu C, Zhao P, Dai F, Zhang Z, et al. Microarray-based gene expression profiles in multiple tissues of the domesticated silkworm, Bombyx mori. Genome Biol. 2007;8:R162. doi: 10.1186/gb-2007-8-8-r162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Li R, Ye J, Li S, Wang J, Han Y, Ye C, Wang J, Yang H, Yu J, Wong GK, et al. ReAS: recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoS Comput. Biol. 2005;1:e43. doi: 10.1371/journal.pcbi.0010043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, Nusskern DR, Wincker P, Clark AG, Ribeiro JM, Wides R, et al. The genome sequence of the malaria mosquito Anopheles gambiae. Science. 2002;298:129–149. doi: 10.1126/science.1076181. [DOI] [PubMed] [Google Scholar]
- 13.Honeybee Genome Sequencing Consortium Insights into social insects from the genome of the honeybee Apis mellifera. Nature. 2006;443:931–949. doi: 10.1038/nature05260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Clark AG, Eisen MB, Smith DR, Bergman CM, Oliver B, Markow TA, Kaufman TC, Kellis M, Gelbart W, Iyer VN, et al. Evolution of genes and genomes on the Drosophila phylogeny. Nature. 2007;450:203–218. doi: 10.1038/nature06341. [DOI] [PubMed] [Google Scholar]
- 15.Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, et al. InterPro: the integrative protein signature database. Nucleic Acids Res. 2009;37:D211–D215. doi: 10.1093/nar/gkn785. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Gene Ontology Consortium The Gene Ontology project in 2008. Nucleic Acids Res. 2008;36:D440–D444. doi: 10.1093/nar/gkm883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Li H, Coghlan A, Ruan J, Coin LJ, Heriche JK, Osmotherly L, Li R, Liu T, Zhang Z, Bolund L, et al. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. 2006;34:D572–D580. doi: 10.1093/nar/gkj118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, et al. The generic genome browser: a building block for a model organism system database. Genome Res. 2002;12:1599–1610. doi: 10.1101/gr.403602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Tweedie S, Ashburner M, Falls K, Leyland P, McQuilton P, Marygold S, Millburn G, Osumi-Sutherland D, Schroeder A, Seal R, et al. FlyBase: enhancing Drosophila Gene Ontology annotations. Nucleic Acids Res. 2009;37:D555–D559. doi: 10.1093/nar/gkn788. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Rogers A, Antoshechkin I, Bieri T, Blasiar D, Bastiani C, Canaran P, Chan J, Chen WJ, Davis P, Fernandes J, et al. WormBase 2007. Nucleic Acids Res. 2008;36:D612–D617. doi: 10.1093/nar/gkm975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Hong EL, Balakrishnan R, Dong Q, Christie KR, Park J, Binkley G, Costanzo MC, Dwight SS, Engel SR, Fisk DG, et al. Gene Ontology annotations at SGD: new data sources and annotation methods. Nucleic Acids Res. 2008;36:D577–D581. doi: 10.1093/nar/gkm909. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Stoesser G, Baker W, van den Broek A, Camon E, Garcia-Pastor M, Kanz C, Kulikova T, Leinonen R, Lin Q, Lombard V, et al. The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 2002;30:21–26. doi: 10.1093/nar/30.1.21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2009;37:D26–D31. doi: 10.1093/nar/gkn723. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Letondal C. A web interface generator for molecular biology programs in Unix. Bioinformatics. 2001;17:73–82. doi: 10.1093/bioinformatics/17.1.73. [DOI] [PubMed] [Google Scholar]