Abstract
The suppression subtractive hybridization (SSH) approach, a PCR based approach which amplifies differentially expressed cDNAs (complementary DNAs), while simultaneously suppressing amplification of common cDNAs, was employed to identify immuneinducible genes in insects. This technique has been used as a suitable tool for experimental identification of novel genes in eukaryotes as well as prokaryotes; whose genomes have been sequenced, or the species whose genomes have yet to be sequenced. In this article, I have proposed a method for in silico functional characterization of immune-inducible genes from insects. Apart from immune-inducible genes from insects, this method can be applied for the analysis of genes from other species, starting from bacteria to plants and animals. This article is provided with a background of SSH-based method taking specific examples from innate immune-inducible genes in insects, and subsequently a bioinformatics pipeline is proposed for functional characterization of newly sequenced genes. The proposed workflow presented here, can also be applied for any newly sequenced species generated from Next Generation Sequencing (NGS) platforms.
Keywords: SSH, NGS, Immunity, Insects, Functional annotation, Bioinformatics
Background
Immune-related genes in insects:
Insects have been remarkably successful in evolution. Current estimates are that they account for major species richness dominating any other class of biota [1], and relatively account for 80% of all animal species to date [2]. To fight against infection or wounding, insect's immune system employ cellular and humoral immune system, although they do not possess adaptive immunity as present in case of higher vertebrates [3–4]. Despite of this fact, insects' innate immune system is one of the ancient immune system constituting first line of defense, and conserved in different organisms, like in higher vertebrate, human.
Review on research progress using SSH and Bioinformatics:
During these years, research on insects (physiology, biochemistry, and immunology) has produced voluminous information on the genome or transcriptome sequences of several insects, and bioinformatics has played a major role in the analysis, and management of the sequence data. There is remarkable progress in genome sequencing from insects after the first genome sequence of the fruit fly, Drosophila melanogaster in the year 2000 [5]. SSH-based sequencing technology has been the sequencing method used since 1960s [6–7] as a low cost alternatives for whole genome/ or transcriptome sequencing methods. The suppression subtractive hybridization (SSH) approach, a PCR based method which amplifies differentially expressed cDNAs, while simultaneously suppressing amplification of common cDNAs, was employed to identify immune-inducible genes in insects. This technique has been used as a suitable tool for experimental identification of novel genes in model insects whose genome has been sequenced such as the red flour beetle, Tribolium castaneum [8], or the pea aphid, Acyrthosiphon pisum [9], or the silkworm, Bombyx mori [10]. In addition, and despite issues related to the sensitivity of this method, the SSH approach has been proven to be useful in targeted screening for immunerelated genes in ancient insect, Thermobia domestica, for which no genomic data are available [11], or in insects which are adapted to habitats contaminated with high loads of microbes such as the rat-tailed maggots of the drone fly Eristalis tenax [12] capable of living in urine and cesspools, or the medicinal maggots of the green blow fly Lucilia sericata which reproduce in septic wounds or in carrions [13], or the burying beetle Nicrophorus vespilloides [4] capable of living in cadaver rich with microbes and fungi. By the use of SSH-based approach, there are several other projects undertaken on different insects [14–20]. Table 1 (see supplementary material) represents a list of projects undertaken by different research groups in identifying immune-related genes in insects using SSH-based approach.
Suppression subtractive hybridization (SSH):
In principle, the aim of the SSH-based sequencing approach is to build unbiased cDNAs library. The cDNA library will be the target of the subtraction (the tester population), which is denatured, and hybridized to another library that is present in the excess (the driver cDNAs population). Fragments common to both are annealed to each other, whereas tester specific products remain single stranded. Sequencing steps in PCRselect cDNA subtraction: The cDNA is synthesized from RNA generated from two types of cells being compared (tester and driver cDNAs); the tester and driver cDNAs are digested with RsaI to produce blunt ends; The tester DNA is divided into two parts, and each part is ligated with a different cDNA adaptor. As the ends of the adaptor do not contain a phosphate group, only one end of the adaptor attaches to the 5´ ends of the cDNA. Furthermore, the two adaptors contain a portion of identical sequence in order to allow annealing of the PCR primer once the recessed ends have been filled in. Two hybridization steps are performed. In the first step, an excess of driver is added to each sample of tester cDNA. “The samples are then heat denatured and allowed for annealing process and at last generating a, b, c, d types of molecules from each sample” [7]. The concentration among high and low abundance sequences are equalized among type a molecules and at the same time type a molecules are significantly enriched with differentially expressed sequences, while cDNAs that are not differentially expressed in type c molecules hybridize with driver. In the second hybridization step, the two primarily hybridization samples are mixed together without denaturing. Through this, only the remaining equalized and subtracted single stranded tester cDNAs can hybridize, and only for type e molecule. This new hybrid is a double stranded tester molecule with different ends corresponding to adaptor1 and adaptor 2R. Again, fresh denatured cDNA is added without denaturing the subtraction mix to enrich the fraction with differentially expressed cDNAs. After this event, the differentially expressed type e molecules are filled with DNA polymerase to make different annealing sites for the nested primers on their 5´ and 3´ ends. The whole population of molecules is subjected to PCR to amplify the differentially expressed sequences. During this process, type a and type d molecules cannot be amplified because they lack primer annealing sites. Due to the suppression effect, type b molecules cannot undergo the amplification process. Type c molecules have only one primer site and can only be amplified linearly. Only the type e molecules are equalized and differentially sequenced with two adaptors and can be amplified exponentially. (Figure 1) represents a schematic diagram of PCR-select cDNA subtraction based on SSH method. The overview of SSH method described above is adopted from [7], therefore the readers are requested to read the above article for more details to set up the SSH-based experiment.
Figure 1.

Schematic diagram of PCR-select cDNA subtraction. Tester and driver cDNAs were prepared from the total RNA of each treatment (control versus treatment). Suppression subtractive hybridization was performed using PCR-Select cDNA Subtraction kit (Clontech Laboratories, Inc.). Forward and reverse subtraction libraries were constructed using cDNA samples of control versus treatment. The subtracted cDNAs were subjected to two rounds of PCR to normalize and enrich cDNA populations. The PCR products were sub-cloned into respective vector. Type a and type d molecules, devoid of primer annealing sites, cannot be amplified, type b molecules, being suppressed, cannot undergo the amplification process, type c molecules have only one primer site and can only be amplified linearly, and type e is differentially expressed, having only two adapters. This figure is adopted from [7].
Bioinformatics methods for analysis of sequence data:
Expressed Sequence Tags (ESTs) analysis:
Expressed Sequence Tags (ESTs) are single-pass reads of approximately 200-800 base pairs (bp) generated from randomly selected cDNA clones. Since they represent the expressed portion of a genome, ESTs have proven to be extremely useful for gene identification and verification of gene predictions. They therefore represent a low-cost alternative to full genome sequences. (Figure 2) represents how the Expressed Sequence Tags (ESTs) are generated by SSH-based sequencing method.
Figure 2.

Graphical representation of Expressed Sequence Tags (ESTs). Expressed sequence tags, or ESTs, are single DNA sequencing reads made from complementary DNAs (cDNAs).Cellular mRNA is extracted from the tissue, it is reverse transcribed to produce DNA complementary to the initial mRNA, and incorporated into a plasmid, with appropriate vectors and linkers.
Transcriptome assembly:
Bioinformatics tools are required to produce reliable, high quality data devoid of unwanted sequences in the preprocessing stage of EST projects. Pre-processing includes removing low-quality sequences, contaminant sequences (from vector to any other artifacts), and special features (like Poly-A tail or adaptors). Bioinformatics tools used in this process are LUCY2 [21], SeqClean ( http://compbio.dfci.harvard.edu/tgi) and PreGap4 [22]. Most of the transcriptomic studies for nonmodel organisms use the sequence assembly as a first step to generate contiguous sequences (contigs) that consist of overlapping reads to provide a consensus-based full length transcript. Multiple algorithms for de novo alignment have been developed, including: CAP3 [23], PHRAP ( http: //www.phrap.org/phredphrap/phrap.html), VELVET [24] and MIRA [25]. Many assembler programs are available, which differ in details in their algorithmic development process. For example: CAP3 is based on Overlap-Layout Consensus (OLC) strategy, MIRA uses a hybrid-based method, whereas VELVET uses a graphbased approach. In terms of sensitivity, accuracy and memory requirement, CAP3 is memory intensive due to n2 complexity of OLC-based method, whereas VELVET is less memory intensive as based on graph models [26]. Assembly programs are also available commercially; a few of them are as follows: ‘CLC Genomics Workbench’, ‘DNASTAR’, ‘AVADIS’, etc. Transcriptome are assembled from shorter reads that vary in size, depending on the sequencing technology used. Contigs are created from these shorter reads by comparing all reads against each other. If the sequence identity and overlap length pass a certain threshold value, they are grouped together into a contig. For example, the following parameters have been used for processing transcriptomic datasets for the burying beetle, Nicrophorus vespilloides [4]. Vector clipping, quality trimming and sequence assembly using stringent conditions (e.g. high quality sequence trimming parameters, 95% sequence identity cutoff, 25 bp overlap) was done with the Lasergene software package (DNASTAR Inc., Madison, WI; USA). (Figure 3) represents an overview of the EST analysis, which generates the consensus sequences.
Figure 3.

An overview of the EST analysis. EST pre-processing can be achieved by contamination check, vector sequence and poly-A removal, masking the repeats in the transcripts, followed by clustering and assembly of ESTs to generate contigs. Expressed Sequence Tags in FASTA format can be submitted to phase I for EST pre-processing. SeqClean (http://compbio.dfci.harvard.edu/tgi/) accepts ESTs in FASTA format, and performs vector removal, poly-A removal, trimming of low quality segments at the 5′ and 3′ ends and cleaning of low complexity regions. The output from SeqClean is processed by RepeatMasker (http://www.repeatmasker.org/) to mask repeats. CAP3 [23] then accepts repeat-masked high quality EST sequences and performs clustering and assembly into contigs. Alternatively, these pre-processing of ESTs can be achieved with DNASTAR (Medison USA) or CLC Genomics Workbench. The assembled ESTs may be submitted for functional annotation (phases II and III) either locally or through BLAST2GO suite tool.
Homology based detection:
Homology based search is the best method for determining function by using NCBI Blastx algorithm, either installed locally to consume less time for large datasets, or achieved through the use of the NCBI online server [27]. The Blastx program and associated databases may be downloaded for local Blast sequence searches (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/). Blast2GO offers a comprehensive suite tool for Blasting, with an advanced functional annotation [28]. SwissProt protein database provides a high level of annotations, a minimal level of redundancy and a high level of integration with other databases, and thus is preferred for functional annotation. After a Blastx search, sequences may be compared to other nucleotide sequences with the program Blastn, or translated and compared to translate sequences using tBlastn to identify other unique sequences where Blastx did not work. However, Blastx is the first choice, since amino acid sequence is more conserved than nucleotide sequences. The statistical significant expectation value (E-value) predicts the probability that two sequences are related by chance. It is an important statistical parameter in performing Blast because setting E-value too low may create a false relationship, while setting high Evalue presents the chance of excluding real sequences. As a consequence, by increasing the sequence length, the probability of finding significant Blast hits also increases. In real practice, performing Blast at low E-value and small sequence overlap length initially, and then filtering the results based on the distribution of hits obtained, may improve the obtainment of significant results from homology based approach. In the case of the Nicrophorus transcriptome analysis, first, sequences were searched against the NCBI non-redundant (nr) protein database using an E-value cut-off of 10-3, with predicted polypeptides of a minimum length of 15 amino acids. Second, sequences retrieving no BLASTx hit were searched again by BLASTn, against an NCBI nr nucleotide database using an E-value cut-off of 10−10 [4].
Gene Ontology based functional annotation:
Gene Ontology (GO) is a bioinformatics initiative that provides structural and controlled vocabulary of genes and genes products, to describe their cellular phenomena in terms of ‘Biological Process’, ‘Molecular Function’, and ‘Cellular Components’. Along with this, the GO Consortium provides tools for easy access to all aspects of the data provided by the project [29]. The interesting part of GO analysis is that GO vocabulary terms do not directly describe the gene or protein; rather, they describe the phenomenon of the gene or product of the gene. Gene Ontology annotation program is integrated with Blast2GO or can be used directly by Gene Ontology database, which can be accessed at http://www.geneontology.org/. Enzyme classification codes, and KEGG (Kyoto Encyclopedia of Genes and Genomes) metabolic pathway annotations, were generated from the direct mapping of GO terms onto their enzyme code equivalents by Blast2GO software program. (Figure 3) describes in details about the functional annotation of genes, which can be applied for functional characterization of genes from prokaryotes as well as eukaryotes.
Discussion
Transcriptome of an organism is defined as its complete repertoire of transcripts, including its splice variants. SSHbased gene identification method was introduced as a costeffective approach for rapid discovery and characterization of novel genes. mRNA expression varies- some are highly expressed in a cell while others are found in tiny amount. In whole transcriptomic sequencing project, a very large number of transcripts has to be sequenced, and to achieve this kind of goal, EST-project would be costly from an economic point of view. Thus for small scale EST-projects, SSH-based method is being used by different research laboratories around the globe. On the other hand, there are disadvantages in using SSH-based approach; like chances of missing rare transcripts, limited for gene expression studies, contigs generated are partial in length etc. In order to avoid missing rare transcript, normalization techniques have been devised to decrease the relative representation of the abundant transcript while increasing that of rare transcripts. SSH-based method is used to reduce the representation of transcripts already surveyed in previous libraries, as well as to enrich the sequences that are differentially expressed among specific tissues, cell types, etc. However, the possibility of missing rare transcripts by SSHbased method cannot be excluded, as in case of Nicrophorus vespilloides, we did not able to identify lysozyme [4]. In recent times due to the popularity of high throughput sequencing techniques, SSH may be replaced by NGS technologies. The benefits of using RNA-Seq include high resolution, high dynamic range of expression (>8000 fold), low background noise, and the ability to identify allele specific expression and different isoforms [30]. The use of sequence data from NGS technologies is considered as a powerful tool for differential gene expression analysis as well as for the fully comprehensive measurements of lower-abundance-class RNAs [31–32]. Nevertheless, it is advisable to perform SSH-based approach for generating small scale cDNAs before investing into NGS technologies for whole transcriptome sequencing. To have a better understanding of the molecular and cellular mechanisms behind disease progression or biological development, the identification of differentially expressed genes has been important. So in recent times, the SSH-based method, have been used to identify the differential expressed genes linked to diseases or developmental process or to identify immuneinducible genes in insects upon challenged with microbes. Of note, the workflow described in this article is also applied for analysis of data derived from NGS technologies.
Supplementary material
Acknowledgments
I want to acknowledge my family, especially my father Mr. S. C. Badapanda, for his constant support and encouragement during my PhD Work. Also my special thanks to my friend Subhanjan Mohanty, who has supported me on this journey in so many different ways and on so many different levels.
Footnotes
Citation:Badapanda, Bioinformation 9(4): 216-221 (2013)
References
- 1.Purvis A, Hector A. Nature. 2000;405:212. doi: 10.1038/35012221. [DOI] [PubMed] [Google Scholar]
- 2.Boman HG, et al. Annu Rev Immunol. 1995;13:61. doi: 10.1146/annurev.iy.13.040195.000425. [DOI] [PubMed] [Google Scholar]
- 3.Zou Z, et al. Genome Biol. 2007;8:R177. doi: 10.1186/gb-2007-8-8-r177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Vogel H, et al. Insect Mol Biol. 2011;20:787. doi: 10.1111/j.1365-2583.2011.01109.x. [DOI] [PubMed] [Google Scholar]
- 5.Celniker SE, Rubin GM. Annu Rev Genomics Hum Genet. 2003;4:89. doi: 10.1146/annurev.genom.4.070802.110323. [DOI] [PubMed] [Google Scholar]
- 6.Bautz EK, Reilly E. Science. 1966;151:328. doi: 10.1126/science.151.3708.328. [DOI] [PubMed] [Google Scholar]
- 7.Ghorbel MT, Murphy D. Methods Mol Biol. 2011;789:237. doi: 10.1007/978-1-61779-310-3_15. [DOI] [PubMed] [Google Scholar]
- 8.Altincicek B, et al. Dev Comp Immunol. 2008;32:585. doi: 10.1016/j.dci.2007.09.005. [DOI] [PubMed] [Google Scholar]
- 9.Altincicek B, et al. Insect Mol Biol. 2008;17:711. doi: 10.1111/j.1365-2583.2008.00835.x. [DOI] [PubMed] [Google Scholar]
- 10.Bao YY, et al. Genomics. 2009;94:138. doi: 10.1016/j.ygeno.2009.04.003. [DOI] [PubMed] [Google Scholar]
- 11.Altincicek B, Vilcinskas A. Insect Biochem Mol Biol. 2007;37:726. doi: 10.1016/j.ibmb.2007.03.012. [DOI] [PubMed] [Google Scholar]
- 12.Altincicek B, Vilcinskas A. BMC Genomics. 2007;8:326. doi: 10.1186/1471-2164-8-326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Altincicek B, Vilcinskas A. Insect Mol Biol. 2009;18:119. doi: 10.1111/j.1365-2583.2008.00856.x. [DOI] [PubMed] [Google Scholar]
- 14.Rinehart JP, et al. J Insect Physiol. 2010;56:603. doi: 10.1016/j.jinsphys.2009.12.007. [DOI] [PubMed] [Google Scholar]
- 15.Irles P, et al. BMC Genomics. 2009;10:206. doi: 10.1186/1471-2164-10-206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Dixit R, et al. Int J Infect Dis. 2009;13:636. doi: 10.1016/j.ijid.2008.07.027. [DOI] [PubMed] [Google Scholar]
- 17.Zhao L, et al. J Med Entomol. 2009;46:490. doi: 10.1603/033.046.0312. [DOI] [PubMed] [Google Scholar]
- 18.Zhu Y, et al. Insect Biochem Mol Biol. 2003;33:541. doi: 10.1016/s0965-1748(03)00028-6. [DOI] [PubMed] [Google Scholar]
- 19.Botha AM, et al. Plant Cell Rep. 2006;25:41. doi: 10.1007/s00299-005-0001-9. [DOI] [PubMed] [Google Scholar]
- 20.Wang Y, et al. Mol Plant Microbe Interact. 2008;21:122. doi: 10.1094/MPMI-21-1-0122. [DOI] [PubMed] [Google Scholar]
- 21.Li S, Chou HH. Bioinformatics. 2004;20:2865. doi: 10.1093/bioinformatics/bth302. [DOI] [PubMed] [Google Scholar]
- 22.Bonfield JK, et al. Nucleic Acid Res. 1995;23:4992. doi: 10.1093/nar/23.24.4992. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Huang X, Madan A. Genome Res. 1999;9:868. doi: 10.1101/gr.9.9.868. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Zerbino DR, Birney E. Genome Res. 2008;18:821. doi: 10.1101/gr.074492.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Chevreux B, et al. Genome Res. 2004;14:1147. doi: 10.1101/gr.1917404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Kumar S, Blaxter ML. BMC Genomics. 2012;11:571. doi: 10.1186/1471-2164-11-571. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Altschul SF, et al. J Mol Biol. 1990;215:403. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 28.Conesa A, Götz S. Int J Plant Genomics. 2008;2008:619832. doi: 10.1155/2008/619832. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Gene Ontology Consortium. Nucleic Acids Res. 2010;38:D331. doi: 10.1093/nar/gkp1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Deng X, et al. BMC Bioinformatics. 2011;12:267. doi: 10.1186/1471-2105-12-267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Mortazavi A, et al. Nat Methods. 2008;5:621. doi: 10.1038/nmeth.1226. [DOI] [PubMed] [Google Scholar]
- 32.Wilhelm BT, Landry JR. Methods. 2009;48:249. doi: 10.1016/j.ymeth.2009.03.016. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
