Abstract
Background
Studying a process in a new species often relies on focusing our attention to a candidate gene, encoding a protein similar to one with a known function. Not all the choices seem to be prudent.
Scope
This Viewpoint includes an overview of issues that are encountered during research of candidate genes. Defining a match for a gene of interest, deciding whether variation in ESTs or RNAseq data for a certain transcript, represent more than one gene. The problem of incorrect annotation of genes due to incorrect in-silico splicing, is also mentioned. The author's humble opinion on how to deal with these issues is provided.
Conclusions
The vast amount of new sequence data provides us with great possibilities for giant leaps in our understanding. Still, we cannot afford to skip over the tedious steps required to confirm that we are indeed studying the correct gene, and try to be sure that the complex expression pattern we observe is not a composite of several genes.
Keywords: Homology, orthology, MADS box, FLOWERING LOCUS T, FLC, candidate gene, citrus, Citrus clementina, apple, Malus domestica
INTRODUCTION
One rationale behind studying ‘model’ plants (MPS) such as arabidopsis is that different plant species are likely to use conserved mechanisms to control development, metabolism and other processes. By learning from one species that is easy to study, we might extrapolate these findings to different, more complex species, which are either more important to agriculture, or simply our ‘favourite’ species (OFS). For example, while flowering of different plant species is mediated by different internal and external cues, it seems to be triggered by a common protein, FLOWERING LOCUS T (FT), which functions as a universal ‘florigen’ (Turck et al., 2008). The MADS box proteins that confer organ identity in flowers (termed the ABC model) are another example of conserved function (Causier et al., 2010). It is obvious that not everything will be the same among species, and the differences help explain why flowering plants are so diversified.
Once we decide to study a certain event in OFS, we might begin by presuming nothing and start our studies with a clean slate. Many choose to also use a parallel approach in which they study ‘candidate’ genes that encode proteins that were functionally characterized in MPS. This Viewpoint discusses problems that might rise in the course of picking such candidate genes.
If X and Y are identical twin dogs and someone scanned all dogs on the planet for the one that is most similar (in genome sequence) to X, they should end up with Y. If X was in a room with one more dog (Z) and several humans, a scan limited to the room would probably identify Z as most similar to X. The depth of the search clearly depends on the available dataset. If I want to identify, in OFS, a gene encoding a protein most similar in sequence to protein X from MPS, I would need to have access to the full genome of OFS, to identify the best match. Still, I might be lucky and identify a gene encoding a protein orthologous to X from a small dataset.
FINDING THE PERFECT MATCH
When can I assume that the gene I identified indeed encodes a protein that has a similar role to the one described for protein X in MPS? There is no straightforward answer to this question and there are different points of view regarding how rigorous the analysis should be before making such a statement. If XMPS and XOFS share extended homology, and mutations in each of the genes cause similar phenotypes, one might assume that these are orthologous genes which have originated from the same gene in their common ancestor and share the same function. Reaching this result would depend on how easy it is to obtain loss of function mutations in both species, and whether or not there are redundant genes that might mask the phenotype of loss of function. In many species the genome has gone through duplications or has become polyploid. Thus, reaching a loss of function phenotype becomes even more difficult. Without loss of function phenotypes, similar expression profiles can also strengthen a choice of an orthologue among paralogues (Grigoryev et al., 2004). Clearly, homology within functional domains is a better indicator than overall similarity, since a loss of an active site would make a paralogue into a pseudogene (Yamagami et al., 2003).
FINDING THE WRONG MATCH
This view point is not aimed at defining a maximal standard, but more on discussing a minimal standard. It arises from reading peer-reviewed papers (Muñoz-Fambuena et al., 2011, 2012) in which the authors call a gene from their OFS ‘X’ or ‘X-like’ and a careful yet simple analysis suggests that in some cases the name is completely misleading (Table 1). This can be more common in cases where there is a gene family in each genome, e.g. the MADS box transcription factors. In arabidopsis, of the 104 MADS BOX genes, 45 belong to the MIKCc clade (Martinez-Castilla and Alvarez-Buylla, 2003); 38 in grape (Diaz-Riquelme et al., 2009) and rice (Arora et al., 2007), and 64 in poplar encode proteins belonging to the same clade. Among these genes are flowering repressors, flowering enhancers as well as proteins required for flower-organ identity. FLOWERING LOCUS C (FLC), a flowering repressor in arabidopsis (Searle et al., 2006), belongs to this group, and is part of a subgroup (MAF) in arabidopsis containing five other members. Similar proteins are found in close relatives of arabidopsis (D'Aloia et al., 2008; Wang et al., 2009), yet it cannot be found in all plant species (Arora et al., 2007; Reeves et al., 2007), and this clade is relatively divergent (Martinez-Castilla and Alvarez-Buylla, 2003; Leseberg et al., 2006; Diaz-Riquelme et al., 2009). If we are looking for a gene encoding an FLC protein in a new species, and in the dataset there are only ten MADS box-encoding genes, among them Y, it could be correct to state that from all the genes in the dataset gene Y encodes the protein most similar to FLC. Is this enough supporting evidence to call it FLC or FLC-like? In my view it is definitely not enough, and publishing such nomenclature could hamper scientific progress. In recent publications (Muñoz-Fambuena et al., 2011, 2012), the gene chosen to represent FLC encodes a protein most similar to the class B organ identity protein PISTILLATA from arabidopsis (Table 1 and Supplementary Data Fig. S1). The gene chosen to represent the flowering enhancer SUPPRESSOR OF CONSTANS 1 (SOC1) encodes a protein most similar to AGAMOUS LIKE 82 (Table 1 and Supplementary Data Fig. S2). Thus, the authors might be reporting the expression of a class B flower organ-identity gene and under the impression they are reporting the expression of a gene that encodes a flowering repressor (Muñoz-Fambuena et al., 2011, 2012). One can imagine reading a discussion including a hypothesis that explains why in this unique species there is a reason why a flowering repressor should be up-regulated in flower primordia.
Table 1.
List of primers used by Muñoz-Fambuena et al. (2011, 2012) to study expression of flowering time genes in sweet orange
| EST name provided in manuscript |
Primers listed in manuscript: what do they really recognize? (6–7) |
||||||
|---|---|---|---|---|---|---|---|
| Arabidopsis protein (1) | Name given in paper (2) | Most homologous to (3) | Position of protein of interest (4) | Best match for arabidopsis protein in PHYTOZOME C. clementina database (5) | Forward primer | Reverse primer | Problem in manuscript (8) |
| LFY | aC34107CO6EF_c | Wrong gene | – | 0·9_030511m | 0·9_030511m | 0·9_030511m | Wrong EST |
| TFL1 | aCL6873Contig1 | MFT | #2 | 0·9_033458m | 0·9_033458m | 0·9_033458m | Wrong EST |
| AP1 | aCL9055Contig1 | SEPALATA1 | #7 | 0·9_020838m.g | 0·9_020838m.g | incorrect | Wrong EST and incorrect reverse primer |
| FT | aCL6275Contig1 | FT | #1 | 0·9_023420m | 0·9_023420m | 0·9_023420m | Primers recognize two of three genes |
| 0·9_033594m | |||||||
| 0·9_023363m | |||||||
| SOC1 | ACL2263Contig1 | AGL82 | #15 | 0·9_021293m | 0·9_020900m | 0·9_020900m | Primers and EST are for a AGL82 homologue |
| 0·9_021261m | |||||||
| 0·9_021283m | |||||||
| 0·9_021297m | |||||||
| FLC | aCL8484Contig1 | PI | #33 | There is no homologue for FLC in citrus genome | Incorrect | Incorrect | Complete mistake |
The same table was published in both manuscripts. For each arabidopsis protein (column 1), I provide the EST code listed in the original table (2), the arabidopsis gene that I identified as most similar to the citrus EST mentioned (3), the ranking (4) of the original arabidopsis protein (1) when the sequence of the EST provided (2) was compared, using BLASTX, to the arabidopsis proteins in (TAIR10). The name of the C. clementina gene I identified, using PHYTOZOME (5), as encoding a protein most similar to the original arabidopsis protein (1), what the sequence is that the primers listed in the manuscript actually recognize (6–7), and a summary of mistakes in the table in the manuscripts (8). In a couple of cases (LFY, TFL1) the primers listed are correct, while the EST name is incorrect. In ‘AP1’ the EST name is incorrect, and only one of the primers listed recognizes the correct gene. In ‘FT’ the EST name and the primers are correct, yet the primers recognize more than one gene (see Supplementary Data Table S1). In SOC1 the EST and primers are for a gene encoding AGL82 and not SOC1. In FLC the EST encodes a PI like protein and the primers are incorrect. I could not detect a citrus gene encoding FLC. Because of this inconsistency within the table, it remains unclear if the figures of gene expression are based on the listed primers, the listed ESTs or some other combination.
The damage might not be contained to that specific OFS or to researchers studying that developmental question and citing these data. The sequence with the false name given by the authors could appear in official databases, such as NCBI. If someone else, identifies a new gene (Q), similar to PISTILLATA in his OFS, and searches public databases for genes that are similar to Q, his best hit might be Y from the above story, named as FLC. If not careful, this new scientist might join the ‘runaway train’ and call his gene FLC or FLC-like, since that is the official name, in the public database of the gene most similar to Q.
FINDING A REASONABLE MATCH
In my opinion, and what is probably usual practice for most researchers, before stating a gene is X-like from arabidopsis, one should at least go through some suggested steps (Fig. 1). The easiest test, ‘reciprocal BLASTP’ or ‘reciprocal best hit (RBH)’ (Chervitz et al., 1998; Rivera et al., 1998) is to take the nucleotide sequence of the gene (Y) and go to TAIR and perform a BLASTX: NT query to AA (TAIR10 Protein) Database http://www.arabidopsis.org/wublast/index2.jsp. I use WU-BLAST2 with no filter. The aim is to find the protein in arabidopsis that is most similar to the protein likely encoded by gene Y. Assuming that the answers are in the order of P-value, the first protein on the list is likely to be the one we can call the most similar. If X is not number 1 then, in my opinion, we cannot call our protein X-like. In some cases, X is only the 10th most similar protein, while the first nine are of unknown function or less written about in the literature. Can this justify calling our gene X-like? In my opinion: no.
Fig. 1.
Suggested steps in identifying our gene of interest. See text for detailed explanation.
If we are afraid that the protein database is incomplete (due to incorrect splicing by computer programs) we can do a more thorough check (TBLASTX: NT query to NT database (six frames) using the whole genome as a database.
In some cases, we might find that the most similar protein is indeed X yet the homology between proteins is relatively low. For example, if we searched the complete rice genome for a gene encoding a protein similar to FLC, the highest hit is LOC_Os12g10540·3, also termed MADS 13. So, we indeed used the whole dataset and chose the most homologous protein in the rice genome. Still, if we now ask what is the most similar protein in arabidopsis to that encoded by LOC_Os12g10540·3, the first hit would be SEEDSTICK (AGL11) (Pinyopich et al., 2003). This exercise shows again that rice does not contain a protein that is similar to FLC (Arora et al., 2007), and we should look for flowering repressors using other approaches.
MORE THAN ONE GENE?
When we search OFS for transcripts that encode for X, we might identify several transcripts (ESTs or RNA seq reads) that are not completely identical to each other XOFS-1, XOFS-2. Assuming that each one passed the criteria described above, what could be the cause of this variation? Unlike arabidopsis ecotypes, many crops are heterozygous, likely containing two different alleles of the same gene. If polyploid, the number of different alleles could be larger. Variation could also be due to alternative splicing of the same gene. Another possibility is that there are two genes encoding similar proteins, possibly arising from duplication, and that the expression of each gene is different (Arora et al., 2007).
Without genome sequence, one way to determine whether XOFS-1, XOFS-2 are encoded by the same loci is by Southern blot: under high stringency hybridization different bands would be expected for different loci. Another method would be to design specific primers for each transcript and use them to capture introns or 5′ or 3′ regions of each gene. With alleles we would expect non-coding regions to be more similar compared with non-coding regions from two paralogous genes. Once the full genome is sequenced, answers are more clear and simpler to obtain. One excellent web site for plant genome data is http://www.phytozome.net/.
As an example, I will spotlight recent papers describing expression of genes encoding FT from citrus (Muñoz-Fambuena et al., 2011, 2012; Nishikawa et al., 2007, 2009, 2010). A close look at the haploid genome of Citrus clementina (Aleza et al., 2009) reveals three loci encoding FT-like proteins: clementine 0·9_023420m, 0·9_033594m and 0·9_023363m (Supplementary Data Figs S3 and S4). One work (Nishikawa et al., 2007) indeed describes expression of three citrus FT genes based on three ESTs: CiFT1, CiFT2 and CiFT3 using specific primers. According to my analysis, it appears that CiFT1 and CiFT2 are encoded by the same gene: 0·9_023420m (Supplementary Data Fig. S3). CiFT3 is encoded by 0·9_033594m and the third FT-encoding gene, 0·9_023363m does not seem to be represented in the EST databases. The authors also used a set of primers meant to recognize all three ESTs, and these also may recognize 0·9_023363m as well, but to a lesser degree (Supplementary Data Table 1). Since primer recognition is not 100 % for all three genes, they would not be likely to amplify the different transcripts at a similar rate. This would cause the interpretation of results using these ‘universal’ primers, to be complex. Another group (Muñoz-Fambuena et al., 2011, 2012) recently claim to be studying expression of one gene they term ‘CiFT’ yet primers used for quantitative real-time RT-PCR, according to my analysis (Supplementary Data Table 1), recognize to some extent all three FT-encoding genes, but not perfectly. Again, making it difficult to interpret the expression patterns they describe.
Without knowledge of the genome, there is no fault in designing primers that at the time of publication seem to be specific, yet end up being non-specific. Still, as readers it would be a great service if authors and journals were encouraged to use online tools to update published manuscripts with notes of caution regarding problems that might arise in interpretation of data due to the use of non-specific primers. I also think reviewers and editors should always ask for primer sequences and alignments in the supplementary data.
A recent paper describing several flowering-time genes in apple presents what seems to be a thorough analysis of the genes within the apple genome (Guitton et al., 2012). The apple genome contains two genes encoding FT-like proteins, MDP0000132050 (on MDC021142·191, also known as MdFT1, AB458506) and MDP0000139278 (on MDC009937·172, also known as MdFT2, AB458505·1) (Kotoda et al., 2010; Guitton et al., 2012). Expression of the two FT-encoding genes was performed using gene-specific primers (Kotoda et al., 2010). Others reported diurnal expression of an apple FT gene (Trankner et al., 2010), and once they were aware that the primers they used could not distinguish between the two FT-encoding genes, they helfully notified the community (Trankner et al., 2011).
I also mention the apple FT genes to point out an example in which researchers were careful in their analysis, yet computer-based splicing of the MdFT2 gene in the PHYTOZOME database is incorrect and misleading (Supplementary Data Figs S5 and S6). An algorithm that decides how a genomic sequence will be spliced to form a mature transcript could be imperfect and, if the simulated mature transcript is not verified with EST and RNAseq data, it should be used with caution. When performing a BLASTP search using FT protein on the Malus domestica genome at Phytozome, the MDP0000139278 protein comes up only as the 9th best hit. The reason is that the gene was not spliced correctly by the computer program, and the resultant protein is less homologous to FT (Supplementary Data Fig. S6). Better gene prediction in new genomes can be achieved by cross-species comparisons (Liu et al., 2008).
SUMMARY
Although not all plant biologists are experts of bioinformatics, this Viewpoint suggests that still should be urged to use extra caution in stating homologies, relying on other people's primers, and believing the computer annotations of genes. A more stringent approach will relieve us from feeling uncomfortable about things we have published, and keep the scientific community focused on understanding real nature.
SUPPLEMENTARY DATA
NOTE ADDED IN PROOF
Current citrus genomic data found at: http://www.citrusgenomedb.org/. Current apple genomic data found at http://www.rosaceae.org/species/apple.
LITERATURE CITED
- Aleza P, Juarez J, Hernandez M, Pina J, Ollitrault P, Navarro L. Recovery and characterization of a Citrus clementina Hort. ex Tan. ‘Clemenules’ haploid plant selected to establish the reference whole Citrus genome sequence. BMC Plant Biology. 2009;9(110) doi: 10.1186/1471-2229-9-110. http://dx.doi.org/10.1186/1471-2229-9-110 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arora R, Agarwal P, Ray S, et al. MADS-box gene family in rice: genome-wide identification, organization and expression profiling during reproductive development and stress. BMC Genomics. 2007;8(242) doi: 10.1186/1471-2164-8-242. http://dx.doi.org/10.1186/1471-2164-8-242 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- Causier B, Schwarz-Sommer Z, Davies B. Floral organ identity: 20 years of ABCs. Seminars in Cell & Developmental Biology. 2010;21:73–79. doi: 10.1016/j.semcdb.2009.10.005. [DOI] [PubMed] [Google Scholar]
- Chervitz SA, Aravind L, Sherlock G, et al. Comparison of the complete protein sets of worm and yeast: orthology and divergence. Science. 1998;282:2022–2028. doi: 10.1126/science.282.5396.2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- D'Aloia M, Tocquin P, Perilleux C. Vernalization-induced repression of FLOWERING LOCUS C stimulates flowering in Sinapis alba and enhances plant responsiveness to photoperiod. New Phytologist. 2008;178:755–765. doi: 10.1111/j.1469-8137.2008.02404.x. [DOI] [PubMed] [Google Scholar]
- Diaz-Riquelme J, Lijavetzky D, Martinez-Zapater JM, Carmona MJ. Genome-wide analysis of MIKCC-type MADS box genes in grapevine. Plant Physiology. 2009;149:354–69. doi: 10.1104/pp.108.131052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grigoryev D, Ma S-F, Irizarry R, Ye S, Quackenbush J, Garcia J. Orthologous gene-expression profiling in multi-species models: search for candidate genes. Genome Biology. 2004;5:R34. doi: 10.1186/gb-2004-5-5-r34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guitton B, Kelner J-J, Velasco R, Gardiner SE, Chagne D, Costes E. Genetic control of biennial bearing in apple. Journal of Experimental Botany. 2012;63:131–149. doi: 10.1093/jxb/err261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kotoda N, Hayashi H, Suzuki M, et al. Molecular characterization of FLOWERING LOCUS T-like genes of apple (Malus domestica Borkh.) Plant and Cell Physiology. 2010;51:561–575. doi: 10.1093/pcp/pcq021. [DOI] [PubMed] [Google Scholar]
- Leseberg CH, Li A, Kang H, Duvall M, Mao L. Genome-wide analysis of the MADS-box gene family in Populus trichocarpa. Gene. 2006;378:84–94. doi: 10.1016/j.gene.2006.05.022. [DOI] [PubMed] [Google Scholar]
- Liu Q, Crammer K, Pereira F, Roos D. Reranking candidate gene models with cross-species comparison for improved gene prediction. BMC Bioinformatics. 2008;9:433. doi: 10.1186/1471-2105-9-433. http://dx.doi.org/10.1186/1471-2105-9-433 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martinez-Castilla LP, Alvarez-Buylla ER. Adaptive evolution in the Arabidopsis MADS-box gene family inferred from its complete resolved phylogeny. Proceedings of the National Academy of Sciences of the USA. 2003;100:13407–13412. doi: 10.1073/pnas.1835864100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Munoz-Fambuena N, Mesejo C, Gonzalez-Mas MC, Primo-Millo E, Agusti M, Iglesias DJ. Fruit regulates seasonal expression of flowering genes in alternate-bearing ‘Moncada’ mandarin. Annals of Botany. 2011;108:511–519. doi: 10.1093/aob/mcr164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Muñoz-Fambuena N, Mesejo C, González-Mas M, Iglesias D, Primo-Millo E, Agustí M. Gibberellic acid reduces flowering intensity in sweet orange Citrus sinensis (L.) Osbeck by repressing CiFT gene expression. Journal of Plant Growth Regulation. 2012 in press. http://dx.doi.org/10.1007/s00344-012-9263-y . [Google Scholar]
- Nishikawa F, Endo T, Shimada T, et al. Increased CiFT abundance in the stem correlates with floral induction by low temperature in satsuma mandarin (Citrus unshiu Marc.) Journal of Experimental Botany. 2007;58:3915–3927. doi: 10.1093/jxb/erm246. [DOI] [PubMed] [Google Scholar]
- Nishikawa F, Endo T, Shimada T, Fujii H, Shimizu T, Omura M. Differences in seasonal expression of flowering genes between deciduous trifoliate orange and evergreen satsuma mandarin. Tree Physiology. 2009;29:921–926. doi: 10.1093/treephys/tpp021. [DOI] [PubMed] [Google Scholar]
- Nishikawa F, Endo T, Shimada T, et al. Transcriptional changes in CiFT-introduced transgenic trifoliate orange (Poncirus trifoliata L. Raf.) Tree Physiology. 2010;30:431–439. doi: 10.1093/treephys/tpp122. [DOI] [PubMed] [Google Scholar]
- Pinyopich A, Ditta G, Savidge B, et al. Assessing the redundancy of MADS-box genes during carpel and ovule development. Nature. 2003;424:85–88. doi: 10.1038/nature01741. [DOI] [PubMed] [Google Scholar]
- Reeves PA, He Y, Schmitz RJ, Amasino RM, Panella LW, Richards CM. Evolutionary conservation of the FLOWERING LOCUS C-mediated vernalization response: evidence from the sugar beet (Beta vulgaris) Genetics. 2007;176:295–307. doi: 10.1534/genetics.106.069336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rivera MC, Jain R, Moore JE, Lake JA. Genomic evidence for two functionally distinct gene classes. Proceedings of the National Academy of Sciences of the USA. 1998;95:6239–6244. doi: 10.1073/pnas.95.11.6239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Searle I, He Y, Turck F, et al. The transcription factor FLC confers a flowering response to vernalization by repressing meristem competence and systemic signaling in Arabidopsis. Genes & Development. 2006;20:898–912. doi: 10.1101/gad.373506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trankner C, Lehmann S, Hoenicka H, et al. Over-expression of an FT-homologous gene of apple induces early flowering in annual and perennial plants. Planta. 2010;232:1309–1324. doi: 10.1007/s00425-010-1254-2. [DOI] [PubMed] [Google Scholar]
- Trankner C, Lehmann S, Hoenicka H, et al. Note added in proof to: Over-expression of an FT-homologous gene of apple induces early flowering in annual and perennial plants. Planta. 2011;233:217–218. doi: 10.1007/s00425-010-1254-2. [DOI] [PubMed] [Google Scholar]
- Turck F, Fornara F, Coupland G. Regulation and identity of florigen: FLOWERING LOCUS T moves center stage. Annual Review of Plant Biology. 2008;59:573–594. doi: 10.1146/annurev.arplant.59.032607.092755. [DOI] [PubMed] [Google Scholar]
- Wang R, Farrona S, Vincent C, et al. PEP1 regulates perennial flowering in Arabis alpina. Nature. 2009;459:423–427. doi: 10.1038/nature07988. [DOI] [PubMed] [Google Scholar]
- Yamagami T, Tsuchisaka A, Yamada K, Haddon WF, Harden LA, Theologis A. Biochemical diversity among the 1-amino-cyclopropane-1-carboxylate synthase isozymes encoded by the Arabidopsis gene family. Journal of Biological Chemistry. 2003;278:49102–49112. doi: 10.1074/jbc.M308297200. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

