Abstract
Gephebase is a manually-curated database compiling our accumulated knowledge of the genes and mutations that underlie natural, domesticated and experimental phenotypic variation in all Eukaryotes—mostly animals, plants and yeasts. Gephebase aims to compile studies where the genotype–phenotype association (based on linkage mapping, association mapping or a candidate gene approach) is relatively well supported. Human clinical traits and aberrant mutant phenotypes in laboratory organisms are not included and can be found in other databases (e.g. OMIM, OMIA, Monarch Initiative). Gephebase contains more than 1700 entries. Each entry corresponds to an allelic difference at a given gene and its associated phenotypic change(s) between two species or two individuals of the same species, and is enriched with molecular details, taxonomic information, and bibliographic information. Users can easily browse entries and perform searches at various levels using boolean operators (e.g. transposable elements, snakes, carotenoid content, Doebley). Data is exportable in spreadsheet format. This database allows to perform meta-analyses to extract global trends about the living world and the research fields. Gephebase should also help breeders, conservationists and others to identify promising target genes for crop improvement, parasite/pest control, bioconservation and genetic diagnostic. It is freely available at www.gephebase.org.
INTRODUCTION
Mutations form the raw bulk of heritable variation upon which traits evolve. Identifying the DNA sequence modifications that drive phenotypic changes is a primary goal of modern genetics, and could greatly improve our understanding of the mechanisms behind biodiversity and adaptation. However, this research program would be most successful if it reaches comparative capacity, for instance by allowing us to detect trends across the Tree of Life (1–3). Advances in genome sequencing and editing are accelerating the rate of discovery of the loci of evolution at a quick pace, making data integration increasingly challenging, and it is now crucial to develop a universal, single resource integrating this body of knowledge. As of today, compilations of genotype–phenotype relationships are available for a limited number of species in taxon-specific databases, for example OMIA for animals (4), OMIM for humans (5), TAIR for Arabidopsis (6), FlyBase for Drosophila (7), or the Monarch Initiative across the main laboratory animal model species (8,9). To date, there are no databases that consolidate genotype–phenotype relationships related to natural evolutionary cases across all Eukaryotes. For example, evolutionary changes in tigers, butterflies, monkeyflowers, or any non-traditional model organism are lacking from existing genotype–phenotype databases, preventing comparative insights on the diversity and similarities of sequence modifications that fuel the generation of observable differences in the living world.
To fill this gap, we developed Gephebase, a manually curated database that gathers published data about the genes and the mutations responsible for evolutionary changes in all Eukaryotes (mostly animals, yeasts and plants) into a single website. The content of Gephebase was developed over the past 10 years, with previous versions of the dataset published as supplementary spreadsheet files associated to two review articles, which successively compiled 331 entries (1), and 1008 entries (2). These datasets have been used by various authors to highlight several trends regarding the genetic basis of natural variation. For example, based on these compilations it was found (a) that the mutations responsible for long-term evolution have distinct properties than the mutations responsible for short-term evolution (1,10), (b) that certain types of mutations are more likely to be fixed than others during the course of evolution (11), (c) that independent evolution of similar traits in distant lineages often involves mutations in the same orthologous gene (2), (d) that current data are biased towards a limited number of model organisms (12) and (e) that the cis-regulatory tinkering of signaling ligand genes is a recurring mode of morphological evolution (13).
We have now created an online version of the Gephebase database, accessible at www.gephebase.org, and we describe here its various features.
SNAPSHOT SUMMARY
In short, Gephebase is a searchable, manually curated knowledge-base of the genetic loci of phenotypic variation. Each entry is a pair of alleles associated to a trait variation, be it naturally existing (inter- or intraspecific), selected by breeders (domestication), or occurring during a bout of experimental evolution in the lab. For instance, forward genetic studies have determined that independently derived null mutations of the Oca2 gene have caused an amelanic phenotype in at least two subterranean populations of cavefish (14), and in a breed of corn snake that has been selected for the pet trade (15). A Gephebase search for the Oca2 gene name reveals these findings, accessible in summary tables (Figure 1) or in a more detailed output (entry view, and CSV spreadsheet format). Gephebase also indicates that some Oca2 allelic variants have been identified by Genome-Wide Association Studies of pigment variation. Importantly, the focus of Gephebase is always on genetic variations that emerge naturally - it never includes laboratory variants that were generated by random or directed mutagenesis. Thus the Oca2 CRISPR knockout phenotypes that have been generated in frogs (16) do not have a dedicated Gephebase entry; the cavefish Oca2 CRISPR/TALEN knockout phenotypes (17,18) do not have a dedicated entry either, but are used as Additional References to support the functionality of the two natural Oca2 null alleles in Gephebase. This makes Gephebase complementary to the Monarch Initiative database, which compiles gene-to-phenotype relationships in humans, as well as in laboratory organisms and mutants generated by reverse genetics, but does not include non-model species such as cavefishes and corn snakes (8,9).
DATABASE CURATION AND STRUCTURE
Criteria for inclusion in Gephebase
Gephebase includes cases of domestication, experimental evolution and natural evolution but no human clinical phenotypes. Gene expression levels (eQTL) and DNA methylation patterns are not included. All kinds of traits above this level, whether morphological, physiological or behavioral, are included. For example, we include ‘Recombination rate’, ‘Telomere length’, ‘Hematopoiesis’, ‘Hybrid incompatibility’.
Cases of genomic regions associated with a trait for which the underlying gene(s) is unclear are not included in Gephebase. Cases where the gene has been identified, but not the exact mutation, are included. Stringent inclusion criteria are used so that Gephebase compile only studies where a given genotype–phenotype association is well supported or understood. Association Mapping studies are included only if there is additional experimental support for the given gene. Candidate Gene studies require conclusive functional assays for inclusion in Gephebase. Overall, gene-to-phenotype links identified by Linkage Mapping with resolutions <500 kb have priority in the dataset. There are multiple types of experimental evidence that led to the discovery of a relationship between a genetic mutation and a phenotypic change. For sake of simplicity and efficiency, each gene-phenotype association is attributed only one type of Experimental Evidence among three possibilities: ‘Association Mapping’, ‘Linkage Mapping’, or ‘Candidate Gene’ (Figure 2). When several methods were used, the least biased one is chosen by the curator (Table 1). And when new evidences emerge, they are added to the entry.
Table 1.
Curation protocol
Searches for relevant papers to be included in the database are done manually by our team of curators. We screen major journals in evolutionary genetics, perform keyword searches using online search tools, and we pay particular attention to citations in primary research articles as well as in review papers. The ‘Suggest an article’ button in the top bar menu allows users to suggest articles to our curation team. Of note, our curations efforts have been maximal until 2013 and then relaxed due to our inability to support a full-time curator. Following our inclusion criteria, we estimate that the database is close to comprehensive for studies published prior to 2013, and up to 30% complete for the 2014–2019 period.
Technical overview of the database and the web interface
Gephebase was developed using the Symfony framework (v2.8) and PHP (v5.6 compatible 7). MariaDB (v10) is used to store data. The database consists of 33 tables including users management and logs. The main table links genotypic change, phenotypic change, references and validation information. Most fields of other tables are automatically retrieved from NCBI databases. The import procedure uses the NCBI E-utility interface with XML to fill the corresponding tables. Gephebase entries of the main table can be imported and exported through a csv file. For convenience, fields retrieved automatically can be present in the csv file even though they are fetched and stored in other tables.
The project code was put under version control (git) from its inception. The code is available in the GitHub repository https://github.com/Biol4Ever/Gephebase-database under GPL (GNU General Public License) version 3.
DATABASE CONTENT AND WEB INTERFACE
Organization of the data into entries
This database currently comprises >1900 entries (Supplementary Table S1). One entry corresponds to a single mutation, or a group of linked mutations within a single gene, either between two closely related species or between two individuals of the same species, and its associated phenotypic change (Figure 3). For cases of repeated evolution (2), we use the following conventions. When several mutations are found within the same gene in a given individual, with each mutation affecting the trait of interest—i.e. several causative mutations within a haplotype, intralineage hotspot (2)—all are grouped into a single entry. In contrast, when independent mutations occur in the same gene in distinct individuals of the same species, leading to similar phenotypic changes (intraspecific parallel evolution, convergent evolution), we chose to create different entries for each lineage-specific haplotype. In cases where a genetic variant was invented once, and then spread into multiple branches of the gene pool, via Incomplete Lineage Sorting (ILS), secondary hybridization (introgression among organisms that are not completely reproductively isolated) or horizontal transfer, a single entry is created and multiple taxa with the derived trait are reported in the entry.
The various fields of a Gephebase entry
A Gephebase entry (Figure 3) comprises 29 manually curated fields regarding bibliographical information, molecular details and taxonomic information; some are free-text and others rely on controlled vocabulary (Table 1). In addition, for each entry, 20 fields are automatically fetched based on manually curated data, from NCBI Taxonomy using the Taxon ID (17), from UniProtKB using the UniProtKB ID (18), and from NCBI PubMed using the PubMed ID (19). Two fields are also automatically computed within Gephebase: ‘Related Genes’, which corresponds to the other genes in Gephebase associated with the same phenotypic trait in the same group of species, and ‘Related Haplotypes’, which displays the other mutations in Gephebase that are found in the same gene and that occurred in other lineage branches in the same group of species.
A single entry can include several traits if a mutation is pleiotropic. Taxon A represents the taxon(s) inferred to bear the ancestral phenotypic state and Taxon B the derived state. If the direction of change cannot be inferred, the field ‘Ancestral State’ is ‘Unknown’ and the two compared taxa are assigned arbitrarily to Taxon A and Taxon B. In most cases, Taxon A and Taxon B correspond to taxa at the species level. In cases of named breeds, cultivars, strains or geographically restricted populations, additional information about the Taxon A/B can be found in the field ‘Taxon A/B Description’. The phenotypic states are described in ‘Trait State in Taxon A/B’.
Exploration tools
Gephebase is designed for interactive exploration and analysis of the genotype–phenotype relationships across species and populations. First-time users can find help on the Frequently Asked Questions page, in tutorials available on the Documentation page and via ‘contextual tips’, small boxes providing information when the cursor hovers over an item. Data can be queried using boolean operators via the Search page, via SQL line or via custom tools after downloading the dataset of interest as a CSV file. The entire dataset can be downloaded as a CSV file by searching for the wild card * in the top bar panel, clicking on ‘Select all’ in the top left corner of the results table and then clicking on ‘Complete Export’. A Browse page, accessible from the main menu, displays all the Trait names, Species names and also compiles the genes with the highest number of mutations reported in Gephebase.
Gephebase comprises two main views, the View-Entry page which displays a single entry (Figure 3) and the Search/Results page, which shows the results of a given search in a table format (Figure 1). Results of a search are displayed as a table and there are four view options. In the default view, one line of the table corresponds to one Gephebase entry. Under the option ‘Split Mutations’, one line corresponds to one mutation. Under the option ‘Group Haplotypes’, one line corresponds to all haplotypes of a given gene for a given pair of Taxon A and Taxon B. Under the option ‘Group Genes’, one line corresponds to all the genes associated with the same phenotypic trait in the same Taxon A and Taxon B. The number of lines in the Results Table is indicated above the table. Clicking on one line of the Results Table will display all the corresponding Gephebase entries if the line corresponds to several Gephebase entries and will lead to the corresponding View-Entry page if the line corresponds to one entry.
Extensive links to external databases (UniProtKB, NCBI Taxonomy, NCBI PubMed) and to Gephebase itself allow in-depth analysis of curated data. Users can provide feedback using the Feedback section on each View-Entry Page (Figure 3) and can suggest new articles for curation in Gephebase using the ‘Suggest an article’ button in the top bar menu.
DISCUSSION AND CONCLUSION
Gephebase contains data for more than 450 eukaryote species and more than 900 distinct genes (Supplementary Table S1, Data S1). In Gephebase, physiological traits represent 67% of the entries, morphological traits 31%, behavioral traits 1% and mixed morphological/physiological/behavioral traits 1% (Figure 4A,C). The most represented traits are Xenobiotic Resistance (21% of the entries) and Coloration (18%). The most represented taxa are Vertebrates (40% of the entries), Green Plants (30%) and Arthropods (17%) (Figure 4D), and a large contribution from traditional model organisms (12). Most data correspond to intraspecific changes (48% of the entries) and domesticated cases (30%) whereas interspecific cases correspond to 10% of the entries (Figure 4E). The three categories of Experimental Evidence are relatively well-distributed among entries (Figure 4F). Gephebase contains a higher number of coding mutations (63% of the entries) compared to cis-regulatory changes (18% of the entries, Figure 4G). While a significant fraction of Gephebase correspond to cases where the exact mutation has not been identified (23% of entries with Aberration Type ‘Unknown’, Figure 4H), most mapped mutations are single nucleotide changes (47%) and indels (26%).
Gephebase stands out compared to the other current databases of genotype–phenotype relationships in that it compiles genotype–phenotype data across all Eukaryotes. We consider our dataset to be highly complementary to other available databases, which are more species-specific and which usually include more detailed information about genotype–phenotype relationships. Gephebase can be used in various ways: as a powerful bibliographic tool, as a place to formulate hypotheses (Figure 1), as a list of potential targets for breeders interested in transferring traits of interest to new species, as an extensive compilation for broad meta-analyses on the genetic loci of evolution, and also as a resource for epistemologists interested in biases and sociological aspects in the field of genetic evolution. Moving forward, we invite the community of scientists interested in comparative genetics and genotype–phenotype associations to join us in our efforts to curate and synthesize accumulating data.
DATA AVAILABILITY
Gephebase is freely available at gephebase.org. The code is available on GitHub. The entire dataset is freely available for download by searching for ‘*’ and clicking on ‘Complete Export’.
Supplementary Material
ACKNOWLEDGEMENTS
We thank the twenty participants of the ‘Loci Of Evolution Workshop’ (Paris, September 2016) for their enthusiasm and encouragements. We are indebted to the team of AtoutLibre (France) and especially Kyle Ratteree for developing the software and website behind Gephebase. Nathalie Vessilier drew the illustrations featured in Figure 2.
Notes
Present address: Stéphane R. Prigent, Institut de Systématique, Evolution, Biodiversité, ISYEB, Muséum national d’Histoire naturelle, CNRS, Sorbonne Université, EPHE, Université des Antilles, Paris, France.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
John Templeton Foundation in 2014–2017 [JTF 43903 to A.M. and V.C.]; European Research Council Starting grant ROBUST [FP7/2007–2013 337579 to V.C.-O.]. Funding for open access charge: NSF grant IOS-1923147 to A.M.
Conflict of interest statement. None declared.
REFERENCES
- 1. Stern D.L., Orgogozo V.. The loci of evolution: how predictable is genetic evolution. Evolution. 2008; 62:2155–2177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Martin A., Orgogozo V.. The loci of repeated evolution: a catalog of genetic hotspots of phenotypic variation. Evolution. 2013; 67:1235–1250. [DOI] [PubMed] [Google Scholar]
- 3. Rockman M.V. The QTN program and the alleles that matter for evolution: all that's gold does not glitter. Evolution. 2012; 66:1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Nicholas F.W. Online Mendelian Inheritance in Animals (OMIA): a comparative knowledgebase of genetic disorders and other familial traits in non-laboratory animals. Nucleic Acids Res. 2003; 31:275–277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Hamosh A., Scott A.F., Amberger J., Bocchini C., Valle D., McKusick V.A.. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2002; 30:52–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Rhee S.Y., Beavis W., Berardini T.Z., Chen G., Dixon D., Doyle A., Garcia-Hernandez M., Huala E., Lander G., Montoya M.. The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res. 2003; 31:224–228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Gramates L.S., Marygold S.J., dos Santos G., Urbano J.-M., Antonazzo G., Matthews B.B., Rey A.J., Tabone C.J., Crosby M.A., Emmert D.B.. FlyBase at 25: looking to the future. Nucleic Acids Res. 2016; 45:D663–D671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Mungall C.J., McMurry J.A., Köhler S., Balhoff J.P., Borromeo C., Brush M., Carbon S., Conlin T., Dunn N., Engelstad M.. The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 2016; 45:D712–D722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. McMurry J.A., Köhler S., Washington N.L., Balhoff J.P., Borromeo C., Brush M., Carbon S., Conlin T., Dunn N., Engelstad M.. Navigating the phenotype frontier: the monarch initiative. Genetics. 2016; 203:1491–1495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Stern D.L., Orgogozo V.. Is genetic evolution predictable. Science. 2009; 323:746–751. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Streisfeld M.A., Rausher M.D.. Population genetics, pleiotropy, and the preferential fixation of mutations during adaptive evolution. Evolution. 2011; 65:629–642. [DOI] [PubMed] [Google Scholar]
- 12. Arnoult L.A. La marche génétique de l’évolution. Biol. Aujourdhui. 2014; 208:237–249. [DOI] [PubMed] [Google Scholar]
- 13. Martin A., Courtier-Orgogozo V.. Morphological evolution repeatedly caused by mutations in signaling ligand genes. Diversity and Evolution of Butterfly Wing Patterns - An Integrative Approach. 2017; Singapore: Springer. [Google Scholar]
- 14. Protas M.E., Hersey C., Kochanek D., Zhou Y., Wilkens H., Jeffery W.R., Zon L.I., Borowsky R., Tabin C.J.. Genetic analysis of cavefish reveals molecular convergence in the evolution of albinism. Nat. Genet. 2006; 38:107. [DOI] [PubMed] [Google Scholar]
- 15. Saenko S.V., Lamichhaney S., Barrio A.M., Rafati N., Andersson L., Milinkovitch M.C.. Amelanism in the corn snake is associated with the insertion of an LTR-retrotransposon in the OCA2 gene. Sci. Rep. 2015; 5:17118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Sakane Y., Iida M., Hasebe T., Fujii S., Buchholz D.R., Ishizuya-Oka A., Yamamoto T., Ken-ichi T.S.. Functional analysis of thyroid hormone receptor beta in Xenopus tropicalis founders using CRISPR-Cas. Biol. Open. 2018; 7:bio030338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Federhen S. The NCBI taxonomy database. Nucleic Acids Res. 2011; 40:D136–D143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Boutet E., Lieberherr D., Tognolli M., Schneider M., Bansal P., Bridge A.J., Poux S., Bougueleret L., Xenarios I.. UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view. Plant Bioinformatics. 2016; Springer; 23–54. [DOI] [PubMed] [Google Scholar]
- 19. Geer L.Y., Marchler-Bauer A., Geer R.C., Han L., He J., He S., Liu C., Shi W., Bryant S.H.. The NCBI biosystems database. Nucleic Acids Res. 2009; 38:D492–D496. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Gephebase is freely available at gephebase.org. The code is available on GitHub. The entire dataset is freely available for download by searching for ‘*’ and clicking on ‘Complete Export’.