Version Changes
Revised. Amendments from Version 1
We have made a number of small improvements to the online database as a result of the reviewers’ comments. In addition, we have made minor changes to the text to clarify meaning. The most significant change is that the domain-phenotype enrichment analysis has been repeated, now associating phenotypes with InterPro families as opposed to individual signatures. This does not significantly impact the results, but makes them easier to interpret. Figures 3 and 4 have been updated to reflect additional phenotype data added since submission.
Abstract
Two decades after the first Plasmodium transfection, attempts have been made to disrupt more than 3,151 genes in malaria parasites, across five Plasmodium species. While results from rodent malaria transfections have been curated and systematised, empowering large-scale analysis, phenotypic data from human malaria parasite transfections currently exists as individual reports scattered across a the literature. To facilitate systematic analysis of published experimental genetic data across Plasmodium species, we have built PhenoPlasm ( http://www.phenoplasm.org), a database of phenotypes generated by transfection experiments in all Plasmodium parasites. The site provides a simple interface linking citation-backed Plasmodium reverse-genetic phenotypes to gene IDs. The database has been populated with phenotypic data on 367 P. falciparum genes, curated from 176 individual publications, as well as existing data on rodent Plasmodium species from RMgmDB and PlasmoGEM. This is the first time that all available data on P. falciparum transfection experiments has been brought together in a single place. These data are presented using ortholog mapping to allow a researcher interested in a gene in one species to see results across other Plasmodium species. The collaborative nature of the database enables any researcher to add new phenotypes as they are discovered. As an example of database utility, we use the currently available datasets to identify RAP (RNA-binding domain abundant in Apicomplexa)-domain containing proteins as crucial to parasite survival.
Keywords: Malaria, Plasmodium, phenotype, genetic, database, knock-out
Introduction
The increasing use of experimental genetics in Plasmodium spp. has provided numerous insights into the biology of the malaria parasite ( de Koning-Ward et al., 2015). Nevertheless, to date such results for P. falciparum transfection experiments are scattered across a range of journals, with no unified or queryable interface. This means that a researcher whose experiment or analysis identifies a set of genes of interest must devote considerable time to reviewing all available literature if they are to understand what is already known about these genes from previous knockout or other genetic manipulation experiments. To facilitate rapid functional profiling using already established phenotypes, we set out to build a database to contain this information.
There were three key functional requirements for such a database:
Systematic, and synergistic with existing resources
To allow for automated bioinformatic analyses, it is crucial that the database have a defined, machine-comprehensible, schema for recording phenotypes. It is also important that this schema is compatible with existing resources. The rodent malaria genetically modified parasite database (RMgmDB, http://pberghei.eu; Khan et al., 2013) provides a powerful curated resource for the rodent Plasmodium species, and contains curated data on disruption attempts for over 500 genes from individual studies, making it the largest manually curated database for Plasmodium experimental genetic data. However, this database does not contain any data for human-infecting Plasmodium species. While some human Plasmodium parasite genes lack rodent orthologs, nearly 75% have such orthologs, and integrating human and rodent Plasmodium phenotypes is likely to be highly informative. To allow such integration, any new database schema must be compatible with that of RMgmDB, which is broken into 6 different stages at which phenotypes can occur (asexual, gametocyte/gamete, fertilization & ookinete, oocyst, sporozoite and liver). RMgmDB also distinguishes cases in which a modification is not successful, which provide some implication of a possible role in asexual growth; we decided to call this quality mutant viability, though of course failure to obtain a mutant might also result from a technical failure. The database must also import phenotypes from the largest source of blood stage Plasmodium phenotyping data available to date, PlasmoGEM barcode-sequencing experiments ( Bushell et al., 2017).
Orthology-based retrieval
The use of model systems has strongly facilitated attempts to understand human malaria parasites ( Zuzarte-Luis et al., 2014). These systems are valuable both because rodent malaria parasites are less technically challenging to genetically modify, and because their in vivo nature has allowed the study of some aspects of parasite biology in a more physiological setting than provided by in vitro culture. Critically, rodent models allow the recapitulation of the complete lifecycle with few technical hurdles, and therefore the vast majority of known non-blood-stage phenotypes come from rodent Plasmodium experiments. Nevertheless, rodent parasites do not contain orthologs of every P. falciparum gene, so these studies alone cannot provide a complete view of the parasite genetics causing human disease. Even where orthologs exist, phenotypes may not always be conserved, although available comparative data does suggest a high level of conservation ( Bushell et al., 2017).
We felt that an optimal approach would allow a researcher to search for gene IDs from any species but at a glance to see both results in this organism and for orthologous genes in other Plasmodium species. The database should also contain records for emerging new genetic models, such as P. knowlesi ( Kocken et al., 2002; Moon et al., 2013), as well as P. vivax which, though currently genetically intractable, is of key medical importance, and where gene functions may be interpreted through orthology.
Community contribution
The role for “crowd-sourcing” in biological databases is contentious ( Good & Su, 2013; Karp, 2016). It is clear that community contributions cannot wholly replace curation for these types of datasets, but on the other hand manual curation is not easy to support through application to funding agencies, and suffers the problem of scale - a single person is unlikely to identify every single phenotype in the Plasmodium literature. An example of successful community contribution in parasite genetics comes from the EuPathDB databases (which include PlasmoDB, Aurrecoechea et al., 2009). These have had thousands of user comments, many of which are now incorporated into annotations. We felt it important to provide a mechanism whereby a motivated researcher can add any missing phenotypes to the database. This requires the creation of an intuitive interface and easy interface for the entry of data, and means that the primary data source must be publications, with each phenotype backed up with a PubMed ID.
Methods
Database schema
The database comprises 4 main data tables. Phenotype data is stored in a “Phenotypes” table, containing the stage at which the phenotype is described, the phenotype itself (selected from a growing and defined taxonomy), a referenceable citation, details of the genetic system used to obtain the phenotype and any additional notes. Here, the gene itself is a reference to the “Genes” table, containing gene name and product data imported from PlasmoDB ( Aurrecoechea et al., 2009). Genes are linked to previous aliases by an “Aliases” table. They are also linked to one another by the “Orthology” table, which contains links between genes in which both OrthoMCL group and synteny is conserved.
Display
The database can be queried either for a set of genes ( Figure 1) or a single gene ( Figure 2). The former provides a table with one line per gene, while the latter provides referencing for each claim and displays any additional notes. The search box on each page can be flexibly queried with a gene ID, symbol or description; but there is also an advanced search facility which allows the retrieval of, for example, only phenotypes backed up by evidence from conditional systems.
Figure 1. The results of a search for ‘kinase’ genes, showing phenotype data, both from P. falciparum experiments and those in rodent models across multiple lifecycle stages.
Green ticks indicate mutant viability, and circled green ticks indicate wild-type phenotype. Red crosses indicate failure to disrupt the gene, and red exclamation marks indicate a phenotype different from wildtype. The icons are either shown in full opacity (indicating they apply to the gene in the species queried) or semi-transparent (indicating they refer to orthologous genes in other species).
Figure 2. The phenotype page for the P. vivax CRK4 gene.
Though no experimental data is available directly from P. vivax, published results are shown from P. berghei and P. falciparum, with references to the original datasets from which likely data in P. vivax could be inferred. This gene is essential and has therefore been refractory to all attempts to disrupt it by classical reverse genetics, but a conditional system has also been recently applied in P. falciparum, allowing a more detailed phenotype to be assigned to the gene from our taxonomy.
Literature review
A scan was made of the Plasmodium literature using Google Scholar (which provides full-text search for a large proportion of publications) to identify reported attempts at P. falciparum gene disruption for curation. Terms for which complete literature scans were made included ‘“attempts to disrupt” falciparum’, ‘“gene disruption of” falciparum’ ‘“gene deletion construct” falciparum’ and ‘“gene disruption construct” falciparum’. Numerous additional terms were used, and the first 10 pages of results for each search were manually curated and added to the database. In addition, genes with a suggested role in erythrocyte invasion were systematically curated by searching for all references to any version of their gene IDs, as discussed below.
One challenge when conducting literature searches into Plasmodium proteins is the fact that the numerous iterative improvements made to Plasmodium genome assemblies mean that a gene could be referred to by any of numerous current or historic gene IDs. To assist with this, the PhenoPlasm page for each gene contains a link to conduct a custom boolean search on Google Scholar, searching article full text for any of the historic gene identifiers which have been historically used to refer to the gene, as provided by PlasmoDB ( Aurrecoechea et al., 2009). Links are also provided to other databases, including searches for pathways and localization images on the Malaria Metabolic Pathways Site.
In addition to this curation, scripts were developed to regularly import data from RMgmDB and PlasmoGEM and transform it to the PhenoPlasm schema.
Enrichment analysis
Genome-scale phenotyping data provides opportunities to integrate diverse genome-wide data sets and investigate how they relate to gene functionality. The PlasmoGEM dataset, which currently contains data for >50% of P. berghei genes, has been used to identify essential metabolic pathways and investigate the relationship between transcriptomics, evolution and phenotype ( Bushell et al., 2017).
To supplement these analyses, and further illustrate the utility of genome-scale phenotype data, we sought to identify protein domains whose presence in a gene was predictive of essentiality or dispensability. We downloaded from PhenoPlasm the phenotypes relating to all P. falciparum genes, both directly assayed and inferred from orthologs, and added annotation information for InterPro signatures ( Aurrecoechea et al., 2009). We then used hypergeometric testing to identify signatures with members significantly enriched in essential or dispensable genes ( Supplementary File S1).
Results
The extent of data now available on PhenoPlasm
At the time of manuscript preparation, some form of phenotyping information is available for 3,188 genes ( Figure 3). Of these, 2,790 are from rodent malaria parasites, and so represent data imported from RMgmDB and PlasmoGEM. The remaining 398 are human parasite genes with phenotypes systematised by our curation, and brought together in a searchable format for the first time. For posterity, the complete data has been additionally deposited on Figshare ( https://doi.org/10.6084/m9.figshare.5114017), and will be updated at least yearly.
Figure 3. The number of genes with phenotyping data available in PhenoPlasm for each Plasmodium species, and the source of these annotations.
There are 7,274 total phenotype datapoints (i.e. data for one life-stage, for one gene knock-out, in one study). The majority of the non-blood stage phenotype datapoints are from the rodent parasites, since relatively few P. falciparum genes have phenotyping data reported beyond the blood stage.
Given that some genes have been phenotyped in multiple Plasmodium species, the number of ortholog groups covered in at least one species is 2,778. This represents 60% of the core Plasmodium genome.
Since every phenotype in PhenoPlasm for the human malaria parasites is linked to a PubMed ID or other citation, we were able to informatically extract the dates associated with these publications and plot how the number of genes phenotyped in P. falciparum has increased over time ( Figure 4). While this analysis has limitations (a portion of citations are review papers), it does reveal that progress remains slow in human-infecting parasites, and with no major acceleration apparent in the last decade. This illustrates the current importance of rodent models, and raises the question of the technologies needed to create a step-change in the rate of phenotype discovery for human malaria species.
Figure 4. A timeline of PubMed dates associated with publications reporting knock-out phenotypes for P. falciparum genes, from the first gene disruption in 1997 to mid-2017.
The values shown are cumulative from all previous years. Around 25 genes per year have had disruption attempts reported since the year 2000. The spikes that occur in 2008 and 2011 largely represent two individual publications systematically knocking out exported genes ( Maier et al., 2008) and kinases ( Solyakov et al., 2011) respectively.
Applying the data to investigate genome-wide relationships between protein structure and knock-out phenotype reveals the importance of RAP-domain containing proteins
The InterPro signatures most enriched in genes producing viable mutants included the 6-cysteine domain (12/12 viable, p=7.46E-05), the MSP7 C-terminal domain (8/8 viable, p=0.0018) and the MFS transporter superfamily (21/24 viable, p=2.19E-05). These data confirm previous inferences, but does so in a systematic way for the first time ( Supplementary Table S1).
The InterPro signatures enriched in apparently essential genes include expected results, such as the OB-fold nucleic acid domain and ribosomal protein S5 domain-2 type folds, but also identify the RAP (RNA-binding domain abundant in Apicomplexa) domain as functionally highly significant ( Table 1, Supplementary Table S1). The RAP domain was named for its dramatic expansion in the Apicomplexa, as compared to other parts of the tree of life ( Lee & Hong, 2004). Every one of the ten genes containing this domain for which knockouts have been attempted to date appears to be essential (there are 18–19 in most Plasmodium genomes). As a result of this analysis, we looked at the other apicomplexan taxon for which genome-scale data is available and found that all eleven proteins containing this domain in Toxoplasma gondii also have suggestions of essentiality in CRISPR-screening data ( Sidik et al., 2016). This functional data, combined with recent experimental observations of these proteins binding mRNA ( Bunnik et al., 2016), suggest these proteins as prominent candidates for future studies which may uncover a new realm of Apicomplexa-specific biology.
Table 1. InterPro signatures most enriched in essential genes.
| InterPro
Family |
Description | Viable
mutant |
Essential
gene |
Proportion of
genes essential |
p-value |
|---|---|---|---|---|---|
| IPR012340 | Nucleic acid-binding, OB-fold | 3 | 24 | 89% | 0.0002 |
| IPR020568 | Ribosomal protein S5 domain 2-type fold | 2 | 18 | 90% | 0.0008 |
| IPR013584 | RAP domain | 0 | 10 | 100% | 0.0023 |
| IPR006073 | GTP binding domain | 0 | 9 | 100% | 0.0043 |
| IPR005225 | Small GTP-binding protein domain | 4 | 19 | 83% | 0.0049 |
Discussion
A rapid growth in available genetic tools, coupled with decreasing costs of gene synthesis and sequencing, mean that Plasmodium experimental genetics is reaching the genome-level scale for the first time. Phenotypic data from these studies has the potential to shed light on the importance of the novel genes found in these early-branching eukaryotes. The development of large scale genetic modification programmes in Plasmodium species ( Bronner et al., 2016; Bushell et al., 2017; Gomes et al., 2015) is now shedding light on a large portion of the genome’s functional importance in the asexual blood stage.
Nevertheless, no single approach is likely to reach saturation for some time, and exploring the complete parasite lifecycle in any system is likely to take even longer. In addition, the lack of a non-homologous end-joining pathway in Plasmodium parasites prevents the use of the conventional CRISPR-Cas9 screens, which have revolutionized genetics in other organisms ( Sidik et al., 2016). For these reasons, a complete view of the reverse-genetic landscape for a gene or pathway will require bringing together multiple datasets with the individual gene-by-gene studies that have characterized decades of research.
Until now, retrieving phenotyping data for a set of 80 genes in Plasmodium might have involved perhaps a day of work, requiring a separate literature search for each gene’s P. falciparum ID, and all of its historic identifiers. To be comprehensive, the set of genes would additionally have to be transformed by orthology into each of the five available Plasmodium genetic systems, with further searches conducted. With the development of PhenoPlasm, all this data is available in a single batch search.
We hope that this database will assist in the prioritization of future large-scale studies, eliminating duplication of existing efforts, and allowing a focus on the portion of the genome which remains wholly unexplored by previous reverse genetic approaches. The availability of systematized data should allow the Plasmodium phenome to be bioinformatically queried in the same sort of routine way that transcriptomic data is used today to understand gene function. The application we present here, identifying RAP domain proteins as crucial for parasite survival, is just a hint of the wealth of information that well-organized phenotypic data can reveal at scale.
Data availability
The data referenced by this article are under copyright with the following copyright statement: Copyright: © 2017 Sanderson T and Rayner JC
The database is accessible at http://phenoplasm.org/. Facilities are provided for download of batch searches in CSV form, and of the entire dataset. In addition, snapshots are available from Figshare ( https://doi.org/10.6084/m9.figshare.5114017) which will be updated yearly.
Funding Statement
This work was supported by the Wellcome Trust [098051].
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 2; referees: 2 approved]
Supplementary material
Supplementary Table S1: Full table of InterPro signatures, and the number of viable and non-viable mutants for proteins containing them, with p-values indicating statistical enrichment.Supplementary File S1: Reproducible analysis. The analytical methods used to calculate the results presented here are preserved in an R-Markdown document to allow full reproduction of the results.
References
- Aurrecoechea C, Brestelli J, Brunk BP, et al. : PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res. 2009;37(Database issue):D539–43. 10.1093/nar/gkn814 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bronner IF, Otto TD, Zhang M, et al. : Quantitative insertion-site sequencing (QIseq) for high throughput phenotyping of transposon mutants. Genome Res. 2016;26(7):980–989. 10.1101/gr.200279.115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bunnik EM, Batugedara G, Saraf A, et al. : The mRNA-bound proteome of the human malaria parasite Plasmodium falciparum. Genome Biol. 2016;17(1):147. 10.1186/s13059-016-1014-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bushell E, Gomes AR, Sanderson T, et al. : Functional profiling of a Plasmodium genome shows a high incidence of essential genes in an intracellular parasite. Cell. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Koning-Ward TF, Gilson PR, Crabb BS: Advances in molecular genetic systems in malaria. Nat Rev Microbiol. 2015;13(6):373–387. 10.1038/nrmicro3450 [DOI] [PubMed] [Google Scholar]
- Gomes AR, Bushell E, Schwach F, et al. : A genome-scale vector resource enables high-throughput reverse genetic screening in a malaria parasite. Cell Host Microbe. 2015;17(3):404–413. 10.1016/j.chom.2015.01.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Good BM, Su AI: Crowdsourcing for bioinformatics. Bioinformatics. 2013;29(16):1925–1933. 10.1093/bioinformatics/btt333 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karp PD: Crowd-sourcing and author submission as alternatives to professional curation. Database (Oxford). 2016;2016: pii: baw149. 10.1093/database/baw149 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khan SM, Kroeze H, Franke-Fayard B, et al. : Standardization in generating and reporting genetically modified rodent malaria parasites: the RMgmDB database. Methods Mol Biol. 2013;923:139–50. 10.1007/978-1-62703-026-7_9 [DOI] [PubMed] [Google Scholar]
- Kocken CH, Ozwara H, van der Wel A, et al. : Plasmodium knowlesi provides a rapid in vitro and in vivo transfection system that enables double-crossover gene knockout studies. Infect Immun. 2002;70(2):655–660. 10.1128/IAI.70.2.655-660.2002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee I, Hong W: RAP--a putative RNA-binding domain. Trends Biochem Sci. 2004;29(11):567–570. 10.1016/j.tibs.2004.09.005 [DOI] [PubMed] [Google Scholar]
- Maier AG, Rug M, O'Neill MT, et al. : Exported proteins required for virulence and rigidity of Plasmodium falciparum-infected human erythrocytes. Cell. 2008;134(1):48–61. 10.1016/j.cell.2008.04.051 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moon RW, Hall J, Rangkuti F, et al. : Adaptation of the genetically tractable malaria pathogen Plasmodium knowlesi to continuous culture in human erythrocytes. Proc Natl Acad Sci U S A. 2013;110(2):531–6. 10.1073/pnas.1216457110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sidik SM, Huet D, Ganesan SM, et al. : A Genome-wide CRISPR Screen in Toxoplasma Identifies Essential Apicomplexan Genes. Cell. 2016;166(6):1423–1435.e12. 10.1016/j.cell.2016.08.019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Solyakov L, Halbert J, Alam MM, et al. : Global kinomic and phospho-proteomic analyses of the human malaria parasite Plasmodium falciparum. Nat Commun. 2011;2:565. 10.1038/ncomms1558 [DOI] [PubMed] [Google Scholar]
- Zuzarte-Luis V, Mota MM, Vigário AM: Malaria infections: What and how can mice teach us. J Immunol Methods. 2014;410:113–122. 10.1016/j.jim.2014.05.001 [DOI] [PubMed] [Google Scholar]




