Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Nov 1.
Published in final edited form as: Proteomics. 2018 Oct 10;18(21-22):e1800083. doi: 10.1002/pmic.201800083

Identification of Moonlighting Proteins in Genomes Using Text Mining Techniques

Aashish Jain 1,2, Hareesh Gali 2, Daisuke Kihara 1,2,3,*
PMCID: PMC6404977  NIHMSID: NIHMS1521265  PMID: 30260564

Abstract

Moonlighting proteins is an emerging concept for considering protein functions, which indicate proteins with two or more independent and distinct functions. An increasing number of moonlighting proteins have been reported in the past years; however, a systematic study of the topic has been hindered because the secondary functions of proteins are usually found serendipitously by experiments. Toward systematic identification and study of moonlighting proteins, we have developed computational methods for identifying moonlighting proteins from several different information sources, database entries, literature, and large-scale omics data. In this work, we overview a pipeline for finding moonlighting proteins we have developed. Then, we apply the literature-mining method, DextMP, to find new moonlighting proteins in three genomes, Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster. Potential moonlighting proteins identified by DextMP were further examined by a two-step manual literature checking procedure, which finally yielded thirteen new moonlighting proteins. Identified moonlighting proteins were categorized into two classes based on the clarity of the distinctness of two functions of the proteins. A few cases of the identified moonlighting proteins are described in detail. Further direction for improving the DextMP algorithm is also discussed.

Introduction

Most molecular level studies in modern biology concern the functions of proteins and the mechanisms of how proteins carry out those functions. Thus, function annotation of proteins serves as fundamental information for biological studies. Algorithms for protein function prediction are extensively studied in bioinformatics [1].

In function annotation, it is generally assumed that a protein has a single function and the possibility of the protein having an additional function is not extensively examined. However, over the past decade, an increasing number of proteins are accumulated that perform two independent and distinct functions. A classic example is an enzyme, L-arginosuccinate arginine-lyase, which was found to function as a lens structural protein delta-crystallin as well [2]. Proteins that have two functions have been called in several different ways, such as bi-, dual-, multi-functional proteins, multitasking proteins, gene sharing, promiscuous enzymes [3], and moonlighting proteins [4], but the latter two have rather specific definitions. A promiscuous enzyme is a protein that catalyzes a side reaction in addition to its main reaction. Moonlighting proteins perform two or more independent and distinct functions. In its original strict definition by Constance Jeffery, who coined the term [4], the multiple functionalities are not due to gene fusions, multiple domains, multiple splice variants, proteolytic fragments, families of homologous proteins, or pleiotropic effects. Some mechanisms identified to be responsible for the switch between two functions include different cellular localization of the protein, expression in different cell types, ligand binding sites, oligomerization states, and ligand concentration [4]. Many known moonlighting proteins were originally identified as enzymes, which were later found to have an additional function, such as transcription factors.

Moonlighting proteins that exhibit multiple functions can provide a competitive advantage to an organism from an evolutionary standpoint, especially in prokaryotes, where growth and the reproductive rate is directly associated with the number of genes translated and replicated [4]. Moonlighting proteins are also known to manage cellular activities by providing a coordinated framework by either self-regulation e.g. thymidylate synthase, an enzyme which can bind to its own mRNA inhibiting its translation [5], or by regulating other similarly functioning proteins e.g. cystic fibrosis transmembrane conductance regulator, a chloride channel which also regulates epithelial sodium channel [6]. It was found that several moonlighting proteins play important roles in cellular activities that are involved in cancer and other diseases [7]. Thus, moonlighting proteins may be interesting drug targets to effectively suppress disease development if both functions of the proteins are involved in the target disease. On the other hand, blocking the activity of a moonlighting protein needs extra caution so that drugs only affect the desired function of the protein. Understanding the functional mechanisms of moonlighting proteins may lead to novel ideas for artificial design of proteins of dual function. It will also provide a foundation on how to avoid unexpected toxicity of artificially designed proteins and a protein artificially placed in a different cellular environment. With moonlighting proteins in the picture, our understanding of the functional interplay of proteins in a cell would need a major and fundamental update [8].

Most of the currently known moonlighting proteins have been found serendipitously, where researchers identify a known protein as having a different function in an unrelated biological context. Jeffery’s Lab manually compiled a list of known moonlighting proteins from literature in a database named MoonProt [9]. Multiple functions for the proteins in this database were reviewed by the authors based on published biochemical, mutagenic, and other evidence. There is another database, MultitaskProtDB-II [10], where the authors curated a list of proteins that were found in PubMed with keywords indicating multiple functions: moonlight proteins, moonlighting proteins, multitask protein, multitasking proteins, moonlight enzymes, moonlighting enzymes, and gene sharing. Considering that we still only know a small number of moonlighting proteins, it is important to develop computational approaches that can systematically identify moonlighting proteins [11]. It was examined whether moonlighting proteins exhibit sequence similarity to protein families of different functions [12]. A second approach is to determine if there is a correlation between disordered regions and multi-functionality of proteins as disordered regions are often responsible for binding different proteins [13]. Another approach is to use protein-protein interaction (PPI) based on the idea that moonlighting proteins tend to interact with proteins with different functions or pathways reflecting their dual functionality [14]. Recently, Cheng et al. developed MoonFinder, which finds moonlighting long non-coding RNAs using subcellular location and function annotation of interacting proteins with long non-coding RNAs [15].

Previously, our group has developed a framework of three methods for identifying potential moonlighting proteins based on the different types of information available about the proteins (Fig. 1). Identifying moonlighting proteins on a large scale is a challenge even for cases when the two or more functions and their mechanisms are well known for proteins because the UniProtKB database [16] does not label such proteins with a specific keyword, e.g. moonlighting proteins or dual functional proteins. Thus, the right branch of the diagram in Figure 1 deals with cases where the dual functionality of proteins is known. When a protein’s function is well studied, documented, and annotated with the Gene Ontology (GO) [17] in its UniProtKB entry, we can directly compute the number of distinct functions of the protein by classifying its annotated GO terms. GO is a pre-defined set of vocabulary organized in a hierarchical fashion. Thus, the similarity of GO terms can be objectively defined and computationally measured. In our earlier work [18], we developed a procedure for clustering GO terms and identify moonlighting proteins and applied to the E. coli genome.

Figure 1.

Figure 1.

Different methods for identifying moonlighting proteins in a genome. When a protein is annotated, clustering GO terms based on their similarity can identify multifunctional proteins. When a literature or functional description of the protein is available, the text mining tool, DextMP, can be used. The omics-data-based method MPFit is useful when a protein is not annotated but several other data, such as protein-protein interaction, expression profile etc. is available.

The second branch in Figure 1 is to handle proteins that have associated literature but no GO annotation. One can read literature of candidate proteins, albeit a time-consuming effort if literature for many proteins need to be examined. To overcome this issue, we have developed a machine learning based method, DextMP, which predicts if a protein moonlights or not based on text information, such as titles and abstracts of publications associated with the protein, or the functional information available in the UniProtKB database [19]. DextMP uses recent computational natural language processing techniques to encode the text information, which is later fed into several machine learning classifiers to identify potential moonlighting proteins.

Lastly, when the function of a protein is not well studied yet but has several large-scale omics data, we can analyze the omics data to find characteristic patterns of moonlighting proteins in them. This is what the MPFit algorithm [20] is designed for, which corresponds to the left branch of Figure 1. MPFit is based on a simple and intuitive idea of moonlighting proteins. Since moonlighting proteins play a role in two (or more) different functions, they probably tend to interact with proteins from two (or more) different functional groups or pathways, and show correlated expression patterns and phylogenetic patterns [21] with proteins from two functional groups/pathways. Therefore, MPFit considers the number of functional clusters of interacting proteins in PPI, co-expression, and phylogenetic profile networks as features of a query protein and feeds it to a machine learning method (random forest) to make the prediction of moonlighting proteins.

It should be noted that these three methods are currently used for screening potential moonlighting proteins, and further manual verification, such as a careful reading of related literature, is needed to finalize a conclusion. This is because these methods do not examine the strict original definition of moonlighting proteins mentioned earlier and because the semantic distance of GO terms does not always capture the distinctiveness of functions well. For example, we occasionally encounter cases that a protein has GO terms that are distant on the GO hierarchy (and thus judged as potential moonlighting proteins) but these terms are somewhat related from a biological point of view. The latter problem is difficult to fix because it originated from the hierarchical graphical structure of current GO.

In this work, we ran DextMP on three genomes, Arabidopsis Thaliana, Caenorhabditis elegans, and Drosophila Melanogaster. In the original paper of DextMP [19], the prediction accuracy was benchmarked on known moonlighting proteins in E. coli, human, and mouse, stored in MoonProtDB. Since we considered that the accuracy observed was sufficiently high for further application (F-score, the harmonic mean of recall and precision of over 0.9), here we applied it to three genomes whose moonlighting proteins are not well studied.

After screening text information of proteins in the three genomes, we performed a two-step manual literature check. During the process, we identified a problem caused by “hub publications”, papers that are associated with several proteins. Generally, such papers comprise of large-scale genomics and proteomics experiment. Hub publications tend to cause false positives, because multiple proteins, thus multiple functions, are mentioned in the text. We removed hub publications to reduce false positives and thus to reduce the burden of downstream manual literature check steps. We identified 13 new moonlighting proteins, which we classified into two classes depending on the confidence level. Finally, four proteins in the high confidence level class are discussed individually.

Materials and Methods

First we describe the genomes we analyzed and the text information of proteins used for detecting moonlighting proteins. Then, we explain the overall procedure used to identify moonlighting proteins in the genome, and then describe the algorithm of DextMP.

Genome dataset

We ran DextMP on proteins in three genomes, Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster. We applied three criteria for choosing these genomes. First, we wanted to analyze model organisms, because they are relatively well studied and have abundant publications in PubMed. Second, among popular model organisms, we excluded human, yeast, and Xenopus laevis from our choice, because they were analyzed in the original paper of DextMP. We also avoided E. coli and mouse genomes, because moonlighting proteins of these two genomes were abundant in MoonProtDB and were used for the parameter training of DextMP.

Text information of proteins

For each protein, three types of textual information were extracted. First, a title of each publication of the protein, which was obtained from the list of “PUBLICATIONS” in its UniProtKB entry. Second, an abstract of each publication. It was extracted from the PubMed database (https://www.ncbi.nlm.nih.gov/pubmed/) using the PubMed ID of the publication as the key for the database search. Third, the functional description text of the protein, which was obtained from the function subsection in the “FUNCTION” section in its UniProtKB entry. The title and function description were downloaded from https://www.uniprot.org/downloads.

Overall procedure of identifying moonlighting proteins

Our procedure consists of five steps (Fig. 2A): 1) Proteins in a genome are obtained from UniProtKB Swiss-Prot. 2) For each protein, three different types of text information, literature titles and abstracts as well as function summary description from UniProtKB are obtained. Hub publications are omitted and consequently, proteins that only have hub publications that associate with more than three proteins are removed. 3) DextMP is used to predict if a protein is moonlighting or not from publication abstracts of the protein. 4) Predicted moonlighting proteins are manually examined by quickly checking publication titles and the functional description in UniProtKB (Manual Checking-1). Both text information can provide an indication if two different functions are associated with the protein. 5) Proteins that pass Step 4 undergo Manual Checking-2, which is an in-depth literature review of the protein. This is the final step where we confirm the two functions of the proteins are independent from each other by reading the literature.

Figure 2.

Figure 2.

Procedure of identifying moonlighting proteins used in this work. A, Overall procedure. Proteins in a genome were obtained from UniProtKB Swiss-Prot. Three types of literature information: publication titles, publication abstract, and UniProtKB functional summary about the proteins were extracted. DextMP predicted if a protein is a potential moonlighting protein or not based on publication abstracts. Predicted proteins underwent a quick Manual Checking-1, and those which passed are checked again in Manual Checking-2 by careful reading of the literature to finalize the list of moonlighting proteins. B, The DextMP algorithm. There are four steps in the algorithm: 1) Each abstract is cleaned and processed, which involved removal of stop words, punctuation, and special symbols, followed by stemming and lemmatization. 2) Each of two language models, a deep learning model and TFIDF, converted the cleaned text into a feature vector. 3) The feature vector (representing one abstract) was predicted as 1(moonlighting) or 0(non-moonlighting). 4) A majority vote was applied to predictions made for the entire abstracts of a protein to predict if the protein is moonlighting or non-moonlighting.

The DextMP algorithm

DextMP takes textual information of proteins to predict if a protein is moonlighting or not using machine learning methods. There are four steps in the DextMP algorithm (Fig. 2B).

First, input text of a query protein undergoes data clean-up. In the original work of DextMP [19], three different types of text information were tested, which were publication titles, publication abstracts, and UniProtKB functional summary. Among the three input types, using publication abstracts showed the highest accuracy [19]. Hence, in this work we used abstracts as the protein text information. Cleaning-up of text data (abstracts) involved removal of stop words, punctuations, and symbols. Next, stemming and lemmatization was done using the nltk package (a natural language analysis toolkit).

In the second step, the cleaned text (abstract) is converted into a k-dimensional feature vector using a statistical language model. Based on the accuracy reported in the original DextMP paper [19], we used two best language models, which were Term Frequency Inverse Document Frequency (TFIDF) [22] and a deep neural network named the paragraph vector [23], to construct the feature vector. TFIDF is a vector that is computed from the number of counts of each word in the abstract relative to the frequency of words observed in the text corpus, which is a dictionary of words taken from all abstracts in a dataset. As the text corpus we used a dataset of abstracts for 263 moonlighting proteins and 162 non-moonlighting proteins, which were collected in the original DextMP paper. The deep neural network learning language model maps a text (an abstract) into a vector space using a neural network, which is trained in a way that semantically similar texts appear closer in the vector space [23]. Thus, intuitively, a vector constructed by the deep neural network captures similarities of abstracts.

Subsequently (Step 3), each of four machine learning methods, linear regression (LR), support vector machine (SVM), random forest (RF), and the gradient boosted machine (GBM), takes the input feature vector and classifies it into moonlighting or non-moonlighting. The prediction is binary, and each abstract associated with a query protein is predicted as 1 (moonlighting) or 0 (non-moonlighting).

In the last step, the prediction made for each abstract of a protein is summarized by majority vote to make the final prediction. Combinations of two language models (TFIDF or the deep learning model) and four machine learning classifiers (LR, SVM, RF, or GBM) resulted in eight final predictions for a protein. If a protein was predicted to be a moonlighting protein by at least one of the language models – classifier combinations, we considered the protein a candidate for moonlighting and passed it to the manual checking steps (Fig. 2A).

The parameters of the language models and the machine learning methods were trained on the same dataset that was used in the original DextMP paper [19]. The accuracy of DextMP on the training dataset ranged from 0.716 to 0.936 for different combinations of language models – classifiers, which were comparable to the values reported in the original paper. The program can be downloaded from http://kiharalab.org/DextMP.

Results

Identifying moonlighting proteins

We ran our procedure (Fig. 2) to identify moonlighting proteins on the three genomes. Table 1 shows the number of proteins that were selected at each step of the procedure. In the A. thaliana, C. elegans, and D. Melanogaster genome, 7,045, 1,600, and 2,663 proteins, respectively, had at least one publication after removing hub publications, which appear as reference for more than three proteins. Among them, 1,917, 1,193, and 286 proteins, respectively, were predicted as moonlighting proteins by DextMP. Manual Checking-1 reduced the potential moonlighting proteins significantly, to 1.2%, 1.3%, and 6.6% for the three genomes. In this step, we only checked titles of publications and UniProt function summary of proteins, because paper abstracts were considered as input data of DextMP in the previous step. Manual Checking-2 which involves careful and thorough reading, finally predicted 13 new moonlighting proteins. A summary of the proteins is provided in Table 2.

Table 1.

The number of proteins in each genome selected by each step of the procedure.

Genome After removing hub publications Predicted as moonlighting by DextMP Passed Manual Checking-1 Passed Manual Checking-2
A. Thaliana 7,045 1,917 23 7
C. elegans 1,600 1,193 16 3
D. Melanogaster 2,663 2,86 19 2

Table 2.

List of identified moonlighting proteins.

Name UniProtK B ID Function 1 (F1) Function 2 (F2) Ref. for F1 Ref. for F2 Conf.class Number of Pfam domains
Arabidopsis thaliana
Leucine aminopeptidase 2, chloroplastic Q944P7 Molecular chaperones Leucine aminopeptidase activity, role in insect defense [27] [27] 1 1
BTB/POZ and TAZ domain-containing protein 2 Q94BN0 component of the TAC1-mediated telomerase activation pathway mediating diverse hormone, stress, and metabolic responses [36] [37] 2 2
Chromophore lyase CRL, chloroplastic Q9FI46 Required for plastid division, and involved in cell differentiation and regulation of the cell division plane Confers sensitivity to cabbage leaf curl virus, probably by hindering its movement [38] [39] 2 1
Twinkle homolog protein B5X582 DNA helicase DNA primase [35] [35] 1 2
Alkaline/neutral invertase C, mitochondrial B9DFA8 Mitochondrial invertase that cleaves sucrose into glucose and fructose Regulation of aerial tissue development [40] [41] 2 1
Actin-depolymerizing factor 9 O49606 Stabilize and cross-link actin filaments Controls expression of Flowering Locus C gene via controlling chromatin remodeling [28] [29] 1 1
NEDD8-activating enzyme E1 regulatory subunit AXR1 P42744 Regulatory subunit ECR1-AXR1 E1 enzyme Regulates the chromosomal localization of meiotic recombination by crossovers and subsequent synapsis, probably through the activation of a CRL4 complex [42] [43] 2 1
D. Melanogaster
Dihydrofolate reductase P17719 Interact with vestigial, this interaction may be important for cell proliferation and survival Dihydrofolate reductase activity [32] [30] 1 1
Lon protease homolog, mitochondrial Q7KUT2 ATP dependent serine protease Chaperone function in assembly of inner membrane protein complexes [44] By similarity [45] 2 3
C. elegans
Exchange factor for Arf-6 G5EET6 guanine nucleotide exchange factor for ARF6 Limit microtubule growth independent of arf-6, inhibit axon regrowth By similarity [46] [47] 2 2
Matrix metalloproteinase-B O44836 embryogenesis/adult development pathogen resistance [48] [48] 2 1
Caveolin-2 Q18879 Scaffolding protein within caveolar membrane uptake of lipids and proteins in intestinal cells By similarity [49] [50] 2 1
Mitochondrial 2-oxoglutarate/malate carrier protein P90992 Transfer alpha ketoglutarate across inner mitochondrial membrane Control apoptosis through LIN-35/RB-like protein pathway By similarity [51] [52] 2 1

In general, there were two reasons for a protein that passed Manual Checking-1 was not judged as a moonlighting protein in Manual Checking-2 when we read the abstract and the main text of papers. The first case is that papers made it clear that the protein was not moonlighting. For example, matrix metalloproteinase-2 in Drosophila (UniProtKB ID: Q8MPP3) has several publications indicating multiple non-related functions such as tissues remodeling, motor neurons contraction, and reepithelization during wound healing. However, when we investigated the mechanism of these functions, we found that in all the biological processes mentioned, the protein performs the same proteinase activity. The second case is that there is not enough information available to conclude that a protein is a moonlighting protein. Tyrosine-protein kinase csk-1 (C-terminal Src kinase) (UniProtKB ID: G5ECJ6) was such an example. csk-1 is known to regulate Src family tyrosine kinases (SFKs). In C. elegans, two SFK’s, src-1 and src-2, are identified, and it has been shown that csk-1 specifically targets the C-terminal tyrosine of both src-1 and src-2, negatively regulating their activities [24]. We found a paper that showed csk-1 is important for pharyngeal muscle organization, independent of src-1 and src-2 [25]. This piqued our interest; however, we found that src-2 is important for larval and pharynx development, thus, csk-1 affects the pharynx development both with and without the SFK’s involvement. Further, it is suggested by the authors that csk-1 might interact with another unknown protein to control pharynx development by phosphorylating a tyrosine of the protein. Thus, we concluded that csk-1 probably only has the kinase function and might not perform a second function in pharyngeal muscle organization.

Case studies

In Table 1, we classified the predicted proteins into two classes based on the confidence of independence of two functions of the proteins. For class 1 proteins there is a clear indication in literature that the two functions are independent of each other or that the functions are performed in different locations in the organism. Proteins which seemingly have two separate functions, but their independence is not well established by current knowledge are categorized as Class 2. In the table, we also show the number of domains defined in the Pfam database [26], as protein multifunctionality attributed to either gene fusion event or presence of multiple domains is generally not considered as moonlighting in the original definition. Below, we discuss the four Class 1 potential moonlighting proteins.

Chloroplastic leucine aminopeptidase 2

The first example is leucine aminopeptidase 2 (LAP2) from Arabidopsis (UniProtKB ID: Q944P7). It is a di-zine metallopeptidase that catalyzes the cleavage of amino acids from N-terminal of various peptides. In the paper by Scranton et al., the peptidase activity of LAP2 was demonstrated on a model substrate, leucine-amino methyl coumarin [27]. In the paper, LAP2 has been shown to have chaperone activity as well. Chaperones are proteins which assist other proteins in folding and unfolding. The authors discovered that LAP2 possess chaperone activity by observing that LAP2 prevented the thermal inactivation of two tested proteins Luc and Ndel. It was further shown that the chaperone function of LAP2 was independent of its peptidase function by mutating the amino acids responsible for peptidase function [27]. This is a relatively easy example to detect from literature because the title and the abstract of this paper used the word “moonlighting”.

Actin-depolymerizing factor 9

The second protein is actin-depolymerizing factor 9 (ADF9) in A. thaliana (UniProtKB ID: O49604). It stabilizes actin filaments and acts as an antagonist to other ADF’s. In the presence of ADF9, the acting filaments organize themselves into actin bundles, which is similar to other actin bundling protein actions. This function has been confirmed in-vitro as well as in-vivo [28]. ADF9 is also important for the expression of flowering locus C (FLC) gene, which is responsible for flowering, indicating that ADF9 is a potential moonlighting protein. The adf9 mutation decreased the level of histone H3 at multiple sites of FLC promoter region, indicating that ADF9 helps in maintaining the chromatin remodeling machinery intact, which regulates the FLC expression [29].

Dihydrofolate reductase

Dihydrofolate reductase (DHFR) (UniProtKB ID: P17719) is an important enzyme in the folate biosynthesis pathway, where it synthesizes 5,6,7,8-tetrahydrofolate from 7,8-dihydrofolate [30]. DHFR is already a known moonlighting protein in human, where aside from its enzymatic activity, it also possesses the ability to bind RNA. DHFR in human binds to DHFR mRNA, thus regulating its own synthesis [31]. The DHFR of Drosophila is a moonlighting protein as well, as it also plays a role in cell survival by interacting with another protein, known as vestigial protein, and controlling gene expression [32]. Vestigial protein (vg) regulates the formation of wings by interacting with nuclear regulatory proteins and controlling gene expression in the wing region. It has been observed that vg regulates DHFR expression at the D/V boundary in the wing disc of Drosophila. Also, decrease in DHFR (and vg) leads to caspase mediated cell death and wing margin defects [32]. Thus, the second function of DHFR of Drosophila is different from that of human DHFR. As we see in this example, it is not uncommon for a homologous protein of a moonlighting protein to either not have a secondary function [33] or has a different secondary function [34].

Twinkle homolog protein

The last Class 1 moonlighting protein, DNA helicase (UniProtKB ID: B5X582), is a protein which can unwind the double-stranded DNA helix into separate strands and opens the DNA to be used as a template for DNA replication. On the other hand, DNA primase is an enzyme that catalyzes the synthesis of a small single stranded RNA that helps in DNA replication. Generally, these two functions are performed by two separate enzymes. The T7 bacteriophage gp4 proteins, however, is a multi-functional protein which has both helicase and primase activity [35]. The twinkle homolog protein is homologous to the gp4 protein of T7 bacteriophage. Such homologs are present in several eukaryotes where they function as only DNA helicases, losing their DNA primase activity. The twinkle protein in A. thaliana is found to possess both DNA helicase as well as primase activity, making it a dual-functional protein. It is present in chloroplast and mitochondria where its primase functions to produce RNA primers, which may help in organelle DNA replication [35].

The two functions of this protein are performed by different domains as shown in Table 1: the primase function is carried out by the toprim domain (Pfam ID: PF13662, Toprim_4) while the helicase activity is performed by DnaB-like helicase C-terminal domain (Pfam ID: PF03796, DNB-C). Note that the two-domain structure may disqualify this protein from being as moonlighting proteins by the original definition because it considers only cases where bi-functionality is not due to multiple domains or gene fusion events.

Class 2 moonlighting proteins

Table 1 includes nine Class 2 moonlighting proteins. For Class 2 proteins, two functions are described in the literature but due to the lack of experimental evidence, it was unclear if one of the functions is not an outcome of the other function. Class 2 category also includes cases that one of the functions is assumed from sequence similarity to a homologous protein. Since homologous proteins do not always share moonlighting function, the assumed functions need to be verified by experiments.

Discussion

Moonlighting proteins are shedding new light on functional studies of genomes and proteomes. The increasing number of identified moonlighting proteins suggests that multiplicity of functions of proteins would always need to be considered for functional studies. Information for the multiple functions associated with a protein is listed in the UniProt, but is only indicated in the functional description. As moonlighting in proteins is a fairly new concept, the database has not provided a specific label that indicates moonlighting, or more generally, bi-functionality, which makes a systematic study difficult.

The most accurate approach to identify moonlighting proteins is to manually read the published literature, that is, to search for proteins that have been experimentally confirmed to perform two or more functions. However, going through a huge amount of publications is a daunting and prohibitively time-consuming task. Our group has previously developed a text-mining tool, DextMP, which takes text information from publications or functional descriptions in UniProtKB and predicts if a protein moonlights or not. DextMP can computationally screen literature and database entries of thousands of proteins in a genome and provides a short list of potential moonlighting proteins, significantly reducing the load of users in checking the literature. In this work, we performed genome-wide moonlighting protein identification for three genomes. From the short list provided by DextMP, we applied a two-step manual literature and database check to find promising moonlighting proteins. The first manual screening, Manual Checking-1, i.e. checking literature titles and UniProtKB functional summary, was introduced for efficiency, and indeed significantly contributed by speeding up the entire manual check process. On the other hand, it is highly possible that some genuine moonlighting proteins were missed by this step. In practice, there is a trade-off between the time requirement and finding more moonlighting proteins by a careful and thorough reading of literature. Manual Checking-2 is a thorough analysis of the publications related to proteins. Specifically, we looked for evidence where inhibition of one function does not affect the second function and vice-versa, confirming that both functions are independent. Upon further improvement of the accuracy of DextMP, we aim to substantially reduce the manual post-processing step; possibly, even removing the manual steps entirely. Below, we discuss several directions for improvement of DextMP.

While running DextMP, we discovered that hub publications, papers that link to several proteins, confuse the program to classify them as moonlighting proteins. A preprocessing step, where such papers are removed, is crucial to reduce the number of false positives in the predictions.

We found that another source of false positives originated from different levels of function descriptions in literature. For example, there are often cases where protein functions are discussed at both molecular and biological levels. At a molecular level, a proteins’ interacting partners, biological pathways the protein belongs to, or active site residues are described whereas biological level descriptions include how the protein affect at a cell or organism level, such as the development, proliferation, and embryogenesis. Currently, DextMP cannot distinguish these two types of functions, and therefore whenever both levels of information are written, the algorithm identifies it as two independent functions and classifies the publication as containing moonlighting protein information. This was observed during manual analyses of DextMP predictions. Being able to identify the two classes of functions mentioned in the paper will greatly improve the specificity of the model.

As shown in Figure 2, DextMP makes a prediction for an individual publication associated with the protein separately, which is then combined by a majority vote to make a final prediction. Therefore, a moonlighting protein will be missed if only one function is mentioned in each individual paper. Practically, this seemed not to be a large problem as usually a newer paper reporting novel secondary functions mention the original function of a protein in its abstract. To be able to consider the all papers for a protein together, technically we will need to introduce a way to judge similarity or dissimilarity of mentioned functions (i.e. different, thus potentially moonlighting function) across papers, which is an interesting technical challenge.

Finally, analyzing the whole papers instead of simply abstracts will provide more information and will contribute to making more accurate predictions, so long as useful information for classification can be effectively extracted from large text information.

Natural language processing (NLP) techniques are a fast-growing area in artificial intelligence research. By introducing new techniques in NLP, we hope to further improve DextMP and contribute in deciphering complex functional interplay of proteins in the cell.

Statement of Significance of the Study.

There is an increasing number of proteins that have been found to exhibit two distinct and independent functions called moonlighting proteins. Moonlighting proteins have been attracting attention recently because this concept requires us to update our fundamental understanding of protein functions. Moonlighting proteins also have strong implications in drug development and artificial protein design. In this article, we introduce our computational methods for systematic identification of moonlighting proteins in genomes. We applied one of the methods, which mines moonlighting proteins from literature, to three genomes and identified thirteen new moonlighting proteins.

Acknowledgment

We are thankful to Md. Mansurul Bhuiyan for his technical help in running DextMP. We are also thankful to Samarth Mathur, Myson C. Burch, and Lyman Monroe for proofreading the manuscript. This study is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the Army Research Office (ARO) under cooperative Agreement Number W911NF-17-2-0105. DK also acknowledges support from the National Institute of General Medical Sciences of the NIH (R01GM123055) and the National Science Foundation (DMS1614777).

Abbreviations

GO

gene ontology

TFIDF

term frequency inverse document frequency

LR

linear regression

SVM

support vector machine

RF

random forest

GBM

gradient boosted machine

LAP2

leucine aminopeptidase 2

ADF9

actin-depolymerizing factor 9

FLC

flowering locus C

DHFR

dihydrofolate reductase

Footnotes

Conflict of Interest

The authors declare that there is no conflict of interest regarding the publication of this article.

References

  • [1].Hawkins T, Chitale M, Luban S, Kihara D, Proteins 2009, 74, 566. [DOI] [PubMed] [Google Scholar]; Khan IK, Wei Q, Chitale M, Kihara D, Bioinformatics 2015, 31, 271. [DOI] [PMC free article] [PubMed] [Google Scholar]; Chitale M, Hawkins T, Park C, Kihara D, Bioinformatics 2009, 25, 1739. [DOI] [PMC free article] [PubMed] [Google Scholar]; Wei Q, McGraw J, Khan I, Kihara D, Methods Mol Biol 2017, 1611, 1. [DOI] [PubMed] [Google Scholar]; Jiang Y, Oron TR, Clark WT, Bankapur AR, D’Andrea D, Lepore R, Funk CS, Kahanda I, Verspoor KM, Ben-Hur A, Koo da CE, Penfold-Brown D, Shasha D, Youngs N, Bonneau R, Lin A, Sahraeian SM, Martelli PL, Profiti G, Casadio R, Cao R, Zhong Z, Cheng J, Altenhoff A, Skunca N, Dessimoz C, Dogan T, Hakala K, Kaewphan S, Mehryary F, Salakoski T, Ginter F, Fang H, Smithers B, Oates M, Gough J, Toronen P, Koskinen P, Holm L, Chen CT, Hsu WL, Bryson K, Cozzetto D, Minneci F, Jones DT, Chapman S, Bkc D, Khan IK, Kihara D, Ofer D, Rappoport N, Stern A, Cibrian-Uhalte E, Denny P, Foulger RE, Hieta R, Legge D, Lovering RC, Magrane M, Melidoni AN, Mutowo-Meullenet P, Pichler K, Shypitsyna A, Li B, Zakeri P, ElShal S, Tranchevent LC, Das S, Dawson NL, Lee D, Lees JG, Sillitoe I, Bhat P, Nepusz T, Romero AE, Sasidharan R, Yang H, Paccanaro A, Gillis J, Sedeno-Cortes AE, Pavlidis P, Feng S, Cejuela JM, Goldberg T, Hamp T, Richter L, Salamov A, Gabaldon T, Marcet-Houben M, Supek F, Gong Q, Ning W, Zhou Y, Tian W, Falda M, Fontana P, Lavezzo E, Toppo S, Ferrari C, Giollo M, Piovesan D, Tosatto SC, Del Pozo A, Fernandez JM, Maietta P, Valencia A, Tress ML, Benso A, Di Carlo S, Politano G, Savino A, Rehman HU, Re M, Mesiti M, Valentini G, Bargsten JW, van Dijk AD, Gemovic B, Glisic S, Perovic V, Veljkovic V, Veljkovic N, Almeida ESDC, Vencio RZ, Sharan M, Vogel J, Kansakar L, Zhang S, Vucetic S, Wang Z, Sternberg MJ, Wass MN, Huntley RP, Martin MJ, O’Donovan C, Robinson PN, Moreau Y, Tramontano A, Babbitt PC, Brenner SE, Linial M, Orengo CA, Rost B, Greene CS, Mooney SD, Friedberg I, Radivojac P, Genome biology 2016, 17, 184.27604469 [Google Scholar]
  • [2].Piatigorsky J, O’Brien WE, Norman BL, Kalumuck K, Wistow GJ, Borras T, Nickerson JM, Wawrousek EF, Proc Natl Acad Sci U S A 1988, 85, 3479. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Pandya C, Farelli JD, Dunaway-Mariano D, Allen KN, The Journal of biological chemistry 2014, 289, 30229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Jeffery CJ, Trends in biochemical sciences 1999, 24, 8. [DOI] [PubMed] [Google Scholar]
  • [5].Chu E, Koeller DM, Casey JL, Drake JC, Chabner BA, Elwood PC, Zinn S, Allegra CJ, Proceedings of the National Academy of Sciences of the United States of America 1991, 88, 8977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Stutts MJ, Canessa CM, Olsen JC, Hamrick M, Cohn JA, Rossier BC, Boucher RC, Science 1995, 269, 847. [DOI] [PubMed] [Google Scholar]
  • [7].Jeffery CJ, IUBMB life 2011, 63, 489; [DOI] [PubMed] [Google Scholar]; Sriram G, Martinez JA, McCabe ER, Liao JC, Dipple KM, American journal of human genetics 2005, 76, 911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Jeffery CJ, Frontiers in genetics 2015, 6, 211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Chen C, Zabad S, Liu H, Wang W, Jeffery C, Nucleic Acids Res 2018, 46, D640. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Franco-Serrano L, Hernandez S, Calvo A, Severi MA, Ferragut G, Perez-Pons J, Pinol J, Pich O, Mozo-Villarias A, Amela I, Querol E, Cedano J, Nucleic acids research 2018, 46, D645. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Khan IK, Kihara D, Biochemical Society transactions 2014, 42, 1780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Gomez A, Domedel N, Cedano J, Pinol J, Querol E, Bioinformatics 2003, 19, 895; [DOI] [PubMed] [Google Scholar]; Hernandez S, Franco L, Calvo A, Ferragut G, Hermoso A, Amela I, Gomez A, Querol E, Cedano J, Front Bioeng Biotechnol 2015, 3, 90; [DOI] [PMC free article] [PubMed] [Google Scholar]; Khan I, Chitale M, Rayon C, Kihara D, BMC proceedings 2012, 6 Suppl 7, S5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Hernandez S, Amela I, Cedano J, Pinol J, Perez-Pons J, Mozo-Villarias A, Querol E, Proteomics Bioinform J 2012, 5, 262; [Google Scholar]; Tompa P, Szasz C, Buday L, Trends in biochemical sciences 2005, 30, 484; [DOI] [PubMed] [Google Scholar]; Dyson HJ, Quarterly reviews of biophysics 2011, 44, 467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Gomez A, Hernandez S, Amela I, Pinol J, Cedano J, Querol E, Mol Biosyst 2011, 7, 2379; [DOI] [PubMed] [Google Scholar]; Pritykin Y, Ghersi D, Singh M, PLoS computational biology 2015, 11, e1004467; [DOI] [PMC free article] [PubMed] [Google Scholar]; Chapple CE, Robisson B, Spinelli L, Guien C, Becker E, Brun C, Nature communications 2015, 6, 7412. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Cheng L, Leung KS, Bioinformatics 2018. [DOI] [PubMed] [Google Scholar]
  • [16].Pundir S, Martin MJ, O’Donovan C, Methods Mol Biol 2017, 1558, 41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Consortium GO, Nucleic acids research 2015, 43, D1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Khan I, Chen Y, Dong T, Hong X, Takeuchi R, Mori H, Kihara D, Biology direct 2014, 9, 30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Khan IK, Bhuiyan M, Kihara D, Bioinformatics 2017, 33, i83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Khan I, McGraw J, Kihara D, Methods Mol Biol 2017, 1611, 45; [DOI] [PubMed] [Google Scholar]; Khan IK, Kihara D, Bioinformatics 2016, 32, 2281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO, Proc Natl Acad Sci USA 1999, 96, 4285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Manning CD, Raghavan P, Schutze H, Introduction to Information Retrieval, Cambridge University Press, 2008. [Google Scholar]
  • [23].Le Q, Mikolov T, Proceedings of the 31st International Conference on Machine Learning, PMLR 2014, 32, 1188. [Google Scholar]
  • [24].Hirose T, Koga M, Ohshima Y, Okada M, FEBS Lett 2003, 534, 133. [DOI] [PubMed] [Google Scholar]
  • [25].Takata N, Itoh B, Misaki K, Hirose T, Yonemura S, Okada M, Genes Cells 2009, 14, 381. [DOI] [PubMed] [Google Scholar]
  • [26].Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A, Nucleic acids research 2016, 44, D279. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Scranton MA, Yee A, Park SY, Walling LL, The Journal of biological chemistry 2012, 287, 18408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Tholl S, Moreau F, Hoffmann C, Arumugam K, Dieterle M, Moes D, Neumann K, Steinmetz A, Thomas C, FEBS letters 2011, 585, 1821. [DOI] [PubMed] [Google Scholar]
  • [29].Burgos-Rivera B, Ruzicka DR, Deal RB, McKinney EC, King-Reid L, Meagher RB, Plant molecular biology 2008, 68, 619. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Hao H, Tyshenko MG, Walker VK, The Journal of biological chemistry 1994, 269, 15179. [PubMed] [Google Scholar]
  • [31].Chu E, Takimoto CH, Voeller D, Grem JL, Allegra CJ, Biochemistry 1993, 32, 4756. [DOI] [PubMed] [Google Scholar]
  • [32].Delanoue R, Legent K, Godefroy N, Flagiello D, Dutriaux A, Vaudin P, Becker JL, Silber J, Cell death and differentiation 2004, 11, 110. [DOI] [PubMed] [Google Scholar]
  • [33].Ozimek P, Kotter P, Veenhuis M, van der Klei IJ, FEBS letters 2006, 580, 46. [DOI] [PubMed] [Google Scholar]
  • [34].Banerjee S, Nandyala AK, Raviprasad P, Ahmed N, Hasnain SE, Journal of bacteriology 2007, 189, 4046; [DOI] [PMC free article] [PubMed] [Google Scholar]; Chen XJ, Wang X, Kaufman BA, Butow RA, Science 2005, 307, 714; [DOI] [PubMed] [Google Scholar]; Tang Y, Guest JR, Microbiology 1999, 145 ( Pt 11), 3069. [DOI] [PubMed] [Google Scholar]
  • [35].Diray-Arce J, Liu B, Cupp JD, Hunt T, Nielsen BL, BMC plant biology 2013, 13, 36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Ren S, Mandadi KK, Boedeker AL, Rathore KS, McKnight TD, The Plant cell 2007, 19, 23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Mandadi KK, Misra A, Ren S, McKnight TD, Plant physiology 2009, 150, 1930. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Asano T, Yoshioka Y, Kurei S, Sakamoto W, Machida Y, The Plant journal : for cell and molecular biology 2004, 38, 448. [DOI] [PubMed] [Google Scholar]
  • [39].Trejo-Saavedra DL, Vielle-Calzada JP, Rivera-Bustamante RF, Virology journal 2009, 6, 169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Ji X, Van den Ende W, Van Laere A, Cheng S, Bennett J, Journal of molecular evolution 2005, 60, 615. [DOI] [PubMed] [Google Scholar]
  • [41].Martin ML, Lechner L, Zabaleta EJ, Salerno GL, Planta 2013, 237, 813. [DOI] [PubMed] [Google Scholar]
  • [42].Hotton SK, Eigenheer RA, Castro MF, Bostick M, Callis J, Plant molecular biology 2011, 75, 515. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [43].Jahns MT, Vezon D, Chambon A, Pereira L, Falque M, Martin OC, Chelysheva L, Grelon M, PLoS biology 2014, 12, e1001930. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [44].Matsushima Y, Goto Y, Kaguni LS, Proceedings of the National Academy of Sciences of the United States of America 2010, 107, 18410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [45].Goto-Yamada S, Mano S, Nakamori C, Kondo M, Yamawaki R, Kato A, Nishimura M, Plant & cell physiology 2014, 55, 482. [DOI] [PubMed] [Google Scholar]
  • [46].Casanova JE, Traffic 2007, 8, 1476. [DOI] [PubMed] [Google Scholar]
  • [47].O’Rourke SM, Christensen SN, Bowerman B, Nature cell biology 2010, 12, 1235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [48].Altincicek B, Fischer M, Luersen K, Boll M, Wenzel U, Vilcinskas A, Developmental and comparative immunology 2010, 34, 1160. [DOI] [PubMed] [Google Scholar]
  • [49].Williams TM, Lisanti MP, Genome biology 2004, 5, 214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [50].Parker S, Walker DS, Ly S, Baylis HA, Molecular biology of the cell 2009, 20, 1763. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [51].Iacobazzi V, Palmieri F, Runswick MJ, Walker JE, DNA sequence : the journal of DNA sequencing and mapping 1992, 3, 79. [DOI] [PubMed] [Google Scholar]
  • [52].Gallo M, Park D, Luciani DS, Kida K, Palmieri F, Blacque OE, Johnson JD, Riddle DL, PloS one 2011, 6, e17827. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES