Significance
Elucidation of neuropeptide–receptor pairs is essential for the investigation of peptidergic signalling processes. Although sequence alignment and molecular phylogenetic analysis can easily predict G protein-coupled receptors for homologous neuropeptides, these methods cannot predict receptors for novel peptides, so many neuropeptide–receptor pairs remain to be identified. We used our original machine-learning system, peptide descriptor-incorporated support vector machine, to predict multiple neuropeptide–receptor pairs of the vertebrate sister group, Ciona robusta. The Ciona-specific neuropeptide–receptor pairs were validated with cell-based pharmacological assays, showing biological roles for the neuropeptides in a model protochordate. Because of the critical phylogenetic position of Ciona, the present study also elucidates the evolutionary processes underlying neuropeptidergic systems in chordates.
Keywords: machine learning, peptide descriptor, deorphanization, neuropeptide, G protein-coupled receptor
Abstract
Neuropeptides play pivotal roles in various biological events in the nervous, neuroendocrine, and endocrine systems, and are correlated with both physiological functions and unique behavioral traits of animals. Elucidation of functional interaction between neuropeptides and receptors is a crucial step for the verification of their biological roles and evolutionary processes. However, most receptors for novel peptides remain to be identified. Here, we show the identification of multiple G protein-coupled receptors (GPCRs) for species-specific neuropeptides of the vertebrate sister group, Ciona intestinalis Type A, by combining machine learning and experimental validation. We developed an original peptide descriptor-incorporated support vector machine and used it to predict 22 neuropeptide–GPCR pairs. Of note, signaling assays of the predicted pairs identified 1 homologous and 11 Ciona-specific neuropeptide–GPCR pairs for a 41% hit rate: the respective GPCRs for Ci-GALP, Ci-NTLP-2, Ci-LF-1, Ci-LF-2, Ci-LF-5, Ci-LF-6, Ci-LF-7, Ci-LF-8, Ci-YFV-1, and Ci-YFV-3. Interestingly, molecular phylogenetic tree analysis revealed that these receptors, excluding the Ci-GALP receptor, were evolutionarily unrelated to any other known peptide GPCRs, confirming that these GPCRs constitute unprecedented neuropeptide receptor clusters. Altogether, these results verified the neuropeptide–GPCR pairs in the protochordate and evolutionary lineages of neuropeptide GPCRs, and pave the way for investigating the endogenous roles of novel neuropeptides in the closest relatives of vertebrates and the evolutionary processes of neuropeptidergic systems throughout chordates. In addition, the present study also indicates the versatility of the machine-learning–assisted strategy for the identification of novel peptide–receptor pairs in various organisms.
Ascidians (tunicates) are invertebrate chordates and the phylogenetically closest living relatives of vertebrates (1–3). Such a critical phylogenetic position sheds light on the significance of investigating the evolutionary process and diversity of biological systems throughout the chordates, including the nervous, neuroendocrine, and endocrine systems (1, 4). Neuropeptides play various pivotal roles in these systems as multifunctional signaling molecules, and the majority of cognate receptors for neuropeptides belong to the G protein-coupled receptor (GPCR) superfamily (5, 6). Thus, the elucidation of specific neuropeptide–GPCR pairs is a primary step in the investigation of the biological roles of neuropeptides, their underlying regulatory mechanisms, and their evolutionary history. In the cosmopolitan species of ascidians, Ciona intestinalis Type A (Ciona robusta), many major neuropeptides (∼40) have so far been characterized by purification, cDNA cloning, and peptidomic approaches (7–13). These neuropeptides are classified into two groups. The first group includes homologs or prototypes of vertebrate neuropeptides: cholecystokinin, calcitonin, gonadotropin-releasing hormones (GnRHs), galanin-like peptides (GALP), tachykinin, and vasopressin (7–13). The molecular characterization of Ciona neuropeptides substantiated that this invertebrate chordate conserves a greater number of neuropeptide homologs than protostomes (e.g., Caenorhabditis elegans and Drosophila melanogaster) and other invertebrate deuterostomes (7–13), confirming the evolutionary and phylogenetic relatedness of ascidians to vertebrates. The second group includes Ciona-specific novel neuropeptides, namely Ci-NTLPs, Ci-LFs, and Ci-YFV/Ls (SI Appendix, Fig. S1 and Table S1), which share neither consensus motifs nor sequence similarity with any other peptides (8, 9). The presence of both homologous and species-specific neuropeptides highlights this phylogenetic relative of vertebrates as a prominent model organism for studies of molecular and functional conservation and specialization in neuropeptidergic systems during chordate evolution. To date, ∼160 GPCRs have been predicted and categorized into five major groups (glutamate, rhodopsin, adhesion, frizzled, and secretin) in Ciona (14). Furthermore, GPCRs for Ciona tachykinins (Ci-TKs) (10), GnRHs (11), cholecystokinin (12), and vasopressin (13) have been elucidated based on the similarity of their sequences to vertebrate homologs. These findings are in good agreement with the principle that GPCRs for homologous neuropeptides possess sequence similarity to homologous GPCRs conserved in other species. In contrast, GPCRs for novel neuropeptides cannot be predicted based on sequence similarity, which has hampered the identification of GPCRs for these neuropeptides. Indeed, no GPCRs for the aforementioned novel neuropeptides (Ci-NTLPs, Ci-LFs, and Ci-YFV/Ls) have ever been identified because these neuropeptides share neither consensus motifs nor sequence similarity with any other peptide. Thus, their cognate GPCRs cannot be predicted by multiple-sequence alignment-based molecular phylogenetic analyses. Similarly, although recent advances in transcriptomes and peptidomes have led to the discovery of numerous putative highly conserved and novel neuropeptides and their cognate receptor candidates (8, 15), many novel GPCRs still remain to be deorphanized.
To date, reverse-pharmacology techniques have been employed for the elucidation of novel ligand–GPCR pairs (16). However, the reverse-pharmacology strategy for deorphanization of GPCRs is analogous to gambling and not systematic: it is time-consuming, costly, and serendipitous. Additionally, limited information regarding GPCR tertiary structures and variations in ligand-receptor binding modes has hampered tertiary structure-based prediction or virtual screening of peptide ligands for orphan GPCRs, including homology modeling. Indeed, only a few low molecular-weight molecules, but not peptides, have been characterized as novel ligands for GPCRs (17–20). These shortcomings indicate the need for a new general and systematic approach for the identification of various novel peptide–GPCR pairs.
Statistical machine learning has been used to predict various ligand–receptor pairs (21–24). In the chemical genomics-based strategy, known ligand–receptor pair information is encoded as numerical vectors (descriptors) or kernels representing amino acid sequences or physicochemical properties, which are input to a machine-learning system, such as a support vector machine (SVM). Indeed, machine-learning systems were used to predict multiple novel ligand–protein pairs using integrated pattern recognition of chemical properties and sequence information of ligands and receptors (25). We previously predicted low molecular-weight drug candidates for human GPCRs using this machine learning system (21, 26). These findings demonstrate the potential of machine learning in the prediction of Ciona peptide–GPCR pairs. However, no peptide descriptors (PDs) are available for machine learning for the reliable and efficient prediction of neuropeptide–GPCR pairs (21, 26).
In this study, we identified 12 (11 Ciona-specific and 1 homologous) neuropeptide–GPCR pairs by a combination of an originally developed machine-learning system, PD-incorporated SVM, and experimental evidence for specific signaling by the predicted neuropeptide–GPCR pairs, and verified unprecedented phylogenetic relatedness of GPCRs for neuropeptides.
Results
CPI Data Collection.
A total of 1,352 compound–protein interactions (CPIs) were collected from IUPHAR, GPCR-SARfari, and UniProt annotations and literature and used as the training dataset. These were composed of 531 human, 310 mouse, 379 vertebrate (vertebrates other than humans and mice), and 132 invertebrate (nonascidian invertebrates) CPIs (Dataset S1). Subsequently, collected GPCRs or peptides were converted into descriptors for machine learning (21, 26, 27). Molecular descriptors for low molecular-weight compounds (28, 29) and proteins (30, 31) have been available for machine-learning–based prediction of CPIs (21, 26). However, chemical descriptors are limited to low molecular-weight compounds due to the computational burden imposed by larger compounds, and protein descriptors cannot be used with short amino acid peptides due to the sparse information available for them for machine learning. To develop PDs possessing peptide physicochemical and biological properties, we initially designed PDs composed of regular expressions (Fig. 1A and SI Appendix, Table S2), which are 1- to 5-aa sequences comprising any amino acid and their physicochemical properties defined by PROFEAT categories (32). The PDs generated 25,935,478-dimensional bit (0, 1) vectors, which represent the absence and presence of subsequences matching the regular expressions. GPCRs were encoded with a transmembrane (TM) z-scale descriptor according to our previous study (21).
Subsequently, we estimated the distribution of similarity scores (SSs) between each peptide or GPCR and the samples most similar to themselves or other subsets, as previously described (21). Tanimoto coefficients (33) of TM z-scale descriptors and the aforementioned original PDs were used for the estimation of the SSs of the GPCRs and peptides, respectively. The average GPCR SSs of humans, mice, vertebrates, and invertebrates with themselves were 0.849 ± 0.020, 0.860 ± 0.020, 0.905 ± 0.016, and 0.762 ± 0.026, respectively (Fig. 2A and SI Appendix, Fig. S2A). The average GPCR SSs of humans, mice, vertebrates, and invertebrates with other subsets were 0.855 ± 0.015, 0.906 ± 0.012, 0.912 ± 0.012, and 0.436 ± 0.014, respectively (Fig. 2B and SI Appendix, Fig. S2B). Although GPCR SSs of humans, mice, and vertebrates were all higher than 0.8, the SSs of invertebrates were less, at 0.436, indicating the dissimilarity of our collected invertebrate GPCRs. The average peptide SSs of humans, mice, vertebrates, and invertebrates with themselves were 0.800 ± 0.025, 0.751 ± 0.029, 0.823 ± 0.024, and 0.314 ± 0.031, respectively (Fig. 2C and SI Appendix, Fig. S2C). The average peptide SSs of humans, mice, vertebrates, and invertebrates with other subsets were 0.676 ± 0.029, 0.778 ± 0.027, 0.822 ± 0.021, and 0.268 ± 0.017, respectively (Fig. 2D and SI Appendix, Fig. S2D). Similar to GPCR SSs, the invertebrate peptide SS (0.268) was extremely small compared with the SSs of humans, mice, and vertebrates (>0.6). This invertebrate-specific distribution of SSs (Fig. 2 and SI Appendix, Fig. S2) represents the sequence varieties of invertebrate GPCRs and peptides, but vertebrate GPCRs and peptides contain more orthologs than those in other species. To estimate prediction performance of species-specific CPIs, we evaluated the performance using leave-one-species-out (LOSO) validation (21, 26, 28).
PD-Incorporated SVM Prediction of Ciona Neuropeptide–Receptor Pairs.
PDs encoding peptides and a TM z-scale descriptor encoding GPCRs (Fig. 1B) were utilized for the encoding of 1,352 CPIs (Dataset S1) and the same number of generated noninteraction pairs (Materials and Methods). The resulting CPIs and noninteraction pairs were in turn utilized as training sets for SVMs (Fig. 1B). The prediction performances of trained SVMs were evaluated by LOSO internal validation using the predicted CPIs and noninteraction pairs as test sets (21, 26, 28). Because the CPI datasets partitioned into respective subsets for peptide–GPCR interactions in humans, mice, other vertebrates, and invertebrates were predicted using models containing the other datasets in a LOSO analysis, species-wide prediction performance was evaluated by LOSO cross-validation. The LOSO analysis using the PD-incorporated SVM produced values for leave-humans-, mice-, vertebrates-, and invertebrates-out of 0.949 ± 0.003, 0.977 ± 0.001, 0.988 ± 0.001, and 0.592 ± 0.032 for the area under the receiver operating characteristic curve (AUC) and 0.884 ± 0.010, 0.937 ± 0.010, 0.971 ± 0.003, and 0.501 ± 0.101 for accuracy (ACC) (Fig. 3A and SI Appendix, Table S3).
To confirm the prediction performance of the present PDs, the peptide–receptor prediction performance using other descriptors—specifically, 5–0, 5–1, and 5–2 mismatch descriptors (30, 34), a class of string kernels that compare sequence strings representing k-mer subsequences—were also evaluated by LOSO. LOSO analysis using the 5–0 mismatch descriptors for leave-humans-, mice-, vertebrates-, and invertebrates-out yielded 0.800 ± 0.011, 0.875 ± 0.003, 0.921 ± 0.012, and 0.436 ± 0.006 for the AUC and 0.743 ± 0.031, 0.852 ± 0.005, 0.861 ± 0.037, and 0.443 ± 0.020 for ACC (SI Appendix, Fig. S4A and Table S3). LOSO analysis using the 5–1 mismatch descriptors for leave-humans-, mice-, vertebrates-, and invertebrates-out yielded 0.867 ± 0.011, 0.925 ± 0.004, 0.962 ± 0.003, and 0.473 ± 0.012 for the AUC and 0.820 ± 0.021, 0.890 ± 0.012, 0.924 ± 0.017, and 0.496 ± 0.022 for ACC (SI Appendix, Fig. S4A and Table S3). LOSO analysis using the 5–2 mismatch descriptors for leave-humans-, mice-, vertebrates-, and invertebrates-out yielded 0.792 ± 0.008, 0.848 ± 0.011, 0.898 ± 0.009, and 0.497 ± 0.010 for the AUC and 0.737 ± 0.031, 0.815 ± 0.018, 0.861 ± 0.024, and 0.493 ± 0.020 for ACC (SI Appendix, Fig. S4A and Table S3). These data indicate that the scores of our developed PDs were higher than those of 5–0, 5–1, and 5–2 mismatch descriptors, confirming high prediction performance of the developed PDs. Consequently, we employed our PDs for the following analysis. However, the prediction performance for leave-invertebrates-out was still lower (0.592 ± 0.032 for the AUC) than that for vertebrates (leave-humans-, mice-, and vertebrates-out). To improve the prediction performance, we optimized the PDs using two rounds of genetic algorithm-based feature selection (GAFS) (Fig. 1B; also see SI Appendix, Supplemental Methods).
After GAFS, the optimized PDs displayed leave-humans-, mice-, and vertebrates-out of 0.955 ± 0.002, 0.972 ± 0.004, and 0.986 ± 0.002 for the AUC and 0.889 ± 0.011, 0.926 ± 0.013, and 0.959 ± 0.006 for ACC. Furthermore, the AUCs and ACCs for leave-invertebrates-out improved to 0.813 ± 0.006 and 0.847 ± 0.076, respectively. These scores confirmed high prediction performance of the neuropeptide–GPCR pairs for any species by the PD-incorporated SVM. Subsequently, we examined the prediction accuracy of the PD-incorporated SVM, trained with all 1,352 CPIs (Dataset S1) and the noninteraction pairs, for eight known CPIs for Ciona peptide and their cognate receptor pairs (Dataset S1) that were not included in the LOSO analysis. As shown in Fig. 3C, Ci-TK-I, Ci-TK-II, and cionin (Ciona cholecystokinin homolog) were predicted to interact specifically with cognate receptors Ci-TK-R, CioR1, and CioR2, respectively, by machine learning (Fig. 3C). These outputs were completely consistent with the previous experimental evidence for their specific interactions (10, 12). Similarly, t-GnRH-3, t-GnRH-5, and t-GnRH-6 were predicted to interact with their cognate receptors with somewhat low ligand selectivity, as previously reported (11). Thus, the present PD-incorporated SVM was found to predict all eight known Ciona peptide–GPCR pairs with an accuracy of 80.95%. In contrast, no positive Ciona peptide–GPCR pairs were predicted with machine-learning models with 5–0, 5–1, and 5–2 mismatch descriptors (30, 34), which agrees with low leave-invertebrates-out validation (Fig. 3B and SI Appendix, Fig. S4 B–E). Collectively, the LOSO evaluation (Fig. 3A) and prediction accuracy for datasets of Ciona neuropeptide–receptor pairs (Fig. 3C) demonstrate that the PD-incorporated SVM model detects neuropeptide–GPCR pairs in both vertebrates and invertebrates. To the best of our knowledge, this is unique as a machine-learning model that can predict peptide-GPCR pairs of any animal species with high accuracy.
Using the PD-incorporated SVM trained with all 1,352 CPIs (rows 2–1,353 in Dataset S1) and the noninteraction pairs, we predicted the interactions between 19 Ciona neuropeptides (SI Appendix, Fig. S1 and Table S1) identified by our previous peptidomics study of the central nervous system (8) and 140 putative Ciona GPCRs (Dataset S2) extracted from the Ghost database (35) by GPCRalign (36). Each GPCR ID was abbreviated by omitting the splicing variant information (SI Appendix, Table S4). The prediction values for each pair ranged from 1 (absolute interaction) to 0 (absolute noninteraction). PD-incorporated SVM analysis of a total of 2,660 Ciona peptide–GPCR pairs [19 Ciona peptides (SI Appendix, Table S1) × 140 Ciona GPCRs (Dataset S2)] were subjected to PD-incorporated SVM prediction and a total of 13 putative peptide–GPCR pairs were produced with prediction scores higher than 0.7 for Ciona galanin-like peptide (Ci-GALP), Ci-NTLP-2, Ci-NTLP-3, Ci-LF-2, Ci-LF-3, Ci-LF-8, Ci-YFV-1, and Ci-YFV-3 (Fig. 4A and SI Appendix, Table S5).
Identification of 12 Neuropeptide–GPCR Pairs by Experimental Validation of the Predicted Pairs.
We predicted and evaluated neuropeptide-GPCR pairs in two stages using a self-training strategy for semisupervised learning (37). For the first-stage evaluation, we experimentally assessed seven pairs (Ci-GALP-KH.C3.660; Ci-NTLP-2-KH.C9.683 and KH.C3.920; Ci-LF-2-KH.C2.127, KH.L172.28, and KH.C2.1132; and Ci-YFV-1-KH.C1.745) that had high prediction values (SI Appendix, Table S5) in the aforementioned model. Each promiscuous Gαq16-fused GPCR was transiently expressed in Sf9 cells, and intracellular Ca2+ mobilization was assessed in the presence of various concentrations of the peptide ligands. The cell-based signaling assay demonstrated that Ci-GALP, Ci-NTLP-2, Ci-LF-2, and Ci-YFV-1 induced Ca2+ mobilization in cells transfected with KH.C3.660 (Fig. 4B), KH.C9.683 (Fig. 4C), KH.C2.1132 (Fig. 4D), and KH.C1.745 (Fig. 4E), respectively, with nanomolar efficacy (Table 1). In contrast, dose-dependent responses were not observed with cells expressing other receptors. Furthermore, the PD-incorporated SVM was provided with data for the four experimentally validated Ciona GPCR–neuropeptide pairs as positive examples and three other pairs as negative examples for the second-stage validation, using a self-training strategy for semisupervised learning (37). The feature set for training and prediction was not changed from the PD-incorporated feature set used above, and the additional datasets were expected to update the discriminant functions (weight vectors) for the possible estimation of peptide–receptor interactions, leading to the prediction of more peptide–receptor pairs. As shown in Fig. 5A, the updated PD-incorporated SVM with additional training data output 22 putative peptide–GPCR pairs for Ci-NTLP-4, Ci-LF-1, Ci-LF-2, Ci-LF-5 to -8, Ci-YFV-1 to -3, and Ci-YFL-1 (SI Appendix, Table S6). Ca2+-mobilization assays also verified specific (nanomolar efficacy) interactions of KH.C4.122 with Ci-LF-1 and Ci-LF-6 (Fig. 5 B and C); of KH.C2.1037 with Ci-LF-1, Ci-LF-5, and Ci-LF-6 (Fig. 5 D–F); of KH.C2.878 with Ci-LF-7 (Fig. 5G); of KH.C2.212 with Ci-LF-8 (Fig. 5H); and of KH.C8.781 with Ci-YFV-3 (Fig. 5I and Table 1). In contrast, all of the above neuropeptides show no Ca2+ mobilization at other GPCRs with prediction scores higher than 0.7. Altogether, these results provided evidence for the identification of a Ci-GALP receptor (Ci-GALP-R), Ci-NTLP-2 receptor (Ci-NTLP-2-R), Ci-LF-1 receptor (Ci-LF-1-R), Ci-LF-2 receptor (Ci-LF-2-R), Ci-LF-5/6 receptor (Ci-LF-5/6-R), Ci-LF-7 receptor (Ci-LF-7-R), Ci-LF-8 receptor (Ci-LF-8-R), Ci-YFV-1 receptor (Ci-YFV-1-R), and Ci-YFV-3 receptor (Ci-YFV-3-R) (Table 1). Although Ci-LF-1-R and Ci-LF-5/6-R were weakly activated by Ci-LF-6 and Ci-LF-1 (Table 1), respectively, Ci-LF-1-R exhibited a 42-fold selectivity for Ci-LF-1 relative to Ci-LF-6, while Ci-LF-5/6 exhibited a 91-fold selectivity for Ci-LF-6 relative to Ci-LF-1. Consequently, Ca2+-mobilization assays for a total of 29 predicted pairs (7 pairs from the first-stage evaluation and 22 from the second-stage evaluation) resulted in a 41% hit rate (12 experimentally validated pairs).
Table 1.
Ghostdatabase ID for receptor gene | Receptor gene name | Ligand | EC50 (nM) |
KH.C3.660 | Ci-GALP-R | Ci-GALP | 1.29 |
KH.C9.683 | Ci-NTLP-2-R | Ci-NTLP-2 | 11.05 |
KH.C4.122 | Ci-LF-1-R | Ci-LF-1 | 5.25 |
KH.C4.122 | Ci-LF-1-R | Ci-LF-6 | 223.87 |
KH.C2.1037 | Ci-LF-5/6-R | Ci-LF-1 | 141.25 |
KH.C2.1037 | Ci-LF-5/6-R | Ci-LF-5 | 4.78 |
KH.C2.1037 | Ci-LF-5/6-R | Ci-LF-6 | 1.55 |
KH.C2.1132 | Ci-LF-2-R | Ci-LF-2 | 0.71 |
KH.C2.878 | Ci-LF-7-R | Ci-LF-7 | 2.04 |
KH.C2.212 | Ci-LF-8-R | Ci-LF-8 | 1.35 |
KH.C1.745 | Ci-YFV-1-R | Ci-YFV-1 | 24.55 |
KH.C8.781 | Ci-YFV-3-R | Ci-YFV-3 | 1.98 |
Molecular Phylogenetic Tree Analysis of Identified Ciona Neuropeptide GPCRs.
To evaluate the presence of known receptors closely related to the identified Ciona GPCRs, gene trees were estimated by collecting similar bilaterian sequences (Fig. 6). We used the Ci-GALP-R sequence as a query with the Basic Local Alignment Search Tool (BLAST) to demonstrate that similar sequences were detected in genome data representing all deuterostome lineages. Among them, Ci-GALP-R displayed 37–42% sequence identity (SI Appendix, Fig. S5A) to eight vertebrate galanin or GALP receptors (38) and 35–44% sequence identity (SI Appendix, Fig. S5A) to nine putative cephalochordate galanin or GALP receptors, indicating sequence identity of Ci-GALP-R to those of other galanin/GALP receptor family GPCRs. Molecular phylogenetic tree analysis demonstrated that urochordate GALP-Rs were positioned outside of either vertebrates or cephalochordate galanin/GALP receptors (SI Appendix, Fig. S5A), indicating that urochordate GALP-Rs evolved in unique ways. However, the deuterostome GALP-R clade including Ci-GALP-R was consistently supported by both neighbor-joining (NJ) and maximum-likelihood (ML) analysis (Fig. 6A and SI Appendix, Fig. S5 A, 1–3), revealing that Ci-GALP-R shares a common ancestor with the vertebrate galanin receptor proteins.
A BLAST search using the Ci-NTLP-2-R sequence as a query identified similar deuterostome sequences, including eight vertebrate adhesion GPCRs (20–24% identity) (SI Appendix, Fig. S5B). However, among these BLAST hits, phylogenetic analyses did not identify any nonurochordate sequence similar to the Ci-NTLP-2-R sequence (Fig. 6B and SI Appendix, Fig. S5B). In addition, the sequence alignment showed that the N terminus of Ci-NTLP-2-R is shorter than that of other GPCRs (SI Appendix, Fig. S5 B, 4). Some adhesion GPCRs are known as receptors for high molecular-weight protein ligands, such as collagen (adhesion GPCR G6; ENST00000394143.5) (39, 40) and neurexins (adhesion GPCR L1; ENST00000340736.10) (41). Notably, the amino acid length of the ligand of Ci-NTLP-2-R, Ci-NTLP-2 (8 aa, MMLGPGIL) (SI Appendix, Table S1), is far shorter than those of collagens (>1,000 aa) and FLRT3 (>600 aa). Given that a significant sequence identity was not found between Ci-NTLP-2 and these proteins, Ci-NTLP-2-R is considered to be a GPCR for a short neuropeptide but not an adhesion-related protein.
A BLAST search using the Ci-LF-1-R sequence as a query identified similar deuterostome sequences (SI Appendix, Fig. S5C). Ci-LF-Rs showed 21–26% sequence identity to 10 vertebrate GPCRs, including class A small molecular-weight transmitters GPCRs (cannabinoid receptors, adenosine receptors, and adrenergic receptors) (SI Appendix, Fig. S5C). Phylogenetic analyses indicated that all Ci-LF-Rs belong to the Ciona-specific clade (Fig. 6C) and this clade is deeply nested within the urochordate LF-R clade consisting of presumable LF-R sequences of Botryllus schlosseri and Oikopleura dioica. The urochordate LF-R clade, however, did not have any closely related sequence of nonurochordate deuterostomes (SI Appendix, Fig. S5C). This result suggests that, after a split of other tunicate lineages, the Ci-LF-Rs evolved within the Ciona lineage as paralogs via gene multiplication and are in good agreement with the finding that the Ci-LF-Rs share little sequence homology with any hitherto known GPCR for peptides.
A BLAST search using the Ci-YFV-1-R sequence as a query identified similar sequences from urochordates but not from other deuterostomes (Fig. 6D). Phylogenetic analyses (SI Appendix, Fig. S5D) demonstrated that Ci-YFV-Rs were grouped with sequences of probable YFV-Rs of Ciona and B. schlosseri. This result suggests that Ci-YFV-Rs were generated within the urochordate lineage. Combined with the experimental evidence for specific neuropeptide–GPCR pairs (Figs. 4 and 5), these molecular phylogenetic tree analyses suggest that Ci-NTLP-Rs, Ci-LF-Rs, and Ci-YFV-Rs are not closely related to any other known GPCRs.
Expression of Ci-GALP-R, Ci-NTLP-2-R, Ci-LF-R, and Ci-YFV-R Genes in Various Tissues.
Real-time PCR revealed the expression patterns of the identified GPCRs. For example, Ci-LF-Rs, except Ci-LF-8-R, were shown to be expressed specifically in the oral and atrial siphons (Fig. 7), suggesting some biological roles of Ci-LF-1 to -7 in feeding behavior. Ci-GALP-R, Ci-YFV-1-R, and Ci-YFV-3-R were more highly expressed in the neural complex, compared with other identified GPCRs (Fig. 7). These results demonstrate the unique expression profile of these GPCRs and suggested that their peptide ligands produce diverse biological functions.
Discussion
Neuropeptides play multiple biological roles upon binding to their cognate receptors expressed in various tissues and cells. Thus, identification of neuropeptide–GPCR pairs, namely, deorphanization of GPCRs, is a crucial step in the elucidation of their endogenous roles. Moreover, both novel and homologous neuropeptides have been characterized in various organisms, highlighting the significance of neuropeptidergic signaling systems in molecular and functional evolution and diversification in the animal kingdom. However, elucidation of nonhomologous neuropeptide–receptor pairs remains a severe bottleneck in a wide range of biological sciences, because prediction and identification of the receptors for novel peptides is one of the most time-consuming and serendipity-dependent tasks in biology due to low sequence similarities and poor molecular phylogenetic correlations, even in human and model organisms. Although reverse-pharmacological strategies generally require multiyear trial-and-error testing to elucidate one ligand-receptor pair, the identification of receptors for novel ligands, including species-specific nonhomologous peptides, still depends on this strategy (42). A large-scale combinatorial reverse-pharmacological method identified 19 invertebrate neuropeptide GPCRs (16), but this strategy requires multistep experiments for numerous peptide–receptor pair candidates, and most of the identified peptide–receptors were homologs of other species. In this study, we efficiently and systematically identified multiple neuropeptide–GPCR pairs of the phylogenetically closest relative of vertebrates, C. intestinalis Type A, with the assistance of an original machine learning-based approach. Of particular significance is that we succeeded in elucidating 1 homologous and 11 Ciona-specific neuropeptide–GPCR pairs during validation of 29 predicted peptide–receptor pairs. This represents a 41% hit rate using only 1,352 CPIs, namely, data for known endogenous peptide–GPCR pairs. Examination of these 29 predicted interactions and elucidation of 12 (11 Ciona-specific and 1 homologous) neuropeptide–GPCR pairs were completed within only 9 mo after the first-round prediction of Ciona neuropeptide–GPCR pairs (Fig. 4A). This is an obviously higher throughput than that of reverse-pharmacological strategies. Consequently, the present study illustrates the effectiveness of combining PD-incorporated SVM with cell-based experimental validation for the identification of neuropeptide–GPCR pairs.
Combined with previously identified homologous neuropeptide–GPCR pairs, this study led to the elucidation of a total of 26 neuropeptide–GPCR pairs in Ciona, which is comparable to those of conventional protostomian model organisms, such as Drosophila and C. elegans (5). Previously, only a few biological roles of Ciona neuropeptides had been elucidated: regulation of vitellogenic follicles by Ci-TK (9, 43) and metamorphosis by GnRH (9, 44). Thus, the present identification of multiple neuropeptide–GPCR pairs (Figs. 5 and 6) and localization of the GPCR gene expression surely facilitates the elucidation of neuropeptidergic molecular mechanisms (Fig. 7) and networks underlying various biological events regulated by the nervous, neuroendocrine, and endocrine systems in Ciona. Furthermore, because Ciona is the closest living relative of vertebrates, this study is also expected to contribute a great deal to the exploration of the common and species-specific evolution of the nervous, neuroendocrine, and endocrine systems throughout the Chordata phylum.
We verified that Ci-LF-1, -2, -5, -6, -7, and -8 and Ci-YFV-1 and -3 exhibited prominent selectivity to their receptors (Figs. 3 and 4), whereas receptors for Ci-LF-3 and -4, Ci-YFV-2, and Ci-YFL-1 have yet to be elucidated. This is mainly due to the failure of expression of the most probable receptor candidate proteins in expression systems, including mammalian cells, insect cells, and Xenopus oocytes, rather than implicit prediction of peptide–receptor systems. Replacement of the N-terminal regions of Ciona GPCRs with those of mammals or insects should result in functional expression, leading to the experimental validation of predicted peptide–GPCR pairs.
Recently, molecular phylogenetic approaches have provided some insight into evolutionary aspects and classification of invertebrate peptides, GPCRs, and peptide–GPCR pairs (5, 45). For example, integrative molecular phylogenetic analyses identified 29 categories of peptide and GPCR subfamilies based on position-specific scoring matrices of GPCRs and peptide precursors, followed by prediction of peptide–GPCR pairs (5, 45). However, these methods were limited to the prediction of known homologous peptide–GPCR pairs. Of particular significance is that Ci-NTLP-2-R, Ci-LF-Rs, and Ci-YFV-Rs constitute unique clades with orphan GPCRs or GPCRs for nonpeptide endogenous ligands, not with hitherto known GPCRs for peptides, indicating that these genes were generated in a species-specific lineage (Fig. 6 and SI Appendix, Fig. S5). The existence of such Ciona-specific evolutionarily unrelated neuropeptide GPCR genes is compatible with a rapid evolutionary rate of the Ciona genome and species-specific gene multiplication (46). In other words, the present molecular phylogenetic trees (Fig. 6 and SI Appendix, Fig. S5) strongly suggest that novel neuropeptide GPCRs also constitute unique clades with GPCRs for nonpeptidic ligands in other species, including humans, supporting the view that methods based on sequence similarity or molecular phylogenetic relatedness have not been useful for predicting novel peptide–GPCR pairs. In contrast, unprecedented molecular mechanisms and evolutionary processes of peptide–GPCR interactions have a high likelihood of being recognized by the PD-incorporated SVM, suggesting that the present machine-learning approach will lead to the exploration of new phylogenetically unrelated GPCR repertoires in a wide range of species, including humans.
Machine-learning methods have provided predictive models or simulations of ligand–receptor interactions (24, 47–50). However, the experimental evidence for these has been limited to nonendogenous small compounds (26). Moreover, to the best of our knowledge, this prediction of peptide–receptor pairs using machine learning enabled by the development of original PDs is unique (Fig. 1 and SI Appendix, Table S2). Collectively, the present study shows identification of cognate endogenous peptide–receptor pairs using a sequential combination of machine learning and experimental validation. Additionally, the aforementioned hit rate of the PD-incorporated SVM (41%) was much higher than those for the elucidation of GPCRs for small nonendogenous compound prediction using in silico virtual screening, such as structure-based (20) and other chemical genomic models (26).
The LOSO validation, which enabled the evaluation of species-wide prediction performance, contributed to estimating the prediction performance of the species-specific CPIs. We also estimated the prediction performances using fivefold cross-validation (5-CV) (24, 26). As shown in SI Appendix, Fig. S6, 5-CV showed prediction performance of AUCs higher than 0.85 for all descriptors, including 5–0, 5–1, and 5–2 mismatch descriptors and PDs, whereas no known Ciona peptide–GPCR pairs were predicted (SI Appendix, Fig. S4 B–D). In contrast, despite low performance of the original SVM with any descriptors validated by leave-invertebrates-out analysis (AUC < 0.6), GAFS-optimized PD-incorporated SVM validated by leave-invertebrates-out analysis showed higher prediction performance (AUC of 0.813) and, indeed, output complete prediction of all known Ciona peptide–GPCR pairs (Fig. 3 B and C) and led to the elucidation of 12 Ciona peptide–GPCR pairs (Figs. 4 and 5). Collectively, these results proved that 5-CV overestimated the prediction performance compared with leave-invertebrates-out analysis. These gaps between validation scores and actual prediction accuracy are likely to result from the difference in distribution of orthologous GPCRs among species. As shown in Fig. 2, a total of 1,220 human, mouse, and vertebrate CPIs include numerous orthologous peptides and receptors with high SS from a single phylum (Vertebrata), whereas invertebrate CPIs include various species-specific peptides and GPCRs with low SS from a wide range of phyla (e.g., Nematoda, Arthropoda, and Mollusca) regardless of the small number (132 invertebrate CPIs). These features of CPIs are thought to cause the overestimation of the prediction performance by the 5-CV (SI Appendix, Fig. S6) and leave-humans-, mice-, and vertebrates-out methods (Fig. 3A and SI Appendix, Fig. S6).
Also of interest is that self-training using experimentally validated data (CPIs) (Fig. 4) facilitated the identification of additional peptide–GPCR pairs (Fig. 8), and some negative CPIs were also generated (Fig. 5). These results provide evidence that validated data feedback to the PD-incorporated SVM improves the prediction accuracy and then verifies an unprecedented mode of ligand–GPCR interaction; in brief, the SVM has become more “intelligent” by acquiring new knowledge. Novel GPCRs have also been found in other species using next-generation sequencer-based genome or transcriptome analyses (51, 52), whereas the cognate ligands of most of such GPCRs have yet to be identified. In this context, the present study indicates that our PD-incorporated SVM (Fig. 8) can identify numerous peptide–GPCR pairs in various organisms via self-training, leading to the elucidation of molecular mechanisms underlying peptide–GPCR recognition and net evolutionary processes of peptide–GPCR interactions. Overall, these findings highlight the current prediction ability of the PD-incorporated SVM using limited amounts of CPI data and indicate the potential for further prediction system development for novel human peptide–GPCR pairs, including artificial peptidic drug candidates.
In conclusion, we have efficiently and systematically elucidated multiple neuropeptide–GPCR pairs in a phylogenetically critical invertebrate chordate, C. intestinalis Type A, using a combination of machine learning and experimental validation. This study not only contributes to the investigation of molecular mechanisms for various nervous, neuroendocrine, and endocrine systems of Ciona, but also sheds light on the versatility of PD-incorporated SVM in the identification of multiple peptide–receptor pairs.
Materials and Methods
CPI Data.
CPI pairs with peptide ligands were collected from the IUPHAR Database (53) and UniProtKB knowledge base (54). From these databases, we utilized 261, 183, 169, 1, 13, and 10 CPI pairs for humans, mice, rats, opossums, zebrafish, and chickens, respectively. The information about the GPCR and peptide sequences was obtained from the UniProtKB (54). Additionally, we collected data for noninteraction pairs and invertebrate peptide–GPCR interaction pairs from the literature. All of the collected interactions and references are listed in Dataset S1. The 531 human interactions (rows 2–532 in Dataset S1), 310 mouse interactions (rows 533–842 in Dataset S1), 379 vertebrate interactions (rows 843–1,241 in Dataset S1), and 132 invertebrate interactions (rows 1,242–1,353 in Dataset S1) were used for training datasets as positive pairs. To generate the same number of negative pairs, we collected the reported noninteraction pairs and generated the randomly selected negative pairs as previously reported (21, 29). A total of 3 reported noninteraction pairs (rows 1,354–1,356 in Dataset S1) and 528 randomly selected negative pairs for humans, 310 randomly selected negative pairs for mice, 7 reported noninteraction pairs (rows 1,357–1,363 in Dataset S1) and 372 randomly selected negative pairs for vertebrates, and 82 reported noninteraction pairs (rows 1,364–1,445 in Dataset S1) and 50 randomly selected negative pairs for invertebrates were used for training datasets.
Peptide Kernels.
We constructed the PDs with regular expression-based high-resolution representations, which encode the existence or absence of regular expression-represented 5-aa motifs. The descriptors were calculated in three steps (Fig. 1A). First, we collected the 51 regular expression elements to match amino acids, which consist of 21-bit representations of PROFEAT (32), 3 repeats, N-terminus and C-terminus marks of peptide sequences, and 25 single residues (SI Appendix, Table S2). For example, the regular expression element of [KR] (13th element of SI Appendix, Table S2) matches a single residue of lysine or arginine. In the second step, all of the permutations and combinations of 5 of these 51 regular expression elements were generated. For example, pHW[GASDT]Y matches the peptide sequences possessing pyroglutamic acid, followed by histidine; followed by glycine, alanine, threonine, aspartic acid, or serine; and followed by tyrosine. The expression ^N.Y{1,5} matches the peptide sequences possessing asparagine at the N terminus, followed by any amino acid, and followed by one- to five-length repetitions of tyrosine. Third, the peptide sequences were encoded with bit (0, 1) vectors, which represent each regular expression match (= 1) or nonmatch (= 0). Then, to unify redundant regular expressions, if there was a pair of regular expressions appearing in the same compound set, the regular expression showing the narrower range was removed. The inner products of these bit vector pairs were calculated as the kernels for each peptide pair.
We also calculated the mismatch descriptor to compare with our proposed regular expression-based descriptors, which is a class of string kernels that compares sequence strings representing k-mer subsequences. The mismatch kernel allows for mutations between the subsequences. Specifically, the mismatch kernel is calculated based on shared occurrences of (k-m)-patterns in the data, where the (k-m)-patterns consist of all k-length subsequences that differ from a fixed k-length sequence pattern by at most m mismatches. The inner products of bit vector pairs were calculated as the mismatch kernels for each peptide pair.
GPCR Kernels.
TM z-scale descriptors were employed for representations of GPCRs, as previously described (21). Briefly, seven TM sequences were directly substituted with the z-scale vectors that represent five leading principal components obtained from 26 measured and computed physicochemical properties of amino acids. The 935-dimensional descriptors were generated by concatenating 5D vectors (z1−z5) for each of the 187 residues of TMs in GPCRs. The inner products of GPCR descriptor pairs were calculated as GPCR kernels for each GPCR pair.
Similarity Scores.
The SSs of the GPCRs and peptides were defined as Tanimoto coefficients (33) of their top 1% most similar GPCRs and peptides, respectively, as described in our previous study (21). For calculation of Tanimoto coefficients, we utilized TM z-scale descriptors and regular expression-based descriptors for GPCRs and peptides, respectively.
CPI Pair Kernels and SVM Prediction.
We utilized kernel methods to incorporate CPI data into SVMs (55) for constructing prediction models, as previously described (21). Here, kernels for CPI pairs were represented as the products of linear kernels for PDs and GPCR descriptors. Parameters of the SVM regularization were optimized using a grid search. All of the training and test CPIs are included in Dataset S1.
Performance Evaluation.
The prediction performance of our proposed model was evaluated using LOSO internal validation, as previously reported (21, 26, 28). In the present LOSO validation, the CPIs and noninteraction pairs partitioned into human data, mouse data, vertebrate data, and invertebrate data were predicted using models containing the other CPIs and noninteraction pairs. For example, for the leave-humans-out validation, mouse, vertebrate, and invertebrate CPIs (rows 533–1,353 in Dataset S1) and noninteraction pairs were used for SVM training, and prediction performances (ACC and AUC) were calculated using the prediction results for human CPIs and noninteraction pairs. The performance of the internal validation was measured by ACC = (TP+TN)/(TP+TN+FP+FN), where TP, TN, FP, and FN denote the numbers of true positives, true negatives, false positives, and false negatives, respectively. To further confirm the prediction performance of CPIs using LOSO analysis, we also measured the performance of internal validation using the AUC (56), which is an index independent of the decision threshold of the prediction model and class probability distributions of predicted data. SEMs of ACCs and AUCs were estimated by five repeated experiments with independently generated negative data. Differences between AUCs and ACCs were evaluated using a Student’s t test as appropriate, with P < 0.05 considered as significant.
Peptide Synthesis.
The peptide sequences we utilized are listed in SI Appendix, Table S1. All peptides were synthesized using an ABI 430A solid-phase peptide synthesizer (Applied Biosystems) and the Fast Moc method, according to the manufacturer’s instruction.
Gαq16-Fused C. intestinalis GPCRs.
Each GPCR ID in the Ghost database (35) was indicated by abbreviated IDs without splicing variant information. The full-length IDs are listed in SI Appendix, Table S4. C. intestinalis putative full-length GPCRs—KH.C3.660, KH.C9.683, KH.C4.122, KH.C2.1132, KH.C2.1037, KH.C2.878, KH.C2.212, KH.C1.745, and KH.C8.781—were cloned from the central nervous system and were C-terminally fused with human Gαq16 protein, which was coupled with GPCRs and triggered intracellular calcium mobilization upon binding of a specific ligand (57). The human Gαq16 ORF clone (OriGene) was amplified (SI Appendix, Table S7) and ligated into the XbaI site of a pFastbacI plasmid (Invitrogen). Then, KH.C3.660, KH.C9.683, KH.C4.122, KH.C2.1132, KH.C2.1037, KH.C2.878, KH.C2.212, KH.C1.745, and KH.C8.781 were cloned into the NotI/XbaI site of the Gαq16-ligated pFastbacI plasmids, respectively. Transformation of competent cells with the Ciona GPCR-Gαq16-pFastbacI plasmid and the resulting bacmid isolation was performed according to the manufacturer’s instructions for the Bac-to-Bac system (Thermo Fisher Scientific).
Calcium Accumulation Assay.
Sf9 cells (Thermo Fisher Scientific) were grown in Sf900 II (Thermo Fisher Scientific) containing 10% FBS (Sigma) at 28 °C. Ciona GPCR-Gαq16-recombinant baculoviruses were generated in Sf9 cells transfected with the above bacmids using Cellfectin II, titrated, isolated, and transiently transfected into Sf9 cells using the Bac-to-Bac system according to the manufacturer’s instruction (Thermo Fisher Scientific). Forty-eight hours after transfection, Sf9 cells were loaded for 30 min with 2.5 μM of Fluo-8 AM (AAT Bioquest) diluted in loading buffer [HBSS supplemented with 1.25 mM of probenecid and 0.04% (wt/vol) of pluronic F-127]. Each Ciona GPCR-fused human Gαq16 expression at cell membrane was confirmed by immunostaining using the anti-Gαq16 antibody (Ori Gene TA318890). Various concentrations of peptides were administrated to Sf9 cells in a FlexStation II-automated apparatus (Molecular Devices). Real-time fluorescent kinetics for Fluo-8 were observed at excitation/emission wavelengths of 490/514 nm. The calcium accumulation data were analyzed using Prism v6 (GraphPad) to fit to a sigmoidal concentration-response curve, and the means ± SEMs of EC50 were calculated.
Real-Time PCR.
Total RNA (2 μg) extracted from various tissues of Ciona was reverse-transcribed using SuperScript III (Invitrogen) and oligo (dT) 20 primer. Real-time PCR was performed using the CFX96 Real-time System and SsoAdvanced Universal SYBR Green Supermix (Bio-Rad Laboratories). Total volume of reaction mixtures was 20 µL, consisting of 100-ng template cDNA, each 500-nM primer, and 10 µL SYBR Green Master Mix solution. PCR was performed for initial steps at 95 °C for 30 s, followed by 44 cycles at 95 °C for 15 s and at 60 °C for 30 min. A melting-curve analysis was performed to confirm the absence of primer dimers. Ct values for GAPDH and identified GPCR genes were calculated according to the manufacturer’s instruction. The mean ± SEM of GAPDH-normalized ΔCt values were estimated from three replicates. Sequences of the primers used for the real-time PCR are listed in SI Appendix, Table S8.
Molecular Phylogenetic Analysis for Ciona GPCRs.
GPCR sequences similar to the identified Ciona GPCR sequences were extracted by ORTHOSCOPE 1.0.1 (58). To implement a BLAST search in ORTHOSCOPE, coding sequences from Ci-GALP-R, Ci-NTP-2-R, Ci-LF-1-R, and Ci-YFL-1-R were used as queries against gene models of vertebrates (Homo sapiens and Gallus gallus), urochordates (C. intestinalis, Ciona savignyi, B. schlosseri, and O. dioica), cephalochordates (Branchiostoma floridae and Branchiostoma belcheri), echinoderms (Acanthaster planci and Strongylocentrotus purpuratus), hemichordates (Ptychodera flava and Saccoglossus kowalevskii), and protostomes (D. melanogaster, C. elegans, and Lingula anatina). The BLAST hit sequences were screened using an E-value cut-off of <10−3, and the top five hits were used for the subsequent phylogenetic analyses. The protein sequences retrieved by the ORTHOSCOPE analyses were aligned using MAFFT (59). Multiple sequence alignments were trimmed by removing poorly aligned regions using TRIMAL 1.2 (60) with the option “gappyout.” Corresponding coding sequences were forced onto the amino acid alignment using PAL2NAL (61) to generate nucleotide alignments for following analyses.
Gene phylogenetic trees were estimated using ML and NJ methods with the first and second codon positions and bootstrap analyses of genes encoding full-length sequences (for NJ and ML analyses) and TM domains (for ML analysis) of GPCRs based upon 100 replicates. Codon-partitioned ML analyses were performed with RAxML 8.2.12 (62), which invokes a rapid bootstrap analysis and searches for the best-scoring ML tree with the general time-reversible with gamma (GTRGAMMA) (63, 64) model. NJ analyses were conducted using the software package Ape in R using the TN93 model (65) with γ-distributed rate heterogeneity (64). The sequences for ligand-identified GPCRs in this paper are presented in Dataset S3. The molecular phylogenetic trees of full-length sequences (for NJ and ML analyses) and TM domains (for ML analysis) of GPCRs were constructed using the MEGA software (v7) (66). Each schematic of gene trees was constructed by focusing on gene clades consistently supported by the three molecular phylogenetic trees (SI Appendix, Fig. S5) using the ORTHOSCOPE, as previously reported (58).
Supplementary Material
Acknowledgments
We thank Prof. Shigetada Nakanishi for his fruitful comments on the manuscript. Ciona intestinalis was raised and supplied by the National Bio-resource Project of Ciona (MEXT, Japan). This work was supported in part by the Japan Society for the Promotion of Science Grant 16K07430 (to H.S.).
Footnotes
The authors declare no conflict of interest.
This article is a PNAS Direct Submission. T.P.S. is a guest editor invited by the Editorial Board.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1816640116/-/DCSupplemental.
References
- 1.Delsuc F, Brinkmann H, Chourrout D, Philippe H. Tunicates and not cephalochordates are the closest living relatives of vertebrates. Nature. 2006;439:965–968. doi: 10.1038/nature04336. [DOI] [PubMed] [Google Scholar]
- 2.Denoeud F, et al. Plasticity of animal genome architecture unmasked by rapid evolution of a pelagic tunicate. Science. 2010;330:1381–1385. doi: 10.1126/science.1194167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Satoh N, Rokhsar D, Nishikawa T. Chordate evolution and the three-phylum system. Proc Biol Sci. 2014;281:20141729. doi: 10.1098/rspb.2014.1729. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Satoh N, Levine M. Surfing with the tunicates into the post-genome era. Genes Dev. 2005;19:2407–2411. doi: 10.1101/gad.1365805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Mirabeau O, Joly JS. Molecular evolution of peptidergic signaling systems in bilaterians. Proc Natl Acad Sci USA. 2013;110:E2028–E2037. doi: 10.1073/pnas.1219956110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hewes RS, Taghert PH. Neuropeptides and neuropeptide receptors in the Drosophila melanogaster genome. Genome Res. 2001;11:1126–1142. doi: 10.1101/gr.169901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Satake H, Kawada T. Neuropeptides, hormones, and their receptors in ascidians: Emerging model animals. In: Satake H, editor. Invertebrate Neuropeptides and Hormones: Basic Knowledge and Recent Advances. Transworld Research Network; Kerala, India: 2006. pp. 253–276. [Google Scholar]
- 8.Kawada T, et al. Peptidomic analysis of the central nervous system of the protochordate, Ciona intestinalis: Homologs and prototypes of vertebrate peptides and novel peptides. Endocrinology. 2011;152:2416–2427. doi: 10.1210/en.2010-1348. [DOI] [PubMed] [Google Scholar]
- 9.Matsubara S, et al. The significance of Ciona intestinalis as a stem organism in integrative studies of functional evolution of the chordate endocrine, neuroendocrine, and nervous systems. Gen Comp Endocrinol. 2016;227:101–108. doi: 10.1016/j.ygcen.2015.05.010. [DOI] [PubMed] [Google Scholar]
- 10.Satake H, et al. Tachykinin and tachykinin receptor of an ascidian, Ciona intestinalis: Evolutionary origin of the vertebrate tachykinin family. J Biol Chem. 2004;279:53798–53805. doi: 10.1074/jbc.M408161200. [DOI] [PubMed] [Google Scholar]
- 11.Tello JA, Rivier JE, Sherwood NM. Tunicate gonadotropin-releasing hormone (GnRH) peptides selectively activate Ciona intestinalis GnRH receptors and the green monkey type II GnRH receptor. Endocrinology. 2005;146:4061–4073. doi: 10.1210/en.2004-1558. [DOI] [PubMed] [Google Scholar]
- 12.Sekiguchi T, Ogasawara M, Satake H. Molecular and functional characterization of cionin receptors in the ascidian, Ciona intestinalis: The evolutionary origin of the vertebrate cholecystokinin/gastrin family. J Endocrinol. 2012;213:99–106. doi: 10.1530/JOE-11-0410. [DOI] [PubMed] [Google Scholar]
- 13.Kawada T, Sekiguchi T, Itoh Y, Ogasawara M, Satake H. Characterization of a novel vasopressin/oxytocin superfamily peptide and its receptor from an ascidian, Ciona intestinalis. Peptides. 2008;29:1672–1678. doi: 10.1016/j.peptides.2008.05.030. [DOI] [PubMed] [Google Scholar]
- 14.Kamesh N, Aradhyam GK, Manoj N. The repertoire of G protein-coupled receptors in the sea squirt Ciona intestinalis. BMC Evol Biol. 2008;8:129. doi: 10.1186/1471-2148-8-129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Hauser F, Cazzamali G, Williamson M, Blenau W, Grimmelikhuijzen CJ. A review of neurohormone GPCRs present in the fruitfly Drosophila melanogaster and the honey bee Apis mellifera. Prog Neurobiol. 2006;80:1–19. doi: 10.1016/j.pneurobio.2006.07.005. [DOI] [PubMed] [Google Scholar]
- 16.Bauknecht P, Jékely G. Large-scale combinatorial deorphanization of platynereis neuropeptide GPCRs. Cell Rep. 2015;12:684–693. doi: 10.1016/j.celrep.2015.06.052. [DOI] [PubMed] [Google Scholar]
- 17.Reynolds KA, Katritch V, Abagyan R. Identifying conformational changes of the beta(2) adrenoceptor that enable accurate prediction of ligand/receptor interactions and screening for GPCR modulators. J Comput Aided Mol Des. 2009;23:273–288. doi: 10.1007/s10822-008-9257-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kobilka BK, Deupi X. Conformational complexity of G-protein-coupled receptors. Trends Pharmacol Sci. 2007;28:397–406. doi: 10.1016/j.tips.2007.06.003. [DOI] [PubMed] [Google Scholar]
- 19.Schwartz TW, Frimurer TM, Holst B, Rosenkilde MM, Elling CE. Molecular mechanism of 7TM receptor activation—A global toggle switch model. Annu Rev Pharmacol Toxicol. 2006;46:481–519. doi: 10.1146/annurev.pharmtox.46.120604.141218. [DOI] [PubMed] [Google Scholar]
- 20.Huang XP, et al. Allosteric ligands for the pharmacologically dark receptors GPR68 and GPR65. Nature. 2015;527:477–483. doi: 10.1038/nature15699. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Shiraishi A, Niijima S, Brown JB, Nakatsui M, Okuno Y. Chemical genomics approach for GPCR-ligand interaction prediction and extraction of ligand binding determinants. J Chem Inf Model. 2013;53:1253–1262. doi: 10.1021/ci300515z. [DOI] [PubMed] [Google Scholar]
- 22.Klabunde T. Chemogenomic approaches to drug discovery: Similar receptors bind similar ligands. Br J Pharmacol. 2007;152:5–7. doi: 10.1038/sj.bjp.0707308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Weill N, Rognan D. Development and validation of a novel protein-ligand fingerprint to mine chemogenomic space: application to G protein-coupled receptors and their ligands. J Chem Inf Model. 2009;49:1049–1062. doi: 10.1021/ci800447g. [DOI] [PubMed] [Google Scholar]
- 24.Hamanaka M, et al. CGBVS-DNN: Prediction of compound-protein interactions based on deep learning. Mol Inform. 2017;36 doi: 10.1002/minf.201600045. [DOI] [PubMed] [Google Scholar]
- 25.Jacob L, Vert JP. Protein-ligand interaction prediction: An improved chemogenomics approach. Bioinformatics. 2008;24:2149–2156. doi: 10.1093/bioinformatics/btn409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Yabuuchi H, et al. Analysis of multiple compound-protein interactions reveals novel bioactive molecules. Mol Syst Biol. 2011;7:472. doi: 10.1038/msb.2011.5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Niijima S, Yabuuchi H, Okuno Y. Cross-target view to feature selection: Identification of molecular interaction features in ligand-target space. J Chem Inf Model. 2011;51:15–24. doi: 10.1021/ci1001394. [DOI] [PubMed] [Google Scholar]
- 28.Mauri A, Consonni V, Pavan M, Todeschini R. Dragon software: An easy approach to molecular descriptor calculations. Match (Mulh) 2006;56:237–248. [Google Scholar]
- 29.Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model. 2010;50:742–754. doi: 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
- 30.Leslie CS, Eskin E, Cohen A, Weston J, Noble WS. Mismatch string kernels for discriminative protein classification. Bioinformatics. 2004;20:467–476. doi: 10.1093/bioinformatics/btg431. [DOI] [PubMed] [Google Scholar]
- 31.Saigo H, Vert JP, Akutsu T. Optimizing amino acid substitution matrices with a local alignment kernel. BMC Bioinformatics. 2006;7:246. doi: 10.1186/1471-2105-7-246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Rao HB, Zhu F, Yang GB, Li ZR, Chen YZ. Update of PROFEAT: A web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 2011;39:W385–W390. doi: 10.1093/nar/gkr284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Martin EJ, et al. Measuring diversity: Experimental design of combinatorial libraries for drug discovery. J Med Chem. 1995;38:1431–1436. doi: 10.1021/jm00009a003. [DOI] [PubMed] [Google Scholar]
- 34.Leslie C, Kuang R. Fast kernels for inexact string matching. In: Schölkopf B, Warmuth M, editors. Porceedings of the 16th Annual Conference on Learning Theory and Kernel Workshop. Springer; Heidelberg, Germany: 2003. pp. 114–128. [Google Scholar]
- 35.Satou Y, Satoh N. Cataloging transcription factor and major signaling molecule genes for functional genomic studies in Ciona intestinalis. Dev Genes Evol. 2005;215:580–596. doi: 10.1007/s00427-005-0016-9. [DOI] [PubMed] [Google Scholar]
- 36.Bissantz C, Logean A, Rognan D. High-throughput modeling of human G-protein coupled receptors: Amino acid sequence alignment, three-dimensional model building, and receptor library screening. J Chem Inf Comput Sci. 2004;44:1162–1176. doi: 10.1021/ci034181a. [DOI] [PubMed] [Google Scholar]
- 37.Triguero I, García S, Herrera F. Self-labeled techniques for semi-supervised learning: Taxonomy, software and empirical study. Knowl Inf Syst. 2015;42:245–284. [Google Scholar]
- 38.Kim DK, et al. Coevolution of the spexin/galanin/kisspeptin family: Spexin activates galanin receptor type II and III. Endocrinology. 2014;155:1864–1873. doi: 10.1210/en.2013-2106. [DOI] [PubMed] [Google Scholar]
- 39.Luo R, Jin Z, Deng Y, Strokes N, Piao X. Disease-associated mutations prevent GPR56-collagen III interaction. PLoS One. 2012;7:e29818. doi: 10.1371/journal.pone.0029818. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Paavola KJ, Sidik H, Zuchero JB, Eckart M, Talbot WS. Type IV collagen is an activating ligand for the adhesion G protein-coupled receptor GPR126. Sci Signal. 2014;7:ra76. doi: 10.1126/scisignal.2005347. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Boucard AA, Ko J, Südhof TC. High affinity neurexin binding to cell adhesion G-protein-coupled receptor CIRL1/latrophilin-1 produces an intercellular adhesion complex. J Biol Chem. 2012;287:9399–9413. doi: 10.1074/jbc.M111.318659. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Laschet C, Dupuis N, Hanson J. The G protein-coupled receptors deorphanization landscape. Biochem Pharmacol. 2018;153:62–74. doi: 10.1016/j.bcp.2018.02.016. [DOI] [PubMed] [Google Scholar]
- 43.Aoyama M, et al. A novel biological role of tachykinins as an up-regulator of oocyte growth: Identification of an evolutionary origin of tachykininergic functions in the ovary of the ascidian, Ciona intestinalis. Endocrinology. 2008;149:4346–4356. doi: 10.1210/en.2008-0323. [DOI] [PubMed] [Google Scholar]
- 44.Kamiya C, et al. Nonreproductive role of gonadotropin-releasing hormone in the control of ascidian metamorphosis. Dev Dyn. 2014;243:1524–1535. doi: 10.1002/dvdy.24176. [DOI] [PubMed] [Google Scholar]
- 45.Jékely G. Global view of the evolution and diversity of metazoan neuropeptide signaling. Proc Natl Acad Sci USA. 2013;110:8702–8707. doi: 10.1073/pnas.1221833110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Satoh N. Chordate Origins and Evolution: The Molecular Evolutionary Road to Vertebrates. Elsevier; Boston: 2016. [Google Scholar]
- 47.Gawehn E, Hiss JA, Schneider G. Deep learning in drug discovery. Mol Inform. 2016;35:3–14. doi: 10.1002/minf.201501008. [DOI] [PubMed] [Google Scholar]
- 48.Yuriev E, Holien J, Ramsland PA. Improvements, trends, and new ideas in molecular docking: 2012-2013 in review. J Mol Recognit. 2015;28:581–604. doi: 10.1002/jmr.2471. [DOI] [PubMed] [Google Scholar]
- 49.König C, Alquézar R, Vellido A, Giraldo J. Systematic analysis of primary sequence domain segments for the discrimination between class C GPCR subtypes. Interdiscip Sci. 2018;10:43–52. doi: 10.1007/s12539-018-0286-3. [DOI] [PubMed] [Google Scholar]
- 50.Sagawa T, et al. Logistic regression of ligands of chemotaxis receptors offers clues about their recognition by bacteria. Front Bioeng Biotechnol. 2018;5:88. doi: 10.3389/fbioe.2017.00088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Li C, et al. Comparative genomic analysis and evolution of family-B G protein-coupled receptors from six model insect species. Gene. 2013;519:1–12. doi: 10.1016/j.gene.2013.01.061. [DOI] [PubMed] [Google Scholar]
- 52.Chen N, et al. Identification of a nematode chemosensory gene family. Proc Natl Acad Sci USA. 2005;102:146–151. doi: 10.1073/pnas.0408307102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Southan C, et al. NC-IUPHAR The IUPHAR/BPS guide to PHARMACOLOGY in 2016: Towards curated quantitative interactions between 1300 protein targets and 6000 ligands. Nucleic Acids Res. 2016;44:D1054–D1068. doi: 10.1093/nar/gkv1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.UniProt Consortium UniProt: A hub for protein information. Nucleic Acids Res. 2015;43:D204–D212. doi: 10.1093/nar/gku989. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Vapnik VN. Statistical Learning Theory. John Wiley & Sons; New York: 1998. [Google Scholar]
- 56.Ling CX, Huang J, Zhang H. AUC: A statistically consistent and more discriminating measure than accuracy. Proc IJCAI. 2003;3:519–524. [Google Scholar]
- 57.Tabata K, Baba K, Shiraishi A, Ito M, Fujita N. The orphan GPCR GPR87 was deorphanized and shown to be a lysophosphatidic acid receptor. Biochem Biophys Res Commun. 2007;363:861–866. doi: 10.1016/j.bbrc.2007.09.063. [DOI] [PubMed] [Google Scholar]
- 58.Inoue J, Satoh N. ORTHOSCOPE: An automatic web tool for phylogenetically inferring bilaterian orthogroups with user-selected taxa. Mol Biol Evol. 2019;36:621–631. doi: 10.1093/molbev/msy226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Katoh K, Kuma K, Toh H, Miyata T. MAFFT version 5: Improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005;33:511–518. doi: 10.1093/nar/gki198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: A tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25:1972–1973. doi: 10.1093/bioinformatics/btp348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Suyama M, Torrents D, Bork P. PAL2NAL: Robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 2006;34:W609–W612. doi: 10.1093/nar/gkl315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Stamatakis A. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–1313. doi: 10.1093/bioinformatics/btu033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Yang Z. Estimating the pattern of nucleotide substitution. J Mol Evol. 1994;39:105–111. doi: 10.1007/BF00178256. [DOI] [PubMed] [Google Scholar]
- 64.Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J Mol Evol. 1994;39:306–314. doi: 10.1007/BF00160154. [DOI] [PubMed] [Google Scholar]
- 65.Tamura K, Nei M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993;10:512–526. doi: 10.1093/oxfordjournals.molbev.a040023. [DOI] [PubMed] [Google Scholar]
- 66.Kumar S, Stecher G, Tamura K. MEGA7: Molecular evolutionary genetics analysis version 7.0 for bigger datasets. Mol Biol Evol. 2016;33:1870–1874. doi: 10.1093/molbev/msw054. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.