Properties of Protein Drug Target Classes

Simon C Bull; Andrew J Doig

doi:10.1371/journal.pone.0117955

. 2015 Mar 30;10(3):e0117955. doi: 10.1371/journal.pone.0117955

Properties of Protein Drug Target Classes

Simon C Bull ¹, Andrew J Doig ^1,^*

Editor: Yoshihiro Yamanishi²

PMCID: PMC4379170 PMID: 25822509

Abstract

Accurate identification of drug targets is a crucial part of any drug development program. We mined the human proteome to discover properties of proteins that may be important in determining their suitability for pharmaceutical modulation. Data was gathered concerning each protein’s sequence, post-translational modifications, secondary structure, germline variants, expression profile and drug target status. The data was then analysed to determine features for which the target and non-target proteins had significantly different values. This analysis was repeated for subsets of the proteome consisting of all G-protein coupled receptors, ion channels, kinases and proteases, as well as proteins that are implicated in cancer. Machine learning was used to quantify the proteins in each dataset in terms of their potential to serve as a drug target. This was accomplished by first inducing a random forest that could distinguish between its targets and non-targets, and then using the random forest to quantify the drug target likeness of the non-targets. The properties that can best differentiate targets from non-targets were primarily those that are directly related to a protein’s sequence (e.g. secondary structure). Germline variants, expression levels and interactions between proteins had minimal discriminative power. Overall, the best indicators of drug target likeness were found to be the proteins’ hydrophobicities, in vivo half-lives, propensity for being membrane bound and the fraction of non-polar amino acids in their sequences. In terms of predicting potential targets, datasets of proteases, ion channels and cancer proteins were able to induce random forests that were highly capable of distinguishing between targets and non-targets. The non-target proteins predicted to be targets by these random forests comprise the set of the most suitable potential future drug targets, and should therefore be prioritised when building a drug development programme.

Introduction

The vast majority of the targets of approved drugs are proteins [1,2]. Knowledge of which proteins are the targets of approved drugs enables the division of the human proteome into two classes: approved drug targets and non-targets. A protein is an approved drug target if it is the target of an approved drug, and a non-target otherwise.

In order for a protein to have any potential as a drug target it must be druggable. A druggable protein is one that possesses folds that favour interactions with small drug-like molecules, be they endogenous or extraneous, and therefore is one that contains a binding site [1,3]. These binding sites are expected to have certain characteristics that enable high affinity site-specific binding by the drug-like molecule. As with all drug targets, a potential protein drug target must be linked to a disease process.

Currently there is a lack of knowledge about both the number of proteins that modern pharmaceuticals act on and the number of potentially druggable proteins. Drews proposed one of the first counts of the number of human protein targets, and determined that there were only 417 protein drug targets (excluding anti-infectives acting on bacteria, viruses or parasites) [4]. More recent estimates for the number of protein drug targets have included 218 [5]; a consensus number of 324 [6]; 399, reduced to 120 when only approved drug targets are considered [1], and 435 [7]. In terms of potential drug targets, an analysis by Russ and Lampel [8] identified between 2000 and 3000 proteins that are druggable. Using a purely bioinformatics approach, Bakheet and Doig were able to identify 668 proteins that are not currently approved drug targets, but that have target-like properties [9]. These latter estimates lend credence to the belief that, although the estimate of the number of currently targeted proteins is in the hundreds, the number of proteins that are druggable is substantially larger [5].

While knowledge of the number of proteins that may be amenable to pharmaceutical modulation is valuable, it is also useful to consider the families to which these proteins belong. Rask-Anderson et al. found that G-protein-coupled receptors (GPCRs) make up 44% of human drug targets, enzymes 29% and transporter proteins 15% [7]; Overington et al. found that over 50% of drugs target GPCRs, nuclear receptors or ion channels [6]; Hopkins and Groom found that enzymes comprise 47% of launched targets, while GPCRs account for 30% [1]; and Zheng et al. found that enzymes make up 50% of approved targets [10]. One very evident trend in these findings is the prominence of enzymes and GPCRs in the set of approved drug target proteins. Using the estimate of Fredriksson et al. that there are approximately 800 GPCRs coded for by the human genome [11], and the knowledge that there are just over 20,000 human proteins [12], we can estimate that roughly 4% of human proteins are GPCRs. The fraction of GPCRs in the set of approved drug targets can therefore be seen to be vastly greater than would be expected if the set’s composition was proportional to that of all the human proteins. Potential reasons for this discrepancy include: the frequency with which proteins from specific families, such as GPCRs and ion channels, can be found to be involved in human diseases, the nature of the diseases that affect developed countries and the potential difficulty of identifying and exploiting other families of proteins.

In this paper, we investigate properties of major types of drug target proteins, in order to identify rules to predict novel future targets. Target classes were selected based on their sizes and importance.

Target Types Investigated

Antineoplastic

Targeted cancer therapies seek to modulate the activity of specific molecular targets that are believed to have a critical role in tumour growth and/or cancer progression. While these targets may be present in non-cancerous cells, they are often overexpressed or altered in cancerous cells, thereby giving targeted therapies increased selectivity and reduced toxicity over conventional cytotoxic treatments [13,14]. By targeting specific proteins, rather than indiscriminately killing proliferating cells, targeted therapies can be used to interfere with specific aspects of cancer progression. For example, the immortalisation of cancer cells could be attacked via the targeting of telomerase, as it is both specific to cancerous cells and necessary for their survival [15]; molecular alterations that deregulate growth can be corrected, as with Imatinib’s targeting of the BCR-ABL protein in chronic myelogenous leukaemia; or the tumour’s blood supply can be cut off by preventing angiogenesis, as done by the drug Bevacizumab’s inhibition of vascular endothelial growth factor A [16]. Due to their importance in modulating growth factors, tyrosine kinases are an especially useful group of targets [17], with drugs such as Imatinib, Gefitinib, Erlotinib and Sunitinib targeting them. Other important targets include growth factors and proteasomes, inhibition of which can potentially slow a tumour’s proliferation by inhibiting growth/angiogenesis or increasing apoptosis, respectively.

GPCRs

The prominent role that GPCRs play in many physiological processes means that GPCRs make up a large fraction of the targets of approved drugs [18,19]. One approach to modulating the activity of a GPCR pharmacologically is to develop a drug that competes with the receptor’s endogenous ligand for access to its orthosteric site. However, in order for a drug to effectively modulate a GPCR’s activity in this manner it must out-compete its endogenous ligand, which necessitates that the drug have a high affinity for the specific GPCR and be maintained at a sufficiently high concentration [20]. Alternatively, a drug can modulate the GPCR’s activity allosterically by binding to a location topographically distinct from the endogenous ligand’s binding site. These allosteric modulators can benefit not only from the increased selectivity due to the often less conserved nature of their binding sites, but also from the fact that the endogenous ligand can still bind to the orthosteric site [21,22].

Ion Channels

Ion channels are popular targets for pharmacological intervention due to their key roles in human physiology, localisation in the membrane and pattern of distribution throughout the body [23,24]. The drugs that target them alter their permeability by changing the probability that the channel will be in a given state, often by preferentially binding to and stabilising a particular channel conformation [25]. Pharmacological modulation of ion channels is generally achieved by interacting with the channel’s pore or altering its gating [26]. Pore modulators are primarily inhibitors that exert their effect by binding to the pore and physically or electrostatically blocking the flow of ions [26,27], predominantly by occluding the pore or stabilising a closed or inactive state of the channel. Gating modulators bind to the channel and change the kinetics of the gating process [26]. They are therefore allosteric in nature, and can be designed to enhance the normal conductance of a channel, either positively or negatively, or exert their effect independently of the channel’s gating stimulus [28].

Kinases

Kinase activity plays a key role in many cellular processes, such as cell cycle progression, apoptosis, differentiation and signal transduction [29]. Eukaryotic protein kinases are related by a homologous catalytic domain of approximately 250–300 amino acids [30] and can be grouped into the serine/threonine and tyrosine kinases, which are responsible for phosphorylating the hydroxyl oxygen of their respective amino acids. Due to the pivotal role of kinases in the regulation of many cellular processes, aberrant kinase activity has been associated with a variety of diseases and the majority of human cancers [31]. Pharmacological interventions targeting kinases have historically been focussed on the inhibition of malfunctioning kinases, and therefore on preventing irregular kinase activity rather than promoting or enhancing normal activity [31]. These inhibitors can be classified based on the state of the kinase they target (active or inactive) and whether they bind to the active site, an allosteric site or both. The majority of kinase inhibitors developed to date compete directly with ATP for its binding pocket [32,33]. Type I inhibitors rely on the availability of a kinase’s active site, and therefore its active state, while type II inhibitors target the inactive form of the kinase, which can display more structural variation as it is not constrained by the need to catalyse the phosphorylation reaction [34,35].

Proteases

Eukaryotic proteases can be divided into ones that perform non-covalent (aspartic and metallo proteases) or covalent (cysteine, serine and threonine proteases) catalysis. Commensurate with their biological importance, deficient or abnormal protease function is present in many pathological conditions. Pharmacological modulation of their activity is therefore a potentially important therapeutic option for treating disease, with an estimated 5–10% of all drugs under development targeting proteases [36]. The therapeutic modulation of protease activity is generally achieved using small molecule reversible or irreversible inhibitors, with the most common approach being to develop a drug that mimics the structure of a protease’s substrate and competes with it for the protease’s active site [37]. Although non-competitive inhibition of protease activity is possible in principle, no non-competitive inhibitors have been approved for sale nor reached the advanced stages of development [38].

Methods

Cleaning and Collation of Protein Data

Protein Accession and Name

The UniProt accessions and name of each human protein were extracted from an XML file containing all reviewed human proteins from UniProt release 2012_05, hereafter referred to as the UniProt XML file. For each protein <entry> element in the file, the accessions were extracted from its <accession> child elements, and the protein’s name from its <name> child element. The first <accession> element encountered in the record for a protein was taken to be the protein’s representative accession. A mapping between non-representative and representative accessions was produced to enable cross referencing with external databases that may use non-representative accessions. Complete lists of proteins in each set are in S1 Supplementary Information.

Simple Sequence Properties

Each protein’s sequence was extracted from the <sequence> child element of its <entry> element in the UniProt XML file. Following the extraction of the sequence, its length was determined by counting the number of amino acid residues in it. Information about the presence or absence of a signal peptide was extracted from the <feature> child elements of a protein’s <entry> element in the UniProt XML file. Any protein with a <feature> element where the value of the type attribute was "signal peptide" was deemed to contain a signal peptide.

The number of PEST motifs in each protein was calculated using epestfind (http://emboss.bioinformatics.nl/cgi-bin/emboss/epestfind) which returns potential, poor and invalid PEST motifs. Only potential PEST motifs were counted. The number of PEST motifs returned by epestfind was summed to get the total number of PEST motifs for the protein. The program was run with the default parameters.

The number of low complexity regions was calculated using segmasker [39]. The number of low complexity intervals returned by segmasker was summed to get the total number of low complexity regions for a protein. The program was run with the default parameters.

The hydrophobicity of a protein was calculated to be the mean of the hydrophobicity values, as determined by the Kyte and Doolittle index [40], of the amino acids in its sequence. This was calculated by summing the hydrophobicity values of all the amino acids in the sequence, and then dividing by the sequence length.

The isoelectric point of each protein was calculated using the pepstats program (http://emboss.sourceforge.net/apps/cvs/emboss/apps/pepstats.html). The program was run using the-auto parameter.

Amino Acid Composition

Following the extraction of the sequence, the number of occurrences of each of the twenty standard amino acids in the sequence was determined. Ambiguous amino acid codes (B, J and Z) were handled by incrementing the occurrence count for their corresponding amino acids (D/N for B, I/L for Q and E/J for Z) by 0.5. From these occurrence counts, the frequency with which each amino acid occurs in the protein’s sequence was determined by dividing the count for the amino acid by the sequence length. Amino acids were also grouped into eight categories: tiny (A, C, G, S and T), small (A, C, D, G, N, P, S, T and V), aliphatic (I, L and V), aromatic (F, H, W and Y), non-polar (A, C, F, G, I, L, M, P, V, W and Y), charged (D, E, H, K and R), positively charged (H, K and R) and negatively charged (D and E). For each protein, the fraction of the amino acids in its sequence that belong to each of the categories was calculated. This was done by summing up the occurrence counts for each of the amino acids in the category, and then dividing by the length of the sequence.

Protein Family

Proteins were classified as being a GPCR, ion channel, kinase, protease or other. Protein family membership was determined using multiple UniProt sources. The first source was the <keyword> child elements of each protein’s <entry> element in the UniProt XML file. A protein was determined to be a GPCR if the value of the id attribute of a <keyword> element was "KW-0297"; an ion channel if the value was one of "KW-1071", "KW-0851", "KW-0107", "KW-0869", "KW-0407", "KW-0631" or "KW-0894"; a kinase if the value was one of "KW-0418", "KW-0723" or "KW-0829" and a protease if value was one of "KW-0031", "KW-0064", "KW-0121", "KW-0224", "KW-0482", "KW-0645", "KW-0720", "KW-0788" or "KW-0888". A protein was also determined to be a GPCR, kinase or protease if it appeared in the GPCR (http://www.uniprot.org/docs/7tmrlist accessed May 14th 2012), kinase (http://www.uniprot.org/docs/pkinfam accessed May 14^th 2012) or protease (http://www.uniprot.org/docs/peptidas accessed May 14th 2012) files respectively.

For the purposes of this work, a cancer protein is one that is implicated in causing cancer or is the target of an antineoplastic drug. Cancer proteins were determined using two sources: the Cancer Gene Census (CGC) [41] and the FDA’s database of approved drugs. The CGC dataset (accessed on June 15th 2012) was parsed in order to determine the NCBI Gene IDs of genes that are causally implicated in cancer. These were then mapped to representative UniProt human protein accessions.

The FDA’s Drugs@FDA database was downloaded (http://www.fda.gov/downloads/Drugs/InformationOnDrugs/UCM054599.zip accessed April 2013), and processed to determine the set of approved antineoplastic drugs. All drugs approved by the FDA through March 2013 were manually evaluated for evidence of being indicated for antineoplastic use. For each drug, the approved indications for it were determined based on the label data stored by the FDA, or using DrugBank [42] and the Therapeutic Target Database (TTD) [43] if no label data was available. Drugs approved for supportive care (e.g. antiemetics and analgesics), adjunct treatment or non-cancerous cellular proliferation (e.g. actinic keratosis) were excluded from the list, while those approved for precancerous conditions (e.g. myelodysplastic syndrome) were included. Once the final set of approved antineoplastic drugs was created, the DrugBank and TTD Drug IDs of the drugs were determined. The targets of these drugs, as recorded by DrugBank and the TTD, were then determined and converted to representative UniProt accessions.

Posttranslational Modifications

Information about the glycosylation and phosphorylation sites of a protein was extracted from the <feature> child elements of the protein’s <entry> element in the UniProt XML file. Information about a glycosylation site was extracted from a <feature> element when the value of its type attribute was "glycosylation site". The element’s description attribute was used to determine whether the glycosylation was N-linked or O-linked. Information about a phosphorylation site on the protein was extracted from a <feature> element when the value of its type attribute was "modified residue". The element’s description attribute was used to determine whether a serine, threonine or tyrosine was phosphorylated. For each protein, the number of each of the five types of posttranslational modification site (O-glycosylation, N-glycosylation, phosphoserine, phosphothreonine and phosphotyrosine) was calculated. The data on phosphorylation sites extracted from UniProt in this manner was also used to calculate the total number of phosphorylation sites, of any type, for each protein.

Secondary Structure

NetSurfP [44] was used to predict the fraction of residues in each protein that participate in exposed α-helices, buried α-helices or β-strands. Although accurate secondary structure information could be obtained from crystal structures, this information is unavailable for the majority of proteins.

Information about the number of α-helical transmembrane regions of each protein was extracted from the <feature> child elements of the protein’s <entry> element in the UniProt XML file. A helical transmembrane region is recorded in a <feature> element when its type attribute is "transmembrane region" and the description attribute is present and contains 'Helical' (without quotes) as its first characters.

Protein Protein Interactions

The protein protein interaction (PPI) information for a protein was extracted from the <comment> child elements of the protein’s <entry> element when the value of the type attribute was "interaction". PPIs recorded in UniProt can be binary or unary, and can record interactions between human and non-human proteins. For each protein, the number of unique human proteins that participate in a binary interaction with the protein was calculated.

External Database References

Data concerning the cross-referencing of UniProt accessions and external database identifiers was extracted from Ensembl [45] using an automated BioMart [46] XML query. The NCBI Gene IDs, Ensembl Gene IDs, Ensembl Transcript IDs, Ensembl Peptide IDs and UniGene cluster IDs associated with each representative UniProt human protein accession were extracted using an XML query. Ensembl variant data was from http://www.ensembl.org/info/genome/variation/sources_documentation.html#homo_sapiens, followed by quality control to weed out bad records. (http://www.ensembl.org/info/genome/variation/data_description.html#quality_control).

UniGene Expression Clusters

Unigene [47] was used to extract data relating to the expression profile of the human proteome. Individual transcripts in UniGene are grouped into clusters that are believed to come from the same locus. The expression profile of a cluster is then determined by counting the number of expressed sequence tags (ESTs) in it for each of the body sites and developmental stages recorded in UniGene. The external cross-references extracted from UniProt were used to map UniProt accessions to UniGene cluster IDs from UniGene build #232. A protein’s expression in an individual body site or developmental stage was taken to be the sum of the ESTs in that body site or developmental stage across all UniGene clusters cross-referenced with the protein. In addition to the raw expression values, a derived feature was created that records the number of body sites in which the protein is expressed. This feature was calculated for each protein as the number of body sites in which the expression level was not 0.

Ensembl

Ensembl was used to extract information about the alternative transcripts, paralogues and germline variants of UniProt proteins. Details are given in S2 Supplementary Information.

Protein Drug Targets

The protein drug targets were determined using the TTD version 4.3.02 [43] and DrugBank version 3 [42]. Details on how UniProt accession numbers were obtained are given in S2 Supplementary Information. The final number of proteins determined to be the target of an approved small molecule drug was 1324, of which 1249 were found in DrugBank and 313 in the TTD. 238 of the proteins were common to both sources, while 1011 were unique to DrugBank and 75 unique to the TTD.

Machine Learning

Datasets Generated

The following 105 features were used in the construction of the protein datasets:

Amino acid composition
- Twenty amino acid frequencies
- Eight amino acid category frequencies
Simple sequence properties
- Sequence length
- The number of PEST motifs
- The number of low complexity regions
- The hydrophobicity of the protein
- The isoelectric point
- The presence of a signal peptide
Posttranslational modifications
- The number of O- and N-glyosylated sites
- The number of phosphorylated serine, threonine and tyrosine sites
- The total number of phosphorylated sites of any type
Secondary structures
- The number of α-helical transmembrane regions
- The percentage of residues predicted to participate in an exposed α-helix
- The percentage of residues predicted to participate in a buried α-helix
- The percentage of residues predicted to participate in a β-strand
Germline variants
- The number of 3’ untranslated region, 5’ untranslated region, nonsynonymous coding and synonymous coding variants
Inter-protein relationships
- The number of binary PPIs
- The number of alternative transcripts
- The number of paralogues
Expression levels
- Seven developmental stage expression levels
- Forty-five body site expression levels
- Derived feature recording the number of body sites the protein is expressed in

Six categories were created from the annotated human proteins. Within each category the proteins can be considered to be either positive or negative, positive proteins being those proteins that are approved drug targets and negative proteins those that are not. However, not all positive proteins will have been identified as such yet. Therefore, the set of negative proteins will contain both proteins that will never be the target of an approved drug and those that are not currently but will be in the future. The categories were therefore divided into positive and- unlabelled proteins, rather than positive and negative, where the unlabelled proteins contain both negative and nominally mislabelled positive proteins. Each protein in the human proteome was evaluated against a set of criteria to determine which of the categories it belongs in, and then evaluated against a separate criterion for each category to determine whether it is a positive protein in that specific category. The six categories, along with their criteria, can be seen in Table 1.

Table 1. Dataset inclusion criteria.

Category Name	Criterion for Inclusion in Category	# Proteins in Class	Criterion for Inclusion in Positive Class	# Positive Proteins
AllTargets	All proteins are included.	20243	The protein must be a target protein.	1324
Cancer	The protein must be a cancer protein.	831	The protein must be the target of an antineoplastic drug.	387
GPCR	The protein must be a GPCR.	827	The protein must be a target protein.	115
IonChannel	The protein must be an ion channel.	320	The protein must be a target protein.	155
Kinase	The protein must be a kinase.	661	The protein must be a target protein.	94
Protease	The protein must be a protease.	531	The protein must be a target protein.	59

Threshold	Non-redundant Observations (Pos/Unl)	Non-redundant Dataset G Mean	Entire Dataset
Threshold	Non-redundant Observations (Pos/Unl)	Non-redundant Dataset G Mean	TP	FP	TN	FN	G Mean
20%	403 (178/225)	0.84	293	50	394	94	0.82
30%	519 (236/283)	0.83	309	53	391	78	0.84
40%	625 (285/340)	0.83	326	66	378	61	0.85
50%	695 (316/379)	0.84	328	67	377	59	0.85
60%	742 (343/399)	0.84	313	52	392	74	0.85
70%	785 (367/418)	0.84	312	53	391	75	0.84
80%	806 (379/427)	0.84	318	59	385	69	0.84
90%	818 (385/433)	0.85	318	55	389	69	0.85
100%	831 (387/444)	0.85	332	71	373	55	0.85

Feature	P-value	PS	Positive Median	Unlabelled Median	Feature	P-value	PS	Positive Median	Unlabelled Median
Alanine *	3.47 × 10⁻⁰⁴	0.53	0.07	0.07	Positively Charged *	7.98 × 10⁻²³	0.42	0.13	0.14
Arginine *	1.28 × 10⁻¹³	0.44	0.05	0.06	Sequence Length *	2.13 × 10⁻¹⁴	0.56	474	410
Asparagine *	1.33 × 10⁻¹⁵	0.57	0.04	0.03	PEST Motifs *	2.66 × 10⁻¹³	0.45	0	0
Aspartic Acid *	5.90 × 10⁻⁰⁸	0.54	0.05	0.05	Low Complexity Regions *	1.83 × 10⁻⁰⁸	0.45	2	2
Cysteine	1.53 × 10⁻⁰¹	0.49	0.02	0.02	Hydrophobicity *	3.28 × 10⁻⁹³	0.67	-0.19	-0.38
Glutamic Acid *	3.71 × 10⁻¹⁹	0.43	0.06	0.07	Isoelectric Point	1.31 × 10⁻⁰¹	0.49	7.31	7.47
Glutamine *	2.57 × 10⁻⁶⁵	0.36	0.04	0.04	Signal Peptide *	8.10 × 10⁻¹¹	0.53	0	0
Glycine *	2.19 × 10⁻¹⁰	0.55	0.07	0.06	O-glycosylation Sites *	3.62 × 10⁻⁰⁴	0.51	0	0
Histidine *	1.35 × 10⁻⁰⁵	0.46	0.02	0.02	N-glycosylation Sites *	1.35 × 10⁻⁶⁴	0.60	0	0
Isoleucine *	1.10 × 10⁻⁷²	0.65	0.05	0.04	Phosphoserine Sites	7.02 × 10⁻⁰¹	0.50	0	0
Leucine *	3.33 × 10⁻⁰⁵	0.53	0.10	0.10	Phosphothreonine Sites	3.02 × 10⁻⁰²	0.51	0	0
Lysine	1.80 × 10⁻⁰¹	0.49	0.05	0.05	Phosphotyrosine Sites *	1.66 × 10⁻²⁵	0.54	0	0
Methionine *	1.31 × 10⁻³³	0.60	0.02	0.02	Total Phosphorylation Sites *	1.98 × 10⁻⁰⁴	0.53	0	0
Phenylalanine *	5.31 × 10⁻⁷⁸	0.65	0.04	0.04	Transmembrane α-helices *	3.16 × 10⁻⁶²	0.60	0	0
Proline *	9.94 × 10⁻¹²	0.44	0.05	0.06	Exposed α-helices *	1.92 × 10⁻⁰⁵	0.54	0.13	0.12
Serine *	1.37 × 10⁻⁶⁰	0.37	0.07	0.08	Buried α-helices *	2.47 × 10⁻⁸⁹	0.66	0.22	0.14
Threonine	2.87 × 10⁻⁰³	0.52	0.05	0.05	β Strands *	2.40 × 10⁻¹²	0.56	0.12	0.09
Tryptophan *	4.64 × 10⁻²⁴	0.58	0.01	0.01	3’ Untranslated	7.32 × 10⁻⁰¹	0.50	1	1
Tyrosine *	1.61 × 10⁻⁵²	0.63	0.03	0.03	5’ Untranslated	3.41 × 10⁻⁰¹	0.51	0	0
Valine *	7.98 × 10⁻⁶⁴	0.64	0.07	0.06	Nonsynonymous Coding *	6.66 × 10⁻¹⁶	0.57	15	11
Aliphatic *	6.09 × 10⁻⁷⁰	0.65	0.22	0.20	Synonymous Coding *	2.50 × 10⁻¹⁰	0.54	0	0
Aromatic *	6.68 × 10⁻⁵⁶	0.63	0.12	0.10	Binary PPIs *	5.02 × 10⁻¹⁴	0.56	1	0
Charged *	1.61 × 10⁻¹³¹.61 × 10⁻²³	0.42	0.24	0.26	Alternative Transcripts *	2.44 × 10⁻¹⁸	0.57	3	2
Negatively Charged *	2.05 × 10⁻⁰⁶	0.46	0.11	0.11	Paralogues *	5.73 × 10⁻⁰⁷	0.53	0	0
Non-polar *	1.24 × 10⁻⁷²	0.65	0.56	0.53	Body Sites Expressed In *	5.31 × 10⁻¹²	0.56	27	26

Feature	P-value	PS	Positive Median	Unlabelled Median	Feature	P-value	PS	Positive Median	Unlabelled Median
Alanine	1.46 × 10⁻⁰¹	0.53	0.07	0.07	Positively Charged *	2.63 × 10⁻¹⁵	0.34	0.13	0.14
Arginine *	8.88 × 10⁻⁰⁵	0.42	0.05	0.05	Sequence Length	4.15 × 10⁻⁰¹	0.48	505	557
Asparagine	5.26 × 10⁻⁰²	0.54	0.04	0.04	PEST Motifs *	1.06 × 10⁻⁰⁸	0.40	0	1
Aspartic Acid	1.22 × 10⁻⁰²	0.45	0.05	0.05	Low Complexity Regions *	1.17 × 10⁻⁰⁹	0.38	2	4
Cysteine *	4.34 × 10⁻⁰⁹	0.62	0.02	0.02	Hydrophobicity *	1.41 × 10⁻⁵⁷	0.82	-0.19	-0.57
Glutamic Acid *	3.73 × 10⁻¹²	0.36	0.06	0.07	Isoelectric Point	1.56 × 10⁻⁰¹	0.53	7.04	6.81
Glutamine *	2.43 × 10⁻²⁵	0.29	0.04	0.05	Signal Peptide *	1.11 × 10⁻¹⁵	0.61	0	0
Glycine	8.61 × 10⁻⁰¹	0.50	0.06	0.06	O-glycosylation Sites	7.76 × 10⁻⁰¹	0.50	0	0
Histidine	1.80 × 10⁻⁰³	0.44	0.02	0.02	N-glycosylation Sites *	2.81 × 10⁻³⁸	0.71	1	0
Isoleucine *	4.15 × 10⁻²⁷	0.72	0.05	0.04	Phosphoserine Sites *	8.17 × 10⁻¹²	0.37	0	1
Leucine *	4.44 × 10⁻¹⁶	0.66	0.10	0.09	Phosphothreonine Sites *	2.13 × 10⁻⁰⁶	0.42	0	0
Lysine	1.29 × 10⁻⁰³	0.44	0.05	0.06	Phosphotyrosine Sites	3.30 × 10⁻⁰³	0.54	0	0
Methionine *	1.25 × 10⁻⁰⁵	0.59	0.02	0.02	Total Phosphorylation Sites *	3.31 × 10⁻⁰⁷	0.40	1	2
Phenylalanine *	1.80 × 10⁻⁴²	0.77	0.04	0.03	Transmembrane α-helices *	5.21 × 10⁻⁴⁵	0.73	1	0
Proline *	2.04 × 10⁻¹³	0.35	0.05	0.07	Exposed α-helices	2.28 × 10⁻⁰¹	0.52	0.12	0.11
Serine *	7.94 × 10⁻¹⁰	0.38	0.07	0.08	Buried α-helices *	1.65 × 10⁻³⁵	0.75	0.22	0.10
Threonine	8.66 × 10⁻⁰³	0.55	0.05	0.05	β Strands *	2.67 × 10⁻⁰⁶	0.59	0.10	0.06
Tryptophan *	6.38 × 10⁻²⁷	0.72	0.02	0.01	3’ Untranslated *	2.55 × 10⁻¹⁹	0.32	0	3
Tyrosine *	1.11 × 10⁻¹⁵	0.66	0.03	0.02	5’ Untranslated *	1.36 × 10⁻¹⁶	0.34	0	2
Valine *	2.70 × 10⁻³⁰	0.73	0.07	0.05	Nonsynonymous Coding *	3.62 × 10⁻⁰⁹	0.38	14	29
Aliphatic *	5.83 × 10⁻⁴³	0.78	0.22	0.19	Synonymous Coding *	2.04 × 10⁻⁰⁴	0.44	0	0
Aromatic *	1.85 × 10⁻³⁵	0.75	0.12	0.09	Binary PPIs *	6.83 × 10⁻⁰⁵	0.42	1	2
Charged *	5.95 × 10⁻¹⁶	0.34	0.24	0.26	Alternative Transcripts	1.08 × 10⁻⁰²	0.45	3	4
Negatively Charged *	3.09 × 10⁻¹⁰	0.37	0.11	0.12	Paralogues	5.50 × 10⁻⁰³	0.45	0	0
Non-polar *	1.26 × 10⁻³³	0.74	0.55	0.51	Body Sites Expressed In *	3.93 × 10⁻¹⁶	0.34	24	32

Feature	P-value	PS	Positive Median	Unlabelled Median	Feature	P-value	PS	Positive Median	Unlabelled Median
Alanine *	4.07 × 10⁻⁰⁸	0.66	0.08	0.06	Positively Charged *	1.75 × 10⁻¹³	0.71	0.11	0.10
Arginine *	9.88 × 10⁻²²	0.77	0.05	0.04	Sequence Length *	1.97 × 10⁻³⁰	0.81	408	320
Asparagine	1.27 × 10⁻⁰³	0.59	0.04	0.03	PEST Motifs *	2.31 × 10⁻⁰⁷	0.59	0	0
Aspartic Acid *	5.45 × 10⁻⁰⁸	0.66	0.03	0.03	Low Complexity Regions *	2.78 × 10⁻⁰⁷	0.64	2	1
Cysteine	3.63 × 10⁻⁰³	0.42	0.03	0.03	Hydrophobicity *	1.06 × 10⁻³³	0.17	0.31	0.68
Glutamic Acid *	6.09 × 10⁻¹⁴	0.71	0.03	0.03	Isoelectric Point *	5.97 × 10⁻⁰⁶	0.63	9.02	8.52
Glutamine	6.78 × 10⁻⁰³	0.58	0.03	0.03	Signal Peptide	1.38 × 10⁻⁰³	0.55	0	0
Glycine *	2.50 × 10⁻⁰⁴	0.61	0.05	0.05	O-glycosylation Sites	1.92 × 10⁻⁰²	0.51	0	0
Histidine *	1.52 × 10⁻¹⁶	0.27	0.02	0.03	N-glycosylation Sites *	6.47 × 10⁻¹²	0.68	2	1
Isoleucine *	7.68 × 10⁻⁰⁶	0.37	0.07	0.08	Phosphoserine Sites *	1.43 × 10⁻⁰⁶	0.57	0	0
Leucine *	1.86 × 10⁻¹⁵	0.28	0.12	0.14	Phosphothreonine Sites *	6.08 × 10⁻⁰⁶	0.54	0	0
Lysine *	8.30 × 10⁻⁰⁴	0.60	0.04	0.03	Phosphotyrosine Sites	5.49 × 10⁻⁰³	0.52	0	0
Methionine *	1.17 × 10⁻¹³	0.29	0.02	0.03	Total Phosphorylation Sites *	7.87 × 10⁻⁰⁸	0.59	0	0
Phenylalanine *	1.93 × 10⁻¹⁷	0.26	0.05	0.07	Transmembrane α-helices	9.85 × 10⁻⁰¹	0.50	7	7
Proline *	4.39 × 10⁻¹¹	0.69	0.05	0.04	Exposed α-helices	3.35 × 10⁻⁰¹	0.53	0.09	0.09
Serine	5.60 × 10⁻⁰²	0.44	0.08	0.08	Buried α-helices *	6.29 × 10⁻¹⁹	0.25	0.47	0.58
Threonine *	7.45 × 10⁻⁰⁴	0.40	0.06	0.06	β Strands *	1.86 × 10⁻¹²	0.30	0.03	0.04
Tryptophan *	3.41 × 10⁻²¹	0.76	0.02	0.01	3’ Untranslated	1.81 × 10⁻⁰²	0.53	0	0
Tyrosine *	1.64 × 10⁻⁰⁸	0.34	0.03	0.04	5’ Untranslated	5.56 × 10⁻⁰²	0.53	0	0
Valine	2.47 × 10⁻⁰¹	0.47	0.08	0.08	Nonsynonymous Coding *	3.92 × 10⁻¹⁶	0.70	2	0
Aliphatic *	8.59 × 10⁻²⁴	0.22	0.26	0.30	Synonymous Coding	7.95 × 10⁻⁰²	0.52	0	0
Aromatic *	3.23 × 10⁻¹⁷	0.26	0.12	0.15	Binary PPIs	6.93 × 10⁻⁰³	0.54	0	0
Charged *	7.68 × 10⁻²²	0.77	0.18	0.15	Alternative Transcripts *	6.68 × 10⁻¹⁸	0.72	1	0
Negatively Charged *	9.12 × 10⁻¹⁸	0.74	0.07	0.05	Paralogues	3.72 × 10⁻⁰²	0.52	0	0
Non-polar *	2.45 × 10⁻¹²	0.30	0.61	0.65	Body Sites Expressed In *	5.85 × 10⁻²⁷	0.79	12	5

Feature	P-value	PS	Positive Median	Unlabelled Median	Feature	P-value	PS	Positive Median	Unlabelled Median
Alanine	4.75 × 10⁻⁰²	0.56	0.08	0.07	Positively Charged	1.92 × 10⁻⁰¹	0.54	0.11	0.11
Arginine	5.63 × 10⁻⁰³	0.59	0.05	0.05	Sequence Length	2.87 × 10⁻⁰³	0.59	408	373
Asparagine	1.77 × 10⁻⁰¹	0.54	0.04	0.04	PEST Motifs	2.40 × 10⁻⁰¹	0.53	0	0
Aspartic Acid *	7.06 × 10⁻⁰⁴	0.61	0.03	0.03	Low Complexity Regions	4.71 × 10⁻⁰¹	0.48	2	2
Cysteine	4.56 × 10⁻⁰¹	0.48	0.03	0.03	Hydrophobicity *	1.84 × 10⁻⁰⁴	0.38	0.31	0.43
Glutamic Acid	5.66 × 10⁻⁰²	0.56	0.03	0.03	Isoelectric Point	1.81 × 10⁻⁰¹	0.54	9.02	8.68
Glutamine	4.25 × 10⁻⁰¹	0.47	0.03	0.03	Signal Peptide	5.96 × 10⁻⁰¹	0.48	0	0
Glycine	4.89 × 10⁻⁰¹	0.52	0.05	0.05	O-glycosylation Sites	7.97 × 10⁻⁰²	0.51	0	0
Histidine *	9.56 × 10⁻⁰⁹	0.32	0.02	0.02	N-glycosylation Sites	2.04 × 10⁻⁰¹	0.54	2	2
Isoleucine	4.73 × 10⁻⁰¹	0.52	0.07	0.06	Phosphoserine Sites	7.94 × 10⁻⁰²	0.53	0	0
Leucine *	1.44 × 10⁻⁰⁴	0.38	0.12	0.13	Phosphothreonine Sites	5.07 × 10⁻⁰³	0.53	0	0
Lysine	7.18 × 10⁻⁰²	0.56	0.04	0.04	Phosphotyrosine Sites	2.78 × 10⁻⁰¹	0.51	0	0
Methionine	4.69 × 10⁻⁰¹	0.52	0.02	0.02	Total Phosphorylation Sites	3.51 × 10⁻⁰²	0.55	0	0
Phenylalanine	2.43 × 10⁻⁰³	0.40	0.05	0.06	Transmembrane α-helices	9.58 × 10⁻⁰¹	0.50	7	7
Proline	1.77 × 10⁻⁰³	0.60	0.05	0.04	Exposed α-helices	5.36 × 10⁻⁰²	0.44	0.09	0.10
Serine	1.82 × 10⁻⁰¹	0.46	0.08	0.08	Buried α-helices	6.55 × 10⁻⁰²	0.44	0.47	0.51
Threonine	6.27 × 10⁻⁰¹	0.48	0.06	0.06	β Strands	4.84 × 10⁻⁰²	0.44	0.03	0.03
Tryptophan	7.86 × 10⁻⁰¹	0.49	0.02	0.02	3’ Untranslated	2.41 × 10⁻⁰¹	0.48	0	0
Tyrosine	4.28 × 10⁻⁰¹	0.47	0.03	0.03	5’ Untranslated	1.85 × 10⁻⁰¹	0.47	0	0
Valine	5.84 × 10⁻⁰¹	0.48	0.08	0.08	Nonsynonymous Coding	2.98 × 10⁻⁰¹	0.53	2	1
Aliphatic	3.39 × 10⁻⁰³	0.41	0.26	0.27	Synonymous Coding	4.03 × 10⁻⁰¹	0.49	0	0
Aromatic *	6.54 × 10⁻⁰⁶	0.36	0.12	0.14	Binary PPIs	3.70 × 10⁻⁰¹	0.48	0	0
Charged	2.63 × 10⁻⁰³	0.60	0.18	0.17	Alternative Transcripts	6.69 × 10⁻⁰³	0.58	1	1
Negatively Charged	1.64 × 10⁻⁰³	0.60	0.07	0.06	Paralogues	7.19 × 10⁻⁰¹	0.50	0	0
Non-polar	3.51 × 10⁻⁰²	0.43	0.61	0.63	Body Sites Expressed In	1.73 × 10⁻⁰¹	0.54	12	11

Feature	P-value	PS	Positive Median	Unlabelled Median	Feature	P-value	PS	Positive Median	Unlabelled Median
Alanine	5.21 × 10⁻⁰²	0.44	0.06	0.07	Positively Charged	8.46 × 10⁻⁰¹	0.51	0.13	0.13
Arginine	5.91 × 10⁻⁰¹	0.52	0.06	0.05	Sequence Length *	4.85 × 10⁻⁰⁴	0.61	613	509
Asparagine *	9.78 × 10⁻⁰⁴	0.61	0.04	0.03	PEST Motifs	6.64 × 10⁻⁰¹	0.51	0	0
Aspartic Acid	1.57 × 10⁻⁰³	0.60	0.05	0.04	Low Complexity Regions	9.72 × 10⁻⁰²	0.55	3	3
Cysteine	1.36 × 10⁻⁰¹	0.45	0.02	0.02	Hydrophobicity	1.90 × 10⁻⁰¹	0.46	-0.11	-0.08
Glutamic Acid	2.63 × 10⁻⁰¹	0.46	0.06	0.06	Isoelectric Point	8.49 × 10⁻⁰¹	0.51	7.38	7.56
Glutamine	1.91 × 10⁻⁰³	0.40	0.03	0.04	Signal Peptide *	1.68 × 10⁻¹⁰	0.66	0	0
Glycine	2.50 × 10⁻⁰¹	0.46	0.06	0.06	O-glycosylation Sites	NA	NA	0	0
Histidine	7.60 × 10⁻⁰¹	0.49	0.02	0.02	N-glycosylation Sites *	1.02 × 10⁻⁰⁸	0.68	2	1
Isoleucine	1.54 × 10⁻⁰²	0.58	0.06	0.06	Phosphoserine Sites	2.58 × 10⁻⁰³	0.58	0	0
Leucine *	2.13 × 10⁻⁰⁶	0.35	0.10	0.11	Phosphothreonine Sites	2.22 × 10⁻⁰¹	0.52	0	0
Lysine	2.39 × 10⁻⁰¹	0.54	0.05	0.05	Phosphotyrosine Sites	1.59 × 10⁻⁰¹	0.53	0	0
Methionine	2.36 × 10⁻⁰²	0.57	0.03	0.02	Total Phosphorylation Sites	9.02 × 10⁻⁰³	0.57	0	0
Phenylalanine	2.79 × 10⁻⁰¹	0.46	0.05	0.05	Transmembrane α-helices	5.26 × 10⁻⁰¹	0.48	4	5
Proline	3.47 × 10⁻⁰¹	0.53	0.05	0.05	Exposed α-helices	2.85 × 10⁻⁰²	0.43	0.14	0.16
Serine	2.59 × 10⁻⁰¹	0.54	0.08	0.07	Buried α-helices	3.53 × 10⁻⁰³	0.41	0.27	0.32
Threonine *	2.16 × 10⁻⁰⁴	0.62	0.05	0.05	β Strands	1.26 × 10⁻⁰³	0.60	0.11	0.06
Tryptophan	9.73 × 10⁻⁰¹	0.50	0.02	0.02	3’ Untranslated	6.21 × 10⁻⁰¹	0.49	0	0
Tyrosine	7.06 × 10⁻⁰¹	0.49	0.03	0.03	5’ Untranslated	2.74 × 10⁻⁰¹	0.47	0	0
Valine	6.60 × 10⁻⁰³	0.59	0.07	0.06	Nonsynonymous Coding	8.18 × 10⁻⁰²	0.56	4	3
Aliphatic	2.54 × 10⁻⁰¹	0.46	0.23	0.23	Synonymous Coding	1.45 × 10⁻⁰³	0.45	0	0
Aromatic	2.67 × 10⁻⁰¹	0.46	0.12	0.13	Binary PPIs	2.89 × 10⁻⁰¹	0.47	0	0
Charged	8.33 × 10⁻⁰¹	0.51	0.23	0.23	Alternative Transcripts	4.44 × 10⁻⁰²	0.56	3	2
Negatively Charged	5.43 × 10⁻⁰¹	0.52	0.11	0.10	Paralogues	7.27 × 10⁻⁰²	0.54	0	0
Non-polar	1.59 × 10⁻⁰²	0.42	0.55	0.57	Body Sites Expressed In	4.14 × 10⁻⁰¹	0.53	15	15

Feature	P-value	PS	Positive Median	Unlabelled Median	Feature	P-value	PS	Positive Median	Unlabelled Median
Alanine	1.19 × 10⁻⁰²	0.42	0.06	0.07	Positively Charged	3.66 × 10⁻⁰³	0.41	0.14	0.15
Arginine	6.76 × 10⁻⁰²	0.44	0.06	0.06	Sequence Length	1.21 × 10⁻⁰¹	0.55	682	587
Asparagine	3.38 × 10⁻⁰³	0.59	0.04	0.03	PEST Motifs	4.16 × 10⁻⁰²	0.44	0	0
Aspartic Acid	2.39 × 10⁻⁰²	0.57	0.05	0.05	Low Complexity Regions	3.05 × 10⁻⁰¹	0.47	2	2
Cysteine	3.63 × 10⁻⁰³	0.59	0.02	0.02	Hydrophobicity	4.61 × 10⁻⁰²	0.56	-0.35	-0.38
Glutamic Acid	4.14 × 10⁻⁰¹	0.47	0.07	0.07	Isoelectric Point	4.59 × 10⁻⁰³	0.41	6.87	7.12
Glutamine	1.32 × 10⁻⁰²	0.42	0.04	0.04	Signal Peptide *	2.74 × 10⁻¹⁰	0.63	0	0
Glycine	1.67 × 10⁻⁰¹	0.54	0.07	0.06	O-glycosylation Sites	2.64 × 10⁻⁰¹	0.50	0	0
Histidine	4.38 × 10⁻⁰¹	0.48	0.03	0.03	N-glycosylation Sites *	3.84 × 10⁻¹²	0.64	0	0
Isoleucine	8.21 × 10⁻⁰²	0.56	0.05	0.05	Phosphoserine Sites	1.78 × 10⁻⁰¹	0.54	2	1
Leucine	8.32 × 10⁻⁰¹	0.49	0.10	0.10	Phosphothreonine Sites	2.94 × 10⁻⁰²	0.56	1	0
Lysine	2.25 × 10⁻⁰¹	0.46	0.06	0.06	Phosphotyrosine Sites *	3.49 × 10⁻¹⁸	0.74	2	0
Methionine	1.25 × 10⁻⁰¹	0.55	0.02	0.02	Total Phosphorylation Sites *	4.05 × 10⁻⁰⁸	0.67	8	3
Phenylalanine	2.42 × 10⁻⁰¹	0.54	0.04	0.04	Transmembrane α-helices *	4.34 × 10⁻⁰⁸	0.62	0	0
Proline	2.82 × 10⁻⁰¹	0.47	0.06	0.06	Exposed α-helices *	9.94 × 10⁻⁰⁶	0.36	0.10	0.13
Serine	3.86 × 10⁻⁰²	0.43	0.07	0.07	Buried α-helices	9.53 × 10⁻⁰²	0.45	0.13	0.15
Threonine	6.87 × 10⁻⁰²	0.56	0.05	0.05	β Strands *	5.12 × 10⁻¹⁰	0.70	0.17	0.13
Tryptophan *	7.12 × 10⁻⁰⁵	0.63	0.01	0.01	3’ Untranslated	2.16 × 10⁻⁰¹	0.54	2	1
Tyrosine *	3.69 × 10⁻⁰⁴	0.61	0.03	0.03	5’ Untranslated	2.41 × 10⁻⁰¹	0.54	2	1
Valine	8.16 × 10⁻⁰²	0.56	0.06	0.06	Nonsynonymous Coding	5.29 × 10⁻⁰³	0.59	24	17
Aliphatic	1.44 × 10⁻⁰¹	0.55	0.21	0.21	Synonymous Coding	4.58 × 10⁻⁰¹	0.52	0	0
Aromatic	1.83 × 10⁻⁰³	0.60	0.11	0.10	Binary PPIs	1.20 × 10⁻⁰²	0.58	2	1
Charged	6.80 × 10⁻⁰²	0.44	0.26	0.27	Alternative Transcripts	2.12 × 10⁻⁰¹	0.54	4	3
Negatively Charged	6.31 × 10⁻⁰¹	0.52	0.12	0.12	Paralogues	9.48 × 10⁻⁰¹	0.50	0	0
Non-polar	7.77 × 10⁻⁰²	0.56	0.53	0.53	Body Sites Expressed In	8.89 × 10⁻⁰³	0.58	33	31

	Serine/Threonine	Tyrosine	Atypical	Unknown
Entire Dataset	390 (59%)	90 (14%)	27 (4%)	154 (23%)
Unlabelled Proteins	355 (63%)	50 (9%)	26 (5%)	136 (24%)
Unlabelled Proteins With Positive Similarity >0.5	56 (52%)	33 (31%)	1 (1%)	17 (16%)
Unlabelled Proteins With Positive Similarity ≥0.75	16 (33%)	26 (53%)	0 (0%)	7 (14%)
Positive Proteins	35 (37%)	40 (43%)	1 (1%)	18 (19%)

Feature	Kinase	Kinase_NTK	Kinase_TK
Phosphotyrosine Sites	* 0.24	0.07	* 0.48
β Strands	* 0.20	0.08	* 0.35
Total Phosphorylation Sites	* 0.17	0.08	* 0.30
Exposed α-helices	*-0.14	0.03	* -0.37
N-Glycosylation Sites	* 0.14	0.03	* 0.29
Signal Peptide	* 0.13	0.02	* 0.27
Tryptophan	* 0.13	0.03	* 0.25
Transmembrane α-helices	* 0.12	0.01	* 0.26
Tyrosine	* 0.11	0.03	* 0.22
Aromatic	0.10	0.08	0.12
Arginine	0.09	0.04	* 0.16
Cysteine	0.09	0.05	0.15
Positively Charged	-0.09	0.00	* -0.22
Isoelectric Point	-0.09	-0.04	* -0.16
Nonsynonymous Coding	0.09	0.11	0.06
Body Sites Expressed In	0.08	0.11	0.05
Alanine	-0.08	-0.09	-0.07
Glutamine	-0.08	-0.03	-0.15
Binary PPIs	0.08	0.12	0.02
Aspartic Acid	0.07	* 0.14	-0.01

Feature	P-value	PS	Positive Median	Unlabelled Median	Feature	P-value	PS	Positive Median	Unlabelled Median
Alanine	8.99 × 10⁻⁰²	0.57	0.07	0.06	Positively Charged	6.29 × 10⁻⁰¹	0.52	0.14	0.13
Arginine	2.25 × 10⁻⁰²	0.59	0.06	0.05	Sequence Length	6.61 × 10⁻⁰¹	0.48	478	497
Asparagine	7.89 × 10⁻⁰¹	0.49	0.04	0.04	PEST Motifs	8.80 × 10⁻⁰²	0.44	0	0
Aspartic Acid *	5.34 × 10⁻⁰⁵	0.66	0.06	0.05	Low Complexity Regions	4.34 × 10⁻⁰¹	0.47	1	2
Cysteine *	3.71 × 10⁻⁰⁵	0.34	0.01	0.03	Hydrophobicity	2.23 × 10⁻⁰²	0.41	-0.39	-0.30
Glutamic Acid	1.70 × 10⁻⁰¹	0.45	0.05	0.06	Isoelectric Point	6.88 × 10⁻⁰¹	0.48	6.97	7.06
Glutamine *	3.56 × 10⁻⁰⁴	0.36	0.04	0.04	Signal Peptide *	8.10 × 10⁻⁰⁴	0.62	1	0
Glycine	1.23 × 10⁻⁰¹	0.56	0.08	0.08	O-glycosylation Sites *	6.08 × 10⁻⁰⁸	0.57	0	0
Histidine	3.58 × 10⁻⁰¹	0.46	0.03	0.03	N-glycosylation Sites	1.81 × 10⁻⁰²	0.59	1	0
Isoleucine	2.03 × 10⁻⁰¹	0.45	0.04	0.05	Phosphoserine Sites	2.21 × 10⁻⁰¹	0.47	0	0
Leucine	8.30 × 10⁻⁰³	0.40	0.09	0.09	Phosphothreonine Sites	3.58 × 10⁻⁰¹	0.48	0	0
Lysine	2.98 × 10⁻⁰¹	0.46	0.05	0.05	Phosphotyrosine Sites	7.18 × 10⁻⁰¹	0.49	0	0
Methionine	9.92 × 10⁻⁰¹	0.50	0.02	0.02	Total Phosphorylation Sites	5.74 × 10⁻⁰¹	0.48	0	0
Phenylalanine *	1.47 × 10⁻⁰⁴	0.65	0.04	0.04	Transmembrane α-helices	7.23 × 10⁻⁰¹	0.51	0	0
Proline	3.20 × 10⁻⁰¹	0.54	0.06	0.06	Exposed α-helices	4.04 × 10⁻⁰¹	0.47	0.07	0.09
Serine *	1.43 × 10⁻⁰⁵	0.33	0.06	0.07	Buried α-helices	9.19 × 10⁻⁰¹	0.50	0.11	0.11
Threonine	1.17 × 10⁻⁰¹	0.56	0.05	0.05	β Strands	9.23 × 10⁻⁰²	0.57	0.21	0.17
Tryptophan	1.32 × 10⁻⁰¹	0.56	0.02	0.02	3’ Untranslated	5.15 × 10⁻⁰¹	0.48	0	0
Tyrosine *	2.83 × 10⁻⁰⁶	0.68	0.04	0.03	5’ Untranslated	1.78 × 10⁻⁰¹	0.45	0	0
Valine	2.80 × 10⁻⁰³	0.38	0.06	0.06	Nonsynonymous Coding	3.23 × 10⁻⁰¹	0.54	13	12
Aliphatic *	2.41 × 10⁻⁰⁵	0.33	0.19	0.21	Synonymous Coding	2.53 × 10⁻⁰²	0.57	0	0
Aromatic *	3.36 × 10⁻⁰⁶	0.68	0.13	0.12	Binary PPIs	8.77 × 10⁻⁰¹	0.51	0	0
Charged	5.48 × 10⁻⁰¹	0.52	0.25	0.25	Alternative Transcripts	3.39 × 10⁻⁰¹	0.46	2	2
Negatively Charged	1.69 × 10⁻⁰¹	0.55	0.11	0.11	Paralogues	1.22 × 10⁻⁰¹	0.46	0	0
Non-polar	1.33 × 10⁻⁰¹	0.56	0.56	0.55	Body Sites Expressed In	5.70 × 10⁻⁰¹	0.52	25	25

Positive Observations				Unlabelled Observations				G Mean
Total	TPs	FNs	Sensitivity	Total	TNs	FPs	Specificity	G Mean
387	334	53	0.86	444	380	64	0.86	0.86

Positive Observations				Unlabelled Observations				G Mean
Total	TPs	FNs	Sensitivity	Total	TNs	FPs	Specificity	G Mean
115	104	11	0.90	712	613	99	0.86	0.88

Dataset Trained On	Positive Observations				Unlabelled Observations				G Mean
Dataset Trained On	Total	TPs	FNs	Sensitivity	Total	TNs	FPs	Specificity	G Mean
GPCR	115	104	11	0.90	291	241	99	0.71	0.74
GPCR_NO	115	88	27	0.77	291	241	50	0.83	0.80

Positive Observations				Unlabelled Observations				G Mean
Total	TPs	FNs	Sensitivity	Total	TNs	FPs	Specificity	G Mean
155	133	22	0.86	165	144	21	0.87	0.87

Feature	Protease	Protease_NMP	Protease_MP
Tyrosine	* 0.18	0.12	* 0.22
Aromatic	* 0.18	-0.02	* 0.31
Serine	* -0.17	-0.02	* -0.27
Aliphatic	* -0.17	-0.10	* -0.21
Cysteine	* -0.16	0.11	* -0.34
Aspartic Acid	* 0.16	0.04	* 0.23
Phenylalanine	* 0.15	-0.06	* 0.29
Glutamine	* -0.14	-0.14	-0.14
Valine	-0.12	-0.04	* -0.17
Signal Peptide	* 0.12	0.07	* 0.15
Leucine	-0.10	-0.13	-0.09
Hydrophobicity	-0.09	0.00	-0.15
Arginine	0.09	0.14	0.06
N-Glycosylation Sites	0.09	0.06	0.10
O-Glycosylation Sites	* 0.07	* 0.17	0.01
Synonymous Coding	0.07	0.09	0.07
Alanine	0.07	0.02	0.10
β Strands	0.07	0.16	0.01
Threonine	0.06	0.14	0.01
Glycine	0.06	0.14	0.01

	Aspartic	Cysteine	Metallo	Serine	Threonine
Entire Dataset	31 (6%)	135 (25%)	167 (31%)	161 (30%)	19 (4%)
Unlabelled Proteins	29 (6%)	133 (28%)	131 (28%)	148 (31%)	13 (3%)
Unlabelled Proteins With Positive Similarity >0.5	4 (8%)	9 (17%)	18 (34%)	14 (26%)	5 (9%)
Unlabelled Proteins With Positive Similarity ≥0.75	0 (0%)	2 (14%)	7 (50%)	3 (21%)	2 (14%)
Positive Proteins	2 (3%)	2 (3%)	36 (61%)	13 (22%)	6 (10%)

Dataset	All Pairs	Pairs of Two Positive Proteins	Pairs of Two Unlabelled Proteins	Pairs of One Unlabelled and One Positive Protein
AllTargets	0.47%	0.92%	0.49%	0.28%
Cancer	1.23%	2.59%	1.29%	0.60%
GPCR	33.41%	35.47%	39.17%	15.38%
GPCR_NO	20.20%	35.47%	16.81%	21.46%
IonChannel	5.27%	9.69%	4.40%	3.66%
Kinase	31.45%	42.44%	30.67%	32.89%
Protease	6.54%	18.81%	6.41%	6.26%

		Positive		Unlabelled
		Metallo	Non-metallo	Metallo	Non-metallo
Positive	Metallo	257	2	457	12
Positive	Non-metallo		74	42	1233
Unlabelled	Metallo			583	129
Unlabelled	Non-metallo				6442

PERMALINK

Properties of Protein Drug Target Classes

Simon C Bull

Andrew J Doig

Roles

Abstract

Introduction

Target Types Investigated

Antineoplastic

GPCRs

Ion Channels

Kinases

Proteases

Methods

Cleaning and Collation of Protein Data

Protein Accession and Name

Simple Sequence Properties

Amino Acid Composition

Protein Family

Posttranslational Modifications

Secondary Structure

Protein Protein Interactions

External Database References

UniGene Expression Clusters

Ensembl

Protein Drug Targets

Machine Learning

Datasets Generated

Table 1. Dataset inclusion criteria.

Random Forest Parameter Optimisation

Feature Selection

Sequence Identity Comparison

Identification of Targets and Their Properties

Results

Sequence Identity Comparison

Table 2. Comparison of RFs induced using non-redundant subsets of the Cancer dataset.

Table 3. Comparison of RFs induced using non-redundant subsets of the GPCR dataset.

Table 4. Comparison of RFs induced using non-redundant subsets of the IonChannel dataset.

Table 5. Comparison of RFs induced using non-redundant subsets of the Kinase dataset.

Table 6. Comparison of RFs induced using non-redundant subsets of the Protease dataset.

Table 7. Fraction of the number of proteins in the entire dataset in each non-redundant dataset.

Target Properties

All Proteins

Table 8. Results of the feature analysis for the AllTargets dataset.

Cancer Proteins

Table 9. Results of the feature analysis for the Cancer dataset.

GPCRs

Table 10. Results of the feature analysis for the GPCR dataset.

Table 11. Results of the feature analysis for the GPCR_NO dataset.

Ion Channels

Table 12. Results of the feature analysis for the IonChannel dataset.

Kinases

Table 13. Results of the feature analysis for the Kinase dataset.

Table 14. Division of positive and unlabelled kinases by type.

Table 15. Comparison of the feature effect sizes across the three datasets of kinases.

Proteases

Table 16. Results of the feature analysis for the Protease dataset.

Table 17. Comparison of the feature effect sizes across the three datasets of proteases.

Target Predictions

All Proteins

Fig 1. Weighted predictions of the proteins in the AllTargets dataset.

Table 18. Random Forest predicted classifications for all proteins.

Cancer Proteins

Fig 2. Weighted predictions of the proteins in the Cancer dataset.

Table 19. Random Forest predicted classifications for Cancer proteins.

GPCRs

Fig 3. Weighted predictions of the proteins in the GPCR dataset.

Table 20. Random Forest predicted classifications for GPCR proteins.

Table 21. A comparison of the predictions of the non-odorant GPCRs.

Fig 4. Weighted predictions of the proteins in the GPCR_NO dataset.

Ion Channels

Fig 5. Weighted predictions of the proteins in the IonChannel dataset.

Table 22. Random Forest predicted classifications for Ion Channel proteins.

Kinases

Fig 6. Weighted predictions of the proteins in the Kinase dataset.

Table 23. Random Forest predicted classifications for Kinases.

Proteases

Fig 7. Weighted predictions of the proteins in the Protease dataset.

Table 24. Random Forest predicted classifications for Proteases.

Table 25. Division of positive and unlabelled proteases by type.