Network- and Attribute-Based Classifiers Can Prioritize Genes and Pathways for Autism Spectrum Disorders and for Intellectual Disability

Yan Kou; Catalina Betancur; Huilei Xu; Joseph D Buxbaum; Avi Ma’ayan

doi:10.1002/ajmg.c.31330

. Author manuscript; available in PMC: 2013 May 15.

Published in final edited form as: Am J Med Genet C Semin Med Genet. 2012 Apr 12;160C(2):130–142. doi: 10.1002/ajmg.c.31330

Network- and Attribute-Based Classifiers Can Prioritize Genes and Pathways for Autism Spectrum Disorders and for Intellectual Disability

Yan Kou ^*, Catalina Betancur ^*, Huilei Xu ^*, Joseph D Buxbaum, Avi Ma’ayan

PMCID: PMC3505691 NIHMSID: NIHMS363397 PMID: 22499558

Abstract

Autism spectrum disorders (ASD) are a group of related neurodevelopmental disorders with significant combined prevalence (~1%) and high heritability. Dozens of individually rare genes and loci associated with high-risk for ASD have been identified, which overlap extensively with genes for intellectual disability (ID). However, studies indicate that there may be hundreds of genes that remain to be identified. The advent of inexpensive massively parallel nucleotide sequencing can reveal the genetic underpinnings of heritable complex diseases, including ASD and ID. However, whole exome sequencing (WES) and whole genome sequencing (WGS) provides an embarrassment of riches, where many candidate variants emerge. It has been argued that genetic variation for ASD and ID will cluster in genes involved in distinct pathways and protein complexes. For this reason, computational methods that prioritize candidate genes based on additional functional information such as protein-protein interactions or association with specific canonical or empirical pathways, or other attributes, can be useful. In this study we applied several supervised learning approaches to prioritize ASD or ID disease gene candidates based on curated lists of known ASD and ID disease genes. We implemented two network-based classifiers and one attribute-based classifier to show that we can rank and classify known, and predict new, genes for these neurodevelopmental disorders. We also show that ID and ASD share common pathways that perturb an overlapping synaptic regulatory subnetwork. We also show that features relating to neuronal phenotypes in mouse knockouts can help in classifying neurodevelopmental genes. Our methods can be applied broadly to other diseases helping in prioritizing newly identified genetic variation that emerge from disease gene discovery based on WES and WGS.

Keywords: High-throughput sequencing, massively parallel sequencing, gene discovery, networks, pathways, neurodevelopmental disorders, classifiers, support vector machine

Introduction

ASD and ID are complex, multifactorial neurodevelopmental disorders with high heritability, which share overlapping risk factors (Betancur 2010; El-Fishawy and State 2009; Topper et al. 2011). Great progress has been made in the past years in identifying rare variants of major effect in both ASD and ID (Betancur 2010; Topper et al. 2011). However, the genetic underpinning of these disorders remains mostly unknown. For example, a specific genetic etiology can currently be identified in about 15% of patients with ASD. Similarly, although dozens of high-risk ASD genes and loci have been identified (Betancur 2010), estimates from ongoing studies estimate that 60–80% of ASD and ID genes and loci remain to be discovered [see (Sanders et al. 2011) and (Topper et al. 2011)]. The recent advances in massively parallel DNA sequencing bring the promise that genetic variation identified in individuals affected with a neurodevelopmental disorder would add to our understanding of the etiology of these disorders. However, as sequencing data are accumulating, vast amounts of genetic variations are being discovered. This presents the challenge of deciding which variations lead to the phenotype and which are coincidental. To address this challenge, it is useful to have computational approaches that place gene products harboring known variation within networks and placing newly identified variation in the same context. For example, (Gilman et al. 2011) developed a weighted functional background network, which when seeded with genes found within CNVs associated with high risk for ASD yielded a subnetwork enriched with neuronal motility, synaptic development and axonal guidance gene products. Their resultant subnetwork was also enriched in genes previously associated with ID phenotypes. Similarly, (Voineagu et al. 2011) developed a gene co-expression subnetwork made from ASD and normal brain samples to find a differentially expressed subnetwork made of genes enriched in neuronal and immune functions as well as glial markers. In another study, Ziats and Rennert [2011] tracked the expression levels of ASD associated genes during development using published microarrays. They showed that co-expressed subnetworks seeded with ASD genes form modules that are enriched in genes known to play a role in immunity (Ziats and Rennert 2011). The ability of such approaches to discover new mechanisms in ASD suggests that functional molecular interactomes may be useful for linking the complex human phenotypes of ASD and ID to variation in genes (Gilman et al. 2011).

Many computational approaches have been developed to construct background networks for the purpose of placing lists of disease genes within the background networks for the purpose of constructing functional disease neighborhoods that connect the seed disease genes (Berger et al. 2007; Chen et al. 2009; Kann 2009; Navlakha and Kingsford 2010; Oti et al. 2006; Zhang et al. 2011). For instance, by calculating the shortest path between seed genes using a protein interaction network, it was shown that a much shorter mean path length exists between eight syndromic ASD proteins compared with the mean shortest path between random proteins (Sakai et al. 2011). This observation indicates a close connectivity among some known ASD-related proteins. An alternative method, the mean-first-passage-time (MFPT) uses diffusion-based random walks on networks instead of shortest paths. MFPT is the average steps a random-walker takes to reach a specific node from a given node in the background network. Comparing different methods for classifying and recovering disease genes with background protein interaction networks, the MFPT approach appears to outperform most other methods (Navlakha and Kingsford 2010). Berger et al. (Berger et al. 2010) implemented an MFPT-based ranking system to identify a distinct disease gene neighborhood by exploring the relationship between known long-QT syndrome (LQTS) genes using a human protein interactome. Such network-based classifiers can be used to rank disease genes and candidate disease genes based on their proximity to the disease subnetwork locus. An alternative and related approach is to classify and rank disease genes based on known disease genes attributes. For example, Support Vector Machine (SVM) (Byvatov and Schneider, 2003) is a popular supervised learning method that has been applied to classify genes based on their shared functional attributes (Xu et al. 2010). Additionally, Li et al. (2009) developed an SVM classifier trained with features that include protein interactions, protein domains and enriched GO terms of known cancer genes to prioritize putative cancer genes; more recently, a set of DNA repair genes were predicted by an SVM classifier trained with gene expression data (Jiang and Ching 2011). Similarly, we developed an SVM classifier for prioritizing pluripotency stem cell regulators from RNAi screens using microarray and ChIP-seq data (Xu et al. 2010). The SVM strategy could be applied to classify ASD and ID genes based on attributes extracted from heterogeneous data sources. Here we developed three supervised learning methods to classify and prioritize ASD and ID disease genes. Two of the classifiers are network-based and one is attribute-based. We find that such methods are show promise in predicting and ranking ASD and ID genes and there is significant overlap between these two disorders as well as the top genes used for the classification. Furthermore, we identified subnetworks that connect the most informative genes to potentially point to the disease molecular loci. The use of such approaches in ongoing WES and WGS sequencing projects will help with gene and pathway identification in neurodevelopmental disorders and can be applied to other complex disorders as well.

APPROACH

ASD and ID gene lists

We made use of manually curated lists of genes implicated in ASD. We focused on genes where there was prior evidence of an etiological role in ASD (i.e., genes of major effect for ASD). We began with a carefully curated list of 103 such genes implicated in ASD, with or without intellectual disability (ID), from a recent review by one of us (CB) (Betancur 2010). In that study, an extensive literature search was conducted looking for articles describing genetic disorders in patients with autism, ASD, pervasive developmental disorder, Asperger syndrome, or autistic/autistic-like traits/features/behavior, using PubMed and Google Scholar, as well as follow-up of references cited in the papers thus identified. This list is meant to be as exhaustive as possible, and has therefore been routinely updated by the author using the same criteria such that 11 additional genes were added since the published report (BBS10, DPYD, FOLR1, GNS, GRIN2B, HEPACAM, HGSNAT, KCNJ11, NAGLU, SCN2A and STXBP1). The final list of 114 genes implicated in ASD (ASD114) is shown in Table I. Since most high-risk ASD genes were identified by unbiased genetic approaches (e.g., characterization of translocation breakpoints, recurrent copy number variants, X-linked genes first identified by linkage, etc), ASD114 represents a largely unbiased list of such genes. A similar list was developed by the same author to include a very diverse, but not exhaustive, group of genes implicated in ID (n=223), which provided a means to assess the behavior of the classifiers against a separate list of neurodevelopmental genes. This gene list was developed in an analogous manner to the ASD gene list, and included genes implicated in ID that were not already in the ASD gene list (note that many genes on the ASD list are also considered genes for ID and many genes first identified in ID have since been shown to contribute to ASD; see Figure 1 in (Betancur, 2010) for many examples on the X chromosome). The ID list is also found in Table II. All gene lists were prepared and frozen before the start of the analyses described here.

Table 1. Seed lists of 114 ASD genes.

ACSL4, GRIN2B, PAH, ADSL, GUCY2D, PCDH19, AFF2, HEPACAM, PHF6, AGTR2, HGSNAT, PHF8, AHI1, HOXA1, POMGNT1, ALDH5A1, HRAS, POMT1, ALDH7A1, IGF2, PQBP1, AP1S2, IL1RAPL1, PRSS12, ARHGEF6, IQSEC2, PTCHD1, ARX, JARID1C, PTEN, ATRX, KCNJ11, PTPN11, BBS10, KIAA2022, RAB39B, BRAF, KRAS, RAI1, BTD, L1CAM, RNF135, CACNA1C, L2HGDH, RPE65, CACNA1F, LAMP2, RPGRIP1L, CASK, MAP2K1, SATB2, CDKL5, MBD5, SCN1A, CEP290, MECP2, SCN2A, CHD7, MED12, SGSH, CNTNAP2, MEF2C, SHANK2, CREBBP, MID1, SHANK3, DCX, MKKS, SLC6A8, DHCR7, NAGLU, SLC9A6, DMD, NDP, SMC1A, DMPK, NF1, STXBP1, DPYD, NFIX, SYN1, EHMT1, NHS, SYNGAP1, FGD1, NIPBL, TBX1, FGFR2, NLGN3, TSC1, FMR1, NLGN4X, TSC2, FOLR1, NPHP1, UBE3A, FOXG1, NRXN1, UPF3B, FOXP1, NSD1, VPS13B, FTSJ1, OCRL, YWHAE, GAMT, OPHN1, ZNF674, GATM, OTC, ZNF81, GNS, PAFAH1B1, GRIA3

Open in a new tab

Shortest path distance D_i (left) and MFPT score Sj (right) were computed for each node in the PPI network. The number of ID or ASD genes identified in each neighborhood within the specified cutoff range is shown on the left and the leave-one-out cross validation (LOOCV) of the seed gene lists is shown on the right.

Table 2. Seed lists of 114 ASD genes.

ABCD1, CA8, FKTN, KIF7, PRPS1, ST3GAL3, AGA, CBL, FLNA, KIRREL3, PVRL1, STIL, AIPL1, CBS, FUCA1, KLF8, QDPR, SUMF1, ALG12, CC2D1A, GAD1, LAMA2, RAB18, SUOX, ALG3, CC2D2A, GALC, LARGE, RAB3GAP1, SYP, ALG6, CDH15, GALE, LCA5, RAB3GAP2, SYT14, ALG8, CDK5RAP2, GCH1, LRAT, RAF1, TBC1D24, ALG9, CENPJ, GDI1, MAGT1, RD3, TBCE, ANKH, CEP152, GFAP, MAN1B1, RDH12, TCF4, ANKRD11, CHKB, GK, MAOA, RECQL4, TGFBR1, AP4B1, COG1, GLB1, MAP2K2, RELN, TGFBR2, AP4E1, COG8, GNPTAB, MCOLN1, RPGRIP1, TIMM8A, AP4M1, COL4A1, GNPTG, MCPH1, RPS6KA3, TMEM216, AP4S1, CRB1, GPC3, MED17, SETBP1, TMEM67, ARFGEF2, CRBN, GPR56, MED23, SHOC2, TRAPPC9, ARG1, CRX, GPSN2, MGAT2, SHROOM4, TRIM32, ARHGEF9, CTSA, GRIK2, MKS1, SIL1, TSEN2, ARL13B, CUL4B, GRIN2A, MLL2, SLC12A6, TSEN54, ARL6, CYB5R3, GTF2H5, MOCS1, SLC16A2, TSPAN7, ASPM, DAG1, GUSB, MOCS2, SLC17A5, TTC8, ASXL1, DBT, HCCS, MPDU1, SLC1A1, TUBA1A, ATP6AP2, DIP2B, HDAC4, MYCN, SLC25A15, TUBB2B, ATP6V0A2, DKC1, HPRT1, NDUFA1, SLC25A22, TUSC3, ATP7A, DLD, HSD17B10, NEU1, SLC2A1, UBE2A, ATR, DLG3, HUWE1, NRAS, SLC35C1, VLDLR, AVPR2, DNMT3B, IDS, NSDHL, SLC46A1, VRK1, BBS1, DPM1, IDUA, OFD1, SLC4A4, WDR62, BBS12, EP300, IER3IP1, PAK3, SMC3, WDR81, BBS2, ERCC1, IGBP1, PAX6, SMS, ZC3H14, BBS4, ERCC2, IGF1, PCNT, SNAP29, ZDHHC9, BBS5, ERCC3, IKBKG, PDHA1, SOBP, ZEB2, BBS7, ERCC5, IMPDH1, PEX7, SOS1, ZNF41, BBS9, ERCC6, INPP5E, PGK1, SOX3, ZNF711, BCKDHA, ERCC8, KCNJ10, PLP1, SPATA7, BCKDHB, ERLIN2, KCNK9, PMM2, SPRED1, BCOR, FANCB, KIAA0226, PNKP, SPTAN1, BRWD3, FH, KIAA1033, POMT2, SRD5A3, C7ORF11, FKRP, KIAA1279, PORCN, SRPX2

Open in a new tab

Mammalian protein-protein interaction network

We collected protein-protein interactions (PPI) data from the following databases and papers: BioGrid (Stark et al., 2006), HPRD (Peri et al., 2004), InnateDB (Lynn et al., 2008), IntAct (Hermjakob et al., 2004), KEGG (Kanehisa et al. 2008), KEA (Lachmann and Ma'ayan, 2009), MINT (Chatr-aryamontri et al., 2007), MIPS (Mewes et al., 2004), DIP (Xenarios et al., 2000), BIND (Bader et al., 2003), BioCarta, PDZBase (Beuming et al., 2005), PPID, Yu et al. (2011), Stelzl et al. (2005), Ewing et al. (2007), Rual et al. (2005) and Ma’ayan et al. (2005). Gene/protein IDs were converted to Entrez gene symbols. To increase the confidence of the protein interaction dataset we filtered the interaction table by removing interactions from PubMed identifiers (PMIDs) that have more than 10 interactions. The final PPI network is fully connected and consists of 14,191 nodes and 64,741 non-redundant, high-confident interactions. From the two gene lists, 82 genes from the ASD list and 158 from the ID gene list were found in the final PPI network. The filtered network can be found in Supplemental eTable 1 (See Supporting Information online).

Control and comparison gene lists

Six types of control gene lists were generated for statistical tests: completely random, degree-matched, brain expressed, gene-ontology biological process matched, gene-ontology molecular function matched, and gene-ontology cellular component matched. All of the control gene lists contained the same number of genes as the seed lists, which is 82 for ASD and 158 for ID. Random lists are made of randomly selected genes picked from the background network. To construct the degree-matched control lists, the connectivity degree of all genes in the network were distributed into bins. Genes were then picked from the same bins as the seed genes. Since the genes in the ASD and ID lists are likely biased toward brain expressed, we also took advantage of a dataset of brain-expressed genes (Kang et al. 2011) and randomly selected genes from this dataset. To obtain a list of brain expressed genes, brain regions were dissected from 57 clinically unremarkable postmortem brains of donors ranging from 6 post conceptual weeks to 82 years, which were divided into 15 periods based on age, and the expression levels of 17,565 protein-coding genes within each sample were assayed using the Affymetrix GeneChip Human Exon 1.0 ST Array platform. A list of “brain-expressed” genes (graciously provided by Drs. Stephan J Sanders and Kyle Meyer) included genes having a log2-transformed signal intensity ≥6 in at least one sample and a mean DABG P<0.01 in at least one brain region of at least one period. The GO-matched control lists were created using GO Slim. Using the Jaccard similarity score to assess overlap of GO terms between pairs of genes; we randomly picked genes having at least a 0.4 Jaccard similarity score when compared to each of the original seed nodes.

Shortest path algorithm for defining distance to seed gene list

The distance (D_i) from a given node to the seed gene list, i.e., ASD or ID gene list, is defined as the average shortest path along the PPI network from the node to all genes in the seed list. The pair-wise shortest path length was obtained using Johnson’s algorithm in MATLAB. Dijkstra’s algorithm was implemented to obtain the specific nodes along the shortest paths. All genes in the PPI network were ranked according to their D_i to the seed gene list. We conducted leave-one-out cross validation (LOOCV) for the seed gene lists by leaving one seed gene out and computing the Di from this gene to the rest of the seed genes. Receiver operating characteristic (ROC) curves were derived by gradually increasing the D_i cutoff. True positive rate (TPR) was defined as the proportion of genes from the input list with Di shorter than an arbitrary cutoff; whereas false positive rate (FPR) was the proportion of genes with Di shorter than the cutoff but not in the input list. Fifty lists were generated for each control list type for comparison, and the means FPR and TPR for the 50 control lists were used to plot the ROC curves.

Mean-first-passage-time to identify genes in a PPI neighborhood

Mean-first-passage-time (MFPT) is the average steps a random walker takes to reach a specific node from a given seed node and provides an alternative way to quantify the distance between pairs of genes. To explore the neighborhood of a list of seed nodes, we defined a module distance score S_j as the difference of MFPT steps starting from non-seed nodes in the background network, compared with starting from seed nodes, normalized by the average MFPT steps a random walker takes to reach the same node from a random start as follows (Berger et al. 2010):

S_{j} = \frac{\frac{Σ_{i \in n} < T_{ij} >}{N_{n}} - \frac{Σ_{i \in s} < T_{ij} >}{N_{s}}}{\frac{Σ_{i} < T_{ij} >}{N_{s} + N_{n}}}

(Equation 1)

Where N_n is the set of seed nodes and N_s is the set of nodes reachable by a random walker starting from the seed nodes, and T_ij is the matrix containing pair-wise MFPT computed for the background network. Therefore the score S_j above zero indicates that on average the target node is located closer to the seed genes than other randomly selected genes in the background network. All nodes in the PPI network were ranked according to their S_j score obtained using ASD or ID genes as seed nodes.

Support Vector Machine (SVM) classifiers for predicting and ranking genes

We utilized 11 gene-set libraries to generate features/attributes for all genes or gene products from the network. The gene-set libraries were previously created by us for the program Lists2Networks (Lachmann and Ma'ayan, 2010) or downloaded from open online sources. These gene-set libraries include: GO biological processes, GO cellular components, and GO molecular functions (libraries 1–3); Transcription factor binding sites from TRANSFAC (Matys et al., 2003) or ChEA (Lachmann et al., 2010) (libraries 4–5); Metabolites associated with gene-lists (library 6); Knockout mouse phenotypes from the MGI-MP browser (Gkoutos et al., 2004) (7); microRNA targets from TargetScan (8); Structural domains (9); Hub proteins (10); and, gene signatures from GeneSigDB (Culhane et al., 2010) (11). In total we collected 8986 features represented as binary vectors each corresponding to a row entry in the gene-set library.

To set up the SVM we created negative and positive gene list sets of the same size. The negative sets are randomly generated based on the various criteria described above: randomly chosen, degree matched, GO term matched, and brain expressed. Positive examples are always ASD or ID genes. For each type of control gene list, 10 lists where created for training and testing and 10 classifiers were generated.

For each classifier, features were ranked, and the top 200 features were selected based on mutual information computed between the feature and the class calculated as follows:

MI (f, c) = H (c) + H (f) - H (c, f)

(Equation 2)

where f- feature and c- class. 200 features were selected after evaluating the performance of using 10, 50, 100, 150, or 200 features. The SVM classifiers map the data from the input space to a high-dimensional feature space in which classification can be performed by locating data points with respect to a hyperplane that separates, in our case, binary classes. The projection from the input space to the high-dimensional feature space is achieved by a kernel function, which is used to transform the data for optimization of the classification. In this study, a standard linear kernel was used since it performed best after trying several other types of kernel functions. Each classifier was subjected to 10-fold cross-validation: for each round among the total 10 rounds, 9 of the selected examples of a list were used for training the SVM classifier and 1 of the lists was left out for testing the performance of the learned SVM classifier.

Five different scores were used to evaluate the classifiers’ performance.

Matthew’s correlation coefficient (MCC):
$MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP) (TP + FN) (TN + FP) (TN + FN)}}$ (Equation 3)
In this equation, TP is the number of true positives; TN is the number of true negatives; FP is the number of false positives; FN is the number of false negatives.
Accuracy:
$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$ (Equation 4)
Sensitivity:
$Sensitivity = \frac{TP}{TP + FN}$ (Equation 5)
Specificity:
$Sensitivity = \frac{TN}{TN + FP}$ (Equation 6)
The area under the ROC curve (AUC)

The AUC scores were computed using MATLAB’s build-in function ‘perfcurve’ from the Statistics Toolbox and the results are provided in Tables III and IV.

Table 3. ASD or ID genes retrieved with attributes-based SVM classifiers.

Genes counts listed in the table are the intersection of genes retrievable by all six SVM classifiers trained with different types of control gene lists.

ASD genes retrieved from ID genes classifier

ID genes retrieved from
ASD genes classifier

AGTR2, MECP2, ALDH5A1, MEF2C, ALDH7A1, MID1, AP1S2, MKKS, ARHGEF6, NDP, ATRX, NFIX, BRAF, NIPBL, CACNA1C, NLGN3, CACNA1F, NLGN4X, CASK, NPHP1, CEP290, NRXN1, CHD7, NSD1, CNTNAP2, OCRL, CREBBP, PAFAH1B1, DCX, PAH, DMD, PQBP1, DMPK, PTEN, EHMT1, PTPN11, FGD1, RAI1, FGFR2, RNF135, FMR1, SATB2, FOXG1, SCN1A, FOXP1, SHANK2, GNS, SLC9A6, GRIN2B, SMC1A, HOXA1, STXBP1, HRAS, SYN1, IGF2, SYNGAP1, IL1RAPL1, TBX1, KCNJ11, TSC1, L1CAM, TSC2, LAMP2, UBE3A, MAP2K1, UPF3B, MBD5, YWHAE

ABCD1, DAG1, MCOLN1, COL4A1, AGA, DBT, MCPH1, CRB1, AIPL1, DKC1, MED17, CRBN, ALG8, DLD, MKS1, CRX, ALG9, DLG3, MLL2, CTSA, ANKRD11, DNMT3B, MYCN, CUL4B, AP4B1, DPM1, NEU1, CYB5R3, AP4E1, EP300, NRAS, TRAPPC9, ARFGEF2, ERCC1, OFD1, TRIM32, ARG1, ERCC2, PAK3, TTC8, ARHGEF9, ERCC3, PAX6, TUBA1A, ARL6, ERCC5, PDHA1, TUBB2B, ASPM, ERCC6, PEX7, ZNF41, ASXL1, ERCC8, PGK1, TIMM8A, ATP6AP2, FANCB, PLP1, KCNJ10, ATP6V0A2, FLNA, PNKP, KIAA0226, ATP7A, GAD1, PORCN, KIAA1033, ATR, GCH1, PRPS1, KIRREL3, AVPR2, GDI1, PVRL1, KLF8, BBS1, GFAP, QDPR, LAMA2, BBS2, GLB1, RAB3GAP1, MAOA, BBS4, GPC3, RAF1, TUSC3, BBS7, GRIK2, RECQL4, UBE2A, BCKDHA, GRIN2A, RELN, VLDLR, BCOR, GTF2H5, RPGRIP1, VRK1, CA8, HCCS, RPS6KA3, WDR62, CBL, HDAC4, SETBP1, ZEB2, CBS, HPRT1, SHOC2, SMS, CC2D1A, HUWE1, SIL1, SNAP29, CDH15, IDUA, SLC12A6, SOS1, CDK5RAP2, IGBP1, SLC1A1, SPRED1, CENPJ, IGF1, SLC2A1, SPTAN1, COG1, IKBKG, SLC4A4, ST3GAL3, COG8, IMPDH1, SMC3, STIL, TGFBR1, TBCE, SYP, SUMF1, TGFBR2, TCF4

Open in a new tab

Table 4. SVM classification results.

Matthew’s correlation coefficient, Accuracy, Sensitivity, Specificity, and area under the curve (AUC) computed for all SVM classifiers.

ASD classifier	MCC	Accu.	Sens.	Spec.	AUC MEAN	AUC STD	Ret. ID	Ret. % ID	Ret. Rand.	Ret. % Ran.	P- value
BrExp	0.69	0.84	0.79	0.89	0.94	0.05	68	0.43	48.2	0.31	1.29E-28
Degree	0.80	0.90	0.84	0.95	0.97	0.04	64	0.41	34.5	0.22	3.04E-40
GO BP	0.72	0.85	0.76	0.95	0.95	0.07	64	0.41	42.3	0.27	1.02E-32
GO CC	0.67	0.84	0.78	0.89	0.96	0.05	64	0.41	37.6	0.24	5.72E-36
GO MF	0.79	0.89	0.82	0.96	0.97	0.04	78	0.49	50.1	0.32	3.22E-42
random	0.84	0.92	0.89	0.95	0.97	0.04	68	0.43	34.2	0.22	2.74E-39
ID classifier	MCC	Accu.	Sens.	Spec.	AUC MEAN	AUC STD	Ret. ID	Ret. % ID	Ret. Rand.	Ret. % Ran.	P- value
BrExp	0.52	0.76	0.75	0.78	0.86	0.07	49	0.60	35.7	0.44	1.64E-28
Degree	0.48	0.74	0.70	0.77	0.84	0.05	58	0.71	28.9	0.35	5.71E-40
GO BP	0.57	0.78	0.75	0.82	0.85	0.08	9	0.11	3.58	0.04	9.38E-24
GO CC	0.49	0.75	0.72	0.77	0.84	0.06	62	0.76	29.0	0.35	7.01E-43
GO MF	0.61	0.80	0.74	0.87	0.90	0.07	36	0.44	19.2	0.23	9.92E-33
random	0.53	0.77	0.71	0.82	0.87	0.07	52	0.63	22.7	0.28	2.87E-42

Open in a new tab

RESULTS

We made use of two non-overlapping lists of 114 known ASD and 223 ID genes reflecting rare, high-risk genes (Tables I and II). These two gene lists are collected from sparse sources with various methods and the genes within those two lists, except for being enriched in neuronal functions and being brain expressed, do not have widespread functional relationships that are immediately apparent. Since the inclusion of the genes within these lists is imperfect, and many more genes are likely to be bona-fide ASD or ID disease genes, the questions that we aim to address in our analysis here are as follows: For any gene that is not identified as an ASD or ID disease gene, can we predict whether the gene is likely to be an ASD or ID disease gene, i.e., can we rank genes for likelihood to be ASD or ID gene based on the known ASD/ID genes; and, can we prioritize and group the already known ASD and ID genes such that we can find functional relationships that connect known ID and ASD disease genes?

To address these questions we utilized prior knowledge about known mammalian PPI as well as functional annotations of human genes and their protein products. We addressed these questions by first developing two related PPI network-based classifiers. These classifiers assume that the known ASD and ID genes form loci within the human interactome that reflect the dis-regulation of molecular protein complexes in the human brain that lead to the associated phenotypes. If this is the case, and we have enough accurate information about binary protein interactions, we should be able to identify such loci and define the distance from such loci as a probability for genes to be found to contain variation that lead to the ASD or ID phenotype. To measure such distance we implemented two complimentary approaches, the shortest path and MFPT classifiers. These two methods define an average distance between the seed disease genes and the rest of the genes within the PPI network. Note that of these two classifiers, the MFPT approach was reported previously to perform better because it reduces the influence of hub nodes and can reach nodes that are not necessarily falling on shortest paths (Navlakha and Kingsford, 2010).

To evaluate the ability of such network based classifiers to predict and rank ASD and ID disease genes we implemented a leave-one-out-cross-validation (LOOCV) analysis drawing neighborhood plots (Figs. 1–2) and plotting ROC curves (Fig. 3) to evaluate the performance of the two network-based classifiers. In this analysis we created several control lists and comparison lists to examine whether the classifiers outperform the misclassification of lists of genes with similar GO terms or lists of genes that are brain expressed. A first finding from our analysis is that the ASD and ID genes are closer to each other in PPI space than by chance (Table V) and ID genes are significantly found in the ASD gene neighborhood (Figs. 1–3). Moreover, the classifiers correctly classify ID genes using ASD-gene derived classifiers and ASD genes using ID-gene derived classifiers more specifically than misclassifying control-lists. In the case of brain-expressed genes, the MFPT ASD classifier can correctly classify ID genes with 75% accuracy and misclassify brain expressed genes on average only 63% of the time as potential ID genes (Fig. 3). The shortest-path-based ASD classifier performs slightly worst in classifying ID genes (70%). On the other hand the ID classifiers, MFPT or shortest-path-based, do not discriminate well between ASD genes and other genes, suggesting that the ID genes are more spread out randomly within the human interactome and are not good enough together to classify ASD genes.

The green frames show D_i and Sj scores chosen arbitrarily as disease neighborhoods.

The D_i of each gene to the seed list was calculated and the ROC curve was plotted by increasing the cutoff distance by steps of 0.05, starting from the minimum distance of all genes in the network. True positive rate (TPR) was defined as the proportion of genes from the inquiring list with D_i shorter than the cutoff distance over the total number of genes in the list and false positive rate (FPR) the proportion of genes with D_i shorter than the cutoff distance but not in the inquiring list over total number of genes not in the list. 50 lists were generated for each control type for comparison, as shown in different colors. The mean FPR and TPR for the 50 control lists were used to plot the ROC curve. In the AUC section, t-test statistics was performed with the null hypothesis that the AUC of ASD/ID genes identification can be achieved with random gene lists of each type. P value <0.0001 is indicated as double stars (**), and <0.01 as single star (*). The ROC curve of Sj was plotted in the same way by increasing the cutoff rank by one gene.

Table 5. Shortest path averages and standard deviations between all genes/proteins in the PPI network, or between ID, ASD, ASD+ID genes.

	All pairs	ID genes	ASD genes	ASD+ID genes
Avg	4.4817	4.0323	3.5265	3.8867
stddev	0.031	0.0491	0.0651	0.041

Open in a new tab

Next we developed an SVM classifier by collecting and combining gene-set libraries and setting each row from these libraries as a potential feature vector for classification. After sampling for various sizes of feature sets (Fig. 4C), we chose the top 200 features using mutual information (see Approach) to create SVM classifiers for ASD and ID genes. In our datasets, our SVM classifiers are capable of discriminating between ASD or ID genes and other genes with ~80%–98% accuracy (Fig. 4 and Tables III and IV), performing better than the network-based classifiers. Complete statistics of the performance of the SVM classifier are provided in Table IV. It is interesting to see that the ASD classifiers perform better than the ID classifiers, consistent with the network-based classifiers, further suggesting that the ID gene list contains a broader and less discriminative list of genes/gene-products. Looking at the top features that contribute most toward correct classification, we repeatedly observe that neuronal-related knockout mouse phenotypes associated with a given gene contributed important information for correct classification (Supplemental Table II – See Supporting Information online). For example, 7 of the top 10 features contributing to the ability of the SVM to distinguish between brain-expressed genes and ASD genes relied on knockout mouse phenotypes. Interestingly, the knockout mouse phenotypes included those associated with morphological abnormalities as well as abnormalities in nervous system function and behavior, all of which are associated with ASD and ID.

The classifiers are trained and tested by 10-fold cross-validation using seed genes and different types of control gene lists with the same size. An average ROC curve for the 10 folds for each classifier is plotted. Inset plots show the average AUC with standard deviation for each classifier.

As a final analysis, we attempted to integrate the results from all three classifiers by overlapping the top ranked genes and identifying the functional connections between them. For this we took the genes within a certain cutoff from the network-based classifiers and the genes retrieved as positive by all SVM classifiers and examined their overlap (Fig. 5A–B). We then connected the genes that overlap among all three ASD or ID classifiers using protein-protein interactions and shared functional annotations by drawing edges if the two genes/gene products directly interact or if pairs of genes share significant number of overlapping annotations as defined by the gene set libraries we used for the SVM classifier. This resulted in four distinct clusters (Fig. 5E), consistent with the accumulating evidence that core pathways, common to ASD and ID, are perturbed in a recurrent manner in these related disorders.

The shortest path distance of 3.95 and 3.65 (shown in Fig. 2) were applied as cutoff for the identification of (A) ASD genes or (B) ID genes, respectively. The number of SVM retrieved genes is the intersection of genes retrievable by all six classifiers trained by different types of control gene lists. The 39 ASD genes and 59 ID genes identified in all three classifiers, as well as the 39+59 genes are connected using functional associated networks with the software Genes2FANs (http://actin.pharm.mssm.edu/genes2FANs) and direct interactions are shown in (C) for the 39 ASD genes, (D) 59 ID genes and (E) 39+59 combined genes.

DISCUSSION

In this study we developed two PPI based classifiers and one attribute based SVM classifier to discriminate between ASD or ID disease genes and other genes. All three classifiers perform better that random classifiers; however, the PPI based classifiers perform only slightly better that would be expected for classifying sets of genes with similar functional categories. In contrast, the SVM classifier performs well, likely because it relies on more data points. However, all classifiers report a relatively high degree of false positives. Nevertheless, all three classifiers point to a highly overlapping core of ASD and ID disease genes loci that organize into four clusters.

The use of these classifiers in ongoing gene and pathway discovery in neurodevelopmental disorders will facilitate discovery and the identification of high-value therapeutic targets. In addition, the hub genes and networks identified (e.g., Fig. 5C–E) can be experimentally perturbed in mouse models to observe their effects on phenotype and to be used for understanding of pathophysiology.

In this study gene expression data were not considered. However, such information can potentially be integrated within the classifiers. Gene expression can be added as attributes for the SVM classifier, or the differentially expressed genes between ASD or ID post-mortem brains, as compared to normal controls, can add confidence to genes within the disease protein interaction neighborhoods. In addition, we can apply the methods presented here to better define the distances and shared mechanisms between other complex diseases with genetic underpinning.

While the complex enigma of pathways and networks in neurodevelopmental disorders is not resolved by these analyses, organizing the accumulated knowledge about ASD and ID genes within a supervised framework is likely to contribute towards a better understanding of the genetic and biological underpinning of this family of complex disorders.

Supplementary Material

Supp Table S1. Supplemental Table I. PPI network.

Filtered PPI background network used to build the network based classifiers.

NIHMS363397-supplement-Supp_Table_S1.xlsx^{(2.1MB, xlsx)}

Supp Table S2. Supplemental Table II. Top features for the ASD SVM classifiers.

Features vector from gene-set libararies that distinguish ASD and ID genes for the different ASD SVM classifiers.

NIHMS363397-supplement-Supp_Table_S2.xlsx^{(147.6KB, xlsx)}

ACKNOWLEDGMENTS

This work was supported by the Seaver Foundation and by NIH grants P50GM071558-03, R01DK088541-01A1, RC2LM010994-01 to AM. YK is a Seaver Graduate Fellow. We would like to thank Ruth Dannenfelser and Seth Berger for technical help and useful discussions.

Biographies

Yan Kou, M.Sc., is a graduate student at Mount Sinai School of Medicine, New York, working under the supervision of Drs. Ma’ayan and Buxbaum. She is interested in developing and applying systems biological methods to complex human disorders. Yan Kou is a Seaver Graduate Fellow.

Catalina Betancur, M.D., Ph.D., is director of research at the INSERM U952, CNRS UMR 7224, Université Pierre et Marie Curie, in Paris, France. Her work is focused on the elucidation of the genetic bases of autism spectrum disorders.

Huilei Xu, B.Sc., is a PhD graduate student at Mount Sinai School of Medicine, New York, working under the supervision of Drs. Ma’ayan and Ihor Lemischka. Huilei Xu is interested in developing and applying data mining methods in computational systems biology with the focus on understanding embryonic stem cell pluripotency and early differentiation.

Joseph D Buxbaum, M.Sc., Ph.D., is a Professor in the Departments of Psychiatry, Neuroscience, and Genetics and Genomic Sciences, at the Mount Sinai School of Medicine in New York. His interests are in understanding the causes of, and developing targeted treatment for, neuropsychiatric disorders. Dr. Buxbaum is the Director of the Seaver Autism Center for Research and Treatment as well as Chief for the Center of Excellence in Neurodevelopmental Disorders of the Friedman Brain Institute, both at Mount Sinai.

Avi Ma’ayan, M.Sc., Ph.D., is an Assistant professor in the Department of Pharmacology and Systems Therapeutics. His interests are in applying graph theory, machine learning and dimensionality reduction methods for integrating omics datasets collected from mammalian sources to better understand biological regulation on a global scale. Dr. Ma’ayan is the Director of the Bioinformatics and Network Analysis Core of the Systems Biology Center New York.

REFERENCES

Bader GD, Betel D, Hogue CWV. BIND: the Biomolecular Interaction Network Database. Nucl Acids Res. 2003;31(1):248–250. doi: 10.1093/nar/gkg056. [DOI] [PMC free article] [PubMed] [Google Scholar]
Berger S, Posner J, Ma'ayan A. Genes2Networks: connecting lists of gene symbols using mammalian protein interactions databases. BMC Bioinformatics. 2007;8(1):372. doi: 10.1186/1471-2105-8-372. [DOI] [PMC free article] [PubMed] [Google Scholar]
Berger SI, Ma'ayan A, Iyengar R. Systems pharmacology of arrhythmias. Sci Signal. 2010;3(118) doi: 10.1126/scisignal.2000723. ra30. [DOI] [PMC free article] [PubMed] [Google Scholar]
Betancur C. Etiological heterogeneity in autism spectrum disorders: More than 100 genetic and genomic disorders and still counting. Brain Research. 2010;1380(0):42–77. doi: 10.1016/j.brainres.2010.11.078. [DOI] [PubMed] [Google Scholar]
Beuming T, Skrabanek L, Niv MY, Mukherjee P, Weinstein H. PDZBase: a protein-protein interaction database for PDZ-domains. Bioinformatics. 2005;21(6):827–828. doi: 10.1093/bioinformatics/bti098. [DOI] [PubMed] [Google Scholar]
Byvatov E, Schneider G. Support vector machine applications in bioinformatics. Appl Bioinformatics. 2003;2(2):67–77. [PubMed] [Google Scholar]
Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G. MINT: the Molecular INTeraction database. Nucl Acids Res. 2007;35(suppl_1):D572–D574. doi: 10.1093/nar/gkl950. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen J, Aronow B, Jegga A. Disease candidate gene identification and prioritization using protein interaction networks. BMC Bioinformatics. 2009;10(1):73. doi: 10.1186/1471-2105-10-73. [DOI] [PMC free article] [PubMed] [Google Scholar]
Culhane AC, Schwarzl T, Sultana R, Picard KC, Picard SC, Lu TH, Franklin KR, French SJ, Papenhausen G, Correll M, Quackenbush J. GeneSigDB a curated database of gene expression signatures. Nucleic Acids Research. 2010;38(suppl 1):D716–D725. doi: 10.1093/nar/gkp1015. [DOI] [PMC free article] [PubMed] [Google Scholar]
El-Fishawy P, State MW. The Genetics of Autism: Key Issues, Recent Findings, and Clinical Implications. The Psychiatric clinics of North America. 2009;33(1):83–105. doi: 10.1016/j.psc.2009.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ewing RM, Chu P, Elisma F, Li H, Taylor P, Climie S, McBroom-Cerajewski L, Robinson MD, O'Connor L, Li M, Taylor R, Dharsee M, Ho Y, Heilbut A, Moore L, Zhang S, Ornatsky O, Bukhman YV, Ethier M, Sheng Y, Vasilescu J, Abu-Farha M, Lambert JP, Duewel HS, Stewart II, Kuehl B, Hogue K, Colwill K, Gladwish K, Muskat B, Kinach R, Adams SL, Moran MF, Morin GB, Topaloglou T, Figeys D. Large-scale mapping of human protein-protein interactions by mass spectrometry. Mol Syst Biol. 2007;3 doi: 10.1038/msb4100134. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gilman SR, Iossifov I, Levy D, Ronemus M, Wigler M, Vitkup D. Rare de novo variants associated with autism implicate a large functional network of genes involved in formation and function of synapses. Neuron. 2011;70(5):898–907. doi: 10.1016/j.neuron.2011.05.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gkoutos G, Green E, Mallon A-M, Hancock J, Davidson D. Using ontologies to describe mouse phenotypes. Genome Biology. 2004;6(1):R8. doi: 10.1186/gb-2004-6-1-r8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R. IntAct: an open source molecular interaction database. Nucl Acids Res. 2004;32(suppl_1):D452–D455. doi: 10.1093/nar/gkh052. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jiang H, Ching WK. Classifying DNA repair genes by kernel-based support vector machines. Bioinformation. 2011;7(5):257–263. doi: 10.6026/97320630007257. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Yl. KEGG for linking genomes to life and the environment. Nucleic Acids Research. 2008;36(suppl 1):D480–D484. doi: 10.1093/nar/gkm882. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kang HJ, Kawasawa YI, Cheng F, Zhu Y, Xu X, Li M, Sousa AMM, Pletikos M, Meyer KA, Sedmak G, Guennel T, Shin Y, Johnson MB, Krsnik Z, Mayer S, Fertuzinhos S, Umlauf S, Lisgo SN, Vortmeyer A, Weinberger DR, Mane S, Hyde TM, Huttner A, Reimers M, Kleinman JE, Sestan N. Spatio-temporal transcriptome of the human brain. Nature. 2011;478(7370):483–489. doi: 10.1038/nature10523. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kann MG. Advances in translational bioinformatics: computational approaches for the hunting of disease genes. Brief Bioinform. 2009;11(1):96–110. doi: 10.1093/bib/bbp048. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lachmann A, Ma'ayan A. KEA: kinase enrichment analysis. Bioinformatics. 2009;25(5):684–686. doi: 10.1093/bioinformatics/btp026. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lachmann A, Ma'ayan A. Lists2Networks: Integrated analysis of gene/protein lists. BMC Bioinformatics. 2010;11(1):87. doi: 10.1186/1471-2105-11-87. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lachmann A, Xu H, Krishnan J, Berger SI, Mazloom AR, Ma'ayan A. ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments. Bioinformatics. 2010;26(19):2438–2444. doi: 10.1093/bioinformatics/btq466. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li L, Zhang K, Lee J, Cordes S, Davis DP, Tang Z. Discovering cancer genes by integrating network and functional properties. BMC Med Genomics. 2009;2:61. doi: 10.1186/1755-8794-2-61. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lynn DJ, Winsor GL, Chan C, Richard N, Laird MR, Barsky A, Gardy JL, Roche FM, Chan THW, Shah N, et al. InnateDB: facilitating systems-level analyses of the mammalian innate immune response. Mol Syst Biol. 2008;4 doi: 10.1038/msb.2008.55. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ma'ayan A, Jenkins SL, Neves S, Hasseldine A, Grace E, Dubin-Thaler B, Eungdamrong NJ, Weng G, Ram PT, Rice JJ, Kershenbaum A, Stolovitzky GA, Blitzer RD, Iyengar R. Formation of Regulatory Patterns During Signal Propagation in a Mammalian Cellular Network. Science. 2005;309(5737):1078–1083. doi: 10.1126/science.1108876. [DOI] [PMC free article] [PubMed] [Google Scholar]
Matys V, Fricke E, Geffers R, Gling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Münch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Research. 2003;31(1):374–378. doi: 10.1093/nar/gkg108. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mewes H, Amid C, Arnold R, Frishman D, Guldener U, Mannhaupt G, Munsterkotter M, Pagel P, Strack N, Stumpflen V, Warfsmann J, Ruepp A. MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 2004;32:D41–D44. doi: 10.1093/nar/gkh092. [DOI] [PMC free article] [PubMed] [Google Scholar]
Navlakha S, Kingsford C. The power of protein interaction networks for associating genes with diseases. Bioinformatics. 2010;26(8):1057–1063. doi: 10.1093/bioinformatics/btq076. [DOI] [PMC free article] [PubMed] [Google Scholar]
Oti M, Snel B, Huynen MA, Brunner HG. Predicting disease genes using proteinâ€“protein interactions. Journal of Medical Genetics. 2006;43(8):691–698. doi: 10.1136/jmg.2006.041376. [DOI] [PMC free article] [PubMed] [Google Scholar]
Peri S, Navarro JD, Kristiansen TZ, Amanchy R, Surendranath V, Muthusamy B, Gandhi TKB, Chandrika KN, Deshpande N, Suresh S, Rashmi BP, Shanker K, Padma N, Niranjan V, Harsha HC, Talreja N, Vrushabendra BM, Ramya MA, Yatish AJ, Joy M, Shivashankar HN, Kavitha MP, Menezes M, Choudhury DR, Ghosh N, Saravana R, Chandran S, Mohan S, Jonnalagadda CK, Prasad CK, Kumar-Sinha C, Deshpande KS, Pandey A. Human protein reference database as a discovery resource for proteomics. Nucl Acids Res. 2004;32(suppl_1):D497–D501. doi: 10.1093/nar/gkh070. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rual J-F, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, Klitgord N, Simon C, Boxem M, Milstein S, Rosenberg J, Goldberg DS, Zhang LV, Wong SL, Franklin G, Li S, Albala JS, Lim J, Fraughton C, Llamosas E, Cevik S, Bex C, Lamesch P, Sikorski RS, Vandenhaute J, Zoghbi HY, Smolyar A, Bosak S, Sequerra R, Doucette-Stamm L, Cusick ME, Hill DE, Roth FP, Vidal M. Towards a proteome-scale map of the human protein-protein interaction network. Nature. 2005;437(7062):1173–1178. doi: 10.1038/nature04209. [DOI] [PubMed] [Google Scholar]
Sakai Y, Shaw CA, Dawson BC, Dugas DV, Al-Mohtaseb Z, Hill DE, Zoghbi HY. Protein interactome reveals converging molecular pathways among autism disorders. Sci Transl Med. 2011;3(86) doi: 10.1126/scitranslmed.3002166. 86ra49. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sanders Stephan J, Ercan-Sencicek AG, Hus V, Luo R, Murtha Michael T, Moreno-De-Luca D, Chu Su H, Moreau Michael P, Gupta Abha R, Thomson Susanne A, Mason CE, Bilguvar K, Celestino-Soper PB, Choi M, Crawford EL, Davis L, Wright NR, Dhodapkar RM, DiCola M, DiLullo NM, Fernandez TV, Fielding-Singh V, Fishman DO, Frahm S, Garagaloyan R, Goh GS, Kammela S, Klei L, Lowe JK, Lund SC, McGrew AD, Meyer KA, Moffat WJ, Murdoch JD, O'Roak BJ, Ober GT, Pottenger RS, Raubeson MJ, Song Y, Wang Q, Yaspan BL, Yu TW, Yurkiewicz IR, Beaudet AL, Cantor RM, Curland M, Grice DE, Günel M, Lifton RP, Mane SM, Martin DM, Shaw CA, Sheldon M, Tischfield JA, Walsh CA, Morrow EM, Ledbetter DH, Fombonne E, Lord C, Martin CL, Brooks AI, Sutcliffe JS, Cook EH, Jr, Geschwind D, Roeder K, Devlin B, State MW. Multiple Recurrent De Novo CNVs, Including Duplications of the 7q11.23 Williams Syndrome Region, Are Strongly Associated with Autism. Neuron. 2011;70(5):863–885. doi: 10.1016/j.neuron.2011.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stark C, Breitkreutz B-J, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucl Acids Res. 2006;34(suppl_1):D535–D539. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, Timm J, Mintzlaff S, Abraham C, Bock N, Kietzmann S, Goedde A, Toksöz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach H, Wanker EE. A Human Protein-Protein Interaction Network: A Resource for Annotating the Proteome. Cell. 2005;122(6):957–968. doi: 10.1016/j.cell.2005.08.029. [DOI] [PubMed] [Google Scholar]
Topper S, Ober C, Das S. Exome sequencing and the genetics of intellectual disability. Clinical Genetics. 2011;80(2):117–126. doi: 10.1111/j.1399-0004.2011.01720.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Voineagu I, Wang X, Johnston P, Lowe JK, Tian Y, Horvath S, Mill J, Cantor RM, Blencowe BJ, Geschwind DH. Transcriptomic analysis of autistic brain reveals convergent molecular pathology. Nature. 2011;474(7351):380–384. doi: 10.1038/nature10110. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM, Eisenberg D. DIP: the Database of Interacting Proteins. Nucl Acids Res. 2000;28(1):289–291. doi: 10.1093/nar/28.1.289. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xu H, Lemischka IR, Ma'ayan A. SVM classifier to predict genes important for self-renewal and pluripotency of mouse embryonic stem cells. BMC Syst Biol. 2010;4:173. doi: 10.1186/1752-0509-4-173. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yu H, Tardivo L, Tam S, Weiner E, Gebreab F, Fan C, Svrzikapa N, Hirozane-Kishikawa T, Rietman E, Yang X, Sahalie J, Salehi-Ashtiani K, Hao T, Cusick ME, Hill DE, Roth FP, Braun P, Vidal M. Next-generation sequencing to generate interactome datasets. Nat Meth. 2011;8(6):478–480. doi: 10.1038/nmeth.1597. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang W, Sun F, Jiang R. Integrating multiple protein-protein interaction networks to prioritize disease genes: a Bayesian regression approach. BMC Bioinformatics. 2011;12(Suppl 1):S11. doi: 10.1186/1471-2105-12-S1-S11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ziats MN, Rennert OM. Expression profiling of autism candidate genes during human brain development implicates central immune signaling pathways. PLoS One. 2011;6(9):e24691. doi: 10.1371/journal.pone.0024691. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Table S1. Supplemental Table I. PPI network.

Filtered PPI background network used to build the network based classifiers.

NIHMS363397-supplement-Supp_Table_S1.xlsx^{(2.1MB, xlsx)}

Supp Table S2. Supplemental Table II. Top features for the ASD SVM classifiers.

Features vector from gene-set libararies that distinguish ASD and ID genes for the different ASD SVM classifiers.

NIHMS363397-supplement-Supp_Table_S2.xlsx^{(147.6KB, xlsx)}

[R1] Bader GD, Betel D, Hogue CWV. BIND: the Biomolecular Interaction Network Database. Nucl Acids Res. 2003;31(1):248–250. doi: 10.1093/nar/gkg056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Berger S, Posner J, Ma'ayan A. Genes2Networks: connecting lists of gene symbols using mammalian protein interactions databases. BMC Bioinformatics. 2007;8(1):372. doi: 10.1186/1471-2105-8-372. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Berger SI, Ma'ayan A, Iyengar R. Systems pharmacology of arrhythmias. Sci Signal. 2010;3(118) doi: 10.1126/scisignal.2000723. ra30. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Betancur C. Etiological heterogeneity in autism spectrum disorders: More than 100 genetic and genomic disorders and still counting. Brain Research. 2010;1380(0):42–77. doi: 10.1016/j.brainres.2010.11.078. [DOI] [PubMed] [Google Scholar]

[R5] Beuming T, Skrabanek L, Niv MY, Mukherjee P, Weinstein H. PDZBase: a protein-protein interaction database for PDZ-domains. Bioinformatics. 2005;21(6):827–828. doi: 10.1093/bioinformatics/bti098. [DOI] [PubMed] [Google Scholar]

[R6] Byvatov E, Schneider G. Support vector machine applications in bioinformatics. Appl Bioinformatics. 2003;2(2):67–77. [PubMed] [Google Scholar]

[R7] Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G. MINT: the Molecular INTeraction database. Nucl Acids Res. 2007;35(suppl_1):D572–D574. doi: 10.1093/nar/gkl950. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Chen J, Aronow B, Jegga A. Disease candidate gene identification and prioritization using protein interaction networks. BMC Bioinformatics. 2009;10(1):73. doi: 10.1186/1471-2105-10-73. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Culhane AC, Schwarzl T, Sultana R, Picard KC, Picard SC, Lu TH, Franklin KR, French SJ, Papenhausen G, Correll M, Quackenbush J. GeneSigDB a curated database of gene expression signatures. Nucleic Acids Research. 2010;38(suppl 1):D716–D725. doi: 10.1093/nar/gkp1015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] El-Fishawy P, State MW. The Genetics of Autism: Key Issues, Recent Findings, and Clinical Implications. The Psychiatric clinics of North America. 2009;33(1):83–105. doi: 10.1016/j.psc.2009.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Ewing RM, Chu P, Elisma F, Li H, Taylor P, Climie S, McBroom-Cerajewski L, Robinson MD, O'Connor L, Li M, Taylor R, Dharsee M, Ho Y, Heilbut A, Moore L, Zhang S, Ornatsky O, Bukhman YV, Ethier M, Sheng Y, Vasilescu J, Abu-Farha M, Lambert JP, Duewel HS, Stewart II, Kuehl B, Hogue K, Colwill K, Gladwish K, Muskat B, Kinach R, Adams SL, Moran MF, Morin GB, Topaloglou T, Figeys D. Large-scale mapping of human protein-protein interactions by mass spectrometry. Mol Syst Biol. 2007;3 doi: 10.1038/msb4100134. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Gilman SR, Iossifov I, Levy D, Ronemus M, Wigler M, Vitkup D. Rare de novo variants associated with autism implicate a large functional network of genes involved in formation and function of synapses. Neuron. 2011;70(5):898–907. doi: 10.1016/j.neuron.2011.05.021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Gkoutos G, Green E, Mallon A-M, Hancock J, Davidson D. Using ontologies to describe mouse phenotypes. Genome Biology. 2004;6(1):R8. doi: 10.1186/gb-2004-6-1-r8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R. IntAct: an open source molecular interaction database. Nucl Acids Res. 2004;32(suppl_1):D452–D455. doi: 10.1093/nar/gkh052. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Jiang H, Ching WK. Classifying DNA repair genes by kernel-based support vector machines. Bioinformation. 2011;7(5):257–263. doi: 10.6026/97320630007257. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Yl. KEGG for linking genomes to life and the environment. Nucleic Acids Research. 2008;36(suppl 1):D480–D484. doi: 10.1093/nar/gkm882. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Kang HJ, Kawasawa YI, Cheng F, Zhu Y, Xu X, Li M, Sousa AMM, Pletikos M, Meyer KA, Sedmak G, Guennel T, Shin Y, Johnson MB, Krsnik Z, Mayer S, Fertuzinhos S, Umlauf S, Lisgo SN, Vortmeyer A, Weinberger DR, Mane S, Hyde TM, Huttner A, Reimers M, Kleinman JE, Sestan N. Spatio-temporal transcriptome of the human brain. Nature. 2011;478(7370):483–489. doi: 10.1038/nature10523. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Kann MG. Advances in translational bioinformatics: computational approaches for the hunting of disease genes. Brief Bioinform. 2009;11(1):96–110. doi: 10.1093/bib/bbp048. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Lachmann A, Ma'ayan A. KEA: kinase enrichment analysis. Bioinformatics. 2009;25(5):684–686. doi: 10.1093/bioinformatics/btp026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Lachmann A, Ma'ayan A. Lists2Networks: Integrated analysis of gene/protein lists. BMC Bioinformatics. 2010;11(1):87. doi: 10.1186/1471-2105-11-87. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Lachmann A, Xu H, Krishnan J, Berger SI, Mazloom AR, Ma'ayan A. ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments. Bioinformatics. 2010;26(19):2438–2444. doi: 10.1093/bioinformatics/btq466. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Li L, Zhang K, Lee J, Cordes S, Davis DP, Tang Z. Discovering cancer genes by integrating network and functional properties. BMC Med Genomics. 2009;2:61. doi: 10.1186/1755-8794-2-61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Lynn DJ, Winsor GL, Chan C, Richard N, Laird MR, Barsky A, Gardy JL, Roche FM, Chan THW, Shah N, et al. InnateDB: facilitating systems-level analyses of the mammalian innate immune response. Mol Syst Biol. 2008;4 doi: 10.1038/msb.2008.55. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Ma'ayan A, Jenkins SL, Neves S, Hasseldine A, Grace E, Dubin-Thaler B, Eungdamrong NJ, Weng G, Ram PT, Rice JJ, Kershenbaum A, Stolovitzky GA, Blitzer RD, Iyengar R. Formation of Regulatory Patterns During Signal Propagation in a Mammalian Cellular Network. Science. 2005;309(5737):1078–1083. doi: 10.1126/science.1108876. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Matys V, Fricke E, Geffers R, Gling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Münch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Research. 2003;31(1):374–378. doi: 10.1093/nar/gkg108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Mewes H, Amid C, Arnold R, Frishman D, Guldener U, Mannhaupt G, Munsterkotter M, Pagel P, Strack N, Stumpflen V, Warfsmann J, Ruepp A. MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 2004;32:D41–D44. doi: 10.1093/nar/gkh092. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Navlakha S, Kingsford C. The power of protein interaction networks for associating genes with diseases. Bioinformatics. 2010;26(8):1057–1063. doi: 10.1093/bioinformatics/btq076. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Oti M, Snel B, Huynen MA, Brunner HG. Predicting disease genes using proteinâ€“protein interactions. Journal of Medical Genetics. 2006;43(8):691–698. doi: 10.1136/jmg.2006.041376. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Peri S, Navarro JD, Kristiansen TZ, Amanchy R, Surendranath V, Muthusamy B, Gandhi TKB, Chandrika KN, Deshpande N, Suresh S, Rashmi BP, Shanker K, Padma N, Niranjan V, Harsha HC, Talreja N, Vrushabendra BM, Ramya MA, Yatish AJ, Joy M, Shivashankar HN, Kavitha MP, Menezes M, Choudhury DR, Ghosh N, Saravana R, Chandran S, Mohan S, Jonnalagadda CK, Prasad CK, Kumar-Sinha C, Deshpande KS, Pandey A. Human protein reference database as a discovery resource for proteomics. Nucl Acids Res. 2004;32(suppl_1):D497–D501. doi: 10.1093/nar/gkh070. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Rual J-F, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, Klitgord N, Simon C, Boxem M, Milstein S, Rosenberg J, Goldberg DS, Zhang LV, Wong SL, Franklin G, Li S, Albala JS, Lim J, Fraughton C, Llamosas E, Cevik S, Bex C, Lamesch P, Sikorski RS, Vandenhaute J, Zoghbi HY, Smolyar A, Bosak S, Sequerra R, Doucette-Stamm L, Cusick ME, Hill DE, Roth FP, Vidal M. Towards a proteome-scale map of the human protein-protein interaction network. Nature. 2005;437(7062):1173–1178. doi: 10.1038/nature04209. [DOI] [PubMed] [Google Scholar]

[R31] Sakai Y, Shaw CA, Dawson BC, Dugas DV, Al-Mohtaseb Z, Hill DE, Zoghbi HY. Protein interactome reveals converging molecular pathways among autism disorders. Sci Transl Med. 2011;3(86) doi: 10.1126/scitranslmed.3002166. 86ra49. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Sanders Stephan J, Ercan-Sencicek AG, Hus V, Luo R, Murtha Michael T, Moreno-De-Luca D, Chu Su H, Moreau Michael P, Gupta Abha R, Thomson Susanne A, Mason CE, Bilguvar K, Celestino-Soper PB, Choi M, Crawford EL, Davis L, Wright NR, Dhodapkar RM, DiCola M, DiLullo NM, Fernandez TV, Fielding-Singh V, Fishman DO, Frahm S, Garagaloyan R, Goh GS, Kammela S, Klei L, Lowe JK, Lund SC, McGrew AD, Meyer KA, Moffat WJ, Murdoch JD, O'Roak BJ, Ober GT, Pottenger RS, Raubeson MJ, Song Y, Wang Q, Yaspan BL, Yu TW, Yurkiewicz IR, Beaudet AL, Cantor RM, Curland M, Grice DE, Günel M, Lifton RP, Mane SM, Martin DM, Shaw CA, Sheldon M, Tischfield JA, Walsh CA, Morrow EM, Ledbetter DH, Fombonne E, Lord C, Martin CL, Brooks AI, Sutcliffe JS, Cook EH, Jr, Geschwind D, Roeder K, Devlin B, State MW. Multiple Recurrent De Novo CNVs, Including Duplications of the 7q11.23 Williams Syndrome Region, Are Strongly Associated with Autism. Neuron. 2011;70(5):863–885. doi: 10.1016/j.neuron.2011.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Stark C, Breitkreutz B-J, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucl Acids Res. 2006;34(suppl_1):D535–D539. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, Timm J, Mintzlaff S, Abraham C, Bock N, Kietzmann S, Goedde A, Toksöz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach H, Wanker EE. A Human Protein-Protein Interaction Network: A Resource for Annotating the Proteome. Cell. 2005;122(6):957–968. doi: 10.1016/j.cell.2005.08.029. [DOI] [PubMed] [Google Scholar]

[R35] Topper S, Ober C, Das S. Exome sequencing and the genetics of intellectual disability. Clinical Genetics. 2011;80(2):117–126. doi: 10.1111/j.1399-0004.2011.01720.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Voineagu I, Wang X, Johnston P, Lowe JK, Tian Y, Horvath S, Mill J, Cantor RM, Blencowe BJ, Geschwind DH. Transcriptomic analysis of autistic brain reveals convergent molecular pathology. Nature. 2011;474(7351):380–384. doi: 10.1038/nature10110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM, Eisenberg D. DIP: the Database of Interacting Proteins. Nucl Acids Res. 2000;28(1):289–291. doi: 10.1093/nar/28.1.289. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Xu H, Lemischka IR, Ma'ayan A. SVM classifier to predict genes important for self-renewal and pluripotency of mouse embryonic stem cells. BMC Syst Biol. 2010;4:173. doi: 10.1186/1752-0509-4-173. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] Yu H, Tardivo L, Tam S, Weiner E, Gebreab F, Fan C, Svrzikapa N, Hirozane-Kishikawa T, Rietman E, Yang X, Sahalie J, Salehi-Ashtiani K, Hao T, Cusick ME, Hill DE, Roth FP, Braun P, Vidal M. Next-generation sequencing to generate interactome datasets. Nat Meth. 2011;8(6):478–480. doi: 10.1038/nmeth.1597. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Zhang W, Sun F, Jiang R. Integrating multiple protein-protein interaction networks to prioritize disease genes: a Bayesian regression approach. BMC Bioinformatics. 2011;12(Suppl 1):S11. doi: 10.1186/1471-2105-12-S1-S11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Ziats MN, Rennert OM. Expression profiling of autism candidate genes during human brain development implicates central immune signaling pathways. PLoS One. 2011;6(9):e24691. doi: 10.1371/journal.pone.0024691. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Network- and Attribute-Based Classifiers Can Prioritize Genes and Pathways for Autism Spectrum Disorders and for Intellectual Disability

Yan Kou

Catalina Betancur

Huilei Xu

Joseph D Buxbaum

Avi Ma’ayan

Abstract

Introduction