Heritable genotype contrast mining reveals novel gene associations specific to autism subgroups

Matt Spencer; Nicole Takahashi; Sounak Chakraborty; Judith Miles; Chi-Ren Shyu

doi:10.1016/j.jbi.2017.11.016

. Author manuscript; available in PMC: 2019 Jan 1.

Published in final edited form as: J Biomed Inform. 2017 Nov 29;77:50–61. doi: 10.1016/j.jbi.2017.11.016

Heritable genotype contrast mining reveals novel gene associations specific to autism subgroups

Matt Spencer ^a, Nicole Takahashi ^b, Sounak Chakraborty ^c, Judith Miles ^b,^d, Chi-Ren Shyu ^a,^e,^f,^*

PMCID: PMC5788310 NIHMSID: NIHMS926019 PMID: 29197649

Abstract

Though the genetic etiology of autism is complex, our understanding can be improved by identifying genes and gene-gene interactions that contribute to the development of specific autism subtypes. Identifying such gene groupings will allow individuals to be diagnosed and treated according to their precise characteristics. To this end, we developed a method to associate gene combinations with groups with shared autism traits, targeting genetic elements that distinguish patient populations with opposing phenotypes. Our computational method prioritizes genetic variants for genome-wide association, then utilizes Frequent Pattern Mining to highlight potential interactions between variants. We introduce a novel genotype assessment metric, the Unique Inherited Combination support, which accounts for inheritance patterns observed in the nuclear family while estimating the impact of genetic variation on phenotype manifestation at the individual level. High-contrast variant combinations are tested for significant subgroup associations. We apply this method by contrasting autism subgroups defined by severe or mild manifestations of a phenotype. Significant associations connected 286 genes to the subgroups, including 193 novel autism candidates. 71 pairs of genes have joint associations with subgroups, presenting opportunities to investigate interacting functions. This study analyzed 12 autism subgroups, but our informatics method can explore other meaningful divisions of autism patients, and can further be applied to reveal precise genetic associations within other phenotypically heterogeneous disorders, such as Alzheimer’s disease.

Keywords: Data Mining, Autistic Disorder, Genetics, Frequent Pattern Mining

Graphical Abstract

graphic file with name nihms926019u1.jpg

1 Introduction

Autism is defined by the presence of a core set of symptoms involving behavioral, social, and cognitive deficits.¹ These phenotypes are measured during diagnosis, often employing sub-scores specific to the individual phenotypes. Diagnostic sub-scores differ greatly between individuals with autism, demonstrating the disorder’s extensive phenotypic variation. The diversity within individuals bearing the same diagnosis is concerning, and suggests that autism is too broad of a classification.² Overly broad grouping is problematic for association studies, as testing for associations over a diverse population severely reduces the power of the test. Thus, the autism research community has endeavored to study subgroups of children with shared attributes.^2–9 In this work, we further emphasize this focus by partitioning children into pairs of opposing subgroups and specifically searching for the major genetic differences between them.

The complex genetic etiology of autism rivals its phenotypic variability, supporting the contemporary consensus that autism is a collection of etiologically distinctive disorders that cause a consistently recognizable phenotype. De novo mutations,^10–12 inherited variation,^13–15 and environmental factors^16–18 have all been linked to autism development. Familiar genetic disorders (ex. Fragile X syndrome) appear in approximately 10% of autism cases.⁶ This multiplicity suggests that the onset of autism, and even the onset of specific autism subtypes, is likely due to a combination of factors, possibly including both genetic and environmental elements.^{6, 19, 20} Our knowledge about interactions between risk factors during the development of autism is insufficient, stressing the need for further examination of the interactions of multiple genotypes.²¹

It is estimated that around half of the genetic contribution of autism development stems from common variants.²² Furthermore, gene-gene interactions are often responsible for the complexity of disease susceptibility, to the point where gene interactions are considered to be more important than the independent effects of the genes in some cases.²³ Associations approximating these genetic interactions are often measured at the nucleotide level, which can be accomplished by examining the joint association of Single Nucleotide Polymorphisms (SNPs) to a disease. However, SNP microarrays often comprise millions of genotypes, making it impossible to examine every combination of these SNPs. The important question becomes: how will we decide which combinations of SNPs (denoted as “SNP-sets”) to test?

Most prior attempts to address this challenge have concentrated on the selection of SNPs based on prior knowledge, such as SNPs from genes closely related to an autism phenotype.^24–30 These studies tend to consider a limited number of loci (<100 SNPs) and are hypothesis-specific, reducing the potential for novel discoveries. Existing methods, such as the popular genome association analysis toolkit PLINK,³¹ are capable of testing epistasis effects, but are limited to pairs of SNPs and test them exhaustively, which is inefficient and time-consuming. By searching all genes for significant interactions, we can discover novel autism candidate genes and interactions between novel and known autism candidate genes. This was previously impossible due to the sheer magnitude of SNPs within genes. Fortunately, recent advances in computational power using distributed computing techniques makes it possible for the autism research community to have a more comprehensive view of SNP-sets associated with the disorder.

To discover genotype combinations relevant to the differentiation of specific autism subgroups while avoiding the limitations of manually selecting genotypes to combine, we developed a novel procedure called Heritable Genotype Contrast Mining (HGCM). This multi-disciplinary method integrates data mining techniques with traditional bioinformatics strategies to address the combinatorial problem of testing combinations of SNPs while searching for associations. HGCM avoids the limitations of pre-selecting SNPs by utilizing genome-wide SNP prioritization, facilitating the discovery of novel associations with autism. The data are partitioned into autism subgroups early in the process, allowing our method to highlight combinations of SNPs that are abundant in specific subgroups using a Frequent Pattern Mining algorithm. Opposing subgroups are then contrasted to reveal differentiating SNP-sets and identify genetic associations specifically relevant to the subgroups. HGCM was developed with focuses on testing combinations of SNPs and comparing disease subgroups, and can be applied to investigate the etiology of any disease with complex heritable genetic contribution and subtype structures.

2 Material and methods

The data analyzed in this study were obtained from the Simons Foundation Autism Research Initiative (SFARI) – Simon’s Simplex Collection (SSC).³² SSC contains copious data from 2,591 simplex families, where simplex refers to families with exactly one child diagnosed with autism (the “proband”). This amounts to SNP microarray genotypes and phenotypic data describing 11,560 individuals including probands age 4–17 and their parents and unaffected siblings. SSC includes families from USA and Canada, but otherwise has no specific geographical restriction.

The SSC genotype dataset was too large to be analyzed by normal means, particularly since our aim was to examine the combinatorial search space of many SNPs. To address this, we utilized a research computing environment comprising 192 cores and 960GB of memory spread over eight compute nodes. Data storage was managed by Apache Hadoop,³³ a framework that supports data distribution and analysis over multiple machines. Resource-demanding computations were executed using Apache Spark,³⁴ an in-memory distributed-computing framework.

2.1 Procedure overview

We describe the full HGCM procedure (Figure 1) with details about each step in the following sections. As a preprocessing step, missing genotypes are imputed. Pairs of opposite subgroups are formed using existing autism subtype classifications or autism behavior scores from the widely used Social Responsiveness Scale.³⁵ SNPs are tested for primary association with each subgroup using a genome-wide prioritization procedure, and the most significant SNPs are selected. Our extended Frequent Pattern Mining implementation identifies combinations of these selected SNPs that are prevalent in the subgroups and evaluates these prevalent SNPs for their potential to contribute to autism development in the specific genetic context of the individual families. The prevalence of SNP combinations is contrasted within opposite subgroups, and high-contrast genotype combinations are tested for association with the corresponding subgroup.

Overview of the Heritable Genotype Contrast Mining procedure. The major operations are indicated in boxed groups, numbered in order of occurrence in the workflow. Abbreviations: SFARI=Simon’s Foundation Autism Research Initiative, SNP=Single Nucleotide Polymorphism, SRSP=Social Responsiveness Scale – Parent Report, FBAT=Family-Based Association Test.

2.2 Missing genotype imputation

SSC data include genotypes measured by three genotyping arrays.³⁶ The number of loci consistent with dbSNP build 147³⁷ measured by each array is shown in Figure 1. Genotypes were preprocessed using a step described by Verma, et al³⁸ to standardize the genotype measurements from the different arrays, a process common to meta-analyses called missing genotype imputation.³⁹ The bioinformatics tool Beagle (version 4.1)⁴⁰ was used to infer genotypes that were missing due to the differences in array measurements, resulting in 2,950,235 SNP genotypes for all individuals.

2.3 Frequent Pattern Mining

Frequent Pattern Mining (FPM) is a data mining technique that excels at identifying combinations of features that occur repeatedly (i.e. frequent patterns).^{41, 42} For this study, our goal was to utilize FPM to ascertain the prevalence of SNPs and SNP-sets within autism populations. FPM requires data to be translated into binary “items”, with the two states indicating the presence or absence of the item in a person. To satisfy this constraint, bi-allelic SNPs, which have three states (homozygous for the major allele, heterozygous, or homozygous for the minor allele), must be condensed into a two-state representation. HGCM does this by combining genotypes containing the putative major allele into the “absent” state, while the “present” state contains only the homozygous minor allele genotype.

This binary construction accounts fully for the case where a genotype has a recessive effect. It will still account for dominant and additive effects of the minor allele, since both of these cases would lead to an enrichment of the homozygous genotype (and thus the present item state). Although these genotypes are handled less elegantly, their effects are likely to be detected. Furthermore, the consequences of this strategy are mitigated because the resulting SNP-sets are subsequently tested for association in a manner that makes no assumptions about the most likely genetic model utilized by any variant.

Once genotypes are converted into items, the population prevalence of each item is calculated; this is called the “support” of the item (Figure 2A). This metric can be extended to groups of items, or SNP-sets: the support of a SNP-set is the proportion of people with all the items in the set. The support of a SNP-set will never exceed the support of its subsets. Thus, FPM automatically rejects supersets of SNP-sets that do not meet a specified support threshold, or “minimum support”. This allows for drastic reductions in the number of SNP-sets that must be examined (Figure 2B).

Overview of Frequent Pattern Mining and Contrast Mining. (A) Example calculation of the support of a SNP-set. (B) When a SNP-set is infrequent (support is less than the specified minimum support, indicated by red shading), so are all its supersets. One single-SNP set “A” being infrequent guarantees that all SNP-sets including “A” are also infrequent, significantly reducing the combinatorial search space. (C) A SNP-set that is eliminated from a FPM analysis of the whole population may pass the Min Sup threshold when examining a specific subgroup. Abbreviations: Sup=support, Min Sup=minimum support.

The memory requirement of storing many combinations of items is a challenge of FPM, as the number of combinations grows exponentially when analyzing tens of thousands of items. We impose a heavy burden on the FPM algorithm by including SNPs that are representative of the entire genome, to the point where the analysis exceeds the capabilities of typical computing systems. Furthermore, we aimed to develop a system that incorporates as many of the available data as possible while searching for associations, leading to our creation of the Unique Inherited Configuration support metric described below. Thus, for the HGCM procedure we developed a customized Spark-enabled implementation of FPM, integrating our novel metric into the algorithm and utilizing a distributed in-memory computing environment capable of analyzing the large dataset.

2.4 Population division

FPM tools avoid examining all possible combinations of items by utilizing minimum support thresholds (Figure 2B). Although this is computationally beneficial, it is limiting from a knowledge discovery standpoint. When a small homogeneous subgroup is diluted in a large diverse population, patterns specific to the subgroup are often eliminated by FPM. However, examining the subgroup in isolation allows these patterns to emerge (see Figure 2C). Thus, HGCM features a procedure known as Contrast Mining^43–45 which divides the autism cohort into subgroups, performs FPM on each subgroup individually, and compares opposite subgroups to identify SNP-sets which appear frequently in one subgroup but rarely in the other.

SSC families were divided into subgroup pairs based on characteristics of the proband (Table 1). One subgroup pair was formed using a previously defined subtype classification characterized by morphological categories. The Autism Dysmorphology Measure classifies probands as “dysmorphic” or “nondysmorphic” (equivalently, complex or essential) according to a decision tree based on the presence of explicit physical abnormalities.⁷ Five more subgroup pairs were formed representing the range of severities for the sub-scores of the Social Responsiveness Scale (SRS): Parent Report³⁵ (chosen due to high response rate). These sub-scores measure five categories: the ability to notice social cues (awareness), the ability to interpret social cues (cognition), the ability to communicate expressively (communication), the motivation to engage in social behavior (motivation), and the expression of stereotypical behaviors or restricted interests (mannerisms). Appendix A details subgroup inclusion criteria using SSC terminology.

Table 1.

Names and descriptions for the examined subgroups, displayed as opposing subgroup pairs.

Subgroup 1			Subgroup 2
Name	Description	Size		Name	Description	Size
dysmorphic	Significant dysmorphology	79	vs	nondysmorphic	No significant dysmorphology	462
high-severity awareness	High SRS-Parent Report Social Awareness sub-score	1129	vs	low-severity awareness	Low SRS-Parent Report Social Awareness sub-score	374
high-severity cognition	High SRS-Parent Report Social Cognition sub-score	1860	vs	low-severity cognition	Low SRS-Parent Report Awareness sub-score	171
high-severity communication	High SRS-Parent Report Social Communication sub-score	1786	vs	low-severity communication	Low SRS-Parent Report Social Communication sub-score	201
high-severity mannerism	High SRS-Parent Report Autistic Mannerisms sub-score	1912	vs	low-severity mannerism	Low SRS-Parent Report Autistic Mannerisms sub-score	202
high-severity motivation	High SRS-Parent Report Social Motivation sub-score	1286	vs	low-severity motivation	Low SRS-Parent Report Social Motivation sub-score	398

Open in a new tab

2.5 Genome-wide SNP prioritization

We return to the question: how do we decide which SNP-sets to test? FPM algorithms answer this by utilizing the minimum support threshold: SNP-sets will be filtered according to their prevalence in the affected population. However, this approach fails to account for linkage disequilibrium, a known phenomenon causing associations between SNPs. Frequently co-occurring SNPs will often be those that have a physical association, rather than an association with autism, and most of these SNP-sets will have no association with autism when comparing cases and controls. However, the FPM algorithm does not account for controls - the analysis examines a group in isolation (note that cases and controls are compared in other stages of HGCM).

To overcome this, HGCM identifies SNPs with some evidence of a primary effect for the disorder using Bioconductor’s GWASTools⁴⁶ package. Minor allele frequencies within probands and unaffected family member controls are used to perform a logistic regression analyses, determining the association of each SNP with the affected population. Generally, it is important to choose a p-value cutoff that corrects for the testing of millions of loci.⁴⁷ However, our purpose for finding the strength of single-locus associations in HGCM is not to make statistical claims, but rather to select SNPs to combine into SNP-sets. Thus, HGCM selects the 30,000 most significant SNPs, as this roughly corresponds to a p-value cutoff of 0.05 for most subgroups (as opposed to the stringent Bonferroni threshold of 1.67e-8 typically used for GWAS).

2.6 Contrast Mining utilizing the UICsup

The direct application of FPM in this context would consider only the data gathered for the autism probands, but the data additionally include valuable information within the genotypes of immediate family members. Many of the autistic probands with a prevalent genotype have unaffected family members with the same genotype, providing evidence against the genotype’s contribution to autism development in the specific proband. Such a comparison between the genotypes of probands and their unaffected family members allows genotypes to be considered for their potential contribution to autism development in the context of the unmeasured genetic landscape of the family and in relatively similar environmental conditions.

To generate clinically relevant association candidates, it is necessary to consider the available information provided by the genotypes of close family members. Thus, we extended the Frequent Pattern Mining algorithm to calculate not only the support of the SNP-sets, but also an adjusted version of the support that accounts for inheritance patterns. We call this novel metric the “Unique Inherited Configuration support” (UICsup) because it calculates the proportion of probands that are the only member of their nuclear family bearing all the items in the SNP-set, thus the proband has a unique configuration of the inherited variants. Our incorporation of this new metric into the data mining procedure allows us to generate SNP-set candidates with higher potential for strong disease association, since UICsup is stricter than the unmodified support and accounts for more of the available data.

Figure 3A depicts some patterns of SNP inheritance where all the displayed probands are included in the support for the depicted SNP-set, but not all of them would contribute to UICsup. HGCM integrates this novel metric accounting for heritable genotypes into the traditional FPM procedure by calculating the UICsup of each SNP-set (Figure 3B) along with the support of the SNP-set. UICsup and the unaltered support are both considered during the Contrast Mining calculations to identify the SNP-sets with high-contrast between opposing subgroups. SNP-sets with a major increase of prevalence in either the support or the UICsup from one subgroup to another are highlighted as high-contrast candidates that will be tested for significant association.

Demonstration of the Unique Inherited Configuration support (UICsup) metric. (A) Various examples of SNP inheritance patterns from mothers (“M”) and fathers (“F”) to probands (“P”) and their unaffected siblings (“S”). Bars across chromatids represent a SNP minor allele. Circles highlight when SNPs are homozygous for all the minor alleles of the SNP-set in question, equivalently indicating when a person would be assigned the item for the SNP-set. Note that all these probands have the depicted items. The right column indicates whether the proband has a unique configuration of inherited variants, and therefore is included in the UICsup. Families 1 and 2 show single-SNP inheritance patterns; families 3 and 4 show how the metric is applied for 2-SNP inheritance patterns. Family 1 and 3 probands do not have a unique configuration of the inherited variants because a non-proband family member has the item. Family 2 and 4 probands do have unique configurations of the inherited variants because they are the only family members with all the items in the SNP-set. (B) Example calculation of the support and UICsup of an SNP-set.

In our application, candidates are generated when the prevalence between subgroups exceeds 50%, to ensure that contrasting genes have sufficient discriminatory power between subgroups. The specific value chosen for this threshold presents a choice between increasing the burden on the statistical testing and increasing the precision of the generated candidates. Reducing the threshold will generate more candidates, but a higher proportion of them will be deemed insignificant by subsequent testing. Furthermore, we consider it important that any discovered results apply to a meaningful proportion of the examined groups. In our experience with the SSC dataset, a prevalence difference of 50% generated enough candidates to yield interesting results while preventing the need for unjustified burden on the statistical testing. This value can be adjusted in future applications to accommodate the situation.

2.7 FBAT statistical testing

SNP-set candidates are tested for statistical association using the Family-Based Association Test (FBAT).⁴⁸ We adjust for the testing of multiple hypotheses using the Benjamini-Hochberg correction.⁴⁹ With a simplex family sample, individuals within each family have much stronger genetic relationships than the average relationship between sampled individuals. In this case the standard association tests used for unrelated individuals becomes biased. This problem of bias due to mixed relatedness is avoided by FBAT by using within-family comparisons compatible with several different family structures. Only the SNP-sets containing SNPs on different chromosomes are tested, as this statistical procedure does not adequately account for physical linkage. Genes are considered to be associated with a subgroup when at least one SNP within the gene is significantly associated with the subgroup.

2.8 Examination of discoveries

We examined the discovered genes by seeing how many genes are selected during the HGCM procedure steps, and by comparing with previous autism literature. The two major selection criteria are (1) high-contrast SNPs with major differences in subgroup prevalence and (2) SNPs significant using the FBAT. Genes containing at least one selected SNP were considered to pass the HGCM selection process. We compared with previous literature to determine which genes are already believed to be relevant to autism. The genes passing the selection processes were cross-referenced with AutDB 3.0⁵⁰, a reference for genes associated with autism also known as SFARI Gene, which contains candidate genes with varying levels of confidence from studies with varying specific research settings. We also implemented a broad search to identify HGCM genes that could be found in NCBI PubMed abstracts related to autism research using the search criteria “(autism OR asd) AND (gene AND <gene name>)”, where the name of each gene was inserted.

3. Results

3.1 Associations with autism subgroups

Significant genetic associations were found within 10 autism subgroups, including a maximum of 172 genes associated with the dysmorphic subgroup (Figure 4, blue bars). In total, 286 distinct genes are associated to at least one autism subgroup (adj. p<0.05). 193 of these are potentially novel candidate autism genes (red bars), as they were not present in AutDB or found in the PubMed abstract search. For each subgroup contrast, multiple novel genes distinguished each subgroup from its counterpart.

The number of genes HGCM found to be significantly associated with each subgroup. Subgroup pairs are aligned vertically. Gene counts are partitioned according to the presence of the gene in the AutDB gene database and PubMed abstract search (known) and absence from prior literature (novel). Bars labeled “Pre-UICsup” indicate the number of genes that would have been found without the implementation of the UICsup metric, whereas “UICsup” bars indicate the additional genes that were found due to the incorporation of UICsup. Abbreviations: high=high-severity, low=low-severity, comm=communication, cog=cognition, manner=mannerisms, aware=awareness, motiv=motivation.

This results section primarily focuses on the 286 associated protein-coding genes, as these have direct functional implications. Additionally, 84 non-coding RNAs (predominantly lincRNAs) have significant associations, and 56.8% of the significant SNPs (689 SNPs) are not within any known genes.

3.2 Comparison with previous literature

We calculated the number of genes within which SNPs passed the major selection processes of HGCM, these are listed in Table 2. The 894 genes with high subgroup contrast were selected from a pool of 11,339 genes that were prevalent in at least one subgroup. Thus, contrast mining eliminated 92% of the pool of genes; of these selected genes, 32% survived the FBAT. This indicates that the contrast mining procedure accomplished its intended purpose of identifying strong candidate genes and gene combinations from the combinatorial search space of many SNPs.

Table 2.

Categorized AutDB genes remaining after HGCM selection operations

HGCM major selection criteria	Genes	Genes present in AutDB	Genes found in PubMed search	Novel autism candidate genes
High subgroup contrast (difference in prevalence > 150%)	894	118 (13.2%)	214 (23.9%)	646 (72.3%)
Significant via FBAT (adj. p-value < 0.05)	286	49 (17.1%)	82 (28.7%)	193 (67.5%)

Open in a new tab

We searched AutDB and PubMed for previous documentation of associations with the significant genes. The majority of the AutDB genes were not selected due to similar prevalence in opposing subgroups, presumably because they are relevant to autism in general and less specific to autism subtypes. Recall that AutDB is a broad collection of autism-related work with varying degrees of certainty from unsupported to high-confidence, and tracks genes under investigation for their potential relevance to autism, so it is not expected that any one method would reproduce a majority of these genes. 49 of our associated genes are present in AutDB – these are genes previously found to be related to autism in general that we found to be specifically relevant to autism subgroups. We note that the results of our method included none of the AutDB genes labelled as “unsupported” and only one labelled as “high-confidence”. Most of these two categories are significant (or not) regardless of the subgroup being examined, so they do not reveal genetic distinctions between subgroups. Most of our significant findings that overlap with AutDB likely have intermediate levels of support due to previous attempts to associate these genes with a broad population of autism, rather than the more homogeneous subgroups examined in this study.

The broader PubMed abstract search found that 44 additional genes, identified by HGCM but not included in AutDB, have been reported in publications related to autism (see Table 2 and Figure 4, green bars). Thus, the remaining 193 HGCM genes are considered novel for autism studies (Figure 4, red bars). We note that the proportion of previously known significant gene candidates is larger than the proportion of previously known high-contrast gene candidates. This suggests that the FBAT further improved the quality of selected genes, highlighting associations that are both statistically significant and clinically relevant for autism subgroup distinctions.

It is important to note that the PubMed search is limited by the fact that not all autism studies report significant findings in the abstract, particularly when many associations are found. A notable example is Yuen et al.’s recent study,⁵¹ which identified 18 new candidate genes. A comparison with the full text revealed that our HGCM method also found one of these genes, PCDH11X, to be associated with the dysmorphic, high-severity cognition, high-severity communication, and high-severity motivation subgroups.

3.3 Method validation

We examine specific procedure elements for their contribution to the HGCM method. Assessing the contribution of the genotype imputation, we notice that although only 18.2% of the SNPs were directly measured by all three microarrays (Figure 1). 54.3% of the significant SNPs identified by the HGCM method were from this non-imputed fraction. However, 51.0% of HGCM-identified genes was supported by at least one significant imputed SNP, and 36.7% of the genes were discovered strictly due to the inclusion of imputed genotypes, showing an important contribution from this procedural step.

To similarly evaluate the contribution of the UICsup metric, we separated the significant genes that would not have been identified if not for UICsup, shown in Figure 4 (dark bar segments). UICsup was responsible for identifying 49.7% of HGCM genes, including 52.7% of HGCM genes classified as “known”. Crucially, UICsup was responsible for over 80% of the genes associated with the SRS diagnostic sub-score subgroups.

3.4 Comparison with existing method

We demonstrate the contribution of our method by comparing to the capabilities of PLINK.³¹ HGCM has several advantages over PLINK from a theoretical standpoint. PLINK supports testing for epistasis by testing all pairs of included SNPs in a case-control comparison. As this is an exhaustive examination of the combinations (and since the tool is not parallelized), this process takes longer than our FPM-based method, but many comparisons can still be completed in a reasonable timeframe. However, this limits potential epistatic effects tested to pairwise interactions, whereas our method has no methodological restriction on the number of SNPs that can form combinations. Additionally, PLINK is primarily designed for population-based samples and supports limited integration of family information. Basic family-based association testing for disease traits can be performed, but this option is not compatible with epistasis testing. Finally, PLINK lacks functionality which would contrast opposing subgroups as we do in this work.

To the best of our ability given the differences in capabilities between the applications, we performed an analogous analysis using PLINK to identify SNP combinations associated with the dysmorphic subgroup. Starting with the 30,000 most significant SNPs identified by our genome-wide SNP prioritization step, PLINK’s epistasis test was used to identify SNP pairs associated with the dysmorphic subgroup in a case-control comparison (note that the analysis does not utilize pedigree information). The significant SNP pairs were measured in dysmorphic and nondysmorphic subgroups, and the prevalence of these SNP pairs were contrasted using the same criterion employed by HGCM: at least 50% prevalence difference. PLINK identified eight gene pairs specific to the dysmorphic subgroup, compared with the 69 gene pairs highlighted by HGCM. Three of the genes in PLINK associations overlap with genes found by HGCM. This, in addition to the unique results found by the inclusion of UICsup, demonstrates the importance of considering pedigree data in genome association analyses, when it is available.

3.5 Gene pair associations

The HGCM method is designed to produce candidate SNP-sets representing interactions between any number of genes. However, the FBAT statistical test found all larger SNP-sets to be insignificant, resulting in single SNPs and pairs of SNPs associated with subgroups. HGCM gene pairs were generated when at least one SNP in each gene composed a significant SNP-set. In total, 71 gene pairs were associated with a subgroup; 55 distinct genes were part of at least one such gene pair. Most of these gene pairs were supported by one or two SNP-sets, but 11 gene pairs were supported by at least 10 SNP-sets connecting the two genes (Figure 5). In fact, these highly supported gene pairs all included the HS6ST2 gene and were associated with the dysmorphic subgroup. Variation in HS6ST2 has been observed in previous autism studies, but it is not known to be a significant contributing gene^52–54. Two gene pairs were associated with the high-severity motivation subgroup, and the rest with the dysmorphic subgroup; this is unsurprising as this subgroup also has the largest number of individually significant genes.

The number of significant SNP-sets supporting each gene pair found to be associated with a subgroup by HGCM. Only pairs of protein-coding genes are shown. The asterisks (*) indicate gene pairs associated with the high-severity motivation subgroup – all other gene pairs are associated with the dysmorphic subgroup. The upper and lower triangles of the matrix are equivalent.

3.6 Case study for deeper understanding

Last, we examined the significant SNPs within the DMD gene. This gene was notable due to its association with several subgroups including the dysmorphic, nondysmorphic, high-severity communication, and low-severity communication subgroups – opposing sides of two subgroup contrasts. This presents a situation that is unique to a method like this which focuses on differences between paired subgroups. To understand this phenomenon, we visualized the positions of the associated SNPs in the context of the DMD exons (Figure 6). The SNPs localized to two regions: 5 SNPs in a 100kb region towards the start of the gene (ChrX:32950000–33050000), and 9 SNPs in a 500kb region towards the end (ChrX:31600000–32100000). The localization of SNPs associated with more severe phenotypes suggests the existence of two regions within the gene which, when damaged, lead to severe phenotypes. SNPs associated with less severe phenotypes are near these regions, but somewhat isolated, indicating that variation further away from these critical regions affects the phenotype less seriously.

The location of SNPs within the DMD gene (ENSG00000198947) associated with autism subgroups divided into more and less severe categories. Note that the dysmorphic/nondysmorphic subgroup pair is based on the presence or absence of specific physical features rather than high/low severity of a phenotype; however, patients classified as dysmorphic tend to have higher severity in behavioral measures than nondysmorphic probands. Black bars represent DMD exons, obtained using Ensembl BioMart (GRCh38.p7). Note that DMD is transcribed from the reverse strand, from right to left in this figure. The top panel magnifies the chromosomal region ChrX:31550000–32070000 and the bottom panel magnifies ChrX:32925000–33090000. The asterisk (*) indicates two adjacent SNPs (79bp apart), both associated with the high-severity communication subgroup. Abbreviations: high=high-severity, low=low-severity, manner=mannerisms, comm=communication, cog=cognition.

Our informatics approach provides many candidate autism genes for future investigation, and highlights genes that are potentially relevant to specific subgroups and autism phenotypes. As a guide to improving the understanding of these subgroups, we provide a summary of the most significant findings for each examined subgroup in Table 3. A comprehensive dataset detailing our findings, including the discovered genes associated with subgroups, the involved SNPs, and a summary of the prevalence of these SNPs in the examined subgroup pairs, is provided in Appendices B and C. It also contains details about significant findings regarding non-protein-coding genes and SNPs in non-genic DNA to facilitate the growth of knowledge of these genetic regions.

Table 3.

The three most significant genes and gene pairs associated with each subgroup 3.7 Availability of discovered genes

Subgroup	Most Significant Genes (adj. p-value)	Subgroup	Most Significant Genes (adj. p-value)

	Most Significant Gene Pairs (adj. p-value)		Most Significant Gene Pairs (adj. p-value)

dysmorphic	DYNC2H1 (0.0046) ^*	nondysmorphic	MXRA5Y (2.0e-06) ^nc
	AC008271.1 (0.0116) ^nc		IL1RAPL2 (0.0001) ^a,^p
	IGF2BP1 (0.0135) ^*		MAMLD1 (0.0003) ^*

	DYNC2H1 \| HMGB1P32 (0.0135) ^* \| ^nc		HS6ST2 \| OFD1P6Y (0.0080) ^* \| ^nc
	NRAP \| HS6ST2 (0.0148) ^* \| ^*		USP26 \| OFD1P6Y (0.0111) ^* \| ^nc
	CSNK1G3 \| HMGB1P32 (0.0148) ^*\| ^nc		-

high-severity awareness	NR6A1 (0.0027) ^*	low-severity awareness	-
	ZNF559 (0.0027) ^a		-
	ZNF559-ZNF177 (0.0027) ^*		-

	-		-
	-		-
	-		-

high-severity cognition	DRP2 (1.0e-48) ^p	low-severity cognition	UPF3B (0.0368) ^a,^p
	PTCHD1-AS (1.0e-48) ^nc		SHROOM4 (0.0368) ^p
	PCDH11X (3.26e-16) ^p		-

	GRIA3 \| TTTY5 (0.0079) ^p \| ^nc		-
	PTCHD1-AS \| TTTY5 (0.0119) ^nc \| ^nc		-
	DRP2 \| TTTY5 (0.0197) ^p \| ^nc		-

high-severity communication	HUWE1 (1.0e-48) ^a,^p	low-severity communication	DMD (0.0349) ^a,^p
	PCDH11Y (7.0e-47) ^p		CA5B (0.0484) ^*
	RP11-158M9.1 (7.6e-18) ^nc		-

	DMD \| TTTY5 (0.0042) ^a,^p \| ^nc		-
	RP13-126P21.2 \| TTTY5 (0.0044) ^nc \| ^nc		-
	SLC25A14 \| TTTY5 (0.0071) ^a,^p\| ^nc		-

high-severity mannerism	NLGN4X (2.5e-11) ^a,^p	low-severity mannerism	-
	GRP173 (1.2e-10) ^*		-
	RP11-268G12.1 (1.2e-10) ^nc		-

	PTCHD1-AS \| TTTY5 (0.0012) ^nc \| ^nc		-
	TMEM164 \| TTTY5 (0.0039) ^* \| ^nc		-
	TMEM164 \| TTTY11 (0.0244) ^* \| ^nc		-

high-severity motivation	LYPD5 (1.3e-11) ^*	low-severity motivation	TAF7L (4.8e-07) ^*
	TTTY5 (3.1e-07) ^nc		PDK3 (0.0003) ^*
	TTTY11 (5.7e-06) ^nc		DRP2 (0.0003) ^p

	LYPD5 \| IL1RAPL2 (0.0058) ^* \| ^a,^p		-
	LYPD5 \| PCDH11X (0.0148) ^* \| ^p		-
	-		-

Open in a new tab

gene novel to this study (not present in AutDB or PubMed abstract search)

gene present in AutDB

gene found in PubMed abstract search

^nc

non-coding gene – various RNAs that are expressed but not translated

4 Discussion

Our Heritable Genotype Contrast Mining procedure is a data-driven, de novo method for detecting gene-subgroup associations which reveal many novel autism candidates and correlate known autism factors to specific phenotypes. HGCM exclusively selects high-contrast genotypes, emphasizing the discovery of genes that may make a critical distinction in the development of precise autism subtypes. We utilized this method to reveal many genetic associations with autism patients grouped by explicit shared characteristics.

Hundreds of genes are estimated to be involved in autism development, and our informatics study reveals many candidates that can be studied with specific focuses on the associated subgroups and phenotypes. In the results of this study (Appendices B and C), we provide abundant information to prioritize these candidates in different ways, such as genes with the highest subgroup contrast or the most significant p-values. Our future prioritization strategy will involve identifying important pathways containing these associated genes, and to investigate groups of candidates related to specific functions for deeper understanding, as several key insights have been revealed from the linking of genes to functionality in autism.^{55, 56}

Many SNP-sets prevalent in each low-severity subgroup are also prevalent in the corresponding high-severity subgroup, and were not selected for association testing due to our focus on high-contrast subgroup differences. This leads to the consistent enrichment of significant SNP-sets in the high-severity subgroups over their low-severity counterparts, seen in Figure 4. This enrichment pattern is explained by the existence of baseline SNP-sets that induce the low-severity phenotype and additional SNPs that combine with these SNP-sets to increase the phenotype severity. This supports our hypothesis that combinations of SNPs are a driving force for autism subtype etiology. It explains the pattern seen in the five subgroup pairs defined by SRS phenotype severity scores, but recall that the dysmorphic and nondysmorphic subgroups are defined by explicit physical traits, and not by phenotype severity. The two subgroups in this pair are more phenotypically distinct than the other five subgroup pairs, leading to less overlap in SNP-sets prevalent in the two subgroups. In fact, we believe the higher phenotypic distinction between the dysmorphic and nondysmorphic subgroups leads to the larger quantity of significant SNP-sets separating these groups.

The major strength of this method is the ability to highlight genetic distinctions between cohort subgroups. We demonstrate this strength here by comparing broad categories of autism patients defined by a single distinction, regardless of how diverse the resulting subgroups are. However, HGCM is also capable of examining differences between much more specific pairs of subgroups, such as ones that have nearly identical characteristics other than a single-trait distinction. This would provide genetic associations that are much more specific to the state of the targeted trait, but would suffer from a reduction in sample size and, thus, statistical power. Our preliminary attempts for understanding more precise traits in this manner involve dividing and examining one of these analyzed subgroups.

We have noticed that the more specific subgroups can have just as many associated genes as the original subgroup before it was partitioned, but the specific subgroups have surprisingly few gene associations in common with their corresponding broad subgroups. While this study successfully identified several significant combinations of genes associated with autism subgroups, it does not sufficiently account for the interactions of multiple genotypes. We expect that many autism subtypes have contributory gene-gene interactions, but our method only identified significant multi-gene associations with two subgroups. We attribute this limitation to the FPM algorithm, which imposes a stringent mechanism for choosing the combinations of SNPs to test for association: the minor alleles for specific SNPs in each gene must be enriched in the subgroups. FPM also generated very few gene combinations with more than two genes, though we suspect that there are gene trios and larger gene combinations that are important for autism development in some subgroups. The few larger gene combinations that were generated as candidates were insignificant using the FBAT statistical test. A more forgiving method would utilize haplotypes or consider multiple SNPs in a region while searching for multi-gene associations. The discovered SNP-set associations must be quite significant indeed to have fulfilled this method’s precise candidate generation requirements.

Several genes are associated with multiple subgroups. This is evidence that these genes are broadly associated with autism, and that this association extends to the subgroups. It is more difficult to interpret situations when a gene is associated with each member of a subgroup pair, such as the DMD example shown in Figure 6. It is tempting to disregard these results as anomalies, but in this case our closer examination revealed a potential explanation, demonstrating why seemly contradictory associations should not be immediately dismissed.

Our HGCM method provides quality candidates for autism subgroup association, and these results should be taken into consideration during the genotype measurement for future studies so that the data exist to confirm or reject the highlighted genes as contributing to the development of autism. Additionally, the HGCM method is not disorder-specific and can be applied to other disorders with potential subgroups, in addition to contrasting additional autism subgroup pairs.

5 Conclusions

In this paper, we describe Heritable Genotype Contrast Mining (HGCM), a novel method for the discovery of genes and gene-pairs associated with disorder subgroups. HGCM integrates bioinformatics tools, distributed data mining algorithms, and a novel family inheritance metric to generate candidate SNPs and combinations of SNPs in a priori fashion, and these are tested for association with the subgroups. We utilized this method to contrast six autism subgroup pairs, including one pair of previously characterized subgroups defined by morphological features and five pairs defined by the scores of a widely-used diagnostic test. 286 genes were discovered to be associated with a subgroup, including 193 potentially novel autism candidates. We conclude that HGCM produced valuable associations between genes and autism subgroups, and can improve precision medicine practices by identifying genetic associations within other disorders that may comprise distinct subtypes. In particular, recent discoveries about novel subtyping of patients with Alzheimer’s makes this disease a suitable target for this method.⁵⁷ The autism candidate genes identified in this study should be incorporated into future data collection to verify the significance of these associations and to identify the mechanisms through which these genes contribute to the development of autism subtypes.

Supplementary Material

NIHMS926019-supplement-1.docx^{(13.6KB, docx)}

NIHMS926019-supplement-2.csv^{(69.5KB, csv)}

NIHMS926019-supplement-3.csv^{(51.9KB, csv)}

Highlights.

Novel procedure to identify significant genetic differences between autism subgroups
Data mining techniques allow testing of combinations of genes
Data-driven procedure for SNP prioritization reduces experimental bias
286 genes associated with specific autism subgroups – 193 novel autism candidates

Acknowledgments

Funding

This work was supported by the National Institutes of Health [grant numbers 5T32GM008396, 5T32LM012410-02]; the Shumaker Endowment for Biomedical Informatics; the National Science Foundation [grant number CNS-1429294]; and the Simons Foundation [award number #26021565-08C000066].

The authors would like to thank Dr. Stephen Kanne for many discussions on the interpretation and use of behavior data scored in the Simons collection, Dr. Zohreh Talebizadeh for the fruitful discussions on subgroup-focused research and potential follow-up directions for this study, and Dr. Michael Phinney for sharing his data mining software and for abundant technical advice.

Footnotes

Competing Interests

None.

Contributors

MS and CS designed the informatics procedure and methods to compare results to prior work. MS performed data preprocessing, analysis, and visualization. SC provided statistical expertise during the experimental design and while determining an appropriate statistical test. NT assisted in the process of requesting the data from the Simons Foundation and provided insights in the interpretation and parsing of the data sets. JM provided expertise on the morphological subgroups and genetics perspectives during the interpretation of results. MS draft the manuscript with substantial contributions from all co-authors. CS supervised the project development and manuscript writing. All authors approved the final manuscript and agree to all aspects of the work.

Declaration of Interests

None.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

1.American Psychiatric Association. American Psychiatric Association Diagnostic and Statistical Manual of Mental Disorders. Washington, DC: p. 1664. [Google Scholar]
2.Wong VC, Fung CK, Wong PT. Use of dysmorphology for subgroup classification on autism spectrum disorder in Chinese children. Journal of autism and developmental disorders. 2014;44(1):9–18. doi: 10.1007/s10803-013-1846-3. [DOI] [PubMed] [Google Scholar]
3.Hu VW, Sarachana T, Kim KS, et al. Gene expression profiling differentiates autism case–controls and phenotypic variants of autism spectrum disorders: Evidence for circadian rhythm dysfunction in severe autism. Autism research. 2009;2(2):78–97. doi: 10.1002/aur.73. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Alexander AL, Lee JE, Lazar M, et al. Diffusion tensor imaging of the corpus callosum in Autism. Neuroimage. 2007;34(1):61–73. doi: 10.1016/j.neuroimage.2006.08.032. [DOI] [PubMed] [Google Scholar]
5.Hu VW, Addington A, Hyman A. Novel autism subtype-dependent genetic variants are revealed by quantitative trait and subphenotype association analyses of published GWAS data. Plos One. 2011;6(4):e19067. doi: 10.1371/journal.pone.0019067. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Miles JH. Autism spectrum disorders—a genetics review. Genetics in Medicine. 2011;13(4):278–94. doi: 10.1097/GIM.0b013e3181ff67ba. [DOI] [PubMed] [Google Scholar]
7.Miles JH, Takahashi TN, Hong J, et al. Development and validation of a measure of dysmorphology: useful for autism subgroup classification. American Journal of Medical Genetics Part A. 2008;146(9):1101–16. doi: 10.1002/ajmg.a.32244. [DOI] [PubMed] [Google Scholar]
8.Ozgen H, Hellemann G, de Jonge M, et al. Predictive value of morphological features in patients with autism versus normal controls. Journal of autism and developmental disorders. 2013;43(1):147–55. doi: 10.1007/s10803-012-1554-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Tager-Flusberg H. Defining language impairments in a subgroup of children with autism spectrum disorder. Science China Life Sciences. 2015;58(10):1044–52. doi: 10.1007/s11427-012-4297-8. [DOI] [PubMed] [Google Scholar]
10.Iossifov I, O’Roak BJ, Sanders SJ, et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature. 2014;515(7526):216–21. doi: 10.1038/nature13908. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Sebat J, Lakshmi B, Malhotra D, et al. Strong association of de novo copy number mutations with autism. Science. 2007;316(5823):445–9. doi: 10.1126/science.1138659. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Sanders SJ, Murtha MT, Gupta AR, et al. De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature. 2012;485(7397):237–41. doi: 10.1038/nature10945. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Risch N, Spiker D, Lotspeich L, et al. A genomic screen of autism: evidence for a multilocus etiology. American journal of human genetics. 1999;65(2):493–507. doi: 10.1086/302497. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Ozonoff S, Young GS, Carter A, et al. Recurrence risk for autism spectrum disorders: a Baby Siblings Research Consortium study. Pediatrics. 2011;128(3):e488–e95. doi: 10.1542/peds.2010-2825. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Weiner DJ, Wigdor EM, Ripke S, et al. Polygenic transmission disequilibrium confirms that common and rare variation act additively to create risk for autism spectrum disorders. Nature genetics. 2017 doi: 10.1038/ng.3863. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Landrigan PJ. What causes autism? Exploring the environmental contribution. Current opinion in pediatrics. 2010;22(2):219–25. doi: 10.1097/MOP.0b013e328336eb9a. [DOI] [PubMed] [Google Scholar]
17.Rodier PM, Hyman SL. Early environmental factors in autism. Mental Retardation and Developmental Disabilities Research Reviews. 1998;4(2):121–8. [Google Scholar]
18.Rossignol D, Genuis S, Frye R. Environmental toxicants and autism spectrum disorders: a systematic review. Translational psychiatry. 2014;4(2):e360. doi: 10.1038/tp.2014.4. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Carter M, Scherer S. Autism spectrum disorder in the genetics clinic: a review. Clinical genetics. 2013;83(5):399–407. doi: 10.1111/cge.12101. [DOI] [PubMed] [Google Scholar]
20.Deth R, Muratore C, Benzecry J, et al. How environmental and genetic factors combine to cause autism: A redox/methylation hypothesis. Neurotoxicology. 2008;29(1):190–201. doi: 10.1016/j.neuro.2007.09.010. [DOI] [PubMed] [Google Scholar]
21.Yuen RK, Thiruvahindrapuram B, Merico D, et al. Whole-genome sequencing of quartet families with autism spectrum disorder. Nat Med. 2015;21(2):185–91. doi: 10.1038/nm.3792. [DOI] [PubMed] [Google Scholar]
22.Gaugler T, Klei L, Sanders SJ, et al. Most genetic risk for autism resides with common variation. Nature genetics. 2014;46(8):881–5. doi: 10.1038/ng.3039. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Human heredity. 2003;56(1–3):73–82. doi: 10.1159/000073735. [DOI] [PubMed] [Google Scholar]
24.Anderson B, Schnetz-Boutaud N, Bartlett J, et al. Examination of association of genes in the serotonin system to autism. Neurogenetics. 2009;10(3):209–16. doi: 10.1007/s10048-009-0171-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Anderson B, Schnetz-Boutaud N, Bartlett J, et al. Examination of association to autism of common genetic variationin genes related to dopamine. Autism Research. 2008;1(6):364–9. doi: 10.1002/aur.55. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Ashley-Koch AE, Jaworski J, Mei H, et al. Investigation of potential gene–gene interactions between APOE and RELN contributing to autism risk. Psychiatric genetics. 2007;17(4):221–6. doi: 10.1097/YPG.0b013e32809c2f75. [DOI] [PubMed] [Google Scholar]
27.Bowers K, Li Q, Bressler J, et al. Glutathione pathway gene variation and risk of autism spectrum disorders. Journal of neurodevelopmental disorders. 2011;3(2):132. doi: 10.1007/s11689-011-9077-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Campbell DB, Li C, Sutcliffe JS, et al. Genetic evidence implicating multiple genes in the MET receptor tyrosine kinase pathway in autism spectrum disorder. Autism Research. 2008;1(3):159–68. doi: 10.1002/aur.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Kim SJ, Brune CW, Kistner EO, et al. Transmission disequilibrium testing of the chromosome 15q11–q13 region in autism. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2008;147(7):1116–25. doi: 10.1002/ajmg.b.30733. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Ma D, Whitehead P, Menold M, et al. Identification of significant association and gene-gene interaction of GABA receptor subunit genes in autism. The American Journal of Human Genetics. 2005;77(3):377–88. doi: 10.1086/433195. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Purcell S, Neale B, Todd-Brown K, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81(3):559–75. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Fischbach GD, Lord C. The Simons Simplex Collection: a resource for identification of autism genetic risk factors. Neuron. 2010;68(2):192–5. doi: 10.1016/j.neuron.2010.10.006. [DOI] [PubMed] [Google Scholar]
33.Shvachko K, Kuang H, Radia S, et al., editors. The hadoop distributed file system. Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on; 2010; IEEE; [Google Scholar]
34.Zaharia M, Chowdhury M, Franklin MJ, et al., editors. Spark: cluster computing with working sets. Proceedings of the 2nd USENIX conference on Hot topics in cloud computing; 2010. [Google Scholar]
35.Constantino JN, Gruber CP. Social responsiveness scale (SRS) Western Psychological Services; Los Angeles, CA: 2007. [Google Scholar]
36.Sanders SJ, He X, Willsey AJ, et al. Insights into autism spectrum disorder genomic architecture and biology from 71 risk loci. Neuron. 2015;87(6):1215–33. doi: 10.1016/j.neuron.2015.09.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Sherry ST, Ward M-H, Kholodov M, et al. dbSNP: the NCBI database of genetic variation. Nucleic acids research. 2001;29(1):308–11. doi: 10.1093/nar/29.1.308. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Verma SS, De Andrade M, Tromp G, et al. Imputation and quality control steps for combining multiple genome-wide datasets. Frontiers in genetics. 2014;5:370. doi: 10.3389/fgene.2014.00370. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Bush WS, Moore JH. Genome-wide association studies. PLoS computational biology. 2012;8(12):e1002822. doi: 10.1371/journal.pcbi.1002822. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Browning SR, Browning BL. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. The American Journal of Human Genetics. 2007;81(5):1084–97. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Agrawal R, Srikant R, editors. Fast algorithms for mining association rules. Proc 20th int conf very large data bases, VLDB; 1994. [Google Scholar]
42.Hipp J, Güntzer U, Nakhaeizadeh G. Algorithms for association rule mining—a general survey and comparison. ACM sigkdd explorations newsletter. 2000;2(1):58–64. [Google Scholar]
43.Bay SD, Pazzani MJ. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery. 2001;5(3):213–46. [Google Scholar]
44.Dong G, Li J, editors. Efficient mining of emerging patterns: Discovering trends and differences. Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining; 1999; ACM; [Google Scholar]
45.Novak PK, Lavrač N, Webb GI. Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res. 2009;10(Feb):377–403. [Google Scholar]
46.Gogarten SM, Bhangale T, Conomos MP, et al. GWASTools: an R/Bioconductor package for quality control and analysis of Genome-Wide Association Studies. Bioinformatics. 2012;28(24):3329–31. doi: 10.1093/bioinformatics/bts610. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Johnson RC, Nelson GW, Troyer JL, et al. Accounting for multiple comparisons in a genome-wide association study (GWAS) BMC genomics. 2010;11(1):724. doi: 10.1186/1471-2164-11-724. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Horvath S, Xu X, Laird NM. The family based association test method: strategies for studying general genotype-phenotype associations. European journal of human genetics: EJHG. 2001;9(4):301. doi: 10.1038/sj.ejhg.5200625. [DOI] [PubMed] [Google Scholar]
49.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society Series B (Methodological) 1995:289–300. [Google Scholar]
50.Basu SN, Kollu R, Banerjee-Basu S. AutDB: a gene reference resource for autism research. Nucleic acids research. 2009;37(suppl 1):D832–D6. doi: 10.1093/nar/gkn835. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Yuen RK, Merico D, Bookman M, et al. Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder. Nature Neuroscience. 2017 doi: 10.1038/nn.4524. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Bremer A, Giacobini M, Eriksson M, et al. Copy number variation characteristics in subpopulations of patients with autism spectrum disorders. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2011;156(2):115–24. doi: 10.1002/ajmg.b.31142. [DOI] [PubMed] [Google Scholar]
53.French L, Pavlidis P. Relationships between gene expression and brain wiring in the adult rodent brain. PLoS computational biology. 2011;7(1):e1001049. doi: 10.1371/journal.pcbi.1001049. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Piton A, Gauthier J, Hamdan F, et al. Systematic resequencing of X-chromosome synaptic genes in autism spectrum disorder and schizophrenia. Molecular psychiatry. 2011;16(8):867–80. doi: 10.1038/mp.2010.54. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Gilman SR, Iossifov I, Levy D, et al. Rare de novo variants associated with autism implicate a large functional network of genes involved in formation and function of synapses. Neuron. 2011;70(5):898–907. doi: 10.1016/j.neuron.2011.05.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Krishnan A, Zhang R, Yao V, et al. Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nature neuroscience. 2016;19(11):1454–62. doi: 10.1038/nn.4353. [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Bredesen DE. Metabolic profiling distinguishes three subtypes of Alzheimer’s disease. Aging (Albany NY) 2015;7(8):595–600. doi: 10.18632/aging.100801. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS926019-supplement-1.docx^{(13.6KB, docx)}

NIHMS926019-supplement-2.csv^{(69.5KB, csv)}

NIHMS926019-supplement-3.csv^{(51.9KB, csv)}

[R1] 1.American Psychiatric Association. American Psychiatric Association Diagnostic and Statistical Manual of Mental Disorders. Washington, DC: p. 1664. [Google Scholar]

[R2] 2.Wong VC, Fung CK, Wong PT. Use of dysmorphology for subgroup classification on autism spectrum disorder in Chinese children. Journal of autism and developmental disorders. 2014;44(1):9–18. doi: 10.1007/s10803-013-1846-3. [DOI] [PubMed] [Google Scholar]

[R3] 3.Hu VW, Sarachana T, Kim KS, et al. Gene expression profiling differentiates autism case–controls and phenotypic variants of autism spectrum disorders: Evidence for circadian rhythm dysfunction in severe autism. Autism research. 2009;2(2):78–97. doi: 10.1002/aur.73. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Alexander AL, Lee JE, Lazar M, et al. Diffusion tensor imaging of the corpus callosum in Autism. Neuroimage. 2007;34(1):61–73. doi: 10.1016/j.neuroimage.2006.08.032. [DOI] [PubMed] [Google Scholar]

[R5] 5.Hu VW, Addington A, Hyman A. Novel autism subtype-dependent genetic variants are revealed by quantitative trait and subphenotype association analyses of published GWAS data. Plos One. 2011;6(4):e19067. doi: 10.1371/journal.pone.0019067. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Miles JH. Autism spectrum disorders—a genetics review. Genetics in Medicine. 2011;13(4):278–94. doi: 10.1097/GIM.0b013e3181ff67ba. [DOI] [PubMed] [Google Scholar]

[R7] 7.Miles JH, Takahashi TN, Hong J, et al. Development and validation of a measure of dysmorphology: useful for autism subgroup classification. American Journal of Medical Genetics Part A. 2008;146(9):1101–16. doi: 10.1002/ajmg.a.32244. [DOI] [PubMed] [Google Scholar]

[R8] 8.Ozgen H, Hellemann G, de Jonge M, et al. Predictive value of morphological features in patients with autism versus normal controls. Journal of autism and developmental disorders. 2013;43(1):147–55. doi: 10.1007/s10803-012-1554-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Tager-Flusberg H. Defining language impairments in a subgroup of children with autism spectrum disorder. Science China Life Sciences. 2015;58(10):1044–52. doi: 10.1007/s11427-012-4297-8. [DOI] [PubMed] [Google Scholar]

[R10] 10.Iossifov I, O’Roak BJ, Sanders SJ, et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature. 2014;515(7526):216–21. doi: 10.1038/nature13908. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Sebat J, Lakshmi B, Malhotra D, et al. Strong association of de novo copy number mutations with autism. Science. 2007;316(5823):445–9. doi: 10.1126/science.1138659. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Sanders SJ, Murtha MT, Gupta AR, et al. De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature. 2012;485(7397):237–41. doi: 10.1038/nature10945. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Risch N, Spiker D, Lotspeich L, et al. A genomic screen of autism: evidence for a multilocus etiology. American journal of human genetics. 1999;65(2):493–507. doi: 10.1086/302497. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Ozonoff S, Young GS, Carter A, et al. Recurrence risk for autism spectrum disorders: a Baby Siblings Research Consortium study. Pediatrics. 2011;128(3):e488–e95. doi: 10.1542/peds.2010-2825. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Weiner DJ, Wigdor EM, Ripke S, et al. Polygenic transmission disequilibrium confirms that common and rare variation act additively to create risk for autism spectrum disorders. Nature genetics. 2017 doi: 10.1038/ng.3863. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Landrigan PJ. What causes autism? Exploring the environmental contribution. Current opinion in pediatrics. 2010;22(2):219–25. doi: 10.1097/MOP.0b013e328336eb9a. [DOI] [PubMed] [Google Scholar]

[R17] 17.Rodier PM, Hyman SL. Early environmental factors in autism. Mental Retardation and Developmental Disabilities Research Reviews. 1998;4(2):121–8. [Google Scholar]

[R18] 18.Rossignol D, Genuis S, Frye R. Environmental toxicants and autism spectrum disorders: a systematic review. Translational psychiatry. 2014;4(2):e360. doi: 10.1038/tp.2014.4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Carter M, Scherer S. Autism spectrum disorder in the genetics clinic: a review. Clinical genetics. 2013;83(5):399–407. doi: 10.1111/cge.12101. [DOI] [PubMed] [Google Scholar]

[R20] 20.Deth R, Muratore C, Benzecry J, et al. How environmental and genetic factors combine to cause autism: A redox/methylation hypothesis. Neurotoxicology. 2008;29(1):190–201. doi: 10.1016/j.neuro.2007.09.010. [DOI] [PubMed] [Google Scholar]

[R21] 21.Yuen RK, Thiruvahindrapuram B, Merico D, et al. Whole-genome sequencing of quartet families with autism spectrum disorder. Nat Med. 2015;21(2):185–91. doi: 10.1038/nm.3792. [DOI] [PubMed] [Google Scholar]

[R22] 22.Gaugler T, Klei L, Sanders SJ, et al. Most genetic risk for autism resides with common variation. Nature genetics. 2014;46(8):881–5. doi: 10.1038/ng.3039. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Human heredity. 2003;56(1–3):73–82. doi: 10.1159/000073735. [DOI] [PubMed] [Google Scholar]

[R24] 24.Anderson B, Schnetz-Boutaud N, Bartlett J, et al. Examination of association of genes in the serotonin system to autism. Neurogenetics. 2009;10(3):209–16. doi: 10.1007/s10048-009-0171-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Anderson B, Schnetz-Boutaud N, Bartlett J, et al. Examination of association to autism of common genetic variationin genes related to dopamine. Autism Research. 2008;1(6):364–9. doi: 10.1002/aur.55. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Ashley-Koch AE, Jaworski J, Mei H, et al. Investigation of potential gene–gene interactions between APOE and RELN contributing to autism risk. Psychiatric genetics. 2007;17(4):221–6. doi: 10.1097/YPG.0b013e32809c2f75. [DOI] [PubMed] [Google Scholar]

[R27] 27.Bowers K, Li Q, Bressler J, et al. Glutathione pathway gene variation and risk of autism spectrum disorders. Journal of neurodevelopmental disorders. 2011;3(2):132. doi: 10.1007/s11689-011-9077-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Campbell DB, Li C, Sutcliffe JS, et al. Genetic evidence implicating multiple genes in the MET receptor tyrosine kinase pathway in autism spectrum disorder. Autism Research. 2008;1(3):159–68. doi: 10.1002/aur.27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Kim SJ, Brune CW, Kistner EO, et al. Transmission disequilibrium testing of the chromosome 15q11–q13 region in autism. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2008;147(7):1116–25. doi: 10.1002/ajmg.b.30733. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Ma D, Whitehead P, Menold M, et al. Identification of significant association and gene-gene interaction of GABA receptor subunit genes in autism. The American Journal of Human Genetics. 2005;77(3):377–88. doi: 10.1086/433195. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Purcell S, Neale B, Todd-Brown K, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81(3):559–75. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Fischbach GD, Lord C. The Simons Simplex Collection: a resource for identification of autism genetic risk factors. Neuron. 2010;68(2):192–5. doi: 10.1016/j.neuron.2010.10.006. [DOI] [PubMed] [Google Scholar]

[R33] 33.Shvachko K, Kuang H, Radia S, et al., editors. The hadoop distributed file system. Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on; 2010; IEEE; [Google Scholar]

[R34] 34.Zaharia M, Chowdhury M, Franklin MJ, et al., editors. Spark: cluster computing with working sets. Proceedings of the 2nd USENIX conference on Hot topics in cloud computing; 2010. [Google Scholar]

[R35] 35.Constantino JN, Gruber CP. Social responsiveness scale (SRS) Western Psychological Services; Los Angeles, CA: 2007. [Google Scholar]

[R36] 36.Sanders SJ, He X, Willsey AJ, et al. Insights into autism spectrum disorder genomic architecture and biology from 71 risk loci. Neuron. 2015;87(6):1215–33. doi: 10.1016/j.neuron.2015.09.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Sherry ST, Ward M-H, Kholodov M, et al. dbSNP: the NCBI database of genetic variation. Nucleic acids research. 2001;29(1):308–11. doi: 10.1093/nar/29.1.308. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Verma SS, De Andrade M, Tromp G, et al. Imputation and quality control steps for combining multiple genome-wide datasets. Frontiers in genetics. 2014;5:370. doi: 10.3389/fgene.2014.00370. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Bush WS, Moore JH. Genome-wide association studies. PLoS computational biology. 2012;8(12):e1002822. doi: 10.1371/journal.pcbi.1002822. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Browning SR, Browning BL. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. The American Journal of Human Genetics. 2007;81(5):1084–97. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Agrawal R, Srikant R, editors. Fast algorithms for mining association rules. Proc 20th int conf very large data bases, VLDB; 1994. [Google Scholar]

[R42] 42.Hipp J, Güntzer U, Nakhaeizadeh G. Algorithms for association rule mining—a general survey and comparison. ACM sigkdd explorations newsletter. 2000;2(1):58–64. [Google Scholar]

[R43] 43.Bay SD, Pazzani MJ. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery. 2001;5(3):213–46. [Google Scholar]

[R44] 44.Dong G, Li J, editors. Efficient mining of emerging patterns: Discovering trends and differences. Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining; 1999; ACM; [Google Scholar]

[R45] 45.Novak PK, Lavrač N, Webb GI. Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res. 2009;10(Feb):377–403. [Google Scholar]

[R46] 46.Gogarten SM, Bhangale T, Conomos MP, et al. GWASTools: an R/Bioconductor package for quality control and analysis of Genome-Wide Association Studies. Bioinformatics. 2012;28(24):3329–31. doi: 10.1093/bioinformatics/bts610. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Johnson RC, Nelson GW, Troyer JL, et al. Accounting for multiple comparisons in a genome-wide association study (GWAS) BMC genomics. 2010;11(1):724. doi: 10.1186/1471-2164-11-724. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Horvath S, Xu X, Laird NM. The family based association test method: strategies for studying general genotype-phenotype associations. European journal of human genetics: EJHG. 2001;9(4):301. doi: 10.1038/sj.ejhg.5200625. [DOI] [PubMed] [Google Scholar]

[R49] 49.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society Series B (Methodological) 1995:289–300. [Google Scholar]

[R50] 50.Basu SN, Kollu R, Banerjee-Basu S. AutDB: a gene reference resource for autism research. Nucleic acids research. 2009;37(suppl 1):D832–D6. doi: 10.1093/nar/gkn835. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] 51.Yuen RK, Merico D, Bookman M, et al. Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder. Nature Neuroscience. 2017 doi: 10.1038/nn.4524. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] 52.Bremer A, Giacobini M, Eriksson M, et al. Copy number variation characteristics in subpopulations of patients with autism spectrum disorders. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2011;156(2):115–24. doi: 10.1002/ajmg.b.31142. [DOI] [PubMed] [Google Scholar]

[R53] 53.French L, Pavlidis P. Relationships between gene expression and brain wiring in the adult rodent brain. PLoS computational biology. 2011;7(1):e1001049. doi: 10.1371/journal.pcbi.1001049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] 54.Piton A, Gauthier J, Hamdan F, et al. Systematic resequencing of X-chromosome synaptic genes in autism spectrum disorder and schizophrenia. Molecular psychiatry. 2011;16(8):867–80. doi: 10.1038/mp.2010.54. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] 55.Gilman SR, Iossifov I, Levy D, et al. Rare de novo variants associated with autism implicate a large functional network of genes involved in formation and function of synapses. Neuron. 2011;70(5):898–907. doi: 10.1016/j.neuron.2011.05.021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] 56.Krishnan A, Zhang R, Yao V, et al. Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nature neuroscience. 2016;19(11):1454–62. doi: 10.1038/nn.4353. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] 57.Bredesen DE. Metabolic profiling distinguishes three subtypes of Alzheimer’s disease. Aging (Albany NY) 2015;7(8):595–600. doi: 10.18632/aging.100801. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Heritable genotype contrast mining reveals novel gene associations specific to autism subgroups

Matt Spencer

Nicole Takahashi

Sounak Chakraborty

Judith Miles

Chi-Ren Shyu

Abstract

Graphical Abstract

1 Introduction

2 Material and methods

2.1 Procedure overview

Figure 1.

2.2 Missing genotype imputation

2.3 Frequent Pattern Mining

Figure 2.

2.4 Population division

Table 1.

2.5 Genome-wide SNP prioritization

2.6 Contrast Mining utilizing the UICsup

Figure 3.

2.7 FBAT statistical testing

2.8 Examination of discoveries

3. Results

3.1 Associations with autism subgroups

Figure 4.

3.2 Comparison with previous literature

Table 2.

3.3 Method validation

3.4 Comparison with existing method

3.5 Gene pair associations

Figure 5.

3.6 Case study for deeper understanding

Figure 6.

Table 3.

4 Discussion

5 Conclusions

Supplementary Material

Highlights.

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases