Abstract
Analysis of de novo mutations (DNMs) from sequencing data of nuclear families has identified risk genes for many complex diseases, including multiple neurodevelopmental and psychiatric disorders. Most of these efforts have focused on mutations in protein-coding sequences. Evidence from genome-wide association studies (GWASs) strongly suggests that variants important to human diseases often lie in non-coding regions. Extending DNM-based approaches to non-coding sequences is challenging, however, because the functional significance of non-coding mutations is difficult to predict. We propose a statistical framework for analyzing DNMs from whole-genome sequencing (WGS) data. This method, TADA-Annotations (TADA-A), is a major advance of the TADA method we developed earlier for DNM analysis in coding regions. TADA-A is able to incorporate many functional annotations such as conservation and enhancer marks, to learn from data which annotations are informative of pathogenic mutations, and to combine both coding and non-coding mutations at the gene level to detect risk genes. It also supports meta-analysis of multiple DNM studies, while adjusting for study-specific technical effects. We applied TADA-A to WGS data of ∼300 autism-affected family trios across five studies and discovered several autism risk genes. The software is freely available for all research uses.
Keywords: autism, psychiatric disorders, de novo mutations, statistical model, noncoding sequences, epigenomics
Introduction
De novo mutations (DNMs) arise spontaneously in offspring and are often detected by sequencing families with disease occurrences, usually trios of parents and affected children (trio-sequencing). Researchers can identify risk genes by searching for genes that harbor more de novo mutations in affected offspring than expected by chance. This approach has been highly successful in studying a range of developmental and psychiatric disorders including autism, intellectual disability, schizophrenia, epilepsy, and congenital heart disease.1, 2, 3, 4, 5, 6 DNMs tend to have larger effects than standing variants because they have not yet been acted on by natural selection. The DNM approach may be particularly helpful for early-onset diseases because standing risk variants for these phenotypes are rare and hard to identify with GWASs.
Most existing work on DNMs focuses on mutations in protein-coding regions. Even when whole-genome sequence (WGS) data are available, researchers often analyze only the coding portion of the genome due to a lack of analytic tools for non-coding mutations.7, 8, 9 However, the majority of disease-associated variants identified by GWASs are located in non-coding sequences, potentially affecting gene regulation rather than protein function. This suggests that non-coding DNMs represent a large, currently unexplored source of genetic variation that can aid gene discovery. The knowledge of non-coding disease variants will provide additional benefits. As the activity of regulatory elements tend to be cell type specific, the analysis of DNMs could offer clues as to which cell types are most relevant to disease etiology. A key research challenge is thus to provide an analytic framework for DNM data that incorporates non-coding mutations from WGS studies of disease-affected families.
Current tools for DNM analysis perform some kind of “burden test” which evaluates whether the number of mutations in a gene is larger than would be expected by chance. He et al. propose the method TADA for DNM analysis, which effectively performs a weighted Bayesian burden analysis.10 TADA divides all mutations into categories, such as nonsense and missense mutations. Mutations in each category are weighted according to how damaging they are expected to be, with the weights for each category learned from the data. Another method, FitDNM, similarly performs weighted burden analysis, but the weights are assumed to be known (from external source) instead of being estimated from data.11
Unlike protein-coding sequences, there is no simple genetic code for researchers to predict the functional effects of non-coding mutations, and thus difficult to assign them to simple categories. Instead, we can describe non-coding mutations using a number of overlapping genomic annotations such as tissue-specific epigenomic marks and cross-species conservation. Which annotations are relevant to a particular disease is not known a priori. Additionally, each annotation may be only weakly informative of pathogenic variants so we may need to combine multiple annotations. Existing tests developed for de novo coding mutations can not handle such complications: TADA can handle only disjoint categories of DNMs and has been used with only a small number of mutational categories; FitDNM is designed for exome-sequencing data, assuming that the probability of a variant affecting protein function is known (from PolyPhen-2).11
In this work, we present a statistical framework for analysis of DNMs, which we call TADA-Annotations (or TADA-A). TADA-A uses a probabilistic model of mutation counts for each position in the genome (Figure 1A). Specifically, we model the mutation counts as following Poisson distribution, and the background mutation rates depend on covariates such as types of nucleotide changes and local GC content. We expect the mutation rates in positions assigned to a disease-associated gene to be elevated compared with background rates, and the fold increases depend on functional annotations in a log-linear model. TADA-A offers several features important for WGS-based DNM studies. First, the model can take an arbitrary number of possibly overlapping annotations and learn from the data which annotations are enriched for causal mutations. No arbitrary weighting scheme or variant filtering is needed. Second, the method predicts risk genes by combining information in both coding and non-coding regions. In addition, the information from coding mutations may come from an independent study, allowing a WGS study to borrow strength from published whole-exome sequencing (WES) studies. Finally, TADA-A supports meta-analysis of multiple WGS studies. It adjusts for possible difference in technical factors across studies by fitting a different background mutation model for each study.
We apply TADA-A to study the contribution of non-coding sequences in autism spectrum disorder (ASD). WES studies using DNMs in autism-affected families have identified 65 ASD risk genes, highlighting the importance of DNMs in the study of autism.1, 12, 13, 14 Recently, efforts have been expanded to whole-genome sequencing of ASD-affected families. Two studies reported modest enrichment of functional non-coding DNMs near known ASD genes in autistic children, comparing with control subjects or unaffected siblings.8, 15 However, none of the published work has utilized non-coding DNMs to map specific risk genes or functional elements. We use TADA-A to analyze a collection of five whole-genome DNM datasets, leveraging a number of genomic annotations. We find that brain enhancers marked by H3K27ac, conserved brain enhancers marked by H3K27ac and high GERP scores (GERP > 2), and regions predicted to affect splicing have increased rates of DNMs in ASD-affected case subjects. Our conservative estimates suggest that regulatory non-coding mutations contribute to about a third of de novo autism risk (i.e., autism risk attributable to all DNMs). Using the DNMs from WGS data as well a published WES study, we were able to identify four ASD risk genes at a FDR < 0.1. Multiple lines of evidence support the possible roles of these genes in ASD.
Material and Methods
TADA-A Model
TADA-A works in two stages: first, it calibrates the background mutation rates (mutation model); second, it learns which functional annotations are predictive of causal mutations and infer the risk genes (functional model). In the mutation model step, we assume that we have un-calibrated base-level mutation rates (summation over all possible allele-specific mutation rates at each base) from external data, e.g., from human-chimp comparison. We used the trinucleotide-based mutation rates table from Samocha et al. as our baseline rates.16 These baseline mutation rates are solely based on the intergenic divergence between humans and chimps. For a particular study, observed mutation rates may differ from these un-calibrated rates as a result of study-specific technical factors. For example, lower sequencing depth reduces the number of called DNMs. Mutation rates may also depend on local genomic features such as GC content. To account for this variability, we calibrate the background mutation rates for each study. To simplify the computation, TADA-A collapses DNMs in a 50-bp genomic window into a single count. We model these mutation counts as Poisson Generalized Linear Model (GLM):
where N is the number of individuals, is the un-calibrated baseline mutation rate of window i (summing up the mutation rates of bases in window i), and the exponential term represents the deviation of the actual mutation rate of window i from . The variable is the th mutation-related feature of window and represents the effect of mutation feature k. The mutation features may include GC content, whether a sequence is transcribed, etc. TADA-A uses glm() in R to estimate the coefficients of the genomic features. After fitting the model, the calibrated allele-specific mutation rate at each base is the un-calibrated allele-specific mutation rate multiplied by a factor, , which is calculated for the window containing that base.
In the functional model, we model the dependency of allele-specific mutation rates on the gene status (risk gene or not) and functional genomic annotations. We make the functional model allele aware because many annotations are allele specific. For example, a de novo SNV could be nonsynonymous or synonymous, depending on what the mutant allele is. Noncoding annotations, such as CADD and SPIDEX scores, are also dependent on the genotypes of mutant alleles. As described in the main text, we assume that all mutations have been uniquely assigned to genes. Let be a binary indicator of whether gene g is a risk gene or not. When gene g is a risk gene , the number of DNMs mutating to allele at base aggregated from affected individuals follows a Poisson distribution
where is the gene that base belongs to, is the calibrated mutation rate to allele of base from the previous step, is the th genomic annotation of base if mutated to allele , and is the effect of the th annotation. Note that we consider annotations related to function at this step, such as conservation and enhancer activity. Since the annotations are binary, is the relative risk of the kth annotation, i.e., the fold increase of mutation rates in positions with that annotation versus those without the annotation. If gene g is a non-risk gene (), the number of DNMs mutating to allele at base simply follows .
Let be the number of DNMs assigned to gene g. Assuming that DNM events are independent, the likelihood of given , , is simply the product of the probabilities of DNM counts at all bases over all possible mutant alleles, according to the equations above. Let be the prior probability of gene being a risk gene, we have the full likelihood over all genes:
TADA-A implements two options for estimating parameters, both based on maximum likelihood. In the first option, is the same for every gene, and we estimate its value by maximum likelihood jointly with . In the second option, we use informative priors for of all genes from external data, and we estimate only . The confidence intervals of the parameters are based on standard asymptotic approximations using Fisher information matrix. When we have multiple annotations in the model, TADA-A uses a standard feature selection protocol to choose annotations. Specifically, it first fits a model with each single annotation and selects those whose coefficients are significantly different from 0 (at 95% confidence interval). It then refits the model jointly with the selected features.
Once we estimate the parameters, we compute the Bayes factor (BF) of a gene, as
where the probabilities are evaluated at the MLE of parameter values. In our ASD analysis, we further multiply the BFs from non-coding analysis with the BFs from previous results based on coding mutations to obtain final BFs for all genes. We control for multiple testing using the Bayesian FDR control procedure.17
Because our likelihood is defined over all bases and all possible mutant alleles, including those possessing no DNM events, naive parameter estimation is computationally expensive. To alleviate this computational burden, we reformulate the likelihood function by collapsing mutations over all bases sharing the same set of annotations, assuming all annotations are discrete. This strategy greatly reduces the computation time (Supplemental Methods). TADA-A software is available at GitHub (Web Resources).
DNMs from Whole-Genome Sequencing Data
The detailed information for each WGS dataset is summarized in Table S1. To remove erroneously called de novo SNVs, we excluded 8 individuals with more than 140 (2 times more than the median of ASD DNMs per individual) DNMs and removed all recurrent DNMs (i.e., exactly the same mutation in multiple individuals). Our unpublished DNM data are from WGS of 32 ASD trios of Han Chinese ancestry (EMBL-EBI: PRJEB14713; data URL provided in Web Resources, details in Supplemental Methods). These filtered data were used for all the analyses in this manuscript.
We also tried filtering out DNMs with a high allele frequency in GnomAD or BRAVO databases, as this could be one way of filtering sequencing errors. There are 167 mutations in case subjects and 18 in control subjects that have allele frequency more than 0.01 in either GnomAD or BRAVO. These mutations are not found in any of the ASD-associated genes we identified. We found that removing these mutations did not change the model parameters (Figure S4), so it will have little impact on our results.
Non-coding Annotations Used in Analyzing ASD Data
For histone modifications, we used H3K27ac sites in fetal and adult brains to define cis-regulatory regions. Fetal brain sites from human cortex at embryonic stages 7, 8.5, and 12 p.c.w. were obtained from a recent study.18 For each stage, only peak regions consistent between two biological replicates were selected. Adult brain H3K27ac sites were obtained from Roadmap Epigenomics Project.19 They include regions from human angular gyrus, anterior caudate, cingulate gyrus, middle hippocampus, inferior temporal lobe, mid-frontal lobe, and substantia nigra. We used MACS2 to call peaks from raw data and kept only peak regions consistent between two biological replicates for each brain region. We used BEDtools20 to merge H3K27ac sites from fetal and adult brain.
For DNase I hypersensitivity sites, fetal brain DNase I sites were downloaded from Roadmap Epigenomics (male and female fetal brain) and adult brain DHS data were downloaded from ENCODE (Cerebrum_frontal_OC, Frontal_cortex_OC and Cerebellum_ OC).
For conservation scores, we used ANNOVAR to obtain GERP++ scores for all mutations.21, 22 We binarized GERP (a base is considered to be conserved if GERP is greater than 2).
For CADD, we downloaded publicly available CADD scores (default parameters, v1.3) and binarized the scores (deleterious if one allele has a CADD score greater than 15).
For splicing score, we used results from SPIDEX, a deep learning-based approach to annotate variants that may affect splicing.23 An SNV is considering a splicing SNV if its delta-psi score is less than −1.416, which is the 10th percentile of all positions with SPIDEX scores.
Meta-analysis Strategy and Applying TADA-A to ASD
In the ASD study, we first calibrate mutation rates of each study separately using the mutation model of TADA-A (Poisson regression), as described in Results. We then fit the functional model of TADA-A with non-coding annotations listed in the previous section. Since the calibrated mutation rates for any base may differ between studies, we calculated the likelihood of all genes for each study separately, and then multiplied the likelihoods over all studies to get a total likelihood. Note that the coefficients of the functional annotations are shared among multiple studies. We then estimate parameters via maximum likelihood. We take advantage of a previous autism study to set the prior probabilities of risk genes .2 Specifically, we convert the Bayes factors reported in that study to posterior probabilities (assuming each gene has 6% chance of being ASD gene) and use these probabilities as . For computational reasons, we used the top 1,000 genes, ranked by , in estimating the annotation parameters, since these genes are the most informative of parameters. After the first round of feature selection using all 12 annotations, only brain H3K27ac, brain H3K27ac + GERP > 2, and splicing effects had significant effect sizes, so we refit the model with only these features jointly. In the feature selection step, we define the search space of log(relative risk) to be from −1 to 10 to cover a wide range of possible effect sizes. Some annotations have the log(relative risk) estimated as −1 due to this boundary limitation.
To validate our relative risk estimation results, we used two different sets of informational priors. The first set is based on the FDRs of predicted ASD genes using a human brain-specific interaction network.24 For each gene, we derived its prior as 1 – FDR (we used the top 1,000 genes for estimating relative risks). The second set is based on a collection of 2,601 genes implicated in neuropsychiatric disorders (see the definition of “neuropsychiatric genes” in the next section for details). We assigned each neuropsychiatric gene a prior 0.431 to make the expected number of ASD genes consistent with the estimate that 0.06 of 18,665 protein-coding genes are ASD risk genes.10, 25
To identify specific ASD risk genes, we first derived the noncoding BF of each gene using data from each WGS dataset, then multiplied these BFs to get a total noncoding BF for each gene. This BF for a gene is multiplied with the published coding BF based on WES studies.2 We then estimated q-values as mentioned previously.17 We used as the fraction of ASD risk genes in this step.10, 25 We call a gene “novel ASD gene” (at a particular q-value cutoff) if its final q-value falls below the cutoff and its coding q-value from the previous study is above the cutoff. A gene that has no noncoding DNM will not be considered a new finding (even if a gene has no evidence from non-coding mutations, its q-value could still change).
Definition of Gene Lists Used in the Analysis
Known ASD genes (194 genes) include genes with q-value < 0.3 from a combined analysis of CNVs, indels, and WES data using TADA,2 SFARI category I (high confidence), and SFARI category II (strong candidate) genes. Neuropsychiatric genes (2,601 genes) are a larger set of genes likely involved in neuropsychiatric disorders, including genes with TADA q-value < 0.5,2 SFARI genes (category I high confidence, category II strong evidence), AutismKB genes,26 ASD risk genes summarized in a previous study,27 intellectual disability genes,28 the union of gene sets enriched with SCZ de novo coding mutations,29 high-confidence postsynaptic density genes,30 and FMRP targets.31 The set of nonASD genes are the 1,000 genes with the highest TADA q-values.2 Intolerant genes include genes with top 5% RVIS32 and haploinsufficient genes obtained from two sources, one using copy number variations (genes with predicted haploinsufficient probability greater than 0.95)33 and the other using estimated mutation rate.34 To define tolerant genes, we started with genes with RVIS scores in the bottom 10%,32 genes with haploinsufficient probability smaller than 0.1,33 and genes that were used as control genes for LoF-deficient genes.34 We then removed any genes that were in the intolerant gene set. To define gene groups based on their expression levels, we used the average expression level for each gene across all developing brain tissues from BrainSpan data.
Burden Analysis of Different Types of De Novo Coding Mutations
In our burden analysis, we accounted for the difference in mutation rates between ASD-affected subjects (∼60/individual) and control subjects (∼39/individual) using “background sequences” (sequences/mutations not expected to have function). Specifically, to assess the burden of nonsynonymous DNMs in ASD-affected subjects versus control subjects, we used the numbers of synonymous SNVs in ASD-affected subjects and control subjects used as background.
We tested whether nonsynonymous DNMs were enriched in ASD-affected subjects versus control subjects using Fisher’s exact test, and the burden was defined as the odds ratio (OR) from the 2 by 2 test.
Assessing Contribution of DNMs to ASD Risk
We treated ASD liability (risk) as a continuous trait and estimated the percentage of variance in ASD liability explained by five types of mutations. The variation of ASD liability explained by the jth type of mutations is expressed as: , where is the effect size of the jth type of mutations at the liability scale and is the probability that an individual carries a mutation of type j (see Supplemental Methods). Note that only causal mutations contribute to ASD liability, so both and are defined for mutations affecting causal genes. We calculated from the relative risk of jth type of mutation using standard quantitative genetic calculations. To obtain , we calculated the total mutation rate of type j mutations and then multiply this by 0.06 (fraction of ASD risk genes) to obtain the rate of causal mutations of type j.
Network Analysis of Candidate ASD Genes
We used two tools, DAWN and GeneMania, to analyze the connectivity pattern of our candidate ASD genes in gene networks. DAWN (detecting association with networks) algorithm35, 36 is a guilt-by-association-based gene prediction algorithm. Its fundamental assumption is that risk genes tend to be functionally related with each other, and thus tightly connected in gene networks. A gene has a high posterior risk probability if it has a high prior risk probability, interacts in a network with other risk genes, or both. The prior risk probabilities came from published WES results.25 For the underlying network, we constructed partial co-expression networks for two spatial-temporal windows: the mid-fetal prefrontal cortex (PFC) and the infancy mediodorsal cerebellar cortex (MD-CBC), which are indicated as high risk windows for ASD.37 BrainSpan microarray dataset is used as the source for spatial-temporal gene expression data. DAWN was run separately for each above-mentioned network. We used regularization parameter (lambda) = 0.12, p value cutoff = 0.1, and correlation thresholds 0.7 for PFC and 0.85 for MD-CBC, respectively. In Table 2, posterior risk scores (q-values) are shown for the candidate genes. A dash means that the corresponding gene is not co-expressed with other risk genes in any of the spatial-temporal windows.
Table 2.
Gene Name | NRXN1a | APBB1 | TANC2 | PNPLA7 | Enrichment p Value |
---|---|---|---|---|---|
LoF | 1 | 1 | 1 | 1 | – |
Mis3 | 1 | 0 | 1 | 1 | – |
Regulatory SNV | 0 | 1 | 0 | 0 | – |
Conserved regulatory SNV | 0 | 2 | 0 | 0 | – |
Splicing SNV | 1 | 0 | 1 | 1 | – |
HI | Y | Y | Y | N | 2.93 × 10−3 |
RVIS (%) | 2.25 | 19.93 | 0.67 | 64.97 | 0.037 |
ExAC zscore (%) | 3.32 | 14.25 | 1.32 | 76.73 | 0.042 |
FMRP targets | Y | Y | Y | N | 3.20 × 10−4 |
BrainSpan expression (%) | 12.03 | 3.41 | 18.09 | 69.90 | 0.049 |
DAWN | 0.001 | – | – | – | 0.30 |
In the evidence rows, Y means overlap with a gene set and N otherwise. Lower RVIS and ExAC z scores percentiles correspond to higher constraint. Lower BrainSpan percentiles correspond to higher brain expression. Enrichment p values were calculated by hypergeometric tests. In RVIS, ExAC z score, and Brainspan, we tested the enrichment of “novel ASD genes” in genes in the lower quartiles. In DAWN analysis, we tested the enrichment of “novel ASD genes” in genes with DAWN q-value < 0.05. The DAWN q-value for each gene in the table is the minimum of the q-values of that gene in two brain regions, mid-fetal prefrontal cortex (PFC) and infancy mediodorsal cerebellar cortex (MD-CBC).
NRXN1 was not identified as a significant ASD gene with WES de novo SNV data by Sanders et al.2 but with the inclusion of small deletions.
GeneMania38 is a tool for studying interactions among genes in a network using various types of information, such as gene co-expression and protein-protein interactions (PPIs). We studied the connection between our candidate genes with high-confidence ASD genes (genes with coding TADA FDR < 0.1 and genes in SFARI categories I and II) using co-expression data. The significance of the number of connections is assessed by randomly sampling gene sets of the same sizes as the candidate genes.
Enhancers with Recurrent De Novo SNVs
We used all brain H3K27ac regions not overlapping with exons (not limited to sequences within 10 kb of TSS). We observed 25 enhancers with at least two de novo SNVs in ASD samples, and we performed simulations to assess significance. In each simulation, we randomly re-distributed de novo SNVs of all brain enhancers, following a multinomial distribution. The multinomial probability of an enhancer is the ratio between the calibrated mutation rate of that enhancer and the sum of calibrated mutation rates across all enhancers. (For each study, we first calibrated the trinucleotide-based mutation rates of all enhancers, accounting for sample size, GC contents, and local 1 Mb human-macaque divergence. We then added up this study-specific calibrated mutation rate across the five studies to get the total calibrated mutation rates for each enhancer.) We performed simulations 10,000 times and obtained the distribution of the number of enhancers with recurrent SNVs.
Power Analysis
We generated de novo mutation data for all genes in the human genome (∼18,700 genes) using the TADA-A model (see Supplemental Methods for details). In brief, we performed five simulations for each sample size, defined as the number of trios. For each iteration, we randomly assigned genes to ASD risk genes with a probability of 0.06, based on previous estimates.10, 25 For each risk gene, we sampled DNMs from each category (LoF, Mis3, less conserved regulatory SNVs, conserved regulatory SNVs, and splicing SNVs) according to allele-specific mutation rates and the average relative risks of these mutational categories, based on the TADA-A model. For non-risk genes, we set the relative risk at 1 for all mutational categories. We then used TADA-A to assess the evidence for each gene, using either coding mutations (WES approach) or all types of mutations (WGS approach). We then identified ASD risk genes with q-values < 0.1. To study cost effectiveness between WES and WGS, we translated trio sample size into budget (WES: $500/sample; WGS: $1,000/sample) and compared the number of identified ASD risk genes at each budget level.
TADs with Recurrent De Novo SNVs
For each TAD region, we calculated the regulatory mutation rate as the sum of per-base calibrated mutation rates of brain H3K27ac sites within the TAD. (For each study, we first calibrated mutation rates of these H3K27ac sites and their 2.5 kb flanking regions, accounting for sample size, GC contents, and 1 Mb human-macaque divergence. We then added up the calibrated mutation rates across five studies together.) Under the null hypothesis, the count of regulatory SNVs follows a Poisson distribution, whose rate is the regulatory mutation rate.10, 16 We then calculated the p value of each TAD region using the Poisson test and used the Benjamini-Hochberg procedure to control FDR.
Results
Overview of TADA-Annotations (TADA-A)
TADA-A consists of two steps. In the first step, study-specific, background mutation rates are estimated at each position. We use an initial estimate based solely on a trinucleotide mutation rates table from the literature.16 These mutation rates are derived from the divergence in intergenic regions between humans and chimps, which are subject to less natural selection comparing with coding regions. We then adjust for genomic and technical covariates such as sequencing depth, local GC content, and 1-Mb local divergence scores between humans and macaques. In the second and main step, we use DNMs and annotation information to identify risk genes. We assume that we can assign each DNM to one gene, but we could also analyze at the level of genomic regions and the DNM-to-gene assignment is not strictly necessary (see Discussions). TADA-A takes as input the number of DNMs at each genomic position, summing over all affected subjects, and a set of possibly overlapping, genomic annotations (Figure 1A, upper panel). Genomic annotations might include cell-specific histone modifications or evolutionary conservation. TADA-A produces two main outputs: (1) the annotations that are informative of causal mutations and their effect sizes and (2) the predictions of specific susceptibility genes of the disease of interest (Figure 1A, bottom panel). We measure the effect of an annotation by its relative risk, i.e., the fold increase of disease risk for a variant carrying that annotation versus a variant without that annotation (assuming the annotation is binary). The model of TADA-A is general enough that it can analyze either coding or non-coding mutations. If both types of mutations are analyzed, the results could be easily combined by multiplying the resulting Bayes factors (BFs).
The intuition behind TADA-A is that, in affected individuals, disease-causing mutations should appear at higher rates than expected from the baseline mutation rate. Our model can be written as , where is the observed number of de novo mutations at position i mutating to allele t ( is usually 0 or 1), is the expected background mutation rate estimated in the first step, N is the sample size, and is the relative risk of a DNM at position i mutating to allele t (greater than 1 for risk mutations). To model the relative risk, , we define a binary (unobserved) variable for each gene indicating whether it is a risk gene or not. For a non-risk gene, all its positions have relative risk equal to 1. For risk genes, we model as a linear function of the genomic annotations. Each gene has a prior probability of being risk gene. TADA-A offers the option of using informative prior probabilities, e.g., a likely risk gene from previous WES studies would have high prior probability. Intuitively, this allows us to put more emphasis on highly plausible risk genes to estimate the parameters of annotations, while discounting the unlikely disease-associated genes. This is important when statistical signals in the annotations are weak.
We estimate model parameters (mainly the relative risk of each annotation) using maximum likelihood. Since annotations could be partially redundant (e.g., an enhancer may be associated with multiple annotations such as open chromatin and H3K27ac) and not all annotations are informative, we implement a feature selection protocol to first select annotations that are informational to predict pathogenic mutations and then jointly estimate the relative risks of these selected annotations. Once we have estimates of all the parameters, we predict whether a gene is a risk gene or not using Bayes factor (BF), combining information in all its associated DNMs. Similar to the original TADA method, we test each gene separately and contrast the null model where the relative risk is always 1 with the alternative model described above for risk genes (relative risks dependent on annotations). To use TADA-A in a meta-analysis setting that combines multiple studies with possibly different rates and patterns of DNMs (e.g., in relation to GC content), we fit a different background mutation model for each study, but estimate a common set of parameters related to functional annotations.
TADA-A can be used to answer several questions about genetics of a complex disease, ASD in our case. What annotations are associated with causative mutations? Based on this knowledge, can we learn about the genetic architecture of the disease, especially about the relative contribution of coding versus non-coding DNMs to the disease liability? Finally, can we identify specific disease-associated genes? We present below our results in answering these questions for ASD.
Whole-Genome Sequencing Data of ASD and Mutation Rate Calibration
We analyzed DNM data from five WGS studies of ASD trios or quartets, with a total of 314 affected subjects (Table S1). Mutation data are limited to de novo SNVs. The validation rate of de novo SNVs based on Sanger sequencing ranges from 85% to 94% in the five studies.39, 40, 41 The number of DNMs per subject ranges from 57 to 63. Additionally, we collected the control data from a cohort of ∼700 non-ASD-affected subjects. While TADA-A does not need control data, we use this additional dataset to perform burden analysis often employed in DNM studies, comparing the rate of DNMs in affected subjects with the rate in control subjects (see Material and Methods). Our main results are limited to sequences close to genes, including protein-coding sequences, non-coding sequences within ±10 kb of TSSs, and potential splicing-regulatory regions that are not covered by these two categories. In a later section, we present results based on distal sequences.
To account for technical difference among studies that may affect observed mutation rates, we use published mutation rates from Samocha et al. as an initial estimate16 and adjust for covariates using a Poisson regression model, separately for each of the five datasets (see Material and Methods). We perform analysis at the level of 50-bp non-overlapping sliding windows and consider four covariates for estimating baseline mutation rates: (1) whether a window is in coding regions (transcribed sequences may have lower mutation rates because of transcription-coupled repair); (2) whether the window is in promoter regions (CpG (de)methylation in promoters might affect mutation rates); (3) the percent GC in the window, which may correlate with sequencing depth and hence DNM detectability;8 and (4) the divergence score between humans and macaques of the 1-Mb window around the window, which is used to capture the local deviation from the trinucleotide-based mutation rates. The effects of these covariates are summarized in Table S2. In all of the five studies, the intercept in the regression model is significantly different from 0 (p < 0.05), suggesting systematic departure of average mutation rates from the published rates. In three of the five ASD studies, GC content has a negative effect on the observed DNM rate (p < 0.05 for three). The mutation features representing whether a sequence is in coding or promoter region were found to have a relatively large effect in specific studies (e.g., the coding feature in Jiang et al.7 was significant at p = 0.003). As expected, local divergence scores have a positive effect on the observed mutation rates in all of the five studies, though the effect is small and not statistically significant. The results from our mutation rate modeling thus support the importance of accounting for difference in studies in meta-analysis of DNM datasets.
We notice that although some studies have a small sample size, the numbers of DNMs are still much larger than the number of parameters. For example, even for the dataset with the smallest sample size (the data of Michaelson et al.82 has a sample size of 10), we still have 79 DNMs in the 50-bp windows that are included in our model. In addition, we fit mutation rate parameters separately for each study, so a study with small sample size will not impact the estimates for a larger study.
Risk-Increasing Mutations in ASD Are Associated with Active Enhancer Mark and Damaging Effects on Splicing
We first assessed the quality of data using coding DNMs. We performed a simple burden analysis of protein-coding sequences in probands versus control subjects. We adjusted for the difference in baseline mutation rates in the ASD studies and control subjects using synonymous mutations (whose true mutation rates should be the same across studies, see Material and Methods). As expected, we found that the average rate of non-synonymous mutations per subject is about 1.2-fold higher in ASD-affected subjects versus control subjects (Figure 1B), in line with previous estimates.2, 12, 42 We also observed an increased rate of non-synonymous mutations in gene sets enriched with ASD risk genes, including known ASD genes, genes likely involved in neuropsychiatric disorders (dubbed neuropsychiatric genes), genes intolerant to mutations, and genes highly expressed in the brain (Figure 1B). Only the burden in mutation-intolerant genes is statistically significant (p < 0.01). Synonymous mutations have recently been reported to be enriched in ASD-affected case subjects as they may disrupt transcriptional regulatory processes, such as splicing.8 Thus we think that using synonymous mutations only makes our results more conservative: if there is indeed enrichment in synonymous mutations, the burden of nonsynonymous mutations would be under-estimated as a result of adjusting mutation rates using synonymous mutations.
For TADA-A analysis, we use a total of 12 functional annotations (Figure 1C). Some annotations measure the regulatory function of variants, including fetal and adult brain H3K27ac18, 19 and fetal19 and adult brain43 DNase hypersensitive sites (DHS). H3K27ac is a mark of active enhancers and DHS is a mark of open chromatin, often suggestive of regulatory functions. We also use a conservation score GERP22 and an aggregate variant score CADD44 and composite annotations of regulatory regions and GERP or CADD. Splicing has been shown to be important for many human diseases, so we include the splicing effects predicted by SPIDEX.23 We choose the 10th percentile of SPIDEX scores as a cutoff. SNVs with a SPIDEX score smaller than this cutoff were classified as affecting splicing. These SNVs are enriched within 20 bp around exon/intron junctions (Figure S1).We limit our analysis to sequences within 10 kb of transcription start sites (TSSs) of protein-coding genes (not including UTRs) and potential splicing-regulating regions (which could be far away from TSSs). To increase the power of TADA-A to detect predictive annotations, we take advantage of existing WES studies. For each gene, we summarize the findings of previous WES studies as the probability of being an ASD risk gene,2 which is then used as the prior probability of being a risk gene in the TADA-A model. This step allows us to put large weights on known ASD risk genes, whose probabilities are close to 1, comparing with average genes (about 0.06).
In our initial analysis of feature selection using TADA-A, we found that among 12 annotations, only brain H3K27ac, brain H3K27ac + GERP > 2, and the SPIDEX score, when estimated separately, make marginally significant contributions (p < 0.05, Figure 1C). We therefore retrain the model using only these three features and jointly estimate their relative risks at 1.54, 3.42, and 3.22, respectively (Table 1 and Table S3). In the following analysis, we refer to de novo SNVs in H3K27ac regions within 10 kb of genes as regulatory SNVs (those with GERP > 2 as conserved regulatory SNVs), and de novo SNVs predicted by SPIDEX to affect splicing as splicing SNVs. To study whether the splicing signal is robust, we also used another simple but commonly used way to predict splicing mutations. We predicted splicing SNVs as any SNVs that are within 20 bp windows of exon/intron junctions. The estimate of Log(Relative risk) is very close to SPIDEX prediction, though is less significant (logRR estimate is 1.09, lower bound is −0.18, and upper bound is 2.35). This difference may be due to the fact that many of the bases within 20 bp of exon/intron junctions do not regulate splicing. We also tried using two other sets of informative priors to analyze the 12 annotations: one based on a genome-wide prediction of ASD genes in the context of a human brain-specific gene interaction network24 and the other based on a neuropsychiatric disorder gene set (see Material and Methods for details). The resulting estimates for functional annotations are largely similar, suggesting that our estimates are quite robust (Figures S2A and S2B).
Table 1.
Mutation Class | Mutation Frequency | Relative Mutational Exposure (%) | Relative Risk | Variance of ASD Liability Explained (100%) |
---|---|---|---|---|
Mis3 | 0.0175 | 12.5 | 4.70 | 0.83 |
Loss-of-function | 0.00405 | 2.90 | 20.0 | 1.08 |
H3K27ac SNV (Gerp < = 2) | 0.0839 | 59.9 | 1.54 | 0.24 |
H3K27ac SNV (Gerp > 2) | 0.0164 | 11.7 | 3.42 | 0.46 |
Splicing SNV | 0.0183 | 13.1 | 3.22 | 0.46 |
In the analyses above, we borrowed priors from other studies to increase our power to detect non-coding signals which are generally weaker than coding signals. When using a uniform prior of 0.06 to perform relative risk estimation, we found that while the sign of several annotations, including brain H3K27ac + GERP > 2 and SPIDEX, remain the same (Figure S2C), the strength of statistical evidence is much weaker. This is consistent with our expectation and underscores the advantage of using informative priors to increase the sensitivity of signal detection.
To demonstrate that the signal discovered by TADA-A is specific to ASD, we run TADA-A using the control data. All the annotations now have relative risk estimates close to or smaller than 1, except for one feature (GERP > 2), though the lower bound of Log(Relative risk) is very close to 0 (0.024) (Figure 1C). In addition, combining this feature with other epigenomic features did not increase the effect size as it did when analyzing ASD data. So we believe that the result of this annotation is likely due to noise.
Enhancer and Splicing Mutations Make Substantial Contributions to the De Novo Risk of Autism
A fundamental question in genetics is how the risk variants are distributed among various functional classes, such as protein-coding sequences, enhancer sequences, non-coding RNAs, etc. This question has been studied recently using common variants.45, 46 It was found that even though variants in protein-coding regions are highly enriched with risk variants, they explain only a small fraction of total disease risk. The results from the previous TADA-A analysis allows us to address this “risk partition” problem from a different angle, using DNMs. Based on TADA-A results, we considered three types of non-coding de novo SNVs—regulatory SNVs with GERP ≤ 2 (less conserved regulatory SNVs), regulatory SNVs with GERP > 2 (conserved regulatory SNVs), and splicing SNVs—in addition to two classes of coding mutations: LoF and probably damaging missense (predicted by PolyPhen-2, denoted as Mis3). We quantify the contribution of a mutation type as liability variance explained (LVE), taking into account both the frequencies of this mutation type and its average relative risk (see Material and Methods). For coding mutations (LoF and mis3), the relative risks were obtained from published TADA estimates in WES studies.10 For non-coding SNVs, we used the relative risks estimated by TADA-A.
The relative risks of regulatory SNVs and splicing SNVs are lower than those of coding SNVs (Table 1). Despite a lower risk per variant, regulatory SNVs are much more frequent than other classes of mutations, making the total contribution of regulatory SNVs comparable to LoF or missense coding mutations. Each class of mutation explains only a small fraction of estimated total ASD genetic risk (Table 1), consistent with the conclusion of an earlier study.47 Considering only the risk due to de novo mutations, we found that non-coding SNVs (including less conserved regulatory SNVs, conserved regulatory SNVs, and splicing SNVs) explain 38% of the de novo risk (Figure 1D). This estimate, however, is likely very conservative (Discussion).
TADA-A Identifies ASD Risk Genes by Combining Coding and Noncoding Mutations
A recent WES study (∼3,500 samples) identified 58 ASD risk genes at FDR < 0.12 using de novo SNVs. Applying TADA-A on the WGS data and combining them with the WES results, we discovered 4 “novel ASD genes” at FDR < 0.1 (Table 2) and 12 at FDR < 0.3 (Table S4). Each of the four genes at FDR < 0.1 has at least one LoF or Mis3 mutation, and the evidence for these genes is strengthened by the presence of regulatory or splicing SNVs. We found extensive evidence supporting the plausibility of these genes as ASD risk genes. APBB1, NRXN1, and TANC2 are the targets of neuronal-RNA binding protein FMRP, whose loss of function causes fragile X syndrome and autistic features (Tables 2 and S5, hypergeometric test, p = 0.00032). These three genes have been identified as haploinsufficient genes33, 34 (Tables 2 and S5, hypergeometric test, p = 0.00293) and are highly expressed in the brain (Tables 2 and S5, hypergeometric test, top 25% of all genes, p = 0.049). The genes also tend to be evolutionarily constrained as measured by either RIVS (Tables 2 and S5, hypergeometric test, RVIS top 25% genes, p = 0.037) or another metric based on tolerance of LoF variants in ExAC (Tables 2 and S5, hypergeometric test, ExAC top 25% genes, p = 0.042). Evolutionary constraint in the human population has been shown to be a strong predictor of autism genes.16
We performed network analyses to further establish the link of candidate genes to autism. DAWN35, 36 is a recently developed method that predicts autism risk genes by virtue of the genes’ association with known ASD genes in co-expression networks of early developing brain. A gene receives a high DAWN score if it is highly connected with other likely ASD genes. We found that NRXN1 has a DAWN q-value < 0.05 in at least one of the two critical spatial-temporal developmental windows for ASD37 (Tables 2 and S5, fold of enrichment 2.92, though the hypergeometric test p = 0.30 is not significant, but the power of the test is small as there are only four “novel” genes). Using GeneMania,38 we found our candidate genes were highly connected to high-confidence ASD genes in the gene co-expression network constructed from multi-tissue gene expression data (91 co-expression links between the two gene sets, p = 0.04, Figure 2A). These analyses, using different analytic tools and genomic data, thus support that our identified genes are functionally related to known ASD genes.
Expanding our analysis to the 12 “novel ASD genes” at FDR < 0.3, we observed significant enrichment of multiple gene annotations (Table S4), including haploinsufficient genes (Table S5, hypergeometric test, p = 0.00035), constrained genes (Table S5, hypergeometric test, p = 0.0012 using RVIS and p = 0.0096 using variant frequency in ExAC), and genes significantly co-expressed with known ASD genes from DAWN analysis (Table S5, hypergeometric test, p = 0.0022). The 12 “novel ASD genes” are also significantly enriched in genes predicted to be ASD risk genes (FDR < 0.1) by a recently developed machine learning approach that utilizes a brain-specific functional gene network (Table S5, hypergeometric test, p = 0.028).24 Literature inspection provides further support of the roles of most of these genes in ASD (Table S6).
Distal Enhancers and TADs with Multiple De Novo Mutations Implicate Additional Risk Genes
Our TADA-A analyses were performed at the gene level and considered only enhancers within 10 kb of TSSs. Applying TADA-A to distal enhancers is challenging largely because of the uncertainty of assigning these enhancers to their target genes. Various studies have shown that only in 10%–30% of cases, distal enhancers target their nearest genes. We use a different approach in this section to test whether distal enhancers may play some roles in autism. Our idea is that the probability of multiple DNMs occurring in a single enhancer by chance is very low. We found 25 H3K27ac enhancers with 2 or more SNVs in ASD-affected case subjects, significantly higher than random expectation based on simulations (Figure 2B, p = 0.0014). We predicted the likely target genes of recurrent enhancers based on cross-tissue correlation between enhancer activity and gene expression from Roadmap Epigenomics (Table S7). We found a recurrent enhancer putatively targeting ZMIZ1, more than 250 kb away (Figure 2C). The region contains two other DNMs in two enhancers, one of which also has correlated activities with the ZIMZ1 promoter. A target of FMRP, ZMIZ1 is highly expressed in the brain and interacts with neuron-specific chromatin remodeling complex (nBAF), which is important in regulating synaptic functions.48, 49 Several nBAF members have been linked to autism, such as ARID1B and BCL11A.50 The pathogenic potential of ZMIZ1 is further supported by the observation of a de novo gene-disrupting translocation in an individual with intellectual disability.51 These results strongly support the role of ZMIZ1 in autism and also highlight the mechanism that DNMs may increase ASD risk by disrupting distal regulatory elements.
We applied the similar idea of recurrent DNM analysis at the level of topologically associating domains (TADs).52 These are megabase-sized chromatin interaction domains that are stable across cell types and have been proposed to demarcate transcriptional regulatory units.52 Based on estimated mutation rates, we found two TADs with a significant (at FDR < 0.3) number of regulatory SNVs (Figure S3 and Table S8). In both TADs, there are only two or three genes, and we conjecture that SRBD1 and MRSA are likely the underlying ASD genes in the two TADs (see Discussion).
Power of Mapping ASD Risk Genes with WGS and WES
Enlightened by the de novo genetic architecture of ASD (Table 1), we used simulations to address how the power of a DNM-focused WES or WGS study depends on its sample size and sequencing budget. We randomly sample ASD risk using a prior probability of 0.06 based on previous estimates of the total number of ASD risk genes;1, 53 randomly sampled mutations according to mutation rates and the TADA-A model (causal genes tend to have more deleterious coding and non-coding mutations compared to expectations) and then applied TADA-A to identify risk genes at q-value < 0.1. We found that the power of the simulated WGS design is about 50%–120% higher than that of the WES (Figure 3A). The gain of power by WGS is more obvious when the sample size is smaller. We next investigated whether the additional power gained from WGS is justifiable on the basis of cost. At the current per-sample cost level (WES: $500 and WGS: $1,000), we found that WES is still more cost effective than WGS (Figure 3B).
Discussion
Analyzing DNMs from exome-sequencing data has been shown to be a powerful paradigm for mapping risk genes of developmental and psychiatric disorders. Extending this to the non-coding genome is the natural next step and has the potential to transform our understanding of these complex disorders. In this work, we present a comprehensive statistical framework to support such analysis. Our described method, TADA-A, is able to leverage multiple functional genomic annotations to better detect and prioritize risk-predisposing mutations. More importantly, TADA-A is able to combine information of all DNMs of a gene, in both coding and non-coding regions, to maximize the power to detect risk genes. The results of our meta-analysis of autism WGS datasets demonstrate the effectiveness of TADA-A. We show that de novo non-coding mutations make substantial contributions to the risk of autism (comparable to de novo LoF or missense mutations) and identified several promising ASD risk genes. We note that the method can be applied to any units, other than genes, such as regulatory elements or sequence windows, though analysis at the gene level has the benefit of permitting us to borrow external information from previous WES studies.
A common strategy for analyzing DNM data is the burden analysis, which contrasts the rates of DNMs in affected individuals, often limited to likely functional mutations, with the expected rates due to chance alone. When researchers have no matched sibling or control data, the burden analysis can be confounded by technical factors such as sequencing depth. The burden analysis in non-coding regions is even more challenging because the statistical signal is considerably weaker than the coding signal, as reported by recent publications as well as our own study (Table 1). TADA-A greatly improves the standard burden analysis in several ways. Our mutation model, based on Poisson regression, incorporates covariates known to influence background mutation rates. In our ASD analysis, while we do not have access to genome-wide sequencing depth information, we used GC content as a proxy. Incorporating prior information of which genes are likely risk genes is critical for estimating parameters of annotations, while using a uniform prior largely lost the signals (Figure S2C).
One of the main challenges in making use of noncoding mutations in risk gene mapping is that we do not know a priori, from many possible noncoding annotations, which ones are disease relevant. TADA-A provides a convenient way to tackle this challenge. It allows users to analyze as many annotations as possible and learn which ones are informative of pathological mutations. In the application of TADA-A to ASD WGS data, we found that mutations with H3K27ac marks or with possible splicing effects contribute to ASD risk. These findings are consistent with previous research implicating a role of transcriptional mis-regulation in ASD etiology: chromatin remodeling and histone modification have been implicated in genes with ASD-associated DNMs;1, 54 trans-acting splicing modulators, such as FMRP, have been identified as syndromic ASD genes;55 and atypical splicing patterns of synaptic genes have been observed in individuals with autism.56, 57, 58 While we think enhancers (as marked by H3K27ac) and splicing regulations are involved in possibly most complex diseases, the exact annotations that are informative of disease variants may differ from our findings. For instance, it is possible that enhancers in only specific tissues, which are not known a priori, may be relevant to a given disease. And one may need to intersect H3K27ac with other annotations, e.g., conservation or open chromatin, to better identify functionally active enhancers. TADA-A provides an automatic way of learning such annotations (and their combinations).
Previous knowledge of the role of non-coding variants in diseases comes mostly from GWASs. The challenge with GWASs is that regulatory elements are much shorter (∼1 kb) than regions of linkage disequilibrium (LD, hundreds of kb on average). It is thus not straightforward to assess the contribution of non-coding variants or to identify specific regulatory elements from GWASs.45 Indeed, the estimated contribution of DHS sites to heritability of complex diseases ranges widely from 79% to 25% in literature,45, 59 largely because of LD. By using DNMs, our work provides independent estimation of the contribution of both coding and non-coding variants to the risk of complex diseases.45 We estimated a modest average relative risk of about 1.5 for de novo mutations in less conserved brain H3K27ac enhancers and 3.4 in conserved brain H3K27ac enhancers, compared to 4–5 for missense and 20 for LoF mutations. We were not able to detect signal in evolutionarily conserved sequences (GERP, if not combined with tissue-specific enhancers) or putative deleterious variants (CADD, trained mostly from non-brain tissues). These results suggest that evolutionary constraint is only weakly correlated with pathogenicity in ASD32 and that regulatory variants of ASD probably act in a tissue- and time-specific manner.60 Compared to a previous study,47 our estimate of ASD risk attributable to coding mutations is somewhat higher (1.9% versus 1.1%), mainly due to a significant contribution from missense DNMs (0.83% versus the previous estimate of 0.04%). We believe this difference is due to our different modeling assumptions: we treated all mutations in a category as a mixture of causal and non-causal mutations, whereas the previous study treated all mutations in a category equally (see Material and Methods).47 We estimated that de novo coding (1.9%), non-coding (1.16%), and copy number variants (1.46%, estimated by a previous study47) together contribute 4.5% of ASD risk. We think that we significantly under-estimated the contribution of non-coding mutations to ASD risk for several reasons. First, we considered only enhancers within 10 kb of TSSs, which constitute about 36% of all enhancers in our data. Second, our dataset contains only regulatory sequences active early in development (7, 8.5, and 12 weeks after conception) or in the adult brain. Third, larger genomic alterations, such as indels, potentially have larger effect sizes and are expected to increase the power for risk gene prediction. However, the false positive rates of calling de novo indels are much higher than SNVs. Besides, there is no good de novo mutation rate model for indels, which makes it difficult to model indels and estimate the relevant parameters. Thus de novo indels were not considered in this study. We also did not include de novo CNVs because of the difficulty of estimating mutation rates and attributing the contribution of a CNV to a risk gene.
Iossifov et al. used ascertainment differential, defined as the difference of DNM rates between probands and unaffected siblings, to measure the contribution of DNMs to ASD risk.14 Based on higher DN nonsynonymous mutation rates in probands, they estimate that DNMs contributes to about 21% of case subjects. We note, however, this does not mean that the DNMs explain all these case subjects, since the DNMs are rarely fully penetrant. A better approach would estimate the contribution of DNMs to the disease liability, similar to the widely used heritability analysis, by taking into account the effect sizes of variants. Using this approach, both Gaugler et al. and our method reach similar estimates that DNMs contribute to a few percent of the ASD risk.47 To appreciate the difference of Iossifov et al.14 and the liability approach, consider a two-hit model where an individual has high ASD liability from inherited variants and one DNM with small effect pushing him above the liability threshold. In this case, DNM certainly contributes, but its effect is small. The ascertainment differential approach will not give us a correct picture of the true impact of DNMs in this scenario.
We identified four “novel ASD genes,” three of which are strongly supported by other evidence. APBB1 is an adaptor protein localized in the nucleus. It is downregulated in ASD cerebellum compared to control cerebellum,61 and its microexons are mis-regulated in the brains of ASD-affected individuals.56 NRXN1 belongs to a group of presynaptic cell adhesion molecules that controls synapse development.62 It has been implicated as a top candidate gene for neurodevelopmental and neuropsychiatric conditions.63 Interestingly, alternative splicing of Nrxn1 has been reported to cause defects of synaptic formation in the hippocampus region in a mouse model (the gene is supported by a splicing SNV, Table 2).64 TANC2 is a member of postsynaptic scaffold proteins. It is highly expressed in the brain and play roles in the regulation of dendritic spines and excitatory synapses.65 One WES study reported TANC2 as a candidate intellectual disability gene.66
Most of the ASD genes at FDR < 0.3 are supported by functional or association studies (summarized in Table S6). JUP is a member of the catenin/cadherin superfamily, which has important roles in neuron connections and interactions.67 It is strongly expressed in the primate prefrontal cortex and hippocampus.68 Dll1 is expressed in most of the neural tube during CNS development in mice.69 Studies of Dll1-deficient mice suggest that Dll1 plays an important role in the expansion and differentiation of mesencephalic dopaminergic neural precursor cells into neurons.70 PPM1D has recently been identified as a risk gene for intellectual disability.71 MSL2, DLL1, SMARCC2, ARHGAP44, and GAPVD1 are predicted as autism risk genes by a recently developed machine-learning approach that utilizes a brain-specific functional interaction network24 (q-values 0.038, 0.0186, 0.0319, 0.0859, and 0.08, respectively).
We also found that the two TAD regions with excess regulatory SNVs in ASD are supported by CNV studies. In one TAD, recurrent, rare CNVs (chr2: 45455651–45984915) spanning the entire SRBD1 gene (the only protein coding gene disrupted by the CNVs within this TAD) were reported in ASD-affected subjects.72 In a later independent study, CNVs in this TAD region were found to be enriched in ASD-affected case subjects versus control subjects.72 These results suggest that SRBD1 is likely the risk gene in this TAD. In the other TAD region, ASD-associated duplication of 8p23.1–8p23.2 introduces a breakpoint between MSRA and RP1L1.73 MSRA is a member of the methionine-sulfoxide reductase system whose function is to alleviate oxidative stress. Increased exposure to oxidative stress plays an important role in the pathogenesis of ASD.74 In addition, GWASs have established associations of MSRA with schizophrenia75 and bipolar disorder.76
One caveat of using multiple datasets for meta-analysis using TADA-A is that DNM load is likely to be different between simplex and multiplex families. Ideally, we would like to treat the data from simplex and multiplex families differently, but in practice, this would reduce sample size and make the estimates less reliable. Several lines of reasoning suggest that the extent of difference may be limited. (1) When choosing simplex families, it is hard to exclude families with high genetic risks, because the family sizes are often small. A high-risk family by chance could give rise to two affected siblings or one affected and one unaffected sibling. The former would be classified as multiplex and the latter simplex. It is estimated that more than 85% of such high-risk families with two children, at least one with autism, would be included in Simons Simplex Collection.77 (2) The de novo CNV burden of simplex and multiplex families were found to be similar. In Pinto et al., the rates of de novo CNVs is 5.9% in simplex and 5.8% in multiplex families.78 (3) The burden of nonsynonymous mutations from our data (Figure 1B, OR about 1.2), in which multiplex families from Yuen et al. takes a majority proportion is quite similar to the burden based on simplex families.14 (4) Yuen et al. found identical DNMs existed in 19% of the sibling pairs of multiplex families they investigated.79 This observation suggests that, even in multiplex families, DNMs derived from germline mosaic mutations could play a significant role in increasing ASD risk. These “mosaic” DNMs are thus similar to DNMs in simplex families, in a sense. With more data from simplex and multiplex families available, accounting for this difference would be a future direction.
We believe that TADA-A can be further developed along several directions. The baseline mutation model of TADA-A is relatively simple, and recent studies demonstrate that broader sequence context and additional genomic features can be highly correlated with mutation rates.80 Additionally, in the current analysis, we focus on regulatory sequences close to genes. However, a large fraction of regulatory sequences are distal to transcription start sites. The challenge is that the target genes of these sequences are often unknown. We plan to integrate chromatin interaction data (e.g., Hi-C) in the future to better analyze mutations in distal enhancers. Finally, TADA-A uses a linear model for predicting effects of mutations from annotations. A more powerful method may use a non-linear model such as deep neuron networks.81
Acknowledgments
This work was supported by National Institutes of Health grant (1R01MH110531) and Simons Foundation award (SFARI Award ID 385027) to X.H.
Published: May 10, 2018
Footnotes
Supplemental Data include four figures, eight tables, and Supplemental Methods and can be found with this article online at https://doi.org/10.1016/j.ajhg.2018.03.023.
Contributor Information
Zhong Sheng Sun, Email: sunzs@mail.biols.ac.cn.
Xin He, Email: xinhe@uchicago.edu.
Web Resources
Cross-tissue enhancer/promoter correlation, http://khuranalab.med.cornell.edu/roadmap_stringent_enhancers.txt
Developing Human Brain, https://developinghumanbrain.org
SFARI, https://sfari.org/
TADA-A, https://github.com/TADA-A/TADA-A
WGS data on EMBL, http://wwwdev.ebi.ac.uk/eva/?eva-study=PRJEB14713
Supplemental Data
References
- 1.De Rubeis S., He X., Goldberg A.P., Poultney C.S., Samocha K., Cicek A.E., Kou Y., Liu L., Fromer M., Walker S., DDD Study. Homozygosity Mapping Collaborative for Autism. UK10K Consortium Synaptic, transcriptional and chromatin genes disrupted in autism. Nature. 2014;515:209–215. doi: 10.1038/nature13772. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Sanders S.J., He X., Willsey A.J., Ercan-Sencicek A.G., Samocha K.E., Cicek A.E., Murtha M.T., Bal V.H., Bishop S.L., Dong S., Autism Sequencing Consortium Insights into autism spectrum disorder genomic architecture and biology from 71 risk loci. Neuron. 2015;87:1215–1233. doi: 10.1016/j.neuron.2015.09.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Lelieveld S.H., Reijnders M.R.F., Pfundt R., Yntema H.G., Kamsteeg E.-J., de Vries P., de Vries B.B.A., Willemsen M.H., Kleefstra T., Löhner K. Meta-analysis of 2,104 trios provides support for 10 new genes for intellectual disability. Nat. Neurosci. 2016;19:1194–1196. doi: 10.1038/nn.4352. [DOI] [PubMed] [Google Scholar]
- 4.Fromer M., Pocklington A.J., Kavanagh D.H., Williams H.J., Dwyer S., Gormley P., Georgieva L., Rees E., Palta P., Ruderfer D.M. De novo mutations in schizophrenia implicate synaptic networks. Nature. 2014;506:179–184. doi: 10.1038/nature12929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Allen A.S., Berkovic S.F., Cossette P., Delanty N., Dlugos D., Eichler E.E., Epstein M.P., Glauser T., Goldstein D.B., Han Y., Epi4K Consortium. Epilepsy Phenome/Genome Project De novo mutations in epileptic encephalopathies. Nature. 2013;501:217–221. doi: 10.1038/nature12439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Homsy J., Zaidi S., Shen Y., Ware J.S., Samocha K.E., Karczewski K.J., DePalma S.R., McKean D., Wakimoto H., Gorham J. De novo mutations in congenital heart disease with neurodevelopmental and other congenital anomalies. Science. 2015;350:1262–1266. doi: 10.1126/science.aac9396. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Jiang Y.-H., Yuen R.K.C., Jin X., Wang M., Chen N., Wu X., Ju J., Mei J., Shi Y., He M. Detection of clinically relevant genetic variants in autism spectrum disorder by whole-genome sequencing. Am. J. Hum. Genet. 2013;93:249–263. doi: 10.1016/j.ajhg.2013.06.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Yuen R.K., Merico D., Cao H., Pellecchia G., Alipanahi B., Thiruvahindrapuram B., Tong X., Sun Y., Cao D., Zhang T. Genome-wide characteristics of de novo mutations in autism. NPJ Genom Med. 2016;1:160271–1602710. doi: 10.1038/npjgenmed.2016.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Yuen R.K.C., Merico D., Bookman M., Howe J.L., Thiruvahindrapuram B., Patel R.V., Whitney J., Deflaux N., Bingham J., Wang Z. Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder. Nat. Neurosci. 2017;20:602–611. doi: 10.1038/nn.4524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.He X., Sanders S.J., Liu L., De Rubeis S., Lim E.T., Sutcliffe J.S., Schellenberg G.D., Gibbs R.A., Daly M.J., Buxbaum J.D. Integrated model of de novo and inherited genetic variants yields greater power to identify risk genes. PLoS Genet. 2013;9:e1003671. doi: 10.1371/journal.pgen.1003671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Jiang Y., Han Y., Petrovski S., Owzar K., Goldstein D.B., Allen A.S. Incorporating functional information in tests of excess de novo mutational load. Am. J. Hum. Genet. 2015;97:272–283. doi: 10.1016/j.ajhg.2015.06.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Iossifov I., Ronemus M., Levy D., Wang Z., Hakker I., Rosenbaum J., Yamrom B., Lee Y.-H., Narzisi G., Leotta A. De novo gene disruptions in children on the autistic spectrum. Neuron. 2012;74:285–299. doi: 10.1016/j.neuron.2012.04.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Neale B.M., Kou Y., Liu L., Ma’ayan A., Samocha K.E., Sabo A., Lin C.-F., Stevens C., Wang L.-S., Makarov V. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature. 2012;485:242–245. doi: 10.1038/nature11011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Iossifov I., O’Roak B.J., Sanders S.J., Ronemus M., Krumm N., Levy D., Stessman H.A., Witherspoon K.T., Vives L., Patterson K.E. The contribution of de novo coding mutations to autism spectrum disorder. Nature. 2014;515:216–221. doi: 10.1038/nature13908. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Turner T.N., Hormozdiari F., Duyzend M.H., McClymont S.A., Hook P.W., Iossifov I., Raja A., Baker C., Hoekzema K., Stessman H.A. Genome sequencing of autism-affected families reveals disruption of putative noncoding regulatory DNA. Am. J. Hum. Genet. 2016;98:58–74. doi: 10.1016/j.ajhg.2015.11.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Samocha K.E., Robinson E.B., Sanders S.J., Stevens C., Sabo A., McGrath L.M., Kosmicki J.A., Rehnström K., Mallick S., Kirby A. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 2014;46:944–950. doi: 10.1038/ng.3050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Newton M.A., Noueiry A., Sarkar D., Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004;5:155–176. doi: 10.1093/biostatistics/5.2.155. [DOI] [PubMed] [Google Scholar]
- 18.Reilly S.K., Yin J., Ayoub A.E., Emera D., Leng J., Cotney J., Sarro R., Rakic P., Noonan J.P. Evolutionary genomics. Evolutionary changes in promoter and enhancer activity during human corticogenesis. Science. 2015;347:1155–1159. doi: 10.1126/science.1260943. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., Heravi-Moussavi A., Kheradpour P., Zhang Z., Wang J., Ziller M.J., Roadmap Epigenomics Consortium Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Quinlan A.R., Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Yang H., Wang K. Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR. Nat. Protoc. 2015;10:1556–1566. doi: 10.1038/nprot.2015.105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Davydov E.V., Goode D.L., Sirota M., Cooper G.M., Sidow A., Batzoglou S. Identifying a high fraction of the human genome to be under selective constraint using GERP++ PLoS Comput. Biol. 2010;6:e1001025. doi: 10.1371/journal.pcbi.1001025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Xiong H.Y., Alipanahi B., Lee L.J., Bretschneider H., Merico D., Yuen R.K.C., Hua Y., Gueroussov S., Najafabadi H.S., Hughes T.R. RNA splicing. The human splicing code reveals new insights into the genetic determinants of disease. Science. 2015;347:1254806. doi: 10.1126/science.1254806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Krishnan A., Zhang R., Yao V., Theesfeld C.L., Wong A.K., Tadych A., Volfovsky N., Packer A., Lash A., Troyanskaya O.G. Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nat. Neurosci. 2016;19:1454–1462. doi: 10.1038/nn.4353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Sanders S.J., Murtha M.T., Gupta A.R., Murdoch J.D., Raubeson M.J., Willsey A.J., Ercan-Sencicek A.G., DiLullo N.M., Parikshak N.N., Stein J.L. De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature. 2012;485:237–241. doi: 10.1038/nature10945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Xu L.M., Li J.R., Huang Y., Zhao M., Tang X., Wei L. AutismKB: an evidence-based knowledgebase of autism genetics. Nucleic Acids Res. 2012;40:D1016–D1022. doi: 10.1093/nar/gkr1145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Betancur C. Etiological heterogeneity in autism spectrum disorders: more than 100 genetic and genomic disorders and still counting. Brain Res. 2011;1380:42–77. doi: 10.1016/j.brainres.2010.11.078. [DOI] [PubMed] [Google Scholar]
- 28.Pinto D., Delaby E., Merico D., Barbosa M., Merikangas A., Klei L., Thiruvahindrapuram B., Xu X., Ziman R., Wang Z. Convergence of genes and cellular pathways dysregulated in autism spectrum disorders. Am. J. Hum. Genet. 2014;94:677–694. doi: 10.1016/j.ajhg.2014.03.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Purcell S.M., Moran J.L., Fromer M., Ruderfer D., Solovieff N., Roussos P., O’Dushlaine C., Chambert K., Bergen S.E., Kähler A. A polygenic burden of rare disruptive mutations in schizophrenia. Nature. 2014;506:185–190. doi: 10.1038/nature12975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Bayés A., Collins M.O., Croning M.D.R., van de Lagemaat L.N., Choudhary J.S., Grant S.G.N. Comparative study of human and mouse postsynaptic proteomes finds high compositional conservation and abundance differences for key synaptic proteins. PLoS ONE. 2012;7:e46683. doi: 10.1371/journal.pone.0046683. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Darnell J.C., Van Driesche S.J., Zhang C., Hung K.Y., Mele A., Fraser C.E., Stone E.F., Chen C., Fak J.J., Chi S.W. FMRP stalls ribosomal translocation on mRNAs linked to synaptic function and autism. Cell. 2011;146:247–261. doi: 10.1016/j.cell.2011.06.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Petrovski S., Wang Q., Heinzen E.L., Allen A.S., Goldstein D.B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 2013;9:e1003709. doi: 10.1371/journal.pgen.1003709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Huang N., Lee I., Marcotte E.M., Hurles M.E. Characterising and predicting haploinsufficiency in the human genome. PLoS Genet. 2010;6:e1001154. doi: 10.1371/journal.pgen.1001154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Petrovski S., Gussow A.B., Wang Q., Halvorsen M., Han Y., Weir W.H., Allen A.S., Goldstein D.B. The intolerance of regulatory sequence to genetic variation predicts gene dosage sensitivity. PLoS Genet. 2015;11:e1005492. doi: 10.1371/journal.pgen.1005492. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Liu L., Lei J., Sanders S.J., Willsey A.J., Kou Y., Cicek A.E., Klei L., Lu C., He X., Li M. DAWN: a framework to identify autism genes and subnetworks using gene expression and genetics. Mol. Autism. 2014;5:22. doi: 10.1186/2040-2392-5-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Liu L., Lei J., Roeder K. Network assisted analysis to reveal the genetic basis of autism. Ann. Appl. Stat. 2015;9:1571–1600. doi: 10.1214/15-AOAS844. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Willsey A.J., Sanders S.J., Li M., Dong S., Tebbenkamp A.T., Muhle R.A., Reilly S.K., Lin L., Fertuzinhos S., Miller J.A. Coexpression networks implicate human midfetal deep cortical projection neurons in the pathogenesis of autism. Cell. 2013;155:997–1007. doi: 10.1016/j.cell.2013.10.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Mostafavi S., Ray D., Warde-Farley D., Grouios C., Morris Q. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol. 2008;9(Suppl 1):S4. doi: 10.1186/gb-2008-9-s1-s4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Roach J.C., Glusman G., Smit A.F.A., Huff C.D., Hubley R., Shannon P.T., Rowen L., Pant K.P., Goodman N., Bamshad M. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010;328:636–639. doi: 10.1126/science.1186802. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Kong A., Frigge M.L., Masson G., Besenbacher S., Sulem P., Magnusson G., Gudjonsson S.A., Sigurdsson A., Jonasdottir A., Jonasdottir A. Rate of de novo mutations and the importance of father’s age to disease risk. Nature. 2012;488:471–475. doi: 10.1038/nature11396. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Campbell C.D., Chong J.X., Malig M., Ko A., Dumont B.L., Han L., Vives L., O’Roak B.J., Sudmant P.H., Shendure J. Estimating the human mutation rate using autozygosity in a founder population. Nat. Genet. 2012;44:1277–1281. doi: 10.1038/ng.2418. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Li J., Cai T., Jiang Y., Chen H., He X., Chen C., Li X., Shao Q., Ran X., Li Z. Genes with de novo mutations are shared by four neuropsychiatric disorders discovered from NPdenovo database. Mol. Psychiatry. 2016;21:290–297. doi: 10.1038/mp.2015.40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Dunham I., Kundaje A., Aldred S.F., Collins P.J., Davis C.A., Doyle F., Epstein C.B., Frietze S., Harrow J., Kaul R., ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Kircher M., Witten D.M., Jain P., O’Roak B.J., Cooper G.M., Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 2014;46:310–315. doi: 10.1038/ng.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Finucane H.K., Bulik-Sullivan B., Gusev A., Trynka G., Reshef Y., Loh P.-R., Anttila V., Xu H., Zang C., Farh K., ReproGen Consortium. Schizophrenia Working Group of the Psychiatric Genomics Consortium. RACI Consortium Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 2015;47:1228–1235. doi: 10.1038/ng.3404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Gusev A., Lee S.H., Trynka G., Finucane H., Vilhjálmsson B.J., Xu H., Zang C., Ripke S., Bulik-Sullivan B., Stahl E., Schizophrenia Working Group of the Psychiatric Genomics Consortium. SWE-SCZ Consortium. Schizophrenia Working Group of the Psychiatric Genomics Consortium. SWE-SCZ Consortium Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet. 2014;95:535–552. doi: 10.1016/j.ajhg.2014.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Gaugler T., Klei L., Sanders S.J., Bodea C.A., Goldberg A.P., Lee A.B., Mahajan M., Manaa D., Pawitan Y., Reichert J. Most genetic risk for autism resides with common variation. Nat. Genet. 2014;46:881–885. doi: 10.1038/ng.3039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Li X., Zhu C., Tu W.H., Yang N., Qin H., Sun Z. ZMIZ1 preferably enhances the transcriptional activity of androgen receptor with short polyglutamine tract. PLoS ONE. 2011;6:e25040. doi: 10.1371/journal.pone.0025040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Wu J.I., Lessard J., Olave I.A., Qiu Z., Ghosh A., Graef I.A., Crabtree G.R. Regulation of dendritic development by neuron-specific chromatin remodeling complexes. Neuron. 2007;56:94–108. doi: 10.1016/j.neuron.2007.08.021. [DOI] [PubMed] [Google Scholar]
- 50.Vogel-Ciernia A., Wood M.A. Neuron-specific chromatin remodeling: a missing link in epigenetic mechanisms underlying synaptic plasticity, memory, and intellectual disability disorders. Neuropharmacology. 2014;80:18–27. doi: 10.1016/j.neuropharm.2013.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Córdova-Fletes C., Domínguez M.G., Delint-Ramirez I., Martínez-Rodríguez H.G., Rivas-Estilla A.M., Barros-Núñez P., Ortiz-López R., Neira V.A. A de novo t(10;19)(q22.3;q13.33) leads to ZMIZ1/PRR12 reciprocal fusion transcripts in a girl with intellectual disability and neuropsychiatric alterations. Neurogenetics. 2015;16:287–298. doi: 10.1007/s10048-015-0452-2. [DOI] [PubMed] [Google Scholar]
- 52.Dixon J.R., Selvaraj S., Yue F., Kim A., Li Y., Shen Y., Hu M., Liu J.S., Ren B. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012;485:376–380. doi: 10.1038/nature11082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Sanders S.J., Ercan-Sencicek A.G., Hus V., Luo R., Murtha M.T., Moreno-De-Luca D., Chu S.H., Moreau M.P., Gupta A.R., Thomson S.A. Multiple recurrent de novo CNVs, including duplications of the 7q11.23 Williams syndrome region, are strongly associated with autism. Neuron. 2011;70:863–885. doi: 10.1016/j.neuron.2011.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Krumm N., O’Roak B.J., Shendure J., Eichler E.E. A de novo convergence of autism genetics and molecular neuroscience. Trends Neurosci. 2014;37:95–105. doi: 10.1016/j.tins.2013.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Didiot M.C., Tian Z., Schaeffer C., Subramanian M., Mandel J.L., Moine H. The G-quartet containing FMRP binding site in FMR1 mRNA is a potent exonic splicing enhancer. Nucleic Acids Res. 2008;36:4902–4912. doi: 10.1093/nar/gkn472. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Irimia M., Weatheritt R.J., Ellis J.D., Parikshak N.N., Gonatopoulos-Pournatzis T., Babor M., Quesnel-Vallières M., Tapial J., Raj B., O’Hanlon D. A highly conserved program of neuronal microexons is misregulated in autistic brains. Cell. 2014;159:1511–1523. doi: 10.1016/j.cell.2014.11.035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Sadakata T., Washida M., Iwayama Y., Shoji S., Sato Y., Ohkura T., Katoh-Semba R., Nakajima M., Sekine Y., Tanaka M. Autistic-like phenotypes in Cadps2-knockout mice and aberrant CADPS2 splicing in autistic patients. J. Clin. Invest. 2007;117:931–943. doi: 10.1172/JCI29031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Talebizadeh Z., Lam D.Y., Theodoro M.F., Bittel D.C., Lushington G.H., Butler M.G. Novel splice isoforms for NLGN3 and NLGN4 with possible implications in autism. J. Med. Genet. 2006;43:e21. doi: 10.1136/jmg.2005.036897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Speed D., Cai N., Johnson M.R., Nejentsev S., Balding D.J., UCLEB Consortium Reevaluation of SNP heritability in complex human traits. Nat. Genet. 2017;49:986–992. doi: 10.1038/ng.3865. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Akbarian S., Liu C., Knowles J.A., Vaccarino F.M., Farnham P.J., Crawford G.E., Jaffe A.E., Pinto D., Dracheva S., Geschwind D.H., PsychENCODE Consortium The PsychENCODE project. Nat. Neurosci. 2015;18:1707–1712. doi: 10.1038/nn.4156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Zeidán-Chuliá F., de Oliveira B.-H.N., Salmina A.B., Casanova M.F., Gelain D.P., Noda M., Verkhratsky A., Moreira J.C.F. Altered expression of Alzheimer’s disease-related genes in the cerebellum of autistic patients: a model for disrupted brain connectome and therapy. Cell Death Dis. 2014;5:e1250. doi: 10.1038/cddis.2014.227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Craig A.M., Kang Y. Neurexin-neuroligin signaling in synapse development. Curr. Opin. Neurobiol. 2007;17:43–52. doi: 10.1016/j.conb.2007.01.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Béna F., Bruno D.L., Eriksson M., van Ravenswaaij-Arts C., Stark Z., Dijkhuizen T., Gerkes E., Gimelli S., Ganesamoorthy D., Thuresson A.C. Molecular and clinical characterization of 25 individuals with exonic deletions of NRXN1 and comprehensive review of the literature. Am. J. Med. Genet. B. Neuropsychiatr. Genet. 2013;162B:388–403. doi: 10.1002/ajmg.b.32148. [DOI] [PubMed] [Google Scholar]
- 64.Traunmüller L., Gomez A.M., Nguyen T.M., Scheiffele P. Control of neuronal synapse specification by a highly dedicated alternative splicing program. Science. 2016;352:982–986. doi: 10.1126/science.aaf2397. [DOI] [PubMed] [Google Scholar]
- 65.Han S., Nam J., Li Y., Kim S., Cho S.-H., Cho Y.S., Choi S.-Y., Choi J., Han K., Kim Y. Regulation of dendritic spines, spatial memory, and embryonic development by the TANC family of PSD-95-interacting proteins. J. Neurosci. 2010;30:15102–15112. doi: 10.1523/JNEUROSCI.3128-10.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.de Ligt J., Willemsen M.H., van Bon B.W.M., Kleefstra T., Yntema H.G., Kroes T., Vulto-van Silfhout A.T., Koolen D.A., de Vries P., Gilissen C. Diagnostic exome sequencing in persons with severe intellectual disability. N. Engl. J. Med. 2012;367:1921–1929. doi: 10.1056/NEJMoa1206524. [DOI] [PubMed] [Google Scholar]
- 67.Takeichi M. The cadherin superfamily in neuronal connections and interactions. Nat. Rev. Neurosci. 2007;8:11–20. doi: 10.1038/nrn2043. [DOI] [PubMed] [Google Scholar]
- 68.Smith A., Bourdeau I., Wang J., Bondy C.A. Expression of Catenin family members CTNNA1, CTNNA2, CTNNB1 and JUP in the primate prefrontal cortex and hippocampus. Brain Res. Mol. Brain Res. 2005;135:225–231. doi: 10.1016/j.molbrainres.2004.12.025. [DOI] [PubMed] [Google Scholar]
- 69.Bettenhausen B., Gossler A. Efficient isolation of novel mouse genes differentially expressed in early postimplantation embryos. Genomics. 1995;28:436–441. doi: 10.1006/geno.1995.1172. [DOI] [PubMed] [Google Scholar]
- 70.Trujillo-Paredes N., Valencia C., Guerrero-Flores G., Arzate D.-M., Baizabal J.-M., Guerra-Crespo M., Fuentes-Hernández A., Zea-Armenta I., Covarrubias L. Regulation of differentiation flux by Notch signalling influences the number of dopaminergic neurons in the adult brain. Biol. Open. 2016;5:336–347. doi: 10.1242/bio.013383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Lelieveld S.H., Reijnders M.R.F., Pfundt R., Yntema H.G., Kamsteeg E., de Vries P., de Vries B.B.A., Willemsen M.H., Kleefstra T., Löhner K. Meta-analysis of 2,104 trios provides support for 10 novel candidate genes for intellectual disability. Nature. 2016;19:1194–1196. doi: 10.1038/nn.4352. [DOI] [PubMed] [Google Scholar]
- 72.Matsunami N., Hensel C.H., Baird L., Stevens J., Otterud B., Leppert T., Varvil T., Hadley D., Glessner J.T., Pellegrino R. Identification of rare DNA sequence variants in high-risk autism families and their prevalence in a large case/control population. Mol. Autism. 2014;5:5. doi: 10.1186/2040-2392-5-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Glancy M., Barnicoat A., Vijeratnam R., de Souza S., Gilmore J., Huang S., Maloney V.K., Thomas N.S., Bunyan D.J., Jackson A., Barber J.C. Transmitted duplication of 8p23.1-8p23.2 associated with speech delay, autism and learning difficulties. Eur. J. Hum. Genet. 2009;17:37–43. doi: 10.1038/ejhg.2008.133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Rossignol D.A., Frye R.E. Evidence linking oxidative stress, mitochondrial dysfunction, and inflammation in the brain of individuals with autism. Front. Physiol. 2014;5:150. doi: 10.3389/fphys.2014.00150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Ma X., Deng W., Liu X., Li M., Chen Z., He Z., Wang Y., Wang Q., Hu X., Collier D.A., Li T. A genome-wide association study for quantitative traits in schizophrenia in China. Genes Brain Behav. 2011;10:734–739. doi: 10.1111/j.1601-183X.2011.00712.x. [DOI] [PubMed] [Google Scholar]
- 76.Ni P., Ma X., Lin Y., Lao G., Hao X., Guan L., Li X., Jiang Z., Liu Y., Ye B. Methionine sulfoxide reductase A (MsrA) associated with bipolar I disorder and executive functions in A Han Chinese population. J. Affect. Disord. 2015;184:235–238. doi: 10.1016/j.jad.2015.06.004. [DOI] [PubMed] [Google Scholar]
- 77.Levy D., Ronemus M., Yamrom B., Lee Y.H., Leotta A., Kendall J., Marks S., Lakshmi B., Pai D., Ye K. Rare de novo and transmitted copy-number variation in autistic spectrum disorders. Neuron. 2011;70:886–897. doi: 10.1016/j.neuron.2011.05.015. [DOI] [PubMed] [Google Scholar]
- 78.Pinto D., Pagnamenta A.T., Klei L., Anney R., Merico D., Regan R., Conroy J., Magalhaes T.R., Correia C., Abrahams B.S. Functional impact of global rare copy number variation in autism spectrum disorders. Nature. 2010;466:368–372. doi: 10.1038/nature09146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Yuen R.K.C., Thiruvahindrapuram B., Merico D., Walker S., Tammimies K., Hoang N., Chrysler C., Nalpathamkalam T., Pellecchia G., Liu Y. Whole-genome sequencing of quartet families with autism spectrum disorder. Nat. Med. 2015;21:185–191. doi: 10.1038/nm.3792. [DOI] [PubMed] [Google Scholar]
- 80.Carlson J., Scott L.J., Locke A.E., Flickinger M., Levy S., Myers R.M., Boehnke M., Kang H.M., Li J.Z., Zöllner S. Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans. bioRxiv. 2017 doi: 10.1038/s41467-018-05936-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Quang D., Chen Y., Xie X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics. 2015;31:761–763. doi: 10.1093/bioinformatics/btu703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Michaelson J.J., Shi Y., Gujral M., Zheng H., Malhotra D., Jin X., Jian M., Liu G., Greer D., Bhandari A. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell. 2012;151:1431–1442. doi: 10.1016/j.cell.2012.11.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.