Abstract
Allele specific expression (ASE) concerns divergent expression quantity of alternative alleles and is measured by RNA sequencing. Multiple studies show that ASE plays a role in hereditary diseases by modulating penetrance or phenotype severity. However, genome diagnostics is based on DNA sequencing and therefore neglects gene expression regulation such as ASE. To take advantage of ASE in absence of RNA sequencing, it must be predicted using only DNA variation. We have constructed ASE models from BIOS (n = 3432) and GTEx (n = 369) that predict ASE using DNA features. These models are highly reproducible and comprise many different feature types, highlighting the complex regulation that underlies ASE. We applied the BIOS-trained model to population variants in three genes in which ASE plays a clinically relevant role: BRCA2, RET and NF1. This resulted in predicted ASE effects for 27 variants, of which 10 were known pathogenic variants. We demonstrated that ASE can be predicted from DNA features using machine learning. Future efforts may improve sensitivity and translate these models into a new type of genome diagnostic tool that prioritizes candidate pathogenic variants or regulators thereof for follow-up validation by RNA sequencing. All used code and machine learning models are available at GitHub and Zenodo.
Subject terms: Computational models, Genetics research
Introduction
Allele-specific expression (ASE) concerns the divergent expression quantity of alternative allelic copies1,2. ASE can be the result of X-chromosome inactivation3, imprinting4, stochasticity5, nonsense-mediated decay6, or genomic regulation7. ASE is heritable8 and typically measured by quantifying RNA expression differences between haplotypes at heterozygous loci of diploid organisms.
ASE has been implicated in disease etiology, even though the underlying mechanisms are not yet fully understood. Around one-third of all non-synonymous single nucleotide polymorphisms are allelically imbalanced and nonsense variants are consistently lower expressed than control sites9, establishing a clear link between pathogenic DNA variation and ASE. Specifically, ASE likely plays a role in pathogenesis or phenotype modulation of many diseases, including autism10, colorectal cancer11, leukemia12, breast cancer13, Hirschsprung disease14, frontotemporal lobar degeneration15, asthma16 neurofibromatosis type 117 and Silver–Russell syndrome18. Interestingly, ASE provides protection against autosomal dominant retinitis pigmentosa19, underscoring its complex role in both causing and preventing disease, and thus overall medical relevance.
ASE is measured by RNA sequencing, while DNA sequencing has become the standard for routine genetic testing20. RNA sequencing yields great promise for molecular diagnostics21–26, but it is not a part of current diagnostic genetic testing routine27 because of many challenges concerning analytical validity, clinical validity and clinical utility28.
In absence of RNA measurements, we must resort to predicting ASE effects to inform genome diagnostics. Computationally estimated ASE effects could help to identify or reject candidate pathogenic variants, including coding variants that cause nonsense-mediated decay detected as ASE29, and cis-acting non-coding variants that regulate transcription of pathogenic alleles30. For cis-acting variants, there are two possibilities to consider. First, heterozygous pathogenic variants in recessive disease genes could be prioritized if the ASE effect of a cis-acting variant is predicted to silence the ’healthy’ allele. Second, when testing for pathogenic variants in families, incomplete penetrance may be explained if the ASE effect of a cis-acting variant is predicted to silence the pathogenic allele, causing a rescue effect. RNA sequencing or other biochemical tests such as PCR can then be performed on the suspected functional defect to reach a final molecular diagnosis.
Here, we present a feasibility study for predicting ASE effects using genomic annotations of autosomal DNA variation. While many studies have used machine learning on genomes to predict gene expression and other phenotypes31–40, to our knowledge, we are the first to predict allele-specific expression specifically. This was accomplished by constructing a machine learning model that predicts whether a DNA variant occurs together with ASE or not. To test the reproducibility of this model, we trained an additional model with the same DNA features on an independent cohort. Using both models, we carried out cross prediction to find out how much of their predictive power remains under new circumstances. We also examined the DNA features of both models to find the main contributors to predicting ASE, and compared feature importance. Furthermore, we tested whether the predictive models have any bias towards gene molecular function by comparing enrichment profiles of predicted ASE against randomly sampled ASE. Finally, we evaluated the potential role of ASE as a modifier for disease. Genetic modifiers are known to affect the penetrance and modulation of rare Mendelian disease41. To achieve this, we applied the ASE prediction model to clinical genes with substantial numbers of population variants where ASE is linked to disease penetrance in case of BRCA213 and RET14, or phenotype modulation in case of NF117 (Fig. 1).
Results
BIOS model ASE predictions
We trained a machine learning model on the BIOS cohort to recognize the difference between DNA sites where ASE was occurring versus sites without ASE. Figure 2A shows that this model achieved an average Area Under the Receiver Operating Characteristic curve (AUROC) of 0.806 with a standard deviation of 0.003 on the independent BIOS test dataset. At a threshold of 0.5, we find a positive predictive value (PPV) of 0.73, a negative predictive value (NPV) of 0.91, a sensitivity of 0.29, and a specificity of 0.99. See Table 1.
Table 1.
Train | Test | Truth | Prediction (thr. 0.5) | |
---|---|---|---|---|
ASE | Non-ASE | |||
BIOS (90%) | BIOS (10%) | ASE | 95 | 231 |
BIOS (90%) | BIOS (10%) | Non-ASE | 35 | 2414 |
BIOS (90%) | GTEx (full) | ASE | 882 | 2140 |
BIOS (90%) | GTEx (full) | Non-ASE | 518 | 22,249 |
GTEx (90%) | BIOS (full) | ASE | 1242 | 2101 |
GTEx (90%) | BIOS (full) | Non-ASE | 667 | 23,739 |
GTEx (90%) | GTEx (10%) | ASE | 77 | 220 |
GTEx (90%) | GTEx (10%) | Non-ASE | 17 | 2265 |
BIOS versus GTEx cross prediction
To find out whether predicting ASE effects is also possible for a different cohort, we trained a machine learning model on the GTEx dataset under equal conditions. As shown in Fig. 2B, this model achieved an average AUROC of 0.793 with a standard deviation of 0.002 on an independent GTEx test dataset with a PPV of 0.82, a NPV of 0.91, a sensitivity of 0.26, and a specificity of 0.99.
To evaluate to what degree the ASE predictions models are specific to their training dataset of origin, we applied the BIOS model to the GTEx dataset, and vice versa. The BIOS model achieved an average AUROC of 0.802 with a standard deviation of 0.002 on the full GTEx dataset (Fig. 2C) with a PPV of 0.63, a NPV of 0.91, a sensitivity of 0.41, and a specificity of 0.98. And lastly, the GTEx model achieved an average AUROC of 0.812 with a standard deviation of 0.0005 on the full BIOS dataset (Fig. 2D) with a PPV of 0.65, a NPV of 0.92, a sensitivity of 0.37, and a specificity of 0.97. All performance metrics are calculated at a threshold of 0.5. A confusion matrix of all test predictions is shown in Table 1.
Feature importance comparison
We examined the relative importance of DNA features to identify the strongest contributors for predicting ASE and elucidate any differences between the BIOS and GTEx models. Figure 3 shows the feature importance according to the BIOS model along with the corresponding GTEx feature importance. The GerpN feature (neutral evolution score defined by GERP++) is the most important in both models. Upon inspection we find that low GerpN scores, indicating a high tolerance to substitution, correspond to positive ASE predictions. High substitution tolerance means that spontaneous mutations at low GerpN loci are most likely under low selection pressure and have therefore a chance to be established as SNVs in a population. This makes sense since ASE can neither be detected nor predicted without the presence of heterozygous DNA variation to distinguish the expressed alleles. The features that follow in highest importance are a mixture of various evolutionary, functional and epigenetic features, such as bStatistic (background selection score), Dist2Mutation (distance between the closest gnomAD SNV up and downstream), cDNApos (base position from transcription start), MinDistTSE (distance to closest transcribed sequence end), cHmmReprPCWk (proportion of cell types in weak repressed polycomb chromatic state) and cHmmQuies (proportion of cell types in quiescent chromatic state). Overall, most features contribute a significant amount of predictive power to both models, and except for a few differences, their relative feature importance is comparable.
Model bias test
We compared gene enrichment profiles of predicted ASE-SNVs, i.e. observed, versus random ASE-SNVs, i.e. expected. We first obtained the profile of the 116 genes belonging to 806 BIOS-unique ASE-SNVs that were correctly predicted by the GTEx-trained model in the complete set of 2092 BIOS-unique ASE-SNVs in 1039 genes. This profile was then compared to profiles of genes belonging to 806 randomly sampled BIOS-unique ASE-SNVs. Figure 4A shows the top-10 gene enrichment terms of this profile including expected-by-chance distributions from tenfold random resampling. Evidence of bias would present itself when the observed ranks, shown as red X’s, were to strongly and consistently deviate from the expected ranks, shown as black violins. Conversely, if the observed ranks be overlapping with or close to the expected ranks, there would be no evidence of bias.
The cohorts are reversed for the second analysis. We obtained the gene enrichment profile of the 107 genes belonging to 341 GTEx ASE-SNVs that were correctly predicted by the BIOS-trained model in the complete set of 1582 GTEx ASE-SNVs in 727 genes. This profile was then compared to profiles of genes belonging to 341 randomly sampled GTEx-unique ASE-SNVs. Figure 4B shows the top-10 gene enrichment terms of this profile including expected-by-chance distributions from tenfold random resampling.
Application to clinical genes
We have applied the BIOS model to gnomAD population variants from three clinical genes, BRCA2, RET and NF1, in which ASE plays a role in disease penetrance or modulation. Out of 8957 SNVs tested in total, 27 were predicted to undergo ASE effects: 8 out of 3316 for BRCA2, 8 out of 1700 for RET and 11 out of 3941 for NF1. All predicted ASE-SNVs have very low minor allele frequencies, and all except two are either intronic or stop gained variants. Of the 27 variants, 12 have been described in ClinVar, of which 10 are classified as Pathogenic.
Being able to predict ASE effects for these particular genes may help to elucidate the variable disease penetrance of pathogenic BRCA213 and RET14 mutations. It may also help to explain the high variation of disease severity in NF1 patients, which is observed even in familial cases, where all affected members carry the same mutation17. See Table 2 for a complete overview of these variants.
Table 2.
Gene | RsID/GRCh37 | MAF | Conseq. | ClinVar |
---|---|---|---|---|
BRCA2 | rs748508287 | 3.99E−06 | Stop gained | P*** |
BRCA2 | rs80358556 | 4.01E−06 | Stop gained | P*** |
BRCA2 | rs80358851 | 3.99E−06 | Stop gained | P*** |
BRCA2 | rs766337502 | 4.60E−06 | Intronic | – |
BRCA2 | rs753979600 | 4.56E−06 | Intronic | – |
BRCA2 | rs779588681 | 4.69E−06 | Intronic | – |
BRCA2 | rs80359003 | 7.95E−06 | Stop gained | P*** |
BRCA2 | rs776353983 (C>A) | 3.98E−06 | Stop gained | P*** |
NF1 | rs764079291 | 4.00E−06 | Stop gained | P** |
NF1 | rs1316926587 | 4.00E−06 | Stop gained | P* |
NF1 | rs761199437 | 0 | Stop gained | – |
NF1 | rs1282299543 | 0 | Stop gained | P* |
NF1 | rs376576925 (C>A) | 1.59E−05 | Synonymous | LB/VUS* |
NF1 | rs376576925 (C>T) | 3.98E−06 | Stop gained | P** |
NF1 | 17:29576138G>A | 3.98E−06 | Splice donor | P** |
NF1 | rs748461474 | 8.04E−06 | Intronic | – |
NF1 | rs776167625 | 4.02E−06 | Intronic | – |
NF1 | rs1481561333 | 4.02E−06 | Intronic | – |
NF1 | rs756300767 | 8.32E−06 | Intronic | – |
RET | rs754967305 | 3.12E−05 | Intronic | LB** |
RET | 10:43596200T>C | 0 | Intronic | – |
RET | rs1452567543 | 4.38E−05 | Intronic | – |
RET | rs1198523793 | 0 | Intronic | – |
RET | rs979417275 | 3.67E−05 | Intronic | – |
RET | rs1471253713 | 0 | Intronic | – |
RET | rs1476675800 | 0 | Stop gained | – |
RET | rs775711017 | 0 | Stop gained | – |
The ClinVar classifications shown are: P for Pathogenic, LB for Likely Benign, and VUS for Variant of Unknown Significance. The asterisks indicate the review status of ClinVar, where zero is the worst and four is the best. The MAF (Minor Allele Frequency) values are taken from GnomAD exomes r2.1.1. A MAF of zero means the variant was detected but there were no high-confidence genotype calls made. The RS identifiers are supplemented with base changes in ambiguous cases. GRCh37 coordinates are used if no RS identifiers exist for an SNV.
Discussion
We have proven that ASE can be predicted from DNA features using machine learning models, with high specificity, albeit with low sensitivity. These models were benchmarked on independent test sets and further validated by applying the BIOS model on the GTEx dataset, and vice versa. All benchmarks result in similar performance in terms of AUROC, PPV, NPV, sensitivity and specificity. Also, the feature importance of both models is comparable. Therefore, we conclude that is indeed feasible to reproducibly predict ASE effects using genomic annotations of DNA variation. The fact that many different types of features are used to make these predictions seems to highlight the complex regulation that underlies ASE.
We evaluated potential bias towards gene molecular function in the prediction models by comparing gene enrichment profiles. If the profiles of predicted ASE-SNVs significantly deviated from the profiles of randomly sampled ASE-SNVs, there would be evidence for a prediction bias. Despite a few deviations, overall agreement is high, therefore no evidence for a prediction bias was found.
When applying the BIOS-trained model to variants in three clinical genes, we predict ASE effects for 27 variants. Most of the stop gained variants have been classified as Pathogenic (9 out of 12), and are undergoing ASE most likely due to nonsense-mediated decay, especially since none are located within the last exon of their transcript. The other variants, including 12 unclassified intronic variants, are potentially ASE regulators via other mechanisms and present interesting candidates for further elucidation of disease etiology.
The benchmark achieved relatively high values for PPV, NPV and specificity, though performance in terms of sensitivity is low. These metrics were obtained by applying an arbitrary probability threshold of 0.5. This stringent threshold may be suitable in circumstances where certainty is preferred over recall, e.g. when limited capacity for functional followups is available. These metrics can of course be optimized for different purposes by adjusting the probability threshold. In addition, model performance can most likely be further improved by adding more genomics features of different types. This is exemplified by the fact that we manually added pLI_score as a feature, which turned out to be a significant contributor to the model.
While we did not find a prediction bias, the resampling analysis did reveal a striking pattern. The top-3 ranking terms for both BIOS and GTEx ASE-SNVs gene enrichment are serine-type endopeptidase activity (GO:0004252), immunoglobulin receptor binding (GO:0034987) and serine-type peptidase activity (GO:0008236). None of these terms are enriched (Adj.P-val < 0.05) in the full set of blood expressed genes in either BIOS (6275) or GTEx (7941). A potential explanation is that immunoglobulin genes are subject to strong ASE mechanisms such as allelic exclusion42,43. We further hypothesize that this effect may also apply to genes involved in serine proteases which are also key components of the human immune system44,45.
There are a number of limitations to our current approach that must be acknowledged.
First, the models we constructed here are based on variants within expressed transcripts. As a consequence, their predictions are probably not informative for variants outside of genes, and neither is such a model capable of predicting ASE effects on a whole-gene level. Our approach could be complemented with whole-genome sequencing (WGS) data so that the learning procedure can be informed by variants that are not part of expressed transcripts. Furthermore, variants can be phased using WGS data, enabling the prediction of whole-gene ASE as well as allelic direction of these effects.
Second, we used whole-blood derived bulk transcriptomics in which we detected SNVs from 6275 expressed genes covering 33% of clinical genes (1374/4122) in the BIOS cohort. Adding additional tissue types and using single-cell sequencing will further inform ASE predictors of tissue-specific46 and even cell type-specific47 gene expression, enabling tailored predictions that may be more informative for anatomically localized-acting diseases.
We have demonstrated that predicting ASE using machine learning models is indeed feasible. A number of obstacles must be addressed before such models can be translated into practical tools, including performing clinical validation and providing implementation guidelines. Nevertheless, we are convinced that ASE predictors would perfectly complement existing in silico tools that infer all kinds of information from DNA variation, for example, tools that predict splicing48, evolutionary pressure49 or estimate pathogenicity35. Such tools are already an established part of diagnostic variant interpretation50. ASE predictions represent an additional piece of the diagnostic puzzle that is crucial in choosing most informative functional follow-up test after DNA sequencing is done to increase overal testing effectiveness.
Methods
RNA isolation and genotyping
We reused data from Biobank-Based Integrative Omics Studies (BIOS) and Genotype-Tissue Expression (GTEx) cohorts, which we describe below. The BIOS Consortium (BBMRI-NL, https://www.bbmri.nl/acquisition-use-analyze/bios) hosts genetic and transcriptomic data on approximately 4000 individuals from 6 Dutch biobanks: CODAM (Cohort on Diabetes and Atherosclerosis Maastricht), LIFELINES (multigenerational cohort study of the northern Dutch population), LLS_PARTOFFS (Leiden Longevity Study, Offspring and their partners), PAN: (Prospective ALS study the Netherlands), RS (Rotterdam Study) and VUNTR (Netherlands Twin Register). RNA was extracted from whole blood of 3432 Dutch individuals collected in the BIOS cohort, available from the European Genome-phenome Archive (EGA) under accession number EGAC00001000277. Isolation and sequencing of RNA material was performed as described by Zhernakova et al.51. Alignment, read mapping, genotype calling quality control was performed on genome build GRCh37 as described by De Klein et al.52. Phasing information was absent because whole-genome sequencing was not available for the majority of samples, so the first and second most common allele were taken as reference allele and alternative allele, respectively. For the BIOS dataset in total, we identified 111,959 heterozygous loci with exactly two alleles in autosomal exonic regions. These SNVs (Single-Nucleotide Variants) were located in 6275 genes. To assess how many clinical genes were covered, we compared these 6275 genes to Clinical Genomic Database53 containing 4122 genes in the 15 oct 2020 release, resulting in an overlap of 1374 genes.
We also requested and downloaded allelic reads from 369 whole blood samples collected in the GTEx Project, available from the database of Genotypes and Phenotypes (dbGaP) under accession number phs000424.v8.p2. The GTEx Project collected blood samples from around 900 individuals with 24 h after death for WGS genotyping and quantification of gene expression through RNA sequencing54. The procedure for data processing and genotype calling was performed as described by the GTEx Project55. In total, we identified 89,022 heterozygous loci with exactly two alleles in autosomal exonic regions for the GTEx dataset. These SNVs are located in 7941 unique genes, of which 4877 overlapping with the 6275 genes found in BIOS. We did not consider allosomal reads in order to capture mechanisms other than X-inactivation, which has been studied extensively56, including in the BIOS57 and GTEx58 cohorts.
ASE effect calling
We assessed the number of uniquely mapped reads per sample at each locus. The probability of identifying an alternative allele at a given locus was modelled based on the beta-binomial distribution. Maximum likelihood estimation was used to aggregate all expression information for each heterozygous locus in the cohort, followed by performing a log-likelihood ratio test to determine the difference between the null model, i.e. loci without ASE-SNV effects, and the alternative model, i.e. loci with ASE-SNV effects. To control errors, p-values were adjusted using FDR (False Discovery Rate). Only loci with an FDR lower than 0.05 were considered to show an ASE effect. Out of all BIOS SNVs, 27,749 SNVs were found in 5 or more individuals, and of those, 3343 SNVs were identified as sites undergoing ASE. These ASE-SNVs were located in 1477 genes.
To identify ASE effects in the GTEx dataset, reads were quantified and analyzed using the exact same statistical methods and criterion as applied for the BIOS cohort. Out of all GTEx SNVs, 25,789 SNVs were found in 5 or more individuals and of those, 3022 SNVs were identified as sites undergoing ASE.
Between BIOS (3343) and GTEx (3022), there is an overlap of 777 ASE-SNVs. The GTEx ASE-SNVs are located in 1387 genes, of which 513 overlapping with the 1477 genes found in BIOS. The SNVs shared between BIOS and GTEx and their ASE effects are plotted in Fig. 1. Overlap between BIOS and GTEx is limited in terms of the number of matching ASE-SNVs and genes, presumably due to many intrinsic differences. However, ASE effect distribution of both cohorts appears quite similar in Fig. 1, perhaps implying that genomic ‘ASE hotspots’ are nonetheless maintained.
It should be noted that there are around 130 well-established imprinted genes59 that were not detectable, because in our experimental setup, genotype calling was performed on expressed transcripts only. When only one allele is expressed as a result of monoallelic silencing through imprinting, only homozygous genotypes are called, on which ASE by definition does not apply.
ASE prediction model samples and features
The target variable for prediction is the probability of a variant undergoing ASE as part of a transcript. Therefore, the number of training SNVs for BIOS is 27,749, of which 24,406 SNVs not having ASE and 3343 SNVs having ASE. For GTEx, the number of training SNVs is 25,789, of which 22,767 SNVs not having ASE and 3022 SNVs having ASE. Ten percent of the SNVs for both BIOS and GTEx was left out to serve as independent test sets.
These training examples are annotated with features to allow the learning process to construct a predictor. A total of 109 genomic features were considered, 107 from Combined Annotation Dependent Depletion (CADD)49 v1.4 for GRCh37 plus pLI_score from ExAC r0.360 and gnomAD_AF from gnomAD Genomes r2.0.261. The pLI_scores represent the tolerance of a given gene to loss of function, and the gnomAD_AF is the allele frequency calculated for variants genotyped in 15,708 whole-genomes from the Genome Aggregation Database (gnomAD). Details on the CADD features can found at https://cadd.gs.washington.edu. We evaluated all features on missing values, their functional role in the genome, and potential correlation with ASE detectability. Removing the latter prevents the model from being biased towards ASE effects that are easier to detect due to higher expression or allele frequency. After evaluation, 39 features were removed and 70 features were used in training the final model. The removed features were: (1) Non-functional features: Chrom, Pos, Length, ConsScore, ConsDetail, motifEName, FeatureID, GeneID, GeneName, CCDS, Intron, Exon. (2) Features with over 40% missing values: motifECount, motifEHIPos, motifEScoreChng, Dst2Splice, Dst2SplType, targetScan, mirSVR-Score, mirSVR-E, mirSVR-Aln, TFBS, TFBSPeaks, TFBSPeaksMax, tOverlapMotifs, motifDist, dbscSNV-ada_score, dbscSNV-rf_score (3) Features that potentially correlate with ASE detectability: EncExp, gnomAD_AF, Freq100bp, Rare100bp, Sngl100bp, Freq1000bp, Rare1000bp, Sngl1000bp, Freq10000bp, Rare10000bp, Sngl10000bp. Missing values of selected features were imputed using the empirical value according to CADD v1.4 release notes. Non-numerical annotations were encoded as category or binary variables.
ASE prediction model construction
A machine learning model was constructed using numpy v1.15.3, scipy v1.1.0, pandas v0.23.4, matplotlib v3.0.0, scikit-learn v0.20.0, imbalanced-learn v0.4.0, and prince v0.6.0 for Python 3.5.1. To discover which approach worked best for predicting ASE, we built models using multiple ensemble classifiers including random forest (AUROC = 0.796, BIOS), balanced random forest (AUROC = 0.778, BIOS), adaptive boosting (AUROC = 0.775, BIOS) and gradient boosting (highest AUROC, see “Results” section). These models were all constructed with default parameters and similar training strategies. All built models are available via Zenodo as Python pickle files (PKL, see “Data availability”).
The gradient boosting62 approach was chosen for the following reasons: (1) allows a mixture of discrete and continuous features, (2) is less prone of over-fitting or under-fitting, (3) allows interpretation of feature importance in contrast to algorithms such as support vector machines, (4) computationally efficient by exploiting multiple threads, (5) showed the best performance in terms of AUROC. Gradient boosting combines multiple weak learners, i.e. decision trees in our case, while tenfold cross validation was used to prevent overfitting. The final machine learning procedure was configured with 100 iterations, inner 6 cross-validation, outer 10 cross-validation, and equally applied to the BIOS and GTEx datasets. When the resulting models are supplied with a set of input DNA features for a locus, they calculate a probability P between 0 and 1 that an ASE effect will occur at that locus, and conversely P-1 that ASE will not occur.
ASE prediction model evaluation
Gini importance was chosen as a measure for feature importance because it is simple and fast to compute63. In scikit-learn, Gini importance is implemented as the impurity importance when using the Gini index as the splitting criterion in classification trees64. It is calculated as the decrease of node impurity, i.e. label homogeneity, weighted by the proportion of samples that reach a certain node, averaged over all classification trees. To evaluate overall model performance, we use Area Under the Receiver Operating Characteristic curve (AUROC), allowing for an unbiased overview of the trade-off between true positive rate (TPR) and false positive rate (FPR) at all decision thresholds. Furthermore, we calculated positive predictive value (PPV), negative predictive value (NPV), sensitivity (i.e. true positive rate or recall) and specificity (i.e. true negative rate or selectivity) as additional metrics to show model behaviour at specific thresholds.
Model bias test
To test if the prediction models have any bias in terms of gene molecular function, we predicted BIOS ASE-SNVs with the GTEx model, and vice versa. We only considered ASE-SNVs unique to a cohort to allow independent back-prediction. We then compared gene enrichment profiles of predicted ASE-SNVs to profiles of randomly sampled ASE-SNVs from the same set. A gene enrichment profile is a list of ranked GO Molecular Function gene annotation terms, for which the term at rank 1 is has the strongest overrepresention in a given set of genes. If these profiles would look exactly or about the same, it would mean that the predictions resemble random draws, and thus have no bias. We obtained the gene enrichment profiles by supplying lists of genes to the Enrichr webtool65,66, set to ‘GO Molecular Function 2018’, selecting ‘Table’ output, and downloading the results using ‘Export entries to table’.
Application to clinical genes
For our exploration of population variant ASE in clinical genes, we obtained lists of variants from gnomAD exomes release 2.1.161 using the following hg19/b37 coordinates, and retaining only SNVs: BRCA2 at chr 13 from 32,889,617 to 32,973,809 (3316 variants), RET at chr 10 from 43,572,517 to 43,625,797 (1700 variants), and NF1 at chr 17 from 29,421,945 to 29,704,695 (3941 variants). For each of these these variants we predicted whether or not they are undergoing ASE by applying the BIOS-trained model using a probability threshold of 0.5. Any SNVs with positive ASE predictions are queried in ClinVar67, accessed 8 oct 2020.
Supplementary Information
Acknowledgements
We thank the UMCG Genomics Coordination Center, the UMCG Research IT programme, the UG Center for Information Technology and their sponsors BBMRI-NL & TarGet for storage and compute infrastructure. We thank the Biobank-Based Integrative Omics Studies (BIOS) Consortium, funded by the Biobanking and Biomolecular Research Infrastructure Netherlands (BBMRI-NL), a research infrastructure financed by the Netherlands Organization for Scientific Research (NWO) under Award Number 184.021.007. The BIOS Consortium members are listed in Supplementary Data S1. We thank the Genotype-Tissue Expression (GTEx) Project, supported by the Common Fund of the Office of the Director of the National Institutes of Health (commonfund.nih.gov/GTEx). Additional funds were provided by the NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. Donors were enrolled at Biospecimen Source Sites funded by NCI Leidos Biomedical Research, Inc. subcontracts to the National Disease Research Interchange (10XS170), Roswell Park Cancer Institute (10XS171), and Science Care, Inc. (X10S172). The Laboratory, Data Analysis, and Coordinating Center (LDACC) was funded through a contract (HHSN268201000029C) to the The Broad Institute, Inc. Biorepository operations were funded through a Leidos Biomedical Research, Inc. subcontract to Van Andel Research Institute (10ST1035). Additional data repository and project management were provided by Leidos Biomedical Research, Inc.(HHSN261200800001E). The Brain Bank was supported supplements to University of Miami Grant DA006227. Statistical Methods development grants were made to the University of Geneva (MH090941 & MH101814), the University of Chicago (MH090951,MH090937, MH101825, & MH101820), the University of North Carolina - Chapel Hill (MH090936), North Carolina State University (MH101819), Harvard University (MH090948), Stanford University (MH101782), Washington University (MH101810), and to the University of Pennsylvania (MH101822).
Author contributions
Z.Z., K.J.V. and M.A.S. conceived the project. Z.Z. and K.J.V. performed the experimental work and wrote the manuscript with critical input and revisions from F.D., N.K., M.E.G., L.H.F., R.J.S. and M.A.S. All authors reviewed the manuscript.
Funding
Z.Z. is supported by a joint fellowship from the University Medical Center Groningen and China Scholarship Council (CSC201706350277). F.D. is supported by the Netherlands CardioVascular Research Initiative:“ the Dutch Heart Foundation, Dutch Federation of University Medical Centres, the Netherlands Organisation for Health Research and Development and the Royal Netherlands Academy of Sciences” for the GENIUS project “Generating the best evidence-based pharmaceutical targets for atherosclerosis” (CVON2011-19). This project has received funding from the Netherlands Organisation for Scientific Research NWO under VIDI Grant Number 917.164.455. In addition we acknowledge support from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 779257 (Solve-RD) and 825575 (European Joint Programma on Rare Disease).
Data availability
The datasets used for the analyses described in this manuscript were obtained from the European Genome-phenome Archive (EGA) at https://www.ebi.ac.uk/ega through accession number EGAC00001000277 for BIOS, and from the database of Genotypes and Phenotypes (dbGaP) at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs000424.v8.p2 for GTEx. All used code and dependencies are available on GitHub at https://github.com/zhenhua-zhang/asep. The codebase is also available as an archive at https://zenodo.org/record/4301755. The constructed machine learning models are available at https://zenodo.org/record/4700237.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-021-89904-y.
References
- 1.Knight JC. Allele-specific gene expression uncovered. Trends Genet. 2004;20:113–116. doi: 10.1016/j.tig.2004.01.001. [DOI] [PubMed] [Google Scholar]
- 2.Raghupathy N, et al. Hierarchical analysis of RNA-seq reads improves the accuracy of allele-specific expression. Bioinformatics. 2018;34:2177–2184. doi: 10.1093/bioinformatics/bty078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Plath K, Mlynarczyk-Evans S, Nusinow DA, Panning B. Xist rna and the mechanism of x chromosome inactivation. Annu. Rev. Genet. 2002;36:233–278. doi: 10.1146/annurev.genet.36.042902.092433. [DOI] [PubMed] [Google Scholar]
- 4.Daelemans C, et al. High-throughput analysis of candidate imprinted genes and allele-specific gene expression in the human term placenta. BMC Genet. 2010;11:25. doi: 10.1186/1471-2156-11-25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Tang F, et al. Deterministic and stochastic allele specific gene expression in single mouse blastomeres. PLoS ONE. 2011;6:e21208. doi: 10.1371/journal.pone.0021208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Tian L, et al. Genome-wide comparison of allele-specific gene expression between African and European populations. Hum. Mol. Genet. 2018;27:1067–1077. doi: 10.1093/hmg/ddy027. [DOI] [PubMed] [Google Scholar]
- 7.Lo HS, et al. Allelic variation in gene expression is common in the human genome. Genome Res. 2003;13:1855–1862. doi: 10.1101/gr.1006603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Yan H. Allelic variation in human gene expression. Science. 2002;297:1143. doi: 10.1126/science.1072545. [DOI] [PubMed] [Google Scholar]
- 9.Kukurba KR, et al. Allelic expression of deleterious protein-coding variants across human tissues. PLoS Genet. 2014;10:e1004304. doi: 10.1371/journal.pgen.1004304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lee C, Kang EY, Gandal MJ, Eskin E, Geschwind DH. Profiling allele-specific gene expression in brains from individuals with autism spectrum disorder reveals preferential minor allele usage. Nat. Neurosci. 2019;22:1521–1532. doi: 10.1038/s41593-019-0461-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Valle L, et al. Germline allele-specific expression of tgfbr1 confers an increased risk of colorectal cancer. Science. 2008;321:1361–1365. doi: 10.1126/science.1159397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.de la Chapelle A. Genetic predisposition to human disease: allele-specific expression and low-penetrance regulatory loci. Oncogene. 2009;28:3345–3348. doi: 10.1038/onc.2009.194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Maia A-T, et al. Effects of brca2 cis-regulation in normal breast and cancer risk amongst brca2 mutation carriers. Breast Cancer Res. 2012 doi: 10.1186/bcr3169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Emison ES, et al. Differential contributions of rare and common, coding and noncoding ret mutations to multifactorial hirschsprung disease liability. Am. J. Hum. Genet. 2010;87:60–74. doi: 10.1016/j.ajhg.2010.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Finch N, et al. Tmem106b regulates progranulin levels and the penetrance of ftld in grn mutation carriers. Neurology. 2011;76:467–474. doi: 10.1212/wnl.0b013e31820a0e3b. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Berlivet S, et al. Interaction between genetic and epigenetic variation defines gene expression patterns at the asthma-associated locus 17q12-q21 in lymphoblastoid cell lines. Hum. Genet. 2012;131:1161–1171. doi: 10.1007/s00439-012-1142-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Jentarra GM, et al. Skewed allele-specific expression of the nf1 gene in normal subjects. J. Child Neurol. 2011;27:695–702. doi: 10.1177/0883073811423439. [DOI] [PubMed] [Google Scholar]
- 18.Gicquel C, et al. Epimutation of the telomeric imprinting center region on chromosome 11p15 in silver-russell syndrome. Nat. Genet. 2005;37:1003–1007. doi: 10.1038/ng1629. [DOI] [PubMed] [Google Scholar]
- 19.Rose AM, et al. Dominantprpf31mutations are hypostatic to a recessivecnot3polymorphism in retinitis pigmentosa: a novel phenomenon of “linkedtrans-acting epistasis”. Ann. Hum. Genet. 2013;78:62–71. doi: 10.1111/ahg.12042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Adams DR, Eng CM. Next-generation sequencing to diagnose suspected genetic disorders. N. Engl. J. Med. 2018;379:1353–1362. doi: 10.1056/nejmra1711801. [DOI] [PubMed] [Google Scholar]
- 21.Saeidian AH, Youssefian L, Vahidnezhad H, Uitto J. Research techniques made simple: whole-transcriptome sequencing by rna-seq for diagnosis of monogenic disorders. J. Investig. Dermatol. 2020;140:1117–1126.e1. doi: 10.1016/j.jid.2020.02.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Li D, Tian L, Hakonarson H. Increasing diagnostic yield by rna-sequencing in rare disease–bypass hurdles of interpreting intronic or splice-altering variants. Ann. Transl. Med. 2018;6:126. doi: 10.21037/atm.2018.01.14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Kremer LS, et al. Genetic diagnosis of mendelian disorders via rna sequencing. Nat. Commun. 2017 doi: 10.1038/ncomms15824. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Hamanaka K, et al. Rna sequencing solved the most common but unrecognized neb pathogenic variant in Japanese nemaline myopathy. Genet. Med. 2018;21:1629–1638. doi: 10.1038/s41436-018-0360-6. [DOI] [PubMed] [Google Scholar]
- 25.Volk AE, Kubisch C. The rapid evolution of molecular genetic diagnostics in neuromuscular diseases. Curr. Opin. Neurol. 2017;30:523–528. doi: 10.1097/wco.0000000000000478. [DOI] [PubMed] [Google Scholar]
- 26.Mohammadi P, et al. Genetic regulatory variation in populations informs transcriptome analysis in rare disease. Science. 2019;366:351–356. doi: 10.1126/science.aay0256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Marco-Puche G, Lois S, Benítez J, Trivino JC. Rna-seq perspectives to improve clinical diagnosis. Front. Genet. 2019 doi: 10.3389/fgene.2019.01152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Byron SA, Van Keuren-Jensen KR, Engelthaler DM, Carpten JD, Craig DW. Translating rna sequencing into clinical diagnostics: opportunities and challenges. Nat. Rev. Genet. 2016;17:257–271. doi: 10.1038/nrg.2016.10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Miller JN, Pearce DA. Nonsense-mediated decay in genetic disease: friend or foe? Mut. Res. Rev. Mut. Res. 2014;762:52–64. doi: 10.1016/j.mrrev.2014.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Rao X, et al. Allele-specific expression and high-throughput reporter assay reveal functional genetic variants associated with alcohol use disorders. Mol. Psychiatry. 2019 doi: 10.1038/s41380-019-0508-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Höllerer S, et al. Large-scale dna-based phenotypic recording and deep learning enable highly accurate sequence-function mapping. Nat. Commun. 2020 doi: 10.1038/s41467-020-17222-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Mahendran N, Durai Raj Vincent PM, Srinivasan K, Chang C-Y. Machine learning based computational gene selection models: a survey, performance evaluation, open issues, and future research directions. Front. Genet. 2020 doi: 10.3389/fgene.2020.603808. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Wani AH, et al. The impact of psychopathology, social adversity and stress-relevant dna methylation on prospective risk for post-traumatic stress: a machine learning approach. J. Affect. Disord. 2021;282:894–905. doi: 10.1016/j.jad.2020.12.076. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Pataki BA, et al. Understanding and predicting ciprofloxacin minimum inhibitory concentration in Escherichia coli with machine learning. Sci. Rep. 2020 doi: 10.1038/s41598-020-71693-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Li S, et al. CAPICE: a computational method for consequence-agnostic pathogenicity interpretation of clinical exome variations. Genome Med. 2020 doi: 10.1186/s13073-020-00775-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Kopp W, Monti R, Tamburrini A, Ohler U, Akalin A. Deep learning for genomics using janggu. Nat. Commun. 2020 doi: 10.1038/s41467-020-17155-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Nielsen AAK, Voigt CA. Deep learning to predict the lab-of-origin of engineered dna. Nat. Commun. 2018 doi: 10.1038/s41467-018-05378-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Eraslan G, Avsec Z, Gagneur J, Theis FJ. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 2019;20:389–403. doi: 10.1038/s41576-019-0122-6. [DOI] [PubMed] [Google Scholar]
- 39.Zhang X, Xiao W, Xiao W. Deephe: accurately predicting human essential genes based on deep learning. PLOS Comput. Biol. 2020;16:e1008229. doi: 10.1371/journal.pcbi.1008229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Zrimec J, et al. Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure. Nat. Commun. 2020 doi: 10.1038/s41467-020-19921-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Rahit KMTH, Tarailo-Graovac M. Genetic modifiers and rare mendelian disease. Genes. 2020;11:239. doi: 10.3390/genes11030239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Brady BL, Steinel NC, Bassing CH. Antigen receptor allelic exclusion: an update and reappraisal. J. Immunol. 2010;185:3801–3808. doi: 10.4049/jimmunol.1001158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Vettermann C, Schlissel MS. Allelic exclusion of immunoglobulin genes: models and mechanisms. Immunol. Rev. 2010;237:22–42. doi: 10.1111/j.1600-065x.2010.00935.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Patel S. A critical review on serine protease: key immune manipulator and pathology mediator. Allergol. Immunopathol. 2017;45:579–591. doi: 10.1016/j.aller.2016.10.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Bestle D, et al. Tmprss2 and furin are both essential for proteolytic activation of sars-cov-2 in human airway cells. Life Sci. Alliance. 2020;3:e202000786. doi: 10.26508/lsa.202000786. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Lee J-H, et al. A robust approach to identifying tissue-specific gene expression regulatory variants using personalized human induced pluripotent stem cells. PLoS Genet. 2009;5:e1000718. doi: 10.1371/journal.pgen.1000718. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Aguirre-Gamboa R, et al. Deconvolution of bulk blood eqtl effects into immune cell subpopulations. BMC Bioinform. 2020 doi: 10.1186/s12859-020-03576-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Jagadeesh KA, et al. S-cap extends pathogenicity prediction to genetic variants that affect rna splicing. Nat. Genet. 2019;51:755–763. doi: 10.1038/s41588-019-0348-4. [DOI] [PubMed] [Google Scholar]
- 49.Kircher M, et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 2014;46:310–315. doi: 10.1038/ng.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Richards S, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American college of medical genetics and genomics and the association for molecular pathology. Genet. Med. 2015;17:405–423. doi: 10.1038/gim.2015.30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Zhernakova DV, et al. Identification of context-dependent expression quantitative trait loci in whole blood. Nat. Genet. 2016;49:139–145. doi: 10.1038/ng.3737. [DOI] [PubMed] [Google Scholar]
- 52.de Klein N, van Dijk F, Deelen P, Urzua CG, Claringbould A, Võsa U, Verlouw JAM, Monajemi R, ‘t Hoen PAC, Sinke RJ, Swertz MA, Franke L, BIOS Consortium Imbalanced expression for predicted high-impact, autosomal-dominant variants in a cohort of 3,818 healthy samples. bioRxiv. 2020 doi: 10.1101/2020.09.19.300095. [DOI] [Google Scholar]
- 53.Solomon BD, Nguyen A-D, Bear KA, Wolfsberg TG. Clinical genomic database. Proc. Natl. Acad. Sci. 2013;110:9851–9855. doi: 10.1073/pnas.1302575110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Lonsdale J, et al. The genotype-tissue expression (gtex) project. Nat. Genet. 2013;45:580–585. doi: 10.1038/ng.2653. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Consortium G. Genetic effects on gene expression across human tissues. Nature. 2017;550:204–213. doi: 10.1038/nature24277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Riggs A. X inactivation, differentiation, and dna methylation. Cytogenet. Genome Res. 1975;14:9–25. doi: 10.1159/000130315. [DOI] [PubMed] [Google Scholar]
- 57.Shvetsova E, et al. Skewed x-inactivation is common in the general female population. Eur. J. Hum. Genet. 2018;27:455–465. doi: 10.1038/s41431-018-0291-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Tukiainen T, et al. Landscape of x chromosome inactivation across human tissues. Nature. 2017;550:244–248. doi: 10.1038/nature24265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.DeVeale B, van der Kooy D, Babak T. Critical evaluation of imprinted gene expression by rna-seq: a new perspective. PLoS Genet. 2012;8:e1002600. doi: 10.1371/journal.pgen.1002600. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Lek M, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Karczewski KJ, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Friedman JH. Greedy function approximation: a gradient boosting machine. Ann. Stat. 2001;29:1189–1232. doi: 10.1214/aos/1013203451. [DOI] [Google Scholar]
- 63.Nembrini S, König IR, Wright MN. The revival of the Gini importance? Bioinformatics. 2018;34:3711–3718. doi: 10.1093/bioinformatics/bty373. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and Regression Trees. CRC Press; 1984. [Google Scholar]
- 65.Chen EY, et al. Enrichr: interactive and collaborative html5 gene list enrichment analysis tool. BMC Bioinform. 2013;14:128. doi: 10.1186/1471-2105-14-128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Kuleshov MV, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucl. Acids Res. 2016;44:W90–W97. doi: 10.1093/nar/gkw377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Landrum MJ, et al. Clinvar: public archive of relationships among sequence variation and human phenotype. Nucl. Acids Res. 2013;42:D980–D985. doi: 10.1093/nar/gkt1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Hunter JD. Matplotlib: a 2d graphics environment. Comput. Sci. Eng. 2007;9:90–95. doi: 10.1109/mcse.2007.55. [DOI] [Google Scholar]
- 69.Van Rossum G, Drake FL., Jr . Python Reference Manual. Centrum voor Wiskunde en Informatica Amsterdam; 1995. [Google Scholar]
- 70.R Core Team . R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2016. [Google Scholar]
- 71.Wickham H. ggplot2: Elegant graphics for data analysis. Springer; 2016. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets used for the analyses described in this manuscript were obtained from the European Genome-phenome Archive (EGA) at https://www.ebi.ac.uk/ega through accession number EGAC00001000277 for BIOS, and from the database of Genotypes and Phenotypes (dbGaP) at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs000424.v8.p2 for GTEx. All used code and dependencies are available on GitHub at https://github.com/zhenhua-zhang/asep. The codebase is also available as an archive at https://zenodo.org/record/4301755. The constructed machine learning models are available at https://zenodo.org/record/4700237.