Skip to main content
Cell Reports Methods logoLink to Cell Reports Methods
. 2023 Sep 12;3(9):100580. doi: 10.1016/j.crmeth.2023.100580

Atlas of primary cell-type-specific sequence models of gene expression and variant effects

Ksenia Sokolova 1,2, Chandra L Theesfeld 2,, Aaron K Wong 3, Zijun Zhang 3,4, Kara Dolinski 2, Olga G Troyanskaya 1,2,3,5,∗∗
PMCID: PMC10545936  PMID: 37703883

Summary

Human biology is rooted in highly specialized cell types programmed by a common genome, 98% of which is outside of genes. Genetic variation in the enormous noncoding space is linked to the majority of disease risk. To address the problem of linking these variants to expression changes in primary human cells, we introduce ExPectoSC, an atlas of modular deep-learning-based models for predicting cell-type-specific gene expression directly from sequence. We provide models for 105 primary human cell types covering 7 organ systems, demonstrate their accuracy, and then apply them to prioritize relevant cell types for complex human diseases. The resulting atlas of sequence-based gene expression and variant effects is publicly available in a user-friendly interface and readily extensible to any primary cell types. We demonstrate the accuracy of our approach through systematic evaluations and apply the models to prioritize ClinVar clinical variants of uncertain significance, verifying our top predictions experimentally.

Keywords: deep learning, gene expression prediction, functional genomics, human disease, variant effects

Graphical abstract

graphic file with name fx1.jpg

Highlights

  • Model atlas for sequence-based gene expression predictions for primary cell types

  • Predicts effects of genomic variants on expression of any gene in a given cell type

  • Prioritization of relevant primary cell types carrying risk in human diseases

  • Available through an interactive interface at https://humanbase.io/expectosc

Motivation

Our work addresses a critical challenge for personalized medicine and basic biology—that of interpreting the genome, focusing on the 98% of the human genome that encodes regulation. Linking this variation to the underlying functional consequences is extremely difficult because of the scale of the mutation space and the impossibility in culturing many primary human cell types in the lab. We developed a computational framework capable of predicting cell-type-specific expression directly from genomic sequence, which allows us to computationally interrogate expression effects of millions of variants across the diversity of cell types and organ systems.


Sokolova et al. introduce ExpectoSC, a deep-learning-based atlas of sequence-based primary cell-type gene expression models. This approach is capable of predicting effects of millions of variants, while an interactive interface (https://humanbase.io/expectosc) and the ability to extend the atlas makes ExpectoSC a resource for the global biomedical community.

Introduction

Deciphering the regulatory code of gene expression includes understanding genomic signals for spatial, temporal, and cell-type specificity in gene expression and elucidating the impact of genetic variation on expression dynamics. This regulatory code underpins fundamental biological processes, including development from a single cell into a complex organism and cellular specialization in function and morphology. Furthermore, precision medicine’s promise of improved, more precisely tailored treatments relies on interpreting an individual’s genome and mutation effects in the context of specialized cell types (e.g., specific neurons or heart fibroblasts). Because the human genome is three billion bases long, with four possible “letters” at each position, experimentally determining regulatory architecture across every mutation, cell type, and environment is intractable, and it is therefore necessary to develop computational methods for ab initio prediction of gene expression based on sequence.

Deep-learning models have recently emerged as a promising approach for prediction of gene expression from sequence.1,2,3,4,5 However, existing models predict gene expression in whole tissues or in lab-derived cell lines from bulk RNA sequencing (RNA-seq) data. Whole-tissue resolution cannot identify cell types contributing to disease and thus does not adequately model the etiology of human disease. Furthermore, cell lines represent a small portion of the emerging specialized cell types defined by recent single-cell gene expression studies and often diverge substantially in gene expression from cells in an organism. For example, brain tissue is composed of specialized neurons, immune cells, and structural support cells. Such primary cell types are difficult to study because the vast majority cannot be experimentally cultured. Thus, there is a critical need for a method capable of predicting sequence-based gene expression and variant effects in primary cell types (for example, to determine whether a given variant changes expression of a neural development gene in autism-relevant neurons or shuts down extracellular matrix proteins in heart cells, contributing to fibrosis).

Here, we introduce ExPectoSC, an atlas consisting of 105 primary cell-type models across seven organ systems. We show that these models are capable of predicting gene expression in primary human cell types based only on the DNA sequence. The atlas can be readily extended to include any cell type characterized by single-cell RNA-seq (scRNA-seq), including newly discovered human cell types. Each of the cell-type models integrates a convolutional neural network with a regularized linear model to predict gene expression effects of variants. We demonstrate that ExPectoSC is accurate and can prioritize causal disease variants.

Results

A deep-learning framework for building cell-type-specific models

To understand how genetic variation contributes to cell-type-specific gene expression and disease phenotypes, we developed a modular deep-learning approach that is scalable to all genes of the human genome and extendable to every cell type captured in single-cell experiments. Our approach, ExPectoSC, leverages our ExPecto2 framework, which previously provided only tissue-specific resolution available in the bulk RNA-seq. ExPectoSC addresses this challenge by enabling users to train cell-type-specific models to get a precise understanding of the biological effects for the full diversity of primary human cell types. The framework trains on the reference genome, without any variant information, to learn sequence rules predictive of chromatin patterns and then to associate these patterns with single-cell expression (Figure 1). First, a neural network module learns interactions between DNA sequence and the genome-wide binding of biochemical regulators of gene expression. Specifically, a convolutional neural network (CNN)2,6 serves as an encoder by accepting a DNA sequence of 2 kb length and predicts 2,002 chromatin profiling features from bulk tissues and cell lines, including histone marks, transcription factor binding, and DNA accessibility. Predictions are accumulated for the 40 kbp region centered around the gene transcription start site (TSS) using a sliding window approach with 200 bp overlap (STAR Methods). Then, a linear models module uses these predictions to establish a regularized linear model for each cell type to infer how biochemical effects predictions at each TSS correspond to gene expression levels. After training, ExPectoSC requires a sequence ±20 kbp around the TSS to predict cell-type-specific expression. Moreover, any single-cell dataset can be used, allowing users to train customizable models.

Figure 1.

Figure 1

Architecture of the ExPectoSC framework and application to defining cell-type-specific effects of genetic variation in human health and disease

ExPectoSC implements a modular approach to predict cell-type-specific expression directly from sequence using module 1: a convolutional neural network (CNN) module to learn biochemical regulatory features; followed by module 2: regularized linear models with clustered single-cell RNA-seq data to provide information about cell-type-specific expression during training. The outputs can be applied to interrogate the effect of any genomic variant list at cell-type resolution.

See also Table S1.

We trained ExPectoSC single-cell gene expression models for over 100 cell types drawn from seven diverse organ systems: kidney,7 liver,8 lung,9 heart,10 pancreas,11 brain,12 and retina13 (Figure 2A; Table S1). ExPectoSC models are trained using cell clustering, cell identities, and gene expression values as defined in the original studies. ExPectoSC directly uses output data from scRNA-seq experiments and is therefore straightforward to extend to new datasets.

Figure 2.

Figure 2

ExPectoSC accurately models single-cell gene expression for >100 cell types in seven organs

(A) Schematic overview of the number of cell types in organ systems present in the article.

(B) Distribution of AUC ROC scores for predicting high and low expressed genes (top and bottom 25th percentiles) in each cell type, grouped by organ.

(C) Comparing AUC ROC for the cell-type-specific and whole-organ models for the cell-type-specific genes. Cell-type-specific models achieve considerably better results, with 10 models more than doubling the AUC score and 14 models moving up from under 0.5 AUC.

See also Figures S1 and S2, Table S1, and Data S1.

ExpectoSC accurately models cell-type-specific gene expression

First, we evaluated ExPectoSC performance on cell-type-specific gene expression prediction using the test dataset, consisting of genes on chromosomes 8 and 9, with a total of 1,104 genes. Importantly, these chromosomes were left out of all steps of training, both for the neural network and for the regularized linear models. All cell-type-specific models achieve over 0.8 area under the receiver operating characteristic curve (AUC ROC) (Figures 2B and S1) on the task of predicting variably expressed genes within the cell type (STAR Methods), indicating that ExPectoSC models accurately predict single-cell gene expression across the range of gene expressions observed in diverse organ tissues in single-cell experiments. These results are consistent with evaluation done using 5-fold cross-validation (Figure S2).

Cell-type-specific models are necessary for accurate single-cell gene expression prediction

To assess whether primary cell-type resolution is necessary for the prediction of cell-type-specific expression, we compare prediction performance of our cell-specific models with identical models aggregated to whole-organ resolution. We observe that for 93% of the models (98/105), cell-type-specific predictions have higher AUC ROC scores than the whole-organ models (Figure S2; Data S1).

Importantly, the difference is strongest for cell-type-specific genes: those that are expressed in the top quartile in at least one cell type and in the bottom quartile of expression in a different cell type in the same organ (STAR Methods; Figures 2C and S2). These cell-type-specific genes may be important to cell-type identity. For such cell-type-specific genes, cell-type specificity of models is especially essential, with nearly all (103/105) models improving performance by 20% on average. For many cell types, whole-organ predictions were inadequate (AUC < 0.5), while cell-type-specific models showed strong performance. For example, cell-type-specific expression in microglial cells in the retina fails to be predicted by the whole-organ model (AUC = 0.36), whereas the cell-type-specific model demonstrates 0.79 AUC. In fact, the five models with lowest whole-organ AUCs (<0.4 AUC) all showed over 50% improvements. Importantly, ExPectoSC predictions are not confined to highly expressed genes and are accurate for genes expressed at both high and low levels, providing insight for genes at all levels of expression.

ExpectoSC prediction of variant effects identifies disease-specific cell-type dysregulation

We then applied the ExPectoSC framework to the study of genetic variation in disease from the single-cell perspective. Our models can predict the effect of any TSS-proximal variant on single-cell expression, thus enabling the study of variants that may be associated with human diseases or traits. Importantly, because ExPectoSC predicts the effect of a DNA sequence change, it enables inference of causality of gene variant impacts on gene expression. Additionally, since models were trained without any variant data and since the ExPectoSC approach is not limited to a preset list of mutations, it can be applied to common and rare variants and even to variants that have not yet been observed in sequencing studies. In this section, we consider applications of the model outputs: first using ExPectoSC scores with genome-wide association studies (GWASs) to infer the impact of common variants associated with human disease risk and then applying ExPectoSC to mostly rare variants in ClinVar14 to identify variants with high cell-type-specific gene expression impact.

ExpectoSC-enabled identification of cell-type contribution to disease risk

GWASs statistically associate a set of genomic loci with specific traits and diseases. Any one study can yield many candidate variants in each region due to linkage disequilibrium, and even with advanced fine-mapping methods, identification of causal variants from other closely linked variants is uncertain. Since the associated loci are thought to act through dysregulation of gene expression, ExPectoSC can address the challenge by prioritizing candidate variants based on their impact on gene expression in relevant cell types. To enable this application, we applied ExPectoSC to a large, genome-wide collection of over 2 million 1000 Genome Project (1000G) variants within ±20 kb of any TSSs for each of the 105 cell types in ExPectoSC models, generating a resource for the community.

The set of ExPectoSC predictions for the 1000G variants can be used to understand high-resolution cell-type-specific effects that contribute to inherited disease risk. To address this challenge, we leveraged stratified linkage disequilibrium score regression (sLDSC).15 LD score regression models the LD structure between SNPs and enables comparisons of SNP effects across different cell types and GWASs. We applied sLDSC to the ExPectoSC-predicted gene expression dysregulation across cell types to 102 GWASs to assess how the predicted SNP impacts (per cell type) contribute to disease risk heritability while also modeling the LD of nearby SNPs. This analysis directly identified significant enrichment of inherited risk for cell types modeled in ExPectoSC for each of 102 GWAS diseases and traits.

In general, we found good correspondence between cell types and tissues known to be involved clinically in disease and the ExPectoSC cell-type annotations enriched in disease risk. For example, we observed a significant (p < 2.2e−16) enrichment of immune cell types for the autoimmune disease GWAS, independent of the subset of diseases included (Figures 3A and S3).

Figure 3.

Figure 3

ExPectoSC functional SNP predictions can partition risk for heritability of traits and diseases to specific and appropriate cell types

(A) Enrichment of immune cell types across all organs for the autoimmune traits sure GWAS studies (see also Figure S3). Showing significantly enriched cell types sorted by the p value.

(B) Differential sLDSC enrichment of the cell types for BMI and waist-hip ratio (WHR) adjusted for BMI, sorted by the relative p value and only showing significant cell types. BMI is enriched for brain cell types, but WHR (conditioned on BMI, which accounts for distribution of adipose tissues) shows the highest enrichment of cell types outside of the brain, as expected.

(C) Brain cell-type enrichments for BMI and adjusted WHR. Size of the dot shows relative enrichment, while the color represents p value. GABAergic neurons have the greatest enrichment.

Notably, the use of cell types from single-cell data allows a higher-resolution analysis of contribution to disease and traits. For example, use of ExPectoSC cell-type annotations in sLDSC demonstrated specific and distinct enrichment differences between cell types in risk heritability for the BMI and BMI-adjusted waist-hip ratio (WHR, a surrogate for abdominal fat measurement). BMI risk was enriched in brain cell types (p = 1.84e−05), but BMI adjusted for WHR noncoding impacts was not (p = 0.0005) (Figure 3B). Through examining gene-level associations, BMI is documented to be connected to the brain regions and development,16 while adjusted WHR is associated with an enrichment of genes expressed in adipose tissue.17 Within ExPectoSC predictions for brain cell types for BMI, GABAergic interneurons had the greatest enrichment (Figure 3C). GABAergic neurons produce gamma-aminobutyric acid (GABA), which is involved in metabolism and is associated with feeding behaviors and body weight.18,19,20 Therefore, the risks associated with ExPectoSC-predicted cell types (based on noncoding variant impacts) are consistent with tissue signals from gene-level analyses, identifying similar tissue types with specific contributions to morphological traits. However, ExPectoSC goes further by enabling the disease risk to be partitioned to high-resolution intra-tissue-specific cell types.

ExpectoSC inquiry of clinically relevant mutations

One of the major challenges of precision medicine is classifying variants from clinical genomes with respect to their impact on disease state. In the ClinVar database, over 40% of variants identified in the course of clinical genome sequencing are classified as variants of uncertain significance (VUSs). We used ExPectoSC to predict the cell-type gene expression impacts for >60,000 ClinVar noncoding variants in 105 cell types that are within the ±20 kb region of every protein-coding gene TSS (Data S2; STAR Methods). We first compared ExPectoSC predictions for those variants annotated as pathogenic versus benign and observed that pathogenic variants had significantly stronger expression impacts (Mann-Whitney U test p = 2.921e−07; see Data S1 for per-cell-type values).

Then, we examined ExPectoSC predictions to identify potentially disease-associated variants among the VUSs. Among the highest predicted gene-expression-altering variants were six variants located in the promoter of PTEN, a regulator of the PI3K/Akt/mTORC1 signaling pathway, a tumor-suppressor gene, and an autism risk gene. ExPectoSC predicted a strong decrease in expression of PTEN in several brain cell types, including neuronal stem cells—a cell type implicated in autism dysregulation during development and explicitly regulated by PTEN (Figure 4A).21,22,23,24 We experimentally tested the transcriptional regulatory activity of top ClinVar variants in the characterized PTEN promoter25 with transcriptional reporter assays and found that the region produced robust transcriptional activity in neuroblastoma cells (Figure 4B). The alternate alleles at three positions with the highest predicted impact decreased this activity, while the alternate allele with low predicted impact did not. Based on the biochemical effects predicted by module 1 of ExPectoSC, we found that these variants likely affect expression through disruption of YY1 binding. Together, these predictions and validation experiments demonstrate how ExPectoSC can be used to identify potentially clinically relevant variants based on their impact on the expression of disease-relevant genes in specific primary human cell types.

Figure 4.

Figure 4

ExPectoSC-predicted and experimentally confirmed cell-type-specific effects of ClinVar variants in the PTEN promoter

(A) The plot shows the predicted impact of ClinVar variants in the region, showing variants of high predicted ExPectoSC impact near each other. Module 1 epigenetic predictions for these variants pointed to high impact on YY1 binding. Experimentally tested variants are additionally labeled for clarity.

(B) The PTEN promoter shows robust transcriptional activity in BE(2)-C neuroblastoma cells using a luciferase assay. The top three highest-predicted-impact ClinVar variants in the region (89623405G>C, 89623406C>G, and 89623408A>G) and one of the lowest-predicted-effect variants at 89623412T>C were tested (Data S3). The y axis shows the magnitude of transcriptional activation normalized to the activity for the empty vector and the reference allele. Significance levels were computed on the basis of a t test (two sided with unequal variance ∗∗∗∗p < 0.0001). Central values of the boxplot represent the median of normalized magnitude, the box extends from the 25th to the 75th percentile, and whiskers extend to the 1.5 of the inter-quartile range (IQR).

See also Data S1, S2, and S3.

Discussion

ExpectoSC cell-type-specific expression prediction offers an important insight into disease pathology at an unprecedented resolution, including prioritizing variants at the level of primary cell-type effects. Much like RNA-seq set a foundation for the development of scRNA-seq, our new method, ExPectoSC, builds on the previous work of predicting tissue gene expression to provide high-resolution primary cell-type-resolved expression effects. The modular structure of the framework allows us to use the deep learner to extract the transcriptional activity information and use it to predict the cell-type-resolved gene expression. This also enables future innovations separately in each module, for example updating module 1, the genome activity encoder, with newer larger models. In the future, we plan to train cell-type-specific models on additional organs, with a goal of expanding the model atlas to cover all major cell systems in the body. We envision that this would further our ability to interrogate variant effects at the cell-type level.

ExPectoSC predicts expression disruptions for primary human cell types by leveraging the module 1 predictions, which have cell line and tissue resolution. As demonstrated by accurate performance, ExPectoSC is able to learn the relationship between the relevant tissue- and cell-line-epigenetic signals in module 1 and the primary cell-type expression predicted with module 2. This mapping is not direct and may be leveraged to gain biological insight. For example, this information could be used to guide biomedical researchers in choosing appropriate and experimentally tractable cell lines for a given primary cell type26,27 and could possibly reveal new connections between epigenetic factors and gene expression regulatory contexts.

To allow users to query the impact of variants in >100 primary human cell types, we provide a user-friendly web server (https://humanbase.io/expectosc) and code that makes it easy to run ExPectoSC using a user-specified single-cell dataset. ExPectoSC enables biomedical researchers to create focused hypotheses for the role of variants in gene expression dysregulation and disease, meeting a critical need for guiding biomedical research into clinical mutation effects.

Limitations of the study

ExpectoSC is trained on the available single-cell-resolved gene expression data, and therefore cell-type definitions are limited to those available in the datasets. As these data compendia increase, the framework will be readily extendable to additional cell types and organ systems.

STAR★Methods

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Chemicals, peptides, and recombinant proteins

Lipofectamine 3000 ThermoFisher Cat# L3000008
One-Glo luminescence reagent Promega E6110

Experimental models: Cell lines

BE(2)-C ATCC Cat# CRL-2268, RRID:CVCL_0529
HEPG2 ATCC Cat# HB-8065
RRID:CVCL_0027

Recombinant DNA

See Data S3 for all plasmid insert sequences (PTEN promoter variants) N/A
pGL4.23[luc2/minP] Promega E8411

Software and algorithms

Python 3.6 N/A
ExpectoSC This work https://github.com/ksenia007/ExPectoSC; https://doi.org/10.5281/zenodo.8234261
ExpectoSC server This work https://humanbase.io/expectosc

Resource availability

Lead contact

Further information requests should be directed to and will be fulfilled by the lead contact, Olga G. Troyanskaya (ogt@cs.princeton.edu).

Materials availability

Plasmids generated in this study are available upon request.

Experimental model and study participant details

Cell culture

Human neuroblastoma BE(2)-C cells (ATCC CRL-2688) were used at low passage after purchase from ATCC and were cultured in 10cm plates with 1:1 EMEM:Ham’s F10 and 10% FBS at 37°C, 5% CO2. Cells were passed at ∼70% confluence and diluted 1:4 before replating.

Promoter construct cloning

For the PTEN promoter, we cloned 605 nucleotides of genomic sequence reported as the “Sheng promoter”.25 Variants of this promoter were synthesized by Genewiz and fragments were cut with KpnI and BglII restriction enzymes for cloning into pGL4.23 (Promega) cut with the same enzymes. Data S3 contains sequences.

Luciferase reporter assays

Luciferase reporter assays were performed using 2x104 human neuroblastoma BE(2)-C cells/well plated in white 96-well plates with clear bottoms(Corning #3903). Twenty-four hours post-plating, cells were transfected with each well receiving 0.2ul Lipofectamine 3000 and 0.2ul P3000 reagent (L3000-015, Thermofisher Scientific), 75ng of Promega pGL4.23 firefly luciferase vector containing the 605nt promoter region in 8ul of OMEM (Thermofisher Scientific, 51985-034). Forty-two hours after transfection, luminescence was detected with the OneGlo assay system (Promega E6110) and BioTek Synergy plate reader. Four replicates per variant were tested in each experiment and variants were tested in at least three separate experiments.

Method details

Training models

The ExPectoSC models are composed of two modules, a deep learning module that was trained to predict epigenetic information from the genome and a linear regression module that translates these effects into gene expression predictions. The first deep learning module of ExPectoSC used the pre-trained BELUGA convolutional neural network, detailed in,2 to encode epigenetic signals from more than 2000 ChIP-seq experiments with tissue and cell-line resolution. This module was run on a sequence within a region +/− 20kb around the TSS sites for every protein-coding gene. To capture sequence context, ExPectoSC used a sliding window with 200bp step, each step taking a 2000bp sequence as input. TSS for each gene was defined using CAGE peaks where possible4,28 (downloaded from https://fantom.gsc.riken.jp/5/datafiles/latest/extra/CAGE_peaks/and accessed on April 26th, 2020; p1 peaks were used when multiple are available), or else Gencode TSS29 (Release 19, GRCh37.p13).

The resulting matrix U was of shape 2002x201 (2002 epigenetic markers, 201 sliding window steps) for each gene. Then we reduced its dimensions using an exponential function, downsampling from 2002x201 to 2002x10 input matrices. The exponential function weighted the middle of the window, giving higher weight to predictions closer to the TSS, by applying the weights of W201×10, to each of the windows, wi,j=eγj|di| with γ=[0.01,0.02,0.05,0.01,0.2,0.2,0.1,0.05,0.02,0.01] and di is the sliding window distance to the TSS. The resulting matrix was X=U×W. See Figure S4 for a visual example for a sample gene. The selection of the function was done in previous work.2

The second module translated the deep learning epigenetic predictions to the cell type specific expression predictions using linear regression models (Ridge regression), with separate models for each of the cell types.

log(gi,j+pseudocount)=XTβj+εi
Lossj=iwilog(L(gi,j,XTβj))
βj=argmin(Lossj+αβj||2)

Where gi,j is i-th gene expression in j-th cell type, βj is the coefficient for j-th cell type, εi is the error residuals and wi is the weight assigned to the gene.

During training, the single-cell RNA-seq expression datasets were used as the labels gi,j for gene expression outputs. No additional imputation was performed. While we use scRNAseq to refer to all datasets used, ExpetoSC works with either scRNAseq or snRNAseq. The cell clustering, cell identities, scaling and gene expression values were kept as they were defined in the original studies. Further quality control was applied and only cell clusters with more than 100 cells were used, and any clusters composed of “unclassified” cells or “doublets” were removed. During training, genes located on the X or Y chromosomes were excluded at all stages. For each of the cell-types and each gene, expression was averaged across cells belonging to the same cluster.

To reduce the impact of non-cell type specific genes, we assigned weights (wi) to genes during training based on the variation across cell types. Genes for which standard deviation across the cell types per organ was high (above 70th percentile) were weighted higher (weight of 2), while genes with low standard deviation across the cell types (std below 10th percentile) were weighted lower (weight of 0.5). Genes between those ranges had a weight of 1. The final prediction scores were then log-transformed, with a pseudocount of 0.01.

The reference genome GRCH37 was used for the sequence information. Test set was composed of all the genes on chromosome 8 and 9. The rest of the genes were used for training and validation. For each model, 5-fold cross validation was performed 5 times, each time leaving out different genes for validation. Separation into folds was random. This allowed us to obtain a large validation dataset without compromising the amount of available training data. Additionally, a grid search was performed over parameter space, which included regularization strength alpha and an option to shift the labels to have zero mean in the log space. Final models used in downstream analyses were trained on all chromosomes (excluding chromosomes X and Y).

Test dataset evaluations

First, we evaluated the models on the test data. We defined two subgroups of genes: high/low expressed genes, and on/off genes. High expressed genes have expression above 75th percentile for the cell type, and low expressed genes were below the 25th percentile for the cell type in the ground truth dataset. Therefore, this definition was based on the relative gene expression within the cell type. When operating on the ground truth, the high expressed genes were labeled as 1, and low expressed as 0. For the predicted expression values, we ranked and scaled them by the number of genes in total, thus converting ranks into the range (0, 1].

To further assess cell-type specificity of gene expression, differentially expressed genes (on/off group) were defined for each of the organ systems. A gene was included in this subset when it was labeled as high in one cell type and low in a different cell type for each of the organs, and assigned a label of 1. The remaining genes were labeled as 0. This allowed us to evaluate the performance of the model on the differentially expressed genes. Variance explained scores were computed using the sklearn30 Python library, and spearman scores were computed using scipy.31

ExpectoSC variant effect predictions

To infer the effect of genetic variation, we compared predicted gene expression values produced by the reference and alternative sequences. To stabilize the resulting scores, we exponentiated the predictions to obtain the predicted gene expression level, added a pseudocount of 1 to both predicted gene expression of reference and alternative sequence, as conventionally used in gene expression analysis, then took the log-fold change between them as the final ExPectoSC variant effect prediction, such that the final score was computed as follows: s=log2(exp(predalt)+1exp(predref)+1), where predref is the prediction for the reference allele and predalt is the prediction for the alternate allele. We ran the ExPectoSC framework on the 1000 Genomes data (∼2 million variants used in sLDSC32) and on Clinvar14 variants, for the variants that are within +/− 20kbp of any gene TSS (Data S2). When more than one gene’s TSS was close, we selected the closest one. For Clinvar (downloaded from https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/on Nov 10, 2020), we used variants that are labeled as 5 prime UTR variant, intron or 3 prime UTR. For the sLDSC analysis, the absolute value of the effect (s) for predictions for 1000 Genomes was used. The strand of the variant was based on the gene that is considered to be the closest. To compare the predictions between the cell-types, we normalized predictions of variant sets to those of 1000 Genomes variants by using the Z-scores computed per cell-type, such that the effect for the given variant is (smean((s1,...,sm))/std((s1,...,sm)), where (s1,...,sm) are all 1000 Genomes variant predictions for the given cell type, and s is the unscaled variant effect prediction.

Genome-wide heritability partitioning

To evaluate how ExPectoSC predicted expression for noncoding genetic variants contributing to disease risk, we used stratified LD-score (s-LDSC) regression to partition the heritability of disease risk to genetic variants. We applied ExPectoSC to compute the effect of all single-nucleotide variants in the 1000 Genomes data that are within +/− 20kb from any gene for each cell type. In case of a variant within range of more than one gene, the gene closest to the variant was chosen. The ExPectoSC predictions per cell type per SNP were used as SNP annotations in place of the LDSC baseline annotations. LDSC15,32 implementation was used for both LD score conversion and LDSC regression. Briefly, genome-wide variant effect predictions were converted to LD scores on a per-cell-type basis using the absolute value of the predicted effect (s), considering LD blocks of European ancestry. LDSC regression was performed in the multivariate mode by regressing all 105 cell-types’ LD scores to GWAS summary statistics. We performed the regression for 102 GWAS studies, where enrichment p values were computed by LDSC and corrected using Benjamini-Hochberg FDR. GWAS studies for which none of the cell types were significantly enriched (corrected p value cutoff of 0.05) or at least one of the cell types labeled as significant had enrichment’s standard deviation greater than half of the enrichment itself were removed, leaving 60 studies.

Quantification and statistical analysis

Statistical tests used for computational tests are explained in each subsection of the STAR Methods. All luciferase assay experiments were performed at least 3 separate times (three biological replicates with starting cells taken from separate cell culture passages). Four replicates per variant were tested in each experiment (Data S3). For each sequence tested, the firefly luminescence was normalized to the empty vector (pGL4.23 with no insert). Statistics were calculated from the fold over empty vector values in each biological replicate by two-sided t-test, assuming unequal variance. Data from all three experiments were combined for the t-test. Sample size estimation power analyses were not performed before experiments. Researchers were not blinded to samples for experiments.

Acknowledgments

This work is supported by NIH grants R01HG005998, U54HL117798, and R01GM071966; HHS grant HHSN272201000054C; and Simons Foundation grant 395506 to O.G.T. The authors acknowledge all members of the Troyanskaya laboratory at Princeton University and the Flatiron Institute for helpful discussions. We also thank the Simons Foundation and the Scientific Computing Core of the Flatiron Institute. We are pleased to acknowledge that the work reported on in this paper was substantially performed using the Princeton Research Computing resources at Princeton University, which is a consortium of groups led by the Princeton Institute for Computational Science and Engineering (PICSciE) and the Office of Information Technology’s Research Computing. The graphical abstract was created with BioRender.com.

Author contributions

K.S., Z.Z., and O.G.T. conceived and designed the computational research. K.S. performed method development and computational analyses with contributions from C.L.T., K.D., Z.Z., and A.K.W. C.L.T. designed and executed the experiments. A.K.W. developed the web interface. K.S., C.L.T., and O.G.T. wrote the manuscript with input from Z.Z., K.D., and A.K.W.

Declaration of interests

The authors declare no competing interests.

Published: September 12, 2023

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.crmeth.2023.100580.

Contributor Information

Chandra L. Theesfeld, Email: chandrat@princeton.edu.

Olga G. Troyanskaya, Email: ogt@cs.princeton.edu.

Supplemental information

Document S1. Figures S1–S4 and Table S1
mmc1.pdf (931KB, pdf)
Data S1. Evaluation and ClinVar variant set results, related to Figures 2 and 4

A: AUC scores for the test and cross-validation datasets, related to Fig 2; B: Mann-Whitney U cell-type level comparison of benign vs pathogenic variants score for ClinVar variants, related to Fig 4

mmc2.xlsx (17.9KB, xlsx)
Data S2. ClinVar predictions for the noncoding variants, related to Figure 4
mmc3.csv (114.4MB, csv)
Data S3. Information for the experimental validation of the PTEN promoter variants, related to Figure 4

A: Luciferase data for experimentally tested PTEN promoter variants, related to Fig 4. B: DNA sequences for experimentally tested PTEN promoter variants, related to Fig 4

mmc4.xlsx (13KB, xlsx)
Document S2. Article plus supplemental information
mmc5.pdf (4.6MB, pdf)

Data and code availability

  • Data: See Supplementary Data. In addition, ExPectoSC annotations for the sLDSC pipeline and Module 1 model weights are available for download at humanbase.io/expectosc.

  • Code: All original code has been deposited at https://github.com/ksenia007/ExPectoSC DOI is listed in the key resources table.

  • Any additional information required to reanalyze the data reported in this work paper is available from the Lead Contact upon request

References

  • 1.Agarwal V., Shendure J. Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks. Cell Rep. 2020;31 doi: 10.1016/j.celrep.2020.107663. [DOI] [PubMed] [Google Scholar]
  • 2.Zhou J., Theesfeld C.L., Yao K., Chen K.M., Wong A.K., Troyanskaya O.G. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 2018;50:1171–1179. doi: 10.1038/s41588-018-0160-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kelley D.R., Reshef Y.A., Bileschi M., Belanger D., McLean C.Y., Snoek J. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 2018;28:739–750. doi: 10.1101/gr.227819.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Avsec Ž., Agarwal V., Visentin D., Ledsam J.R., Grabska-Barwinska A., Taylor K.R., Assael Y., Jumper J., Kohli P., Kelley D.R. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods. 2021;18:1196–1203. doi: 10.1038/s41592-021-01252-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Koido M., Hon C.-C., Koyama S., Kawaji H., Murakawa Y., Ishigaki K., Ito K., Sese J., Parrish N.F., Kamatani Y., Carninci P., Terao C. Prediction of the cell-type-specific transcription of non-coding RNAs from genome sequences via machine learning. Nat. Biomed. Eng. 2022;7:830–844. doi: 10.1038/s41551-022-00961-8. [DOI] [PubMed] [Google Scholar]
  • 6.Zhou J., Troyanskaya O.G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods. 2015;12:931–934. doi: 10.1038/nmeth.3547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Stewart B.J., Ferdinand J.R., Young M.D., Mitchell T.J., Loudon K.W., Riding A.M., Richoz N., Frazer G.L., Staniforth J.U.L., Vieira Braga F.A., et al. Spatiotemporal immune zonation of the human kidney. Science. 2019;365:1461–1466. doi: 10.1126/science.aat5031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.MacParland S.A., Liu J.C., Ma X.-Z., Innes B.T., Bartczak A.M., Gage B.K., Manuel J., Khuu N., Echeverri J., Linares I., et al. Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations. Nat. Commun. 2018;9:4383. doi: 10.1038/s41467-018-06318-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Madissoon E., Wilbrey-Clark A., Miragaia R.J., Saeb-Parsy K., Mahbubani K.T., Georgakopoulos N., Harding P., Polanski K., Huang N., Nowicki-Osuch K., et al. scRNA-seq assessment of the human lung, spleen, and esophagus tissue stability after cold preservation. Genome Biol. 2019;21:1. doi: 10.1186/s13059-019-1906-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Litviňuková M., Talavera-López C., Maatz H., Reichart D., Worth C.L., Lindberg E.L., Kanda M., Polanski K., Heinig M., Lee M., et al. Cells of the adult human heart. Nature. 2020;588:466–472. doi: 10.1038/s41586-020-2797-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Baron M., Veres A., Wolock S.L., Faust A.L., Gaujoux R., Vetere A., Ryu J.H., Wagner B.K., Shen-Orr S.S., Klein A.M., et al. A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. Cell Syst. 2016;3:346–360.e4. doi: 10.1016/j.cels.2016.08.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Habib N., Avraham-Davidi I., Basu A., Burks T., Shekhar K., Hofree M., Choudhury S.R., Aguet F., Gelfand E., Ardlie K., et al. Massively parallel single-nucleus RNA-seq with DroNc-seq. Nat. Methods. 2017;14:955–958. doi: 10.1038/nmeth.4407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Lukowski S.W., Lo C.Y., Sharov A.A., Nguyen Q., Fang L., Hung S.S., Zhu L., Zhang T., Grünert U., Nguyen T., et al. A single-cell transcriptome atlas of the adult human retina. EMBO J. 2019;38 doi: 10.15252/embj.2018100811. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Landrum M.J., Lee J.M., Benson M., Brown G.R., Chao C., Chitipiralla S., Gu B., Hart J., Hoffman D., Jang W., et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018;46:D1062–D1067. doi: 10.1093/nar/gkx1153. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Finucane H.K., Bulik-Sullivan B., Gusev A., Trynka G., Reshef Y., Loh P.-R., Anttila V., Xu H., Zang C., Farh K., et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 2015;47:1228–1235. doi: 10.1038/ng.3404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Hamer M., Batty G.D. Association of body mass index and waist-to-hip ratio with brain structure: UK Biobank study. Neurology. 2019;92:e594–e600. doi: 10.1212/WNL.0000000000006879. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Shungin D., Winkler T.W., Croteau-Chonka D.C., Ferreira T., Locke A.E., Mägi R., Strawbridge R.J., Pers T.H., Fischer K., Justice A.E., et al. New genetic loci link adipose and insulin biology to body fat distribution. Nature. 2015;518:187–196. doi: 10.1038/nature14132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Turenius C.I., Htut M.M., Prodon D.A., Ebersole P.L., Ngo P.T., Lara R.N., Wilczynski J.L., Stanley B.G. GABA(A) receptors in the lateral hypothalamus as mediators of satiety and body weight regulation. Brain Res. 2009;1262:16–24. doi: 10.1016/j.brainres.2009.01.016. [DOI] [PubMed] [Google Scholar]
  • 19.Xu Y., O’Brien W.G., 3rd, Lee C.-C., Myers M.G., Jr., Tong Q. Role of GABA release from leptin receptor-expressing neurons in body weight regulation. Endocrinology. 2012;153:2223–2233. doi: 10.1210/en.2011-2071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Tong Q., Ye C.-P., Jones J.E., Elmquist J.K., Lowell B.B. Synaptic release of GABA by AgRP neurons is required for normal regulation of energy balance. Nat. Neurosci. 2008;11:998–1000. doi: 10.1038/nn.2167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Butler M.G., Dasouki M.J., Zhou X.-P., Talebizadeh Z., Brown M., Takahashi T.N., Miles J.H., Wang C.H., Stratton R., Pilarski R., Eng C. Subset of individuals with autism spectrum disorders and extreme macrocephaly associated with germline PTEN tumour suppressor gene mutations. J. Med. Genet. 2005;42:318–321. doi: 10.1136/jmg.2004.024646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Buxbaum J.D., Cai G., Chaste P., Nygren G., Goldsmith J., Reichert J., Anckarsäter H., Rastam M., Smith C.J., Silverman J.M., et al. Mutation screening of the PTEN gene in patients with autism spectrum disorders and macrocephaly. Am. J. Med. Genet. B Neuropsychiatr. Genet. 2007;144B:484–491. doi: 10.1002/ajmg.b.30493. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Kwon C.-H., Luikart B.W., Powell C.M., Zhou J., Matheny S.A., Zhang W., Li Y., Baker S.J., Parada L.F. Pten regulates neuronal arborization and social interaction in mice. Neuron. 2006;50:377–388. doi: 10.1016/j.neuron.2006.03.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Luikart B.W., Schnell E., Washburn E.K., Bensen A.L., Tovar K.R., Westbrook G.L. Pten knockdown in vivo increases excitatory drive onto dentate granule cells. J. Neurosci. 2011;31:4345–4354. doi: 10.1523/JNEUROSCI.0061-11.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Sheng X., Koul D., Liu J.L., Liu T.J., Yung W. Promoter Analysis of Tumor Suppressor Gene PTEN: Identification of Minimum Promoter Region (2002) Biochem. Biophys. Res. Commun. 2002;292:422–426. doi: 10.1006/bbrc.2002.6662. [DOI] [PubMed] [Google Scholar]
  • 26.Lopes-Ramos C.M., Paulson J.N., Chen C.-Y., Kuijjer M.L., Fagny M., Platig J., Sonawane A.R., DeMeo D.L., Quackenbush J., Glass K. Regulatory network changes between cell lines and their tissues of origin. BMC Genom. 2017;18:723–813. doi: 10.1186/s12864-017-4111-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Deng L., Pollmeier L., Zhou Q., Bergemann S., Bode C., Hein L., Lother A. Gene expression in immortalized versus primary isolated cardiac endothelial cells. Sci. Rep. 2020;10:2241–2249. doi: 10.1038/s41598-020-59213-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Bertin N., Mendez M., Hasegawa A., Lizio M., Abugessaisa I., Severin J., Sakai-Ohno M., Lassmann T., Kasukawa T., Kawaji H., et al. Linking FANTOM5 CAGE peaks to annotations with CAGEscan. Sci. Data. 2017;4:170147–170148. doi: 10.1038/sdata.2017.147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Harrow J., Frankish A., Gonzalez J.M., Tapanari E., Diekhans M., Kokocinski F., Aken B.L., Barrell D., Zadissa A., Searle S., et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012;22:1760–1774. doi: 10.1101/gr.135350.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
  • 31.Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J., et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods. 2020;17:261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Bulik-Sullivan B.K., Loh P.-R., Finucane H.K., Ripke S., Yang J., Schizophrenia Working Group of the Psychiatric Genomics Consortium. Patterson N., Daly M.J., Price A.L., Neale B.M. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 2015;47:291–295. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S4 and Table S1
mmc1.pdf (931KB, pdf)
Data S1. Evaluation and ClinVar variant set results, related to Figures 2 and 4

A: AUC scores for the test and cross-validation datasets, related to Fig 2; B: Mann-Whitney U cell-type level comparison of benign vs pathogenic variants score for ClinVar variants, related to Fig 4

mmc2.xlsx (17.9KB, xlsx)
Data S2. ClinVar predictions for the noncoding variants, related to Figure 4
mmc3.csv (114.4MB, csv)
Data S3. Information for the experimental validation of the PTEN promoter variants, related to Figure 4

A: Luciferase data for experimentally tested PTEN promoter variants, related to Fig 4. B: DNA sequences for experimentally tested PTEN promoter variants, related to Fig 4

mmc4.xlsx (13KB, xlsx)
Document S2. Article plus supplemental information
mmc5.pdf (4.6MB, pdf)

Data Availability Statement

  • Data: See Supplementary Data. In addition, ExPectoSC annotations for the sLDSC pipeline and Module 1 model weights are available for download at humanbase.io/expectosc.

  • Code: All original code has been deposited at https://github.com/ksenia007/ExPectoSC DOI is listed in the key resources table.

  • Any additional information required to reanalyze the data reported in this work paper is available from the Lead Contact upon request


Articles from Cell Reports Methods are provided here courtesy of Elsevier

RESOURCES