Skip to main content
eLife logoLink to eLife
. 2020 Oct 21;9:e59929. doi: 10.7554/eLife.59929

Gene expression variability in human and chimpanzee populations share common determinants

Benjamin Jung Fair 1,, Lauren E Blake 2, Abhishek Sarkar 2, Bryan J Pavlovic 3, Claudia Cuevas 2, Yoav Gilad 1,2,
Editors: Hunter B Fraser4, Detlef Weigel5
PMCID: PMC7644215  PMID: 33084571

Abstract

Inter-individual variation in gene expression has been shown to be heritable and is often associated with differences in disease susceptibility between individuals. Many studies focused on mapping associations between genetic and gene regulatory variation, yet much less attention has been paid to the evolutionary processes that shape the observed differences in gene regulation between individuals in humans or any other primate. To begin addressing this gap, we performed a comparative analysis of gene expression variability and expression quantitative trait loci (eQTLs) in humans and chimpanzees, using gene expression data from primary heart samples. We found that expression variability in both species is often determined by non-genetic sources, such as cell-type heterogeneity. However, we also provide evidence that inter-individual variation in gene regulation can be genetically controlled, and that the degree of such variability is generally conserved in humans and chimpanzees. In particular, we found a significant overlap of orthologous genes associated with eQTLs in both species. We conclude that gene expression variability in humans and chimpanzees often evolves under similar evolutionary pressures.

Research organism: Human

Introduction

Variation in gene expression underlies many phenotypic differences between and within species. Gene expression itself is a quantitative phenotype that is subject to both random drift and natural selection. A deeper understanding of how natural selection shapes gene expression across primates is central to our understanding of human evolution, and may also elucidate the mechanistic basis for variation in quantitative traits and disease risk within species.

Several studies have used a comparative transcriptomics approach across species and tissues to identify genes whose expression patterns are consistent with the action of natural selection (Barbosa-Morais et al., 2012; Brawand et al., 2011; Merkin et al., 2012). For example, a pattern of highly similar gene expression levels in all primates may be consistent with the action of stabilizing selection on gene regulation, and potentially, negative selection on the corresponding regulatory elements. Indeed, many studies have found that the expression of most genes evolves slower than expected under neutrality (Chan et al., 2009; Khaitovich et al., 2006; Khan et al., 2013; Lemos et al., 2005; Merkin et al., 2012; Romero et al., 2012). Conversely, genes that show a reduced or elevated expression level exclusively in the human lineage may indicate directional selection on gene expression in humans, potentially resulting in positive selection for particular regulatory variants (Blekhman et al., 2008; Gilad et al., 2006; Perry et al., 2012). However, it is often difficult to determine whether lineage-specific expression changes are due to inter-species environmental differences or to natural selection on certain regulatory variants.

A complementary approach to understanding gene regulation and associated selection pressures utilizes within-species variation to identify genetic variants that affect gene expression levels. Such variants are referred to as expression quantitative trait loci, or eQTLs. Overall, there is some evidence that eQTLs are evolving under weak negative selection. For example, the magnitude of eQTL effect size on expression levels is weakly anti-correlated with eQTL allele frequency (Battle et al., 2014; Glassberg et al., 2019). Additionally, the set of human genes associated with an eQTL (eGenes) tend to be slightly depleted for genes relevant to universally conserved and essential cellular processes (Popadin et al., 2014; Tung et al., 2015; Ward and Gilad, 2019), and for genes at the center of protein interaction networks (Battle et al., 2014; Mähler et al., 2017). These observations are consistent with negative selection purging strong regulatory variants within species, particularly variants that regulate the expression of genes whose precise regulation is essential.

The eQTL mapping approach allows us to connect genetic variants to the genes they regulate, and provides a mechanistic explanation for a portion of the heritability of gene expression phenotypes (Price et al., 2011; Wright et al., 2014). However, not all gene expression variation can be explained by genetic variation, and of the genetic contribution, only about 20% of heritability can be explained by locally acting variants (referred to as cis eQTLs; Albert et al., 2018; Price et al., 2011). We assume that the remaining 80% of heritable expression variation is determined by distally acting regulatory variants (trans eQTLs). However, because trans eQTLs can be located anywhere in the genome, they are difficult to pinpoint, even in studies with large sample sizes (Battle et al., 2014; Westra et al., 2013). Therefore, regarding the identification of genes undergoing stabilizing selection within or between closely related species, the insights gained by cis eQTL approaches may be limited.

A third approach to understanding the evolutionary forces on gene expression is to directly quantify the degree of total gene expression variation (i.e. the variability) within or between populations or species. The variability of gene expression in a population is the sum of variability introduced by local and distal genetic regulation, as well as other sources, including epigenetic (Bashkeel et al., 2019) and environmental effects. The variability in gene expression observed in a given population is the result of all of these effects as well as any technical variability that was introduced during experimental data collection. While variability measurements alone cannot help disentangle the genetic component of variability from the non-genetic component, we do expect that natural selection will minimize the regulatory variation of genes with dosage-sensitive functions. Fortunately, it is possible to obtain relatively stable measurements of population variability for every expressed gene using a moderate sample size of just tens of individuals (de Jong et al., 2019).

The quantification of regulatory variability may provide a window into distinct biological phenomena that are difficult to ascertain by studying inter-species differences in mean expression levels. For example, the adaptability of a population in response to new stresses may be a general function of gene expression variability, as has been demonstrated in yeast populations (Bódi et al., 2017; Wang and Zhang, 2011; Zhang et al., 2009). Furthermore, identification of hypervariable genes, regardless of the genetic or non-genetic source of variability, may help identify genes and pathways which confer differences in disease susceptibilities and treatment responses (Ho et al., 2008; Knowles et al., 2018; Simonovsky et al., 2019). Direct measurements of variability may therefore be a useful complement to the analysis of differences in mean gene expression levels or to eQTL mapping, and could contribute to a better understanding of gene expression evolution and complex traits.

A previous analysis of gene expression variability within and between human populations found higher regulatory variation in genes associated with disease susceptibility (Li et al., 2010). This study did not otherwise identify any particular functional classes or features of genes that showed significant inter-population differences in expression variability – although given the relative genetic similarity, short evolutionary timescale, and migrations between human populations, we might not expect different signatures of selection to be apparent. Nonetheless, this study found that regulatory variability across genes correlates with the levels of genetic variability at nearby loci (Li et al., 2010), thereby providing some measure of support to the notion that differences in regulatory variability between genes is at least partly genetically encoded.

We sought to better understand selection pressures on gene regulation variability by collecting gene expression data from humans and chimpanzees (Pan troglodytes). The availability of suitable samples from chimpanzees has notoriously been a limitation for comparative functional genomic studies. Here, we were able to collect the largest population sample of chimpanzee primary tissues to date, allowing not only for sensitive assessment of differences in mean expression levels between species, but also differences in variability within species. To better isolate the genetic component of variability, we complemented the analysis of gene expression levels with a comparative eQTL mapping approach.

Results

We analyzed RNA-seq data from postmortem primary heart tissue samples of 39 human and 39 chimpanzee individuals (Figure 1—source data 1). Data from 11 of these humans and 18 of the chimpanzees were previously collected in our lab and published (Pavlovic et al., 2018). We obtained additional human data from the GTEx consortium (post-mortem heart, left ventricle), filtering for high quality samples with sufficient read depth (Figure 1A–B, see Materials and methods for filtering criteria). To complement the published data from chimpanzees, we generated new RNA-seq data from primary heart samples of 21 additional chimpanzee individuals (see Materials and methods). Because the data were collected at different times, and some of the data from humans were collected in different labs (the GTEx data), our overall study design introduces clear technical batch effects that are partly confounded with species (Figure 1B). Our focus, however, is on inter-species comparisons of gene expression variability and eQTLs, which are estimated and identified using the within-species data. The batch effects are therefore expected to have only a minimal impact on the reported results.

Figure 1. Summary of RNA-seq datasets.

(A) Number of reads mapped to expressed genes for each sample. To obtain a balanced set of 39 samples per species, the 10 samples outlined in gray were excluded from differential expression and variability analysis. Nine of these samples are among the lowest read count samples sourced from GTEx. The remaining sample was excluded on the basis of inspection of (B) unsupervised hierarchal clustering of RNA-seq samples by Pearson correlation matrix of expression log(CPM). (C) Principal component analysis shows samples separating by species along the first principal component. Only samples used in DE and variability analyses are shown in (C). Source data for (A) and other metadata for RNA-seq datasets used in this study are in Figure 1—source data 1.

Figure 1—source data 1. RNA-seq datasets used in this study.

Figure 1.

Figure 1—figure supplement 1. Effect size and significance of differential expression (DE) analysis.

Figure 1—figure supplement 1.

(A) Volcano plot of DE genes between human and chimpanzee heart tissue. Inset focuses on the red boxed area to highlight the relationship between expression fold change and DE gene classification at various FDR thresholds. (B) The number of DE genes identified under any given fold change threshold at various FDR thresholds. Full DE results are available in Figure 1—figure supplement 1—source data 1.
Figure 1—figure supplement 1—source data 1. Full DE results.

Figure 1—figure supplement 2. Evaluation of the contribution of sample size and read depth to differential expression analysis between chimpanzee and human.

Figure 1—figure supplement 2.

(A) The number of DE genes identified at varying thresholds after randomly subsampling (with replacement) the number of individuals in each species at the indicated sample size per species (hereafter referred to as a single bootstrap replicate). Dashed lines indicate the number of DE genes identified from the full dataset of 39 human and 39 chimpanzee individuals. Box-whisker plots depict quantiles among 100 bootstrap replicates. (B) An empirical estimate of FDR was obtained by calculating the fraction of DE genes in each subsample which were not identified at FDR < 0.01 in the full dataset. (C) Receiver operator characteristic (ROC) curves indicate the sensitivity and specificity of DE gene classification at varying significance thresholds. For each sample size, the filled line represents the median ROC sensitivity amongst 100 bootstrap replicates, while the dashed lines represent the 0.05 and 0.95 quantiles. (D) The ability to significantly detect small effect size DE genes increases as sample size increases. At each indicated subsample size, box-whisker plot indicates effect size of true significant DE genes (The effect size and true classifications of significant DE genes are defined as those at FDR < 0.01 with the full dataset). (E) Winner’s curse effects decrease with increasing sample size. The distribution of median differences between the estimated effect size of DE genes and the true effect sizes estimated from the full dataset among 100 bootstrap replicates. (F–L) Same as (A–E) but each sample was subsampled at the level of RNA-seq reads to a depth of 25 million mapped reads. (K–O) Same as (A–E) but each sample was subsampled to 10 million mapped reads. All box-whisker plots depict 0.05, 0.25, 0.5, 0.75, and 0.95 quantiles among 100 bootstrap replicates.

Figure 1—figure supplement 3. Contribution of genetic relatedness to differential expression analysis between chimpanzee and human.

Figure 1—figure supplement 3.

(A) A mean-centered genetic relatedness matrix of the chimpanzee individuals used in DE analysis. Samples are hierarchically clustered and further grouped into k = 7 kinship clusters (row colors). The colors used to represent each cluster is consistent throughout the figure. These clusters were further manually filtered into three clusters of size n = 4 individuals with relatively high inter-relatedness and one cluster of size n = 13 individuals with relatively low inter relatedness (column colors). (B) The distribution of pairwise intra-cluster relatedness coefficients. The contribution of these cluster annotations as factors which explain gene expression and DE power is explored in C-F. (C) A clustered gene expression (log(RPKM)) Pearson correlation matrix. Row colors represent RNA extraction batch for RNA-seq. Column colors represent kinship cluster. (D) Variance partitioning analysis using a linear mixed model with each term as a mixed effect was used to quantify the contribution of various explanatory factors to expression of each gene. Boxplots indicate the fraction of variance explained by each factor for 0.05, 0.25, 0.5 0.75, and 0.95 percentiles across all expressed genes. As there is only one replicate per individual, a model term for individual was not appropriate, and individual effects are captured in the residual. (E) DE analysis was performed using n = 4 individuals each of chimpanzee and human. The effect of inter-related chimpanzee samples was assessed by all four chimpanzee individuals from one of the three highly inter-related clusters, or by drawing a combination of four individuals (without replacement) from the lowly inter-related cluster. The distribution of number of DE genes (E) across the resampled DE analyses is shown, as well as an empirical estimate of FDR based on the full n = 39 dataset (F). DE analyses containing outlier samples Little_R or 537, originating from the same technical batch, have a much larger effect on results than drawing from inter-related samples. Kinship matrix plotted in (A) is available as Figure 1—figure supplement 3—source data 1.
Figure 1—figure supplement 3—source data 1. Kinship matrix of chimpanzees in this study.

To analyze the RNA-seq data, we employed a uniform processing pipeline that only considers reads mapping to human-chimpanzee orthologous exonic regions (Pavlovic et al., 2018). Using these mapped reads, we estimated gene expression levels for each individual in each species. We excluded lowly expressed genes (see Materials and methods), and retained data from 13,432 expressed genes for further analysis. Principle component analysis (Figure 1C) and unsupervised clustering (Figure 1B) of the data show that, as expected, samples primarily separate by species, and not by source or batch, although as we pointed out, some technical batches are partly confounded with species in this study.

Using a linear model framework (see Materials and methods), we identified 8880 differentially expressed (DE) genes between species (FDR < 0.05), including 6,409 DE genes with effect sizes smaller than a twofold change (Figure 1—figure supplement 1B). As our comparative sample size is unusually large, we have an opportunity to comment on the robustness of observations that are made using study designs with smaller sample sizes, which are more typical for comparative studies in primates (Khan et al., 2013; Pavlovic et al., 2018; Perry et al., 2012). To do so, we subsampled our data and repeated the DE analysis for various sample sizes and read depths. To benchmark the results, we used the effect size estimates and DE gene classifications of the full dataset as an ad hoc gold standard reference.

As expected, we found that the number of individuals, not sequencing depth (within the range of 10M to >25M reads per sample), is the primary driver of power to detect inter-species differences in gene expression (Figure 1—figure supplement 2F–O). Our results indicate that while comparative studies with small sample sizes are underpowered, their reports of differentially expressed genes are generally robust. For example, when we used a sample size of only four chimpanzee individuals and four human individuals, we identified a median of 1,373 (869–2131 interquartile range across 100 resamples) DE genes (FDR < 0.05), or just 15% (10–24% IQR) of the genes identified as DE in the analysis of the full set of samples (Figure 1—figure supplement 3A). Furthermore, a study with only four individuals from each species is particularly underpowered to identify subtle inter-species differences (Figure 1—figure supplement 2D), only capturing a median of 5% (2–12% IQR) of the DE genes with effect sizes smaller than a twofold change. In addition, of the DE genes identified at this sample size, the estimated magnitude effect sizes are typically upwardly biased (Figure 1—figure supplement 2E) due to the winner’s curse effects (Göring et al., 2001; Ioannidis, 2008), which particularly affect under-powered study designs (Figure 1—figure supplement 2E). However, even at this small sample size, the false positive rate associated with the classification of DE genes is well calibrated; when we used an FDR of 5% to classify genes as DE between the species, we empirically estimated a median of 2% (1–5% IQR) false discoveries based on the gold standard reference. Similar analyses for study designs with different sample sizes are available in Figure 1—figure supplement 2.

Considering that some of the chimpanzees we collected data from are first-degree relatives, we wondered if the presence of highly related chimpanzee samples plays a meaningful role in the interspecies DE comparisons. To examine that, we identified subsample sets of chimpanzees with varying degrees of inter-relatedness (Figure 1—figure supplement 3A–B). We then estimated the proportion of variance in gene expression that may be explained by relatedness. We found that technical factors, such as RNA extraction batch, play a larger role in explaining gene expression variance (Figure 1—figure supplement 3C–D) than the inter-relatedness. Furthermore, we repeated the DE analysis procedures with subsamples of only the inter-related chimpanzee individuals, and did not find any meaningful difference in number of DE genes or estimated false discovery rate compared to subsamples of less related individuals (Figure 1—figure supplement 3E–F). We conclude that the presence of inter-related samples plays a minimal role in altering interspecies DE analyses. The extrapolation of these analyses to guide interspecies study designs in relation to other factors such as tissue type and species, is not clear. Yet, we reason that tissue types that are more homogenous with less variability within species will naturally require fewer samples to detect similar effect sizes. Conversely, species comparisons that are more highly diverged, will likely have larger true effect size differences, and may require fewer samples to acquire meaningful results.

Characterizing variability in gene expression

The main focus of our study was to assess and compare the inter-individual variability in gene expression between the two species. To do so, we estimated overdispersion for each gene in chimpanzees and humans separately (see Materials and methods). Briefly, we assumed that the measurement process is captured by Poisson sampling, and that true gene expression follows a Gamma distribution. These assumptions are supported by theory (Pachter, 2011) and empirical data (Marioni et al., 2008). Accordingly, we used a negative binomial regression model to fit the RNA-seq data (22,23). Fitting the model to the data from each gene yields estimates of the mean and biological variance of gene expression, which result in the overdispersion of the observed read counts relative to a Poisson distribution.

Consistent with previous findings (Ecker et al., 2017; Eling et al., 2018; Robinson et al., 2010), we observed that overdispersion is strongly correlated with mean expression, due to reasons which may be technical or biological in nature. To understand the properties of gene expression variability that are independent of mean gene expression level, we regressed out the mean from the overdispersion estimates, to obtain a gene-wise mean-corrected summary of variability, hereafter referred to as ‘dispersion’ (Figure 2A–B). A dispersion greater than 0 indicates more variability than expected given the gene’s expression level. Conceptually, our approach is similar to a method recently devised to identify differentially variable genes using single cell gene expression data (Eling et al., 2018). Using this approach, one can identify genes whose inter-individual variance in expression is different, irrespective of their mean or median expression levels. For example, while the genes TERF2 and SNORD14E have similar median expression levels, TERF2 has a lower dispersion than SNORD14E, in both species (Figure 2B). We used a bootstrap test (Materials and methods) to assess the stability of dispersion estimates and identified genes whose estimated dispersion is significantly different between species. For example, ZNF514 has higher dispersion in humans than chimpanzees, despite being similarly expressed in both species (Figure 2B). In total, we identified 2658 inter-species differentially dispersed genes (FDR < 0.1). Inter-species differences in mean expression levels cannot explain this finding, as differentially dispersed genes are generally not differentially expressed (Figure 2—figure supplement 1).

Figure 2. Gene variance independent of expression is correlated across species.

(A) To estimate dispersion, RNA-seq counts for 39 human heart tissue samples were used to estimate gene-wise mean (μ) and overdispersion (ϕ) parameters. Across all genes, overdispersion is correlated with mean expression (left) in the hexbin scatter plot. We regressed out this correlation, using the residual of a LOESS fitted line (blue) as a metric (dispersion) of the variability of a gene’s expression across a population relative to similarly expressed genes (a.u., arbitrary units). (B) Dispersion estimates and the underlying expression in each sample for three similarly expressed genes in human and chimpanzee. Error bars represent bootstrapped standard error. Q-value for ZNF514 represents an estimate of FDR after genome-wide multiple hypothesis testing correction. (C) Dispersion estimates across all genes are correlated across human and chimpanzee, despite identification of thousands of differentially dispersed genes in red. R and p-value correspond to Pearson’s correlation. Full dispersion estimates and differential dispersion results available as Figure 2—source data 1.

Figure 2—source data 1. Gene-wise dispersion estimates and differential testing.

Figure 2.

Figure 2—figure supplement 1. Interspecies dispersion estimates are largely independent from interspecies mean expression changes.

Figure 2—figure supplement 1.

Across all genes, the interspecies difference in dispersion, colored by significance, is plotted versus difference in mean expression.

Despite the identification of thousands of differentially dispersed genes between the species, gene-wise dispersion estimates are well correlated in humans and chimpanzees overall (R = 0.60, Figure 2C), suggesting similar determinants of variability in the two species. We asked what other gene properties may be associated with the degree of gene expression variability. Specifically, we hypothesized that highly conserved essential genes, whose coding regions evolve under strong negative selection, would also be less variable in their expression across individuals. Indeed, when we ranked and grouped genes by degree of protein coding conservation (assessed by percent amino acid identity between human and chimpanzee; Figure 3A), or by the ratio of nonsynonymous to synonymous codon changes (dN/dS) across mammals (Figure 3B), we found that lower dispersion in expression levels is associated with higher protein coding conservation. This trend is most significant for genes in functional categories (Gene Ontology; GO categories) related to immune function (Figure 3—figure supplement 1), consistent with immune-related genes being targets of rapid evolution by diversifying selection pressures, which may increase dispersion. To further establish the relationship between evolutionary coding conservation and dispersion, we utilized the prevalence of genetic diversity data from humans to categorize genes as loss-of-function-tolerant (ExAC pLI score <0.1) or loss-of-function-intolerant (ExAC pLI score >0.9) (Lek et al., 2016). We found that the expression of loss-of-function-intolerant genes is associated with lower dispersion (Figure 3C). Unsurprisingly, these features of low dispersion genes similarly hold in chimpanzee (Figure 3—figure supplement 2) as in human (Figure 3).

Figure 3. Gene features correlated with expression variability.

(A) Protein coding genes with high coding divergence (defined by amino acid identity between chimpanzee and human) generally have higher variability than genes with low coding divergence. The distribution of dispersion estimates is plotted as the empirical cumulative distribution function (ECDF) for the top and bottom decile genes by percent identity. (B) Same as (A) but defining coding divergence based on ratio of non-synonymous to synonymous substitution rates (dN/dS) across mammals. (C) Loss-of-function tolerant (LoF tolerant) genes, defined by pLI score (Lek et al., 2016), generally have higher variability than loss-of-function intolerant (LoF intolerant) genes. (D) TATA box genes generally show higher variability. P-values and ρ correlation coefficient provided for (A) and (B) represent Spearman correlation across all quantiles, rather than just the upper and lower decile, which are plotted for similar visual interpretation as (C) and (D), where the P-values provided represent a two-sided Mann-Whitney U-test. (E) Gene set enrichment analysis (GSEA) of genes ordered by human dispersion estimates. Only the top and bottom three most enriched significant categories (Adjusted p-value<0.05) are shown for each ontology set for space. Full GSEA results available as Figure 3—source data 1.

Figure 3—source data 1. Full GSEA results based on human dispersion levels.

Figure 3.

Figure 3—figure supplement 1. Correlation of coding conservation and dispersion across gene categories.

Figure 3—figure supplement 1.

(A) The correlation between dN/dS across mammals and dispersion (human-chimpanzee dispersion estimate mean) when only considering genes in three example GO categories (B) Histogram of distribution of spearman correlation test p-values across GO categories. (C) The most three most significant (Adjusted p-value<0.05) gene categories for each effect direction for each ontology group. Spearman test results for all GO categories available as Figure 3—figure supplement 1—source data 1.
Figure 3—figure supplement 1—source data 1. dN/dS correlation with dispersion by GO category.
Figure 3—figure supplement 2. Gene features correlated with expression variability (chimpanzee).

Figure 3—figure supplement 2.

(A) Protein coding genes with high coding divergence (defined by amino acid identity between chimpanzee and human) generally have higher variability than genes with low coding divergence. The distribution of chimpanzee dispersion estimates is plotted as the empirical cumulative distribution function (ECDF) for the top and bottom decile genes by percent identity. (B) Same as (A) but defining coding divergence based on ratio of non-synonymous to synonymous substitution rates (dN/dS) across mammals. (C) Loss-of-function tolerant (LoF tolerant) genes, defined by pLI score (Lek et al., 2016), generally have higher variability than loss-of-function intolerant (LoF intolerant) genes. (D) TATA box genes generally show higher variability. p-Values and ρ correlation coefficient provided for (A) and (B) represent Spearman correlation across all quantiles, rather than just the upper and lower decile, which are plotted for similar visual interpretation as (C) and (D), where the P-values provided represent a two-sided Mann-Whitney U-test. (E) Gene set enrichment analysis of genes ordered by chimpanzee dispersion estimates. Only the top and bottom three most enriched significant categories (Adjusted p-value<0.05) are shown for each ontology set for space.

Next, we asked what functional GO categories are enriched among the most or least dispersed genes. To do this, we ordered genes by their dispersion estimate in humans (Figure 3E) or chimpanzees (Figure 3—figure supplement 1E) and used gene set enrichment analysis (GSEA) (Subramanian et al., 2005) to identify GO categories enriched at the top or bottom of the list. We found that genes with housekeeping functions universal to all cell-types, such as genes related to transcription initiation and tRNA modification, are among the most enriched gene categories associated with low dispersion (Figure 3E, Figure 3—figure supplement 1E). Conversely, genes with high dispersion are enriched for categories like fibrinolysis (the breakdown of blood clots), oxygen sensing, and other cardiovascular related functions (Figure 3E, Figure 3—figure supplement 1E). Finally, consistent with previous reports (de Jong et al., 2019; Hagai et al., 2018; Ravarani et al., 2016), we found that genes with TATA boxes are associated with higher dispersion (Figure 3D), potentially because TATA boxes are associated with higher transcriptional noise at the molecular level (Blake et al., 2006; Raser and O'Shea, 2004; Ravarani et al., 2016). However, a possible technical explanation for this observation is that TATA box genes are enriched among cell-type-specific genes (Schug et al., 2005) and cell-type heterogeneity between individuals or samples could contribute to the observed dispersion.

To investigate the possibility that differences in cell composition between samples contribute to our observations of gene expression dispersion, we first asked whether dispersion is associated with the degree of cell-type-specificity of expression. We hypothesized that genes with high inter-individual dispersion are more likely to have cell-type-specific gene expression signatures in single-cell RNA-seq datasets of heart tissue. To examine this we turned to the Tabula Muris dataset, a comprehensive single-cell transcriptomics dataset which includes single-cell RNA-seq data from adult mouse heart tissue (Tabula Muris Consortium et al., 2018). Qualitatively, we observed that the most highly dispersed genes in our bulk chimpanzee and human samples have more cell-type-specific expression among the nine heart cell types identified in Tabula Muris (Figure 4A). Conversely, the lowest dispersed genes are more evenly expressed across cell types (Figure 4B). More generally, when we summarized the level of cell-type-specific expression for each gene as a single summary statistic, τ (Kryuchkova-Mostacci and Robinson-Rechavi, 2016; Yanai et al., 2005), we found that dispersion is strongly correlated with cell-type-specificity (R = 0.32, p<2×10−16; Figure 4C). Notably, the degree of cell-type-specificity is derived from data from mouse hearts, which may have diverged cell type expression profiles compared with primates. The correlation between cell-type-specificity and dispersion may therefore be downwardly biased as a result of error in estimating the true degree of cell-type-specificity of genes expressed primate hearts.

Figure 4. High dispersion genes are expressed in a cell-type-specific manner.

(A) The top 30 most dispersed genes (the mean of the dispersion estimates between human and chimpanzee) are shown in a mouse heart single cell RNA-seq dataset (Tabula Muris Consortium et al., 2018). Scatterplot inset on the left shows the dispersion estimate of the 30 genes in chimpanzee and human (red points) compared to all other genes (gray). Each row in the heatmap is a gene. Each column is a single cell, grouped by cell type (colors at top of columns). Normalized expression is colored in the body of the heatmap. The τ statistic, colored on the right of the heatmap, summarizes for each gene how cell-type-specific the gene is, ranging from 0 (equally expressed among all cell types) to 1 (expressed exclusively in a single cell type). (B) The same as (A) but for the bottom 30 most dispersed genes. (C) Across all genes, a hexbin scatterplot shows the correlation between cell-type-specificity (τ) estimated from mouse single-cell RNA-seq data, and dispersion (mean of human and chimpanzee dispersion for each gene) estimated from the bulk RNA-seq data. (D) The difference in dispersion between chimpanzee and human is only weakly correlated with cell-type-specificity. R and p-value for (C) and (D) represent Pearson’s correlation.

Figure 4.

Figure 4—figure supplement 1. Expression and dispersion estimates that correct for cell type composition.

Figure 4—figure supplement 1.

(A) The cell type composition estimates of chimpanzee and human bulk RNA-seq samples used in dispersion. Height of colored bars represents estimates of the proportion of each heart cell type. (B) A heatmap of the Pearson correlation between the principle components of the chimpanzee samples used in eQTL mapping, to their cell composition estimates. Benjamini-Hochberg corrected P-values for the correlation shown as text in each cell. (C) Correlation matrix of cell-type-specific population mean expression estimates estimated from deconvoluted bulk data. Hierarchal clustering reveals a clear clustering of cell types (row colors, same color key as in (A), with bulk expression profiles colored as black) over species (column colors). (D) Individual cell-type expression estimates were used to estimate dispersion with the same approach as used in bulk data: Cell-type-specific dispersion is defined as the residual of the depicted LOESS fit over the mean-overdispersion trend in each cell type in each species. (E) Correlation matrix of cell-type-specific dispersion estimates reveals a mixed hierarchal clustering pattern with respect to cell-type and species (rows and columns colored same as C). Full cell-type-specific expression and dispersion estimates used to create Pearson correlation matrices in (C) and (E) available as Figure 4—figure supplement 1—source data 1.
Figure 4—figure supplement 1—source data 1. Cell-type-specific expression and dispersion estimates.
Figure 4—figure supplement 2. Cell type heterogeneity has a strong non-technical, individual-specific component.

Figure 4—figure supplement 2.

(A) Cell type proportion deconvolution (Donovan et al., 2020) for left ventricle and atrial appendage tissue sections from matched individuals. Height of colored bars represents estimates of the proportion of each heart cell type. Individuals are ordered identically in the left ventricle and atrial appendage sub-plots as indicated by the x-axis colorbar. (B) Variance partitioning analysis using a linear mixed model with each term as a mixed effect was used to quantify the contribution of anatomic tissue section and individual to proportion of each cell type.
Figure 4—figure supplement 3. GSEA of genes with different levels of dispersion between species.

Figure 4—figure supplement 3.

(A) GSEA of genes ordered by difference in dispersion estimates. Genes with more variable expression in chimpanzee often relate to immune response ontology terms. Genes with more variable expression in human often relate to heart function related ontology terms. Only the top three and bottom three most enriched significant terms for each sub-ontology category are shown for space. Full GSEA results available as Figure 4—figure supplement 3—source data 1.
Figure 4—figure supplement 3—source data 1. Full GSEA results based on interspecies dispersion differences.

Given the correlation between dispersion and the extent of cell-type-specific regulation, we sought to estimate the proportions of different cell types amongst our bulk RNA-seq samples for both chimpanzee and human. We applied BayesPrism cell type deconvolution and expression estimation (Chu and Danko, 2020) to the bulk RNA-seq profiles using reference cell type profiles derived from Tabula Muris (Materials and methods). As expected, cell type proportions between chimpanzee and human hearts are qualitatively similar, although much inter-individual variation exists in both species for particular cell types, such as cardiac muscle cells and myofibroblasts (Figure 4—figure supplement 1A). Furthermore, using estimates of expression profiles within each cell type (Figure 4—figure supplement 1C), we calculated dispersion for human and chimpanzee within each cell type (Figure 4—figure supplement 1D–E). We found that the deconvoluted expression profiles for different cell types cluster tightly by cell type rather than species, as expected. Interestingly, the dispersion estimates cluster in a complex pattern that is more strongly influenced by species. This is consistent with the idea that genetic variation, which nearly completely segregates by species, plays a meaningful role is explaining dispersion when cell composition variation is corrected for. Put together, these analyses indicate that the levels of inter-individual dispersion we observed in humans and chimpanzees, and the high similarity in dispersion observed across species, are driven genetic differences as well as cell type heterogeneity between samples. The cell type heterogeneity across samples may be partly due to both technical differences between sample preparations, as well as biological differences between individuals.

To investigate the extent to which the cell composition differences that drive dispersion are biologically relevant, as opposed to technical differences in tissue dissection and sample preparation, we turned to cell deconvolution profiles of samples from anatomically different heart samples from matched individuals from GTEx (Donovan et al., 2020). We reasoned that if intentionally anatomically different heart sections (left ventricle, versus atrial appendage) from the same individual correlate better than matched tissue samples from different individuals, then the cell type composition differences across our chimpanzee samples are also likely driven by individual level differences, rather than technical differences in sample acquisition.

We found that the estimated fraction of cardiac muscle cells, the most common cell type in both tissue sections, is highly correlated between atrial appendage and left ventricle samples from matched individuals (Figure 4—figure supplement 2A). We used a linear mixed model (Hoffman and Schadt, 2016) to quantify the contribution of individual level versus tissue level factors to explain cell type composition estimates, and found that the individual level factor generally explains more variance (Figure 4—figure supplement 2B). We interpret this as strong evidence that the sample-to-sample differences captured by our dispersion estimates are driven largely by true differences between individuals, rather than random technical differences in sample acquisition and dissection.

Having more confidence that the dispersion estimates reflect true biological variability, we sought to characterize the differences in dispersion between chimpanzee and human. Importantly, we find that the interspecies difference in dispersion is much less likely to be driven by cell type heterogeneity, although there is a relatively small but significant correlation with τ (R = 0.07, p=2.2×10−12; Figure 4D). We next asked what gene categories are enriched among differentially dispersed genes.

We performed GSEA (Subramanian et al., 2005), ranking genes by the polarized significance level of the chimpanzee-human difference in dispersion estimates. We found that genes more variable in human are enriched for mitotic regulators (Figure 4—figure supplement 3). Given evidence that ischemia may induce mitosis in adult mammalian cardiac cells (Kajstura et al., 1998; Kimura et al., 2017; Nakada et al., 2017), the enrichment for mitotic regulators may reflect the highly variable life histories in GTEx samples, a large fraction of which are sourced from organ donors with ischemic cardiovascular disease. Conversely, we found that genes related to immune function are more variable in chimpanzee. We note that some of our chimpanzee individuals were sourced from laboratory settings and have been challenged with HBV or HCV viral infections. However, our GSEA results are robust to the exclusion of these samples (Figure 4—figure supplement 3—source data 1, Supplementary file 1).

Within-species genetic variation contributes to inter-species differences in variability

Although potentially technical in nature, inter-individual differences in cellular composition provide a partial explanation for our observation of similar dispersion estimates across species. We looked for evidence that genetic diversity also drives dispersion. We asked whether inter-species differences in dispersion, which are much less likely to be explained by cellular composition, are associated with corresponding differences in selection pressures between the species. We reasoned that if inter-species differences in dispersion are partially driven by inter-species differences in selection pressures, we may see differences between the species in genetic signatures that are consistent with natural selection near the differentially dispersed genes. More specifically, we expect that genes with particularly low dispersion in human compared to chimpanzee will also display more constraint at the coding level in human than in chimpanzee. To assess this, we analyzed genotype data from our chimpanzee and human cohorts. We obtained human genotype data from the GTEx consortium, and chimpanzee data by performing high-coverage whole genome sequencing on the 39 chimpanzee samples used in this study (>30X genome coverage obtained in all samples, Supplementary file 2); these data represent roughly a 50% increase in the number of high-coverage Pan troglodytes genomes currently available (de Manuel et al., 2016). We used the sequencing data to identify nearly 2.9 million novel chimpanzee SNPs with a minor allele frequency (MAF) greater than 10%.

Using the genotype data, we asked if differences in expression dispersion are associated with differences in evolutionary constraint on protein coding regions in the human and chimpanzee lineages. For each gene, we calculated the ratio of non-synonymous polymorphisms (Pn) scaled to synonymous polymorphisms (Ps) within each species. This Pn/Ps metric may be used to assess purifying and diversifying selection pressures acting on coding sequences within species (Fuller et al., 2015; Huguet et al., 2014; Tanaka and Nei, 1989), which thus may correspond to our measurements of within-species expression dispersion. In contrast, the more often-utilized dN/dS metric is based on fixed coding differences between species and therefore not well suited to identify the signatures of selection that uniquely confine variability within a species. As expected, we found that loss-of-function tolerant genes have higher Pn/Ps than loss-of-function intolerant genes (Figure 5A). When we compared Pn/Ps between humans and chimpanzees, we found that the inter-species ratio of Pn/Ps is positively correlated with the difference in gene expression dispersion between species (Figure 5B). That is, on average, genes with higher dispersion in chimpanzees than in humans have a higher abundance of non-synonymous polymorphisms (Pn/Ps) in chimpanzees compared to the orthologous genes in humans. This suggests that inter-species differences in selection pressures at polymorphic loci play a role in the observed inter-species differences in expression dispersion.

Figure 5. Interspecies differences in dispersion correlate to interspecies differences in coding constraint.

Figure 5.

(A) The number of nonsynonymous polymorphisms scaled to synonymous polymorphisms (Pn/Ps) for each gene was calculated in GTEx human population. Loss-of-function tolerant (LoF intolerant) genes, defined by pLI score (Lek et al., 2016), generally have higher Pn/Ps than loss of function tolerant (LoF tolerant) genes, as shown in ECDF plot. (B) Pn/Ps was calculated for both human and chimpanzee. The distribution of the chimpanzee Pn/Ps to human Pn/Ps ratio is plotted as an ECDF, grouped by whether the gene has higher dispersion estimate in chimpanzee than in human. p-Value indicates a Mann Whitney U-test. Gene-wise Pn/Ps statistics available in Figure 5—source data 1.

Figure 5—source data 1. Gene-wise Pn/Ps statistics for chimpanzee and human.

Genes with eQTLs are shared across species more often than expected by chance

The correlation between the degree of evolutionary constraint on coding sequences and dispersion of expression at the gene level suggests that differences in cellular composition are not the only explanation for differences in dispersion. Motivated by this notion, we searched for further evidence for genetic regulation of dispersion by identifying genes associated with eQTLs - eGenes - in both humans and chimpanzees. In humans, we obtained a list of 11,682 heart left ventricle eGenes identified by the GTEx consortium (with a sample size of 386 individuals). In chimpanzee, we used the genotype and expression data we collected to map cis eQTL SNPs within 250 kb of each of the 13,545 expressed genes. We included 10 principal components as covariates (Figure 6—figure supplement 1) and accounted for genetic relatedness between individuals and population structure, which we inferred from the genotype data (Materials and methods and Figure 6—figure supplement 2). Because gene expression principle components are correlated with cellular heterogeneity (Figure 4—figure supplement 1B), their inclusion as covariates in the eQTL linear modeling helps correct for this heterogeneity and additional unobserved technical sources of variation. Using this approach, we identified 310 eGenes in chimpanzee hearts (FDR < 0.1; Figure 6—figure supplement 1A–B, Supplementary file 3). Consistent with previous eQTL studies in primates (Jasinska et al., 2017; Pickrell et al., 2010; Tung et al., 2015; Veyrieras et al., 2008), we found that the chimpanzee eQTL SNPs are enriched near transcription start sites (Figure 6—figure supplement 1C).

We considered the overlap of eGenes in humans and chimpanzees. When considering only one-to-one orthologs tested for eGenes in both species, there is no significant overlap of eGenes in the two species (Odds Ratio = 1.03, p=0.46, hypergeometric test, Figure 6—figure supplement 3A). However, this comparison is affected by the substantial difference in power to detect eQTLs between the large human sample used in GTEx and the relatively small chimpanzee sample we collected. To address this, we iteratively subsampled the human cohort to sample sizes comparable to our chimpanzee cohort, and re-mapped human eQTLs. When we compared lists of eGenes identified in humans and chimpanzees using similar sample sizes, we found a greater overlap of eGenes in the two species than expected by chance (Figure 6—figure supplement 3B). This observation is robust even if we use the eQTL results of the full GTEx dataset, as long as we perform the comparison by using the largest-effect eQTLs (those that can be identified as significant even in smaller sample sizes). Specifically, of the top 500 significant GTEx eGenes (Figure 6A, Figure 6—figure supplement 3C), 21 are also found to be eGenes in chimpanzee (FDR < 0.1), a significant enrichment (Odds Ratio = 2.05, p=0.003, hypergeometric test). Importantly, we found that species-specific eGenes have higher species-specific dispersion (Figure 6B). That is, eGenes identified in chimpanzees but not humans have higher dispersion in chimpanzees and vice versa (p=3.7×10−11; Figure 6B). Furthermore, within each species, eGenes tend to have higher dispersion than non-eGenes (Figure 6—figure supplement 4). These observations are consistent with a genetic contribution to inter-species differences in gene expression dispersion. Furthermore, the observation that more eGenes are shared among humans and chimpanzees than expected by chance suggests that the regulation of these genes evolves under less evolutionary constraint.

Figure 6. Species-sharing and dispersion of eGenes.

(A) eGenes were classified by a 10% FDR threshold in chimpanzee and considering only the top 500 eGenes by FDR in human (GTEx). (B) ECDF of the difference in dispersion of genes between chimpanzee and human. Chimpanzee-specific eGenes are more dispersed in chimpanzee; human-specific eGenes are more dispersed in human. p-Values provided for one-sided Mann-Whitney U-test with the noted alternative hypothesis.

Figure 6.

Figure 6—figure supplement 1. eQTL mapping in chimpanzee samples.

Figure 6—figure supplement 1.

(A) The number of significant variant:gene pairs and eGenes is plotted when varying numbers of principle components included as covariates. (B) QQ-plot of p-values for all variant:gene pairs tested shows inflation compared to sample permutation control. (C) For eGenes (FDR < 0.1), the distribution of distances of the top eQTL variant to the transcription start site, compared to the distribution of all test variants in eGenes.
Figure 6—figure supplement 2. Population structure and relatedness of chimpanzee cohort based on whole genome sequencing genotyping.

Figure 6—figure supplement 2.

(A) Scatter plot of the expected and observed genetic relatedness of each pair of chimpanzees for which we have pedigree information. We expected eight first-degree relationships (relatedness = 0.25), five second-degree relationships, and three third-degree relationships. (B) The chimpanzees sequenced in this study are generally Western chimpanzees, although recent admixture with other subspecies is prevalent for many of these captive-born chimpanzees. Principle component analysis of genotypes of newly sequenced and previously sequenced (de Manuel et al., 2016) chimpanzees. Individuals colored by subspecies for previously sequenced chimpanzees (Eastern, Pan troglodytes schweinfurthii; Western, Pan troglodytes verus; Central, Pan troglodytes troglodytes; Nigeria-Cameroon, Pan troglodytes ellioti). Donald is a previously sequenced captive born chimpanzee with known Western-Central admixture (de Manuel et al., 2016). Only newly sequenced chimpanzees with signs of admixture are labelled. (C) Admixture analysis (K = 4) of newly sequenced and previously sequenced chimpanzees. Height of bars indicates group membership. Admixture source data available as Figure 6—figure supplement 2—source data 1.
Figure 6—figure supplement 2—source data 1. Admixture group membership of chimpanzees in this study.
Figure 6—figure supplement 3. Significant overlap of eGenes between chimpanzee and human is observed when studies are similarly powered.

Figure 6—figure supplement 3.

(A). Overlap of eGenes identified in chimpanzee (FDR < 0.1, n = 38) and human (FDR < 0.1, sourced from GTEx; n = 386). Only genes that are one-to-one orthologs and passed expression filters for testing were considered. (B) The overlap of eGenes is more than expected by chance (assessed by odds ratio >1) when the overpowered GTEx dataset is randomly subsampled to sizes more comparable to the chimpanzee dataset and genes are retested for eGene activity in the subsample. (C) Similar to (B) when only the top X genes (ranked by FDR using the full GTEx dataset) are considered human eGenes, the overlap becomes significant as eGene classification becomes more stringent. Shaded region represents 95% confidence interval of odds ratio.
Figure 6—figure supplement 4. eGenes have higher dispersion than non-eGenes.

Figure 6—figure supplement 4.

ECDF of the dispersion estimates of eGenes versus non-eGenes in chimpanzee and human. Genes are classified as eGene or non-eGene separately for chimpanzee and humans. To make cross species comparisons roughly comparable, only the top 500 eGenes by FDR in human (GTEx) are considered eGenes in the human case. p-Values provided for two-sided Mann Whitney U-test.

Shared and species-specific eGenes may evolve under different selection pressures

To further examine whether shared and species-specific eGenes may evolve under different selection pressures than non-eGenes, we examined other indicators of selection for each of these eGene groups. We found that eGenes have higher inter-species differences in expression levels than non-eGenes (Figure 7A), and eGenes identified in both species have even larger differences in mean expression levels between species than eGenes identified only in chimpanzees or humans. These observations are consistent with the notion that the regulation of genes associated with eQTLs tend to evolve under less evolutionary constraint. Furthermore, eGenes tend to have lower levels of coding conservation in both species, as measured by amino acid identity between human and chimpanzee (Figure 7B) or dN/dS across mammals (Figure 7C).

Figure 7. Characteristics of eGenes are consistent with less constraint on eGene expression.

(A) eGenes are more differentially expressed between species than non-eGenes, with eGenes detected in both species being even more differentially expressed. The distribution of the inter-species differential expression effect size is plotted for each eGene group as an ECDF. (B) eGenes are more diverged at amino acid level than non-eGenes. (C) Human-specific eGenes and shared eGenes are more divergent than expected under neutrality. The analogous test for chimpanzee-specific eGenes displayed a shift that was not statistically significant, although our classification of eGenes in chimpanzee may be underpowered. p-Values provided for one-sided Mann-Whitney U-tests with the noted alternative hypothesis.

Figure 7.

Figure 7—figure supplement 1. Gene categories of human-chimpanzee shared and chimpanzee-specific eGenes.

Figure 7—figure supplement 1.

(A) Gene ontology categories enriched amongst the 21 eGenes identified in both species (foreground), compared to the 734 eGenes identified in only chimpanzee or only human (universe/background). Among the significant categories (Adjusted p-value<0.05), only the top five most enriched gene categories for each ontology set are shown for space. (B) Gene ontology categories enriched amongst the 148 eGenes identified only in chimpanzee, compared all 6797 eGenes identified in either species using the full GTEx dataset (universe/background). All significant categories are shown. Full GO enrichment results for (A) available as Figure 7—figure supplement 1—source data 1 and for (B) as Figure 7—figure supplement 1—source data 2.
Figure 7—figure supplement 1—source data 1. Full GO enrichment results of species-shared eGenes.
Figure 7—figure supplement 1—source data 2. Full GO enrichment results of chimpanzee-specific eGenes.

We next performed GO enrichment analysis (hypergeometric test) to ask which functional classes are identified as eGenes in both human and chimpanzee. We reasoned that the 21 genes identified as shared eGenes may have high levels of genetically regulated variability, which may indicate expression evolution at a neutral rate or faster. We found that these genes are strongly enriched for immune response genes, including major histocompatibility complex (MHC) genes (Figure 7—figure supplement 1A). This observation is consistent with previous reports that immune genes evolve under strong directional and balancing selection pressures across vertebrates, in part to respond to ever-evolving pathogen challenges (Ejsmond and Radwan, 2015; Hagai et al., 2018; Lam et al., 2017; Shultz and Sackton, 2019).

Given the evidence that highly dispersed genes and eGenes are associated with relaxed evolutionary constraint, we next asked which gene classes are enriched among chimpanzee-specific eGenes. This set of genes may be subject to stronger stabilizing selection in the human lineage. We approached this question by considering all eGenes discovered in the full GTEx dataset (FDR < 0.1) as human eGenes, as this is the most stringent way to classify eGenes as chimpanzee-specific. We identified 148 chimpanzee-specific eGenes, which we found to be significantly enriched for transcriptional regulation terms (Figure 7—figure supplement 1B).

Effects of trans-species polymorphisms on gene expression

Genetically driven variability in gene expression may also arise due to overdominant or frequency-dependent selection on gene regulation, which maintains polymorphisms over evolutionary time through balancing selection (Croze et al., 2016; Těšický and Vinkler, 2015). MHC and other immune genes are well known targets of these modes of selection, as host immune systems are under constant evolutionary pressure to diversify in response to quickly evolving pathogens. As such, these genes sometimes contain trans-species polymorphisms maintained through evolutionary time by balancing selection (Croze et al., 2016; Těšický and Vinkler, 2015).

A previous study identified 125 trans-species polymorphic haplotypes outside of the MHC region that are shared between chimpanzee and human, all but two of which are in noncoding regions (Leffler et al., 2013). Whether these polymorphisms are maintained by balancing selection because of their potential for regulatory effects on gene expression is not clear. We found that the set of genes nearest to these trans-species SNPs have higher median levels of dispersion than distal genes, though the effect is small and may be due to chance (p=0.053, Figure 8—figure supplement 1). If these trans-species SNPs have conserved regulatory activity, which diversifies the expression levels of nearby genes, we would additionally expect to see similar eQTL effects in both human and chimpanzee. To test this, we remapped eQTLs for these SNPs in both species with a uniform pipeline (see Methods). We detected 37/192 trans-species SNPs with a clear eQTL signal (FDR < 0.1) in the well-powered human dataset, such as rs257899, which associates with SLC27A6 expression. However, we did not identify any significant cis eQTL activity for this SNP in chimpanzee (Figure 8A). More generally, we did not find any inter-species correlation of regulatory effect size estimates among the 12 trans-species haplotypes with an eQTL in human (FDR < 0.1; Figure 8B) that were also tested in chimpanzee (Methods). Considering all trans-species SNPs, we did not find any evidence for their regulatory effects in chimpanzee, compared to a set of control SNPs (Figure 8C). While we found these SNPs to have measurable regulatory activity in the human dataset, it was not significantly different than that of control SNPs (Figure 8D).

Figure 8. Trans-species polymorphisms do not detectably regulate gene expression.

(A) Boxplot of SLC27A6 expression stratified by species and genotype of the trans-species SNP rs257899, the most significant eQTL of the trans-species polymorphisms tested in human heart (left ventricle) in GTEx. eQTL effect size estimates (β) and nominal p-values are provided. (B) There is not a general correlation of gene regulation effects of trans-species SNPs between chimpanzee and human. For each trans-species polymorphic region previously identified (Leffler et al., 2013), the most significant SNP:gene pair in human is shown with the effect size estimate in both human and chimpanzee. Only the 48 regions where the strongest human SNP:gene association was also a one-to-one ortholog and the same SNP:gene pair was also tested in chimpanzee were plotted. Labeled SNP:gene pairs indicate FDR < 0.1 in human. One sided P-value provided for Pearson correlation, under the alternative hypothesis that effect sizes should be positively correlated between species. Only SNP:gene pairs FDR < 0.1 in human were considered for this test. (C) The trans-species polymorphisms do not have detectable cis eQTL activity in chimpanzee. QQ-plot of p-values of cis eQTL activity of the trans-species polymorphisms, compared to a sample permutation control, and to a control set of SNPs. (D) Same as (C) but testing cis eQTL activity in human. p-Values provided for (C) and (D) represent one-sided Mann-Whitney U-test with the alternative hypothesis that trans-species polymorphisms have smaller cis eQTL p-values than the control SNPs.

Figure 8.

Figure 8—figure supplement 1. Genes closest to trans-species polymorphisms exhibit similar dispersion.

Figure 8—figure supplement 1.

ECDF of dispersion estimates in chimpanzee and human of these genes (defined as closest protein coding gene to trans-species SNP and <100 kb away). p-Values represent a two-sided Mann-Whitney U-test for each species.
Figure 8—figure supplement 2. Trans-species polymorphisms do not regulate gene expression differently than control SNPs across all GTEx tissues.

Figure 8—figure supplement 2.

(A) The distribution of the smallest eQTL P-value in GTEx tissues for each trans-species polymorphism, compared to the matched control set of SNPs (<100 kb from trans-species polymorphism but unlinked and matched for allele frequency; see Materials and methods). p-Value represents one-sided Mann-Whitney test, with the alternative hypothesis that trans-species polymorphisms have smaller cross-tissue minimum p-values (B) The distribution of GTEx tissues of the smallest eQTL p-value for each trans-species polymorphism, compared to matched control set of SNPs.

Finally, we asked whether these SNPs may be under selection due to eQTL effects in tissues other than heart. To this end, we identified the most significant eQTL P-value for each of these SNPs across all GTEx tissues and found that these eQTL effects are not statistically different from that of control SNPs (Figure 8—figure supplement 2A). Furthermore, the tissues where the eQTL minimum p-value was identified are similar to those of control SNPs (Figure 8—figure supplement 2B), suggesting these trans-species SNPs in general do not have specific regulatory activity in any particular tissue. In summary, we found no compelling evidence that these trans-species polymorphisms have strong regulatory activity in either species.

Discussion

We set out to understand the properties that are associated with different levels of gene expression variability in human and chimpanzee populations. Because we know that gene expression variation in humans is often associated with genetic variation (in the form of eQTLs), we hypothesized that the degree of population variability in gene expression levels may reflect the evolutionary constraint on gene regulation. We reasoned that a comparison of regulatory variation in humans and chimpanzees, and a comparative eQTL study, may provide evidence to support said hypothesis and further identify inter-species similarities and differences in the selective pressures on gene expression.

We found that inter-individual expression variability is highly correlated in humans and chimpanzees. At first glance, this seems to support the notion that regulatory variation evolves under similar selective pressures in both species. However, we were unable to exclude a technical explanation for this observation. It was difficult to disentangle the genetic and non-genetic contributions of this variability because we used primary tissue samples that include multiple cell types. We found that, across genes, cell type heterogeneity is a major driver of the degree to which gene expression varies in the population. Because orthologous genes in human and chimpanzee are expected to have similar expression patterns, this finding can potentially explain the observation of high correlation in expression dispersion in the two species. Though this technical explanation may be intuitive, the degree of the association between population variability and the cellular specificity of gene expression may have been overlooked without the use of single-cell data.

Cell type heterogeneity has likely affected previous comparative studies of gene regulation that used primary tissue samples, including studies from our own lab. We and others have commented on this property of primary tissue comparisons in the past (Avila Cobos et al., 2018; Blekhman et al., 2008; Newman et al., 2015; Selewa et al., 2020), but without single cell data it was impossible to effectively assess the magnitude of this effect. Our findings further underscore the need for single cell measurements to disentangle sources of variation in bulk RNA-seq data, especially from primary tissues. Future work examining population variability should take this into account, possibly by collecting single cell data, to separate cell-type heterogeneity from heterogeneity within a particular cell type. It is important to account for cellular composition not only in comparative studies of variation in gene expression, but also in studies that focus on a single species. Indeed, cellular heterogeneity may itself have a genetic component, which will be confounded with regulatory differences within a cell type (Donovan et al., 2020; Marderstein et al., 2020).

Notwithstanding these complications, our observations do indicate that natural selection has played a role in shaping inter-species similarities and differences in gene expression variability. Differences in cellular composition cannot explain the observed correlation between dispersion and measures of nucleotide divergence and diversity. This was the first observation in our study that provided some measure of support for the hypothesis that regulatory variation may be a genetic trait. Although the inference of selection at the genomic level indicates selection on coding regions (not gene regulation), the correlation with the degree of variation in expression suggests that the regulation of functionally important genes is also a selected trait.

Encouraged by this finding, we were able to find more evidence to support our hypothesis by carrying out a comparative eQTL analysis. We identified eQTLs in humans by using the GTEx data, which sampled hundreds of individuals. In chimpanzee, we identified eQTLs by using our 39 samples. It is quite difficult to obtain chimpanzee primary tissue samples and although this sample size is modest, it is probably the largest population of chimpanzee primary tissue samples ever reported. The difference in sample size between the human and chimpanzee eQTL discovery panels means that observations of human-specific eGenes are quite expected and can often be explained by the fact that we have more power to detect eQTLs in humans. In contrast, the observation of chimpanzee-specific eGenes is quite meaningful because it is much less likely that the human sample was underpowered to detect eQTLs for said genes if they existed.

With that in mind, our observation that species-specific eGenes have higher variability in the species in which the eQTL was detected is significant, because it directly points to a genetic basis for differences in expression dispersion between humans and chimpanzees. This observation is also consistent with the notion that genetic variation within species contributes to the overall inter-species divergence in gene regulation. This is a critical piece of evidence supporting our hypothesis, although we acknowledge that under some complicated scenarios, one could evoke cellular composition as a potential explanation for this observation as well. We argue, however, that this would require inter-species differences in cellular composition to segregate with dozens of genotypes in just one species, and for these genotypes to appear as cis eQTLs for genes that have higher divergence due to cellular composition. The requirement for the genotypes to correlate with the difference in cellular composition while also being in proximity to the specific set of eGenes that would satisfy the divergence requirement is extremely unlikely, although we cannot offer empirical data to entirely exclude this possibility. That said, we also observed stronger signals of coding selection for non-eGenes than eGenes in both species, further suggesting selection on the eGenes themselves (this is not expected if our observations are to be explained by differences in cellular composition).

Collectively, our observations suggest that, across species, eQTLs may be a subtle indicator of dosage insensitivity for relatively neutrally evolving genes. Thus, we believe that an inter-species analysis of population variability in gene expression may be a relatively simple way to complement existing methods to assess differences in selection pressures between lineages. Interestingly, the eGenes we identified only in chimpanzee, but not in human, are slightly enriched for transcription regulatory processes, reminiscent of a previous observation that transcription factors seem to be enriched among the genes positively selected for in the human lineage (Blekhman et al., 2008; Gilad et al., 2006).

In both human and chimpanzee, we found that genes involved in immune response are among the most variably expressed and strongly enriched among species-shared eGenes. This is consistent with a body of literature (Croze et al., 2016; Ejsmond and Radwan, 2015; Hagai et al., 2018; Shultz and Sackton, 2019) that points to diversifying selective pressure on immune related cell-surface receptor genes to identify and combat diverse and ever-evolving pathogens. A prime example of this is the abundance of MHC complex genes among the most variable genes in both species, and strongly enriched in the subset of shared eGenes. However, the extent to which quantitative regulation of gene expression is functionally important to pathogen defense and thus the target of selection is unclear. Alternatively, these regulatory variants may be hitchhiking with functionally important and tightly linked coding variants under strong positive or balancing selection (D'Antonio et al., 2019; Meyer et al., 2018; Shiina et al., 2009). If non-coding trans-species polymorphisms are targets of long lived balancing selection on gene regulation (Johnsen et al., 2009; Leffler et al., 2013), we expect these polymorphisms to display similar eQTL effects in both species. However, we failed to identify generally conserved regulatory effects of trans-species polymorphisms. Although we found clear instances of regulatory effects from some of these trans-species SNPs in humans, we note that this human dataset is well powered enough to detect similar regulatory effects even from random control SNPs. The cumulative regulatory effects of trans-species polymorphisms are not significantly different than the control SNPs. We acknowledge that we may be underpowered to detect subtle conserved regulatory effects in chimpanzee. Moreover, some of these trans-species polymorphisms may have important regulatory functions; albeit this function may also be tissue-, cell-type-, or context-dependent and not present in our assessment of heart and other GTEx bulk tissues.

In summary, we performed a comparative assessment of expression variability and eQTL mapping and found signatures of stabilizing selection on gene regulation in both species. A deeper understanding of differences in selection on gene expression may be gained by further assessing mean differences, variability, and eQTL contributions in various tissue types across primate groups. Such studies may benefit from single-cell techniques, as we find strong contributions of cell-type heterogeneity in our analysis of variability, which may be biological or technical in nature.

Materials and methods

Novel data generation

In total, 39 post-mortem heart tissue biopsies were collected from captive born chimpanzees, 18 of which have been previously described (Pavlovic et al., 2018). A partial pedigree suggests at least eight first degree relationships among these individuals. Other metadata, including sex, age at death, and primate research center source of tissue, are detailed in Figure 1—source data 1. DNA and RNA were extracted from frozen tissues using kits (QIAGEN Cat No. 74104) or Trizol extraction. RNA-seq and whole genome sequencing libraries were prepared according to manufacturer’s protocols (PolyACapture followed by TruSeq v2 RNA Library prep kit; same RNA-seq protocol used by GTEx consortium. Nextera DNA Flex Library prep kit). Sequencing of RNA-seq libraries was performed by University of Chicago sequencing facilities on HiSeq 4000 using 75 bp single-end sequencing chemistry. The 10 RNA-seq libraries previously described (Pavlovic et al., 2018) were re-sequenced for additional depth, along with the 29 new libraries. Whole genome sequencing for all 39 chimpanzee samples was performed on NovaSeq using 300+300 paired end sequencing chemistry.

RNA-seq and differential expression power analysis

39 RNA-seq fastq files for human left ventricle were chosen at random from GTEx v7 (Figure 1—source data 1, see Acknowledgements). Additionally, the 10 human and 18 chimpanzee RNA-seq libraries previously generated (Pavlovic et al., 2018), and all the novel chimpanzee RNA-seq libraries generated in this study are described in Figure 1—source data 1. Fastq files for samples that were sequenced on multiple lanes were combined after confirmation that all gene expression profiles for all fastq files cluster primarily by sample and not by sequencing lane. GTEx-derived fastq files were trimmed to 75 bp single-end reads to match the non-GTEx sequencing data. Reads were aligned to the appropriate annotated genome (GRCh38.p13 or Pan_tro_3.0 from Ensembl release 95; Zerbino et al., 2018) using STAR aligner (Dobin et al., 2013) default parameters. Only chromosomal contigs were considered for read alignment throughout this work. That is, unplaced contigs were excluded from the reference genome. Gene counts were obtained with subread featureCounts (Liao et al., 2014) using a previously described annotation file of human-chimpanzee orthologous exons (Pavlovic et al., 2018). The gene expression matrix was converted to CountsPerMillion (CPM) with edgeR (Robinson et al., 2010) to normalize reads to library size. The mean-variance trend was estimated using limma-voom (Ritchie et al., 2015). Genes with less than 6 CPM in all samples were excluded from further analysis. This cutoff was chosen based on visual inspection of the voom mean-variance trend to identify where the trend becomes unstable. To normalize differences in orthologous exonic gene size between human and chimpanzee, we converted the log(CPM) matrix to log(RPKM) based on the species-specific length of orthologous exonic regions. We then visually inspected PCA plots and hierarchical clustering to identify potential outliers and batch effects (Figure 1). We note that although all the chimpanzee RNA isolation and sequencing were prepared separately from GTEx heart samples, PCA and clustering analysis suggest that the inter-species differences vastly outweigh the technical batch effects (based on the inter-species but within-batch samples sourced from Pavlovic et al., 2018). The dataset of 49 human samples and 39 chimpanzee samples was culled to 39 human and 39 chimpanzee samples based on exclusion of 5 obvious human outliers which did not cluster with the rest and had among the lowest read depths (Figure 1A–B). The remaining human samples to exclude to reach a balanced set of 39 human and 39 chimpanzee samples were chosen by excluding the remaining GTEx samples with the lowest mapped read depths. Differential expression was tested using limma (Ritchie et al., 2015), using the eBayes function with default parameters and applying Benjamini-Hochberg FDR estimation. For Figure 1—figure supplement 2, this process was repeated at varying sample size (sampling with replacement) and at various read depths. Sequencing depth subsampling analysis was performed at the level of bam files to obtain matched numbers of mapped reads across samples and differential expression analyses was repeated.

Contribution of inter-relatedness chimpanzees to differential expression analysis

A centered genetic relatedness matrix of kinship coefficients (the same as used in chimpanzee cis-eQTL mapping) was used to cluster the 39 chimpanzee samples into groups which have varying levels of inter-relatedness. Clusters were determined using 'hclust' and 'cutree' functions in R with k = 7 clusters and defaults for other parameters. This resulted in the seven clusters depicted in Figure 1—figure supplement 3, which were further culled into three clusters of comparable size (each of size n = 4) with relatively high inter-relatedness, and one cluster (n = 13) with relatively low inter-relatedness. Specifically, sample 529 in the purple cluster was dropped as it had the lowest mean intra-cluster pairwise kinship coefficient. Conversely, six samples were dropped from the red cluster based on the presence of a high intra-cluster pairwise kinship coefficient which indicate first degree relatives. The VarianceParition R package (Hoffman and Schadt, 2016) was used implement a linear mixed model to assess the contribution of the cluster annotations (which we interpret as a proxy for inter-relatedness), RNA extraction batch, and sex (each as random effects) to explaining logRPKM gene expression for each gene. We also used the cluster groupings to empirically evaluate the degree to which highly inter-related subsamples contribute to DE power: More specifically, the same power analysis procedure described above was used to empirically assess the number of DE genes and false positive rate using four human samples (samples 63145, 62606, 59167, 59263, which were all derived from the same technical batch), and four chimpanzee samples, each drawn without replacement from within a cluster of varying degrees of inter-relatedness. The full distribution of DE results from all 715 possible combinations of four chimpanzees drawn from the lowly related n = 13 cluster was used as a baseline.

Expression variability estimation

To estimate the mean and variance of gene expression, we assume

xijxi+,λijPoisson(xi+ljλij)λijGamma(ϕj,ϕj/μj)

where xij is the number of reads mapping to gene j in sample i (i=1,,n;j=1,,p), xi+=jxij is the total number of reads observed in sample i, lj is the effective length (Pachter, 2011) of gene j, and λij is the true relative gene expression of gene j in sample i. The effective length for each gene, lj, was calculated separately for chimpanzee and human as the length of orthologous exonic regions from which aligned reads were summed to create a count matrix. Under this model, true gene expression values for gene j, λ1j,,λnj, follow a Gamma distribution with mean μj and variance μj2/ϕj, implying that the observed counts x1j,,xnj follow a Negative Binomial distribution with mean xi+ljμj and overdispersion 1/ϕj. This model corresponds to a generalized linear model (Hilbe, 2014), which we fit by maximizing the likelihood using the 'glm.nb' function in the R package MASS.

We fit a LOESS trend using the 'loess.fit' function in R (with degree = 1) to the mean-overdispersion trend across all genes and considered the residual from the trend as the gene’s mean-corrected dispersion. We estimated standard errors for each dispersion estimate by bootstrapping with replacement 1000 times. We estimated bootstrap p-values to test the alternative hypothesis that the absolute difference in dispersion between chimpanzee and human is greater than zero, bootstrapping with replacement 10,000 times. More specifically, we estimated the distribution of the absolute difference in dispersion under the null by performing 10,000 iterations of resamples (n = 39 individuals) from a joint count matrix containing all human and chimpanzee individuals. p-Values for each gene are then defined as the fraction of resamples with an absolute difference greater than what is observed. False discovery rates were estimated using Storey’s q-value (Storey and Tibshirani, 2003).

Dispersion and cell-type heterogeneity

Single-cell RNA-seq data were downloaded from the Tabula Muris mouse single cell atlas (Tabula Muris Consortium et al., 2018). This dataset contains both FACS-based and droplet-based single-cell RNA-seq datasets for adult mouse heart. Only the FACS based heart data were used, as the droplet based data are much sparser by comparison (Tabula Muris Consortium et al., 2018). Data were analyzed with Seurat (Butler et al., 2018) using the published cell type labels (Tabula Muris Consortium et al., 2018). After subsetting cells that contain at least 1000 genes with nonzero counts, the scTransform function was used to obtain a normalized count matrix used for plotting Figure 4A–B. A cell-type-specificity score, τ (Kryuchkova-Mostacci and Robinson-Rechavi, 2016), was calculated for each gene (only considering one-to-one mouse/human orthologs) by utilizing the nine cell type labels assigned by Tabula Muris to sum raw read counts from each cell type to create a pseudo-bulk count matrix. The pseudo-bulk count matrix was converted to CountsPerMillion and subsequently used to calculate τ with open source software (doi:10.5281/zenodo.3558708).

BayesPrism (Chu and Danko, 2020) was used to estimate cell types in the bulk RNA-seq datasets used for the RNA-seq dispersion and power analyses. The intersection of genes that are one-to-one mouse-human orthologs and used in DE and bulk dispersion estimation were used to filter the cell-type-labeled mouse scRNA-seq reference gene expression matrix for deconvolution ('run.Ted' function with default parameters) of the 39 chimpanzee and 39 human bulk samples used in DE analysis. This process yielded per-individual cellular proportion estimates and expression estimates for each cell type. The expression estimates were converted to log(CPM) based on the library size of the bulk count matrix. A cell-type-specific dispersion estimate was obtained similarly to the bulk procedure: a LOESS trend was fit to the population mean expression versus the log(variance) trend across all genes, and the residual was considered as the cell-type-specific dispersion estimate. Standard errors for cell-type-specific dispersion estimates were obtained by bootstrapping 1000 samples from the BayesPrism estimated cell-type-specific expression matrices, and as such, the reported standard error (Figure 4—figure supplement 1—source data 1) does not incorporate error in cell type deconvolution or expression estimation.

Chimpanzee genome sequencing

For whole genome sequencing data processing, we followed general guidelines for read alignment and de novo variant calling as previously described (Li, 2014). More precise steps are as follows: Sequencing adapters were trimmed with cutadapt (Martin, 2011) and aligned to Pan_tro_3.0 (Ensembl) using bwa-aligner (Li and Durbin, 2009) with default parameters. The average genome coverage in every sample was >30X. Sample-specific statistics for basic data processing steps, including read alignment and variant calling, are summarized in Supplementary file 2. PCR duplicates were removed via Picard tools. Low-complexity regions of the genome were determined using dustmasker (Morgulis et al., 2006) with default settings and excluded from variant calling. Sites where any sample had coverage (after PCR duplicate removal) outside of d+3√d and d+4√d coverage (where d is the sample average fold coverage across the genome) were also excluded, as these regions are enriched for duplicated or paralogous regions which are prone to false heterozygous variant calls (Li, 2014). The resulting callable sites span 2,544,417,587 out of 2,967,125,077 bases on the contiguous chromosomal genome. Variants were called in all samples jointly using freebayes (Garrison and Marth, 2012) with the following parameters: {--min-coverage 3 --max-coverage 150 k –standard-filters -n 2 –report-genotype-likelihood-max} and subsequently filtered for Phred-scaled quality score >30 to generate a VCF file (see Data Availability). Due to memory constraints, variant calling was executed in 2.5 megabase chunks which were later merged. In total, 19,789,407 single nucleotide variants passed variant calling filters, yielding a transition/transversion ratio of 2.08. Variants in the accompanying VCF and throughout this work were left ‘clumped’, meaning that completely linked SNPs within 3 bp (the default setting of freebayes) are combined into a single variant in the VCF, and tested as a single variant during eQTL calling.

For admixture analysis, a VCF file of previously sequenced wild-born chimpanzees (de Manuel et al., 2016) was lifted over to Pan_tro_3.0 and merged with the VCF file described above, keeping only variants present in both sets. Variants in high LD were pruned using plink (Purcell et al., 2007) with parameters {--indep-pairwise 50 5 0.5}. The resulting genotypes were analyzed for population structure with PCA using 'prcomp' function in R. Additionally, we utilized Admixture software (Alexander et al., 2009) with K = 4 clusters, as there are four recognized distinct chimpanzee subspecies, all of which were represented in the wild-born cohort (de Manuel et al., 2016).

Cis eQTLmapping (chimpanzee)

RNA-seq reads were re-aligned with STAR aligner (Dobin et al., 2013) using Pan_tro_3.0 gene annotations from Ensembl version 95 (Zerbino et al., 2018). Gene counts were quantified with STAR --quantMode GeneCounts to compile a gene expression matrix. Genes were filtered to require at least eight read counts in 75% of samples, leaving 13,545 genes for further analysis. The resulting read count matrix was converted to log(CPM), standardized across individuals, and quantile normalized to a normal distribution across genes as previously described (Degner et al., 2012). Principle component analysis of the normalized matrix identified significant associations between various observed technical factors and some of the first 10 principle components, including RNA library prep batch and sex, and as such, principle components were included later as covariates during eQTL calling.

Variants were filtered for MAF >0.1 and Hardy-Weinberg equilibrium (nominal P>1x10-7.5, hardy function in plink) to filter out rare variants and genotyping errors. The resulting 5,957,179 variants were each tested for association with expression of each local cis gene (cis window defined as within 250 kb of gene body). MatrixEQTL (Shabalin, 2012) was used to implement a linear mixed model for each cis variant:gene pair to estimate the effect of the variant genotype on normalized expression gene expression. We supplied MatrixEQTL with a genetic relatedness matrix (GRM) to account for heteroskedastic errors generated from underlying population structure and genetic relatedness amongst individuals. The standardized GRM was produced by GEMMA (Zhou and Stephens, 2012) using variants pruned for LD as described for admixture analysis. Additionally, between 0 and 15 gene expression principle components (PCs) were tested as covariates to the linear model and 10 PCs were included in the final model as this maximizes the number of eQTLs (Figure 6—figure supplement 1A). Manual inspection of normalized expression boxplots stratified by genotype for the top eQTLs revealed many of the nominally strongest associations were driven by a single expression outlier point for a single homozygous individual, MD_And. Further inspection revealed this individual has among the highest levels of homozygosity genome-wide among our chimpanzee cohort (Supplementary file 2), possibly due to inbreeding. Given that this individual is the only chimpanzee sourced from MD Anderson primate research center, we felt justified excluding this individual from eQTL calling to minimize false associations. After excluding this individual and re-performing eQTL mapping (n = 38), visual inspection of a QQ-plot of p-values compared to permuted null data (where the sample labels for expression and covariates were randomly assigned to genotype) indicates inflation of small p-values, and that p-values are well calibrated under the permuted null (Figure 6—figure supplement 1B). To obtain gene-level p-values that test whether a gene contains an eQTL (eGenes) we used EigenMT, a method which approximates permutation testing procedures to account for multiple testing of linked SNPs (Davis et al., 2016).

Cis eQTL mapping (human)

We downloaded gene-level (eGene) summary statistics for Heart_Left_ventricle GTEx v8 from GTEx portal (https://gtexportal.org/). The summary statistics from this mapping pipeline only considers expressed genes, which are defined as >0.1 TPM in at least 20% of samples and ≥6 reads in at least 20% of samples. The analysis in Figure 6—figure supplement 3B required more than summary statistics. We downloaded normalized phenotypes, covariates, and genotypes for GTEx v8 data for Heart_Left_ventricle and remapped eGenes using varying sample sizes using a mapping pipeline nearly identical to GTEx. Namely, we used FastQTL (Ongen et al., 2016) on the supplied data with randomly selected individuals corresponding to sample sizes (n) of 40, 60, 80, 100, 120, 160, 200. Similar to guidelines described by GTEx (https://gtexportal.org/), we included only 10 PEER factor covariates for n = 40, 15 for 40 > n > 150, or 30 for n > 150.

Gene-wise conservation statistics

Gene-wise amino acid percent identity between chimpanzee and human was obtained from BioMart (Kinsella et al., 2011). Gene-wise pan-mammal dN/dS for each gene was obtained from a previous study that used alignments from 29 mammals (Lindblad-Toh et al., 2011). Pn/Ps was calculated from all GTEx v8 genotype data and the union of all Pan troglodyte genotypes available in this study and de Manuel et al., 2016. Specifically, Ensembl vep (McLaren et al., 2016) was used to annotate coding variants with MAF >0.1 as synonymous or non-synonymous to tabulate Pn (number of polymorphic non-synonymous sites) and Ps (number of polymorphic synonymous sites) for each gene within each species. After requiring that genes have at least one polymorphism in both species, a pseudocount of 0.5 was added to both Pn and Ps for each gene for both species to avoid division-by-zero errors. TATA box genes were classified as genes with a TATA motif within 35 bp of a transcription initiation site from published transcription initiation sites (Abugessaisa et al., 2019).

eQTL-mapping of shared polymorphisms

263 SNPs among 125 regions that are trans-species polymorphisms between chimpanzee and humans were obtained from Leffler et al., 2013. Each trans-species polymorphic region contains at least two trans-species SNPs to ensure regions are identical by descent rather than recurrent mutation of an isolated SNP (Leffler et al., 2013). Only SNPs with MAF >0.1 in our datasets were further utilized, leaving 192 SNPs in chimpanzee, 196 SNPs in human (GTEx), and 144 SNPs (amongst 76 regions) tested for eQTL activity in both species. For each test SNP, a matched control SNP was randomly chosen for each species with the criteria that it should have a matching allele frequency (±5% MAF), within 100 kb of the test SNP, and unlinked to the test SNP (R2 <0.2, LD calculated with plink). We used MatrixEQTL to retest these SNPs for cis eQTL activity (1 MB window) using MatrixEQTL with the same normalized gene expression matrix, GRM matrix (for chimpanzee only), and covariates described above for chimpanzee and human cis eQTL mapping. The effect sizes between species occasionally had to be re-polarized to relate to the same allele, as the effect sizes obtained from eQTL testing software is often polarized by minor allele or by reference vs non-reference allele, although the minor allele and/or reference allele at these trans-species polymorphisms is not always the same between species.

Gene set enrichment analysis

Gene set enrichment and gene ontology overlap analysis was performed with clusterProfiler R package (Yu et al., 2012). The GSEA test was performed with an ordered gene list using the 'gseGO' function with 1,000,000 permutations. As ordering genes was not applicable to inter-species eGene classifications, the 'enrichGO' function was used to perform gene ontology overlap analysis (hypergeometric test) with foreground and background gene sets based on eGene classifications. To quantify the correlation between dN/dS and dispersion for different GO sets, we obtained dN/dS annotations as described above (Lindblad-Toh et al., 2011) and gene set annotations from MSigDB v7.2 (Liberzon et al., 2011). We tested all gene sets with dN/dS and dispersion estimates for at least five genes in the gene set using Spearman’s test, adjusting for multiple testing with the Storey’s Q-value (Storey and Tibshirani, 2003).

Acknowledgements

We thank Natalia Gonzales, Michelle Ward, Ittai Eres, and other members of the Gilad lab for helpful analysis discussions and comments on the manuscript. This work was supported by NIH grant R35GM131726 as well as the Yerkes National Primate Research Center Base Grant ORIP/OD P51OD011132 and RR00165,. Computational resources were provided by the University of Chicago Research Computing Center.

The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. The data used for the analyses described in this manuscript were obtained from: the GTEx Portal on 10/08/19 and/or dbGaP accession number phs000424.v7.p2 on 09/17/19.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Benjamin Jung Fair, Email: bjf79@uchicago.edu.

Yoav Gilad, Email: gilad@uchicago.edu.

Hunter B Fraser, Stanford University, United States.

Detlef Weigel, Max Planck Institute for Developmental Biology, Germany.

Funding Information

This paper was supported by the following grant:

  • National Institute of General Medical Sciences R35GM131726 to Yoav Gilad.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Data curation, Formal analysis, Visualization, Methodology, Writing - original draft, Writing - review and editing.

Resources, Data curation, Writing - review and editing, Discussion and interpretation of results.

Methodology, Writing - review and editing, Discussion and interpretation of results.

Resources, Data curation, Investigation.

Investigation.

Conceptualization, Supervision, Funding acquisition, Writing - original draft, Writing - review and editing.

Additional files

Supplementary file 1. Full GSEA results based on interspecies dispersion differences after excluding virally challenged chimpanzees.
elife-59929-supp1.txt (70.7KB, txt)
Supplementary file 2. Whole genome sequencing sample summary statistics.
elife-59929-supp2.txt (3.6KB, txt)
Supplementary file 3. Chimpanzee eGene summary statistics.
elife-59929-supp3.txt (1.6MB, txt)
Transparent reporting form

Data availability

RNA-Seq data available under GEO accession number GSE151397. Raw whole genome sequencing data under SRA accession PRJNA635393. Processed whole genome sequencing data available as variant calls at European variation archive, EVA accession PRJEB39475.

The following datasets were generated:

Fair BJ, Blake LE, Chavarria C, Sarkar A, Pavlovic BJ, Gilad YY. 2020. Gene expression variability in human and chimpanzee populations share common determinants. NCBI Gene Expression Omnibus. GSE151397

Fair BJ, Blake LE, Chavarria C, Sarkar A, Pavlovic BJ, Gilad YY. 2020. Whole genome sequencing of 39 captive born chimpanzees. NCBI BioProject. PRJNA635393

Fair BJ. 2020. Whole genome sequencing of 39 captive born chimpanzees. EBI European Variation Archive. PRJEB39475

The following previously published datasets were used:

Pavlovic BJ, Blake LE, Chavarria C, Gilad Y. 2018. A Comparative Assessment of iPSC Derived Cardiomyocytes with Heart Tissues in Humans and Chimpanzees. NCBI Gene Expression Omnibus. GSE110471

The GTEx Consortium 2019. GTEx Analysis V8. dbGaP. phs000424.v8.p2

References

  1. Abugessaisa I, Noguchi S, Hasegawa A, Kondo A, Kawaji H, Carninci P, Kasukawa T. refTSS: a reference data set for human and mouse transcription start sites. Journal of Molecular Biology. 2019;431:2407–2422. doi: 10.1016/j.jmb.2019.04.045. [DOI] [PubMed] [Google Scholar]
  2. Albert FW, Bloom JS, Siegel J, Day L, Kruglyak L. Genetics of trans-regulatory variation in gene expression. eLife. 2018;7:e35471. doi: 10.7554/eLife.35471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Research. 2009;19:1655–1664. doi: 10.1101/gr.094052.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Avila Cobos F, Vandesompele J, Mestdagh P, De Preter K. Computational deconvolution of transcriptomics data from mixed cell populations. Bioinformatics. 2018;34:1969–1979. doi: 10.1093/bioinformatics/bty019. [DOI] [PubMed] [Google Scholar]
  5. Barbosa-Morais NL, Irimia M, Pan Q, Xiong HY, Gueroussov S, Lee LJ, Slobodeniuc V, Kutter C, Watt S, Colak R, Kim T, Misquitta-Ali CM, Wilson MD, Kim PM, Odom DT, Frey BJ, Blencowe BJ. The evolutionary landscape of alternative splicing in vertebrate species. Science. 2012;338:1587–1593. doi: 10.1126/science.1230612. [DOI] [PubMed] [Google Scholar]
  6. Bashkeel N, Perkins TJ, Kærn M, Lee JM. Human gene expression variability and its dependence on methylation and aging. BMC Genomics. 2019;20:941. doi: 10.1186/s12864-019-6308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Battle A, Mostafavi S, Zhu X, Potash JB, Weissman MM, McCormick C, Haudenschild CD, Beckman KB, Shi J, Mei R, Urban AE, Montgomery SB, Levinson DF, Koller D. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Research. 2014;24:14–24. doi: 10.1101/gr.155192.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Blake WJ, Balázsi G, Kohanski MA, Isaacs FJ, Murphy KF, Kuang Y, Cantor CR, Walt DR, Collins JJ. Phenotypic consequences of Promoter-Mediated transcriptional noise. Molecular Cell. 2006;24:853–865. doi: 10.1016/j.molcel.2006.11.003. [DOI] [PubMed] [Google Scholar]
  9. Blekhman R, Oshlack A, Chabot AE, Smyth GK, Gilad Y. Gene regulation in primates evolves under tissue-specific selection pressures. PLOS Genetics. 2008;4:e1000271. doi: 10.1371/journal.pgen.1000271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Bódi Z, Farkas Z, Nevozhay D, Kalapis D, Lázár V, Csörgő B, Nyerges Á, Szamecz B, Fekete G, Papp B, Araújo H, Oliveira JL, Moura G, Santos MAS, Székely T, Balázsi G, Pál C. Phenotypic heterogeneity promotes adaptive evolution. PLOS Biology. 2017;15:e2000644. doi: 10.1371/journal.pbio.2000644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Brawand D, Soumillon M, Necsulea A, Julien P, Csárdi G, Harrigan P, Weier M, Liechti A, Aximu-Petri A, Kircher M, Albert FW, Zeller U, Khaitovich P, Grützner F, Bergmann S, Nielsen R, Pääbo S, Kaessmann H. The evolution of gene expression levels in mammalian organs. Nature. 2011;478:343–348. doi: 10.1038/nature10532. [DOI] [PubMed] [Google Scholar]
  12. Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology. 2018;36:411–420. doi: 10.1038/nbt.4096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Chan ET, Quon GT, Chua G, Babak T, Trochesset M, Zirngibl RA, Aubin J, Ratcliffe MJ, Wilde A, Brudno M, Morris QD, Hughes TR. Conservation of core gene expression in vertebrate tissues. Journal of Biology. 2009;8:33. doi: 10.1186/jbiol130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Chu T, Danko CG. Bayesian cell-type deconvolution and gene expression inference reveals tumor-microenvironment interactions. bioRxiv. 2020 doi: 10.1101/2020.01.07.897900. [DOI]
  15. Croze M, Živković D, Stephan W, Hutter S. Balancing selection on immunity genes: review of the current literature and new analysis in Drosophila melanogaster. Zoology. 2016;119:322–329. doi: 10.1016/j.zool.2016.03.004. [DOI] [PubMed] [Google Scholar]
  16. D'Antonio M, Reyna J, Jakubosky D, Donovan MK, Bonder MJ, Matsui H, Stegle O, Nariai N, D'Antonio-Chronowska A, Frazer KA. Systematic genetic analysis of the MHC region reveals mechanistic underpinnings of HLA type associations with disease. eLife. 2019;8:e48476. doi: 10.7554/eLife.48476. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Davis JR, Fresard L, Knowles DA, Pala M, Bustamante CD, Battle A, Montgomery SB. An efficient Multiple-Testing adjustment for eQTL studies that accounts for linkage disequilibrium between variants. The American Journal of Human Genetics. 2016;98:216–224. doi: 10.1016/j.ajhg.2015.11.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. de Jong TV, Moshkin YM, Guryev V. Gene expression variability: the other dimension in transcriptome analysis. Physiological Genomics. 2019;51:145–158. doi: 10.1152/physiolgenomics.00128.2018. [DOI] [PubMed] [Google Scholar]
  19. de Manuel M, Kuhlwilm M, Frandsen P, Sousa VC, Desai T, Prado-Martinez J, Hernandez-Rodriguez J, Dupanloup I, Lao O, Hallast P, Schmidt JM, Heredia-Genestar JM, Benazzo A, Barbujani G, Peter BM, Kuderna LF, Casals F, Angedakin S, Arandjelovic M, Boesch C, Kühl H, Vigilant L, Langergraber K, Novembre J, Gut M, Gut I, Navarro A, Carlsen F, Andrés AM, Siegismund HR, Scally A, Excoffier L, Tyler-Smith C, Castellano S, Xue Y, Hvilsom C, Marques-Bonet T. Chimpanzee genomic diversity reveals ancient admixture with bonobos. Science. 2016;354:477–481. doi: 10.1126/science.aag2602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Degner JF, Pai AA, Pique-Regi R, Veyrieras JB, Gaffney DJ, Pickrell JK, De Leon S, Michelini K, Lewellen N, Crawford GE, Stephens M, Gilad Y, Pritchard JK. DNase I sensitivity QTLs are a major determinant of human expression variation. Nature. 2012;482:390–394. doi: 10.1038/nature10808. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Donovan MKR, D'Antonio-Chronowska A, D'Antonio M, Frazer KA. Cellular deconvolution of GTEx tissues powers discovery of disease and cell-type associated regulatory variants. Nature Communications. 2020;11:955. doi: 10.1038/s41467-020-14561-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Ecker S, Chen L, Pancaldi V, Bagger FO, Fernández JM, Carrillo de Santa Pau E, Juan D, Mann AL, Watt S, Casale FP, Sidiropoulos N, Rapin N, Merkel A, Stunnenberg HG, Stegle O, Frontini M, Downes K, Pastinen T, Kuijpers TW, Rico D, Valencia A, Beck S, Soranzo N, Paul DS, BLUEPRINT Consortium Genome-wide analysis of differential transcriptional and epigenetic variability across human immune cell types. Genome Biology. 2017;18:18. doi: 10.1186/s13059-017-1156-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Ejsmond MJ, Radwan J. Red queen processes drive positive selection on major histocompatibility complex (MHC) Genes. PLOS Computational Biology. 2015;11:e1004627. doi: 10.1371/journal.pcbi.1004627. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Eling N, Richard AC, Richardson S, Marioni JC, Vallejos CA. Correcting the Mean-Variance dependency for differential variability testing using Single-Cell RNA sequencing data. Cell Systems. 2018;7:284–294. doi: 10.1016/j.cels.2018.06.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Fuller ZL, Niño EL, Patch HM, Bedoya-Reina OC, Baumgarten T, Muli E, Mumoki F, Ratan A, McGraw J, Frazier M, Masiga D, Schuster S, Grozinger CM, Miller W. Genome-wide analysis of signatures of selection in populations of african honey bees (Apis mellifera) using new web-based tools. BMC Genomics. 2015;16:518. doi: 10.1186/s12864-015-1712-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv. 2012 https://arxiv.org/abs/1207.3907
  28. Gilad Y, Oshlack A, Smyth GK, Speed TP, White KP. Expression profiling in primates reveals a rapid evolution of human transcription factors. Nature. 2006;440:242–245. doi: 10.1038/nature04559. [DOI] [PubMed] [Google Scholar]
  29. Glassberg EC, Gao Z, Harpak A, Lan X, Pritchard JK. Evidence for weak selective constraint on human gene expression. Genetics. 2019;211:757–772. doi: 10.1534/genetics.118.301833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Göring HH, Terwilliger JD, Blangero J. Large upward Bias in estimation of locus-specific effects from genomewide scans. The American Journal of Human Genetics. 2001;69:1357–1369. doi: 10.1086/324471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Hagai T, Chen X, Miragaia RJ, Rostom R, Gomes T, Kunowska N, Henriksson J, Park JE, Proserpio V, Donati G, Bossini-Castillo L, Vieira Braga FA, Naamati G, Fletcher J, Stephenson E, Vegh P, Trynka G, Kondova I, Dennis M, Haniffa M, Nourmohammad A, Lässig M, Teichmann SA. Gene expression variability across cells and species shapes innate immunity. Nature. 2018;563:197–202. doi: 10.1038/s41586-018-0657-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Hilbe JM. Modeling Count Data. Cambridge: Cambridge University Press; 2014. [DOI] [Google Scholar]
  33. Ho JW, Stefani M, dos Remedios CG, Charleston MA. Differential variability analysis of gene expression and its application to human diseases. Bioinformatics. 2008;24:i390–i398. doi: 10.1093/bioinformatics/btn142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Hoffman GE, Schadt EE. variancePartition: interpreting drivers of variation in complex gene expression studies. BMC Bioinformatics. 2016;17:483. doi: 10.1186/s12859-016-1323-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Huguet G, Nava C, Lemière N, Patin E, Laval G, Ey E, Brice A, Leboyer M, Szepetowski P, Gillberg C, Depienne C, Delorme R, Bourgeron T. Heterogeneous pattern of selective pressure for PRRT2 in human populations, but no association with autism spectrum disorders. PLOS ONE. 2014;9:e88600. doi: 10.1371/journal.pone.0088600. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Ioannidis JP. Why most discovered true associations are inflated. Epidemiology. 2008;19:640–648. doi: 10.1097/EDE.0b013e31818131e7. [DOI] [PubMed] [Google Scholar]
  37. Jasinska AJ, Zelaya I, Service SK, Peterson CB, Cantor RM, Choi OW, DeYoung J, Eskin E, Fairbanks LA, Fears S, Furterer AE, Huang YS, Ramensky V, Schmitt CA, Svardal H, Jorgensen MJ, Kaplan JR, Villar D, Aken BL, Flicek P, Nag R, Wong ES, Blangero J, Dyer TD, Bogomolov M, Benjamini Y, Weinstock GM, Dewar K, Sabatti C, Wilson RK, Jentsch JD, Warren W, Coppola G, Woods RP, Freimer NB. Genetic variation and gene expression across multiple tissues and developmental stages in a nonhuman primate. Nature Genetics. 2017;49:1714–1721. doi: 10.1038/ng.3959. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Johnsen JM, Teschke M, Pavlidis P, McGee BM, Tautz D, Ginsburg D, Baines JF. Selection on cis-regulatory variation at B4galnt2 and its influence on von willebrand factor in house mice. Molecular Biology and Evolution. 2009;26:567–578. doi: 10.1093/molbev/msn284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Kajstura J, Leri A, Finato N, Di Loreto C, Beltrami CA, Anversa P. Myocyte proliferation in end-stage cardiac failure in humans. PNAS. 1998;95:8801–8805. doi: 10.1073/pnas.95.15.8801. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Khaitovich P, Enard W, Lachmann M, Pääbo S. Evolution of primate gene expression. Nature Reviews Genetics. 2006;7:693–702. doi: 10.1038/nrg1940. [DOI] [PubMed] [Google Scholar]
  41. Khan Z, Ford MJ, Cusanovich DA, Mitrano A, Pritchard JK, Gilad Y. Primate transcript and protein expression levels evolve under compensatory selection pressures. Science. 2013;342:1100–1104. doi: 10.1126/science.1242379. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Kimura W, Nakada Y, Sadek HA. Hypoxia-induced myocardial regeneration. Journal of Applied Physiology. 2017;123:1676–1681. doi: 10.1152/japplphysiol.00328.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Kinsella RJ, Kähäri A, Haider S, Zamora J, Proctor G, Spudich G, Almeida-King J, Staines D, Derwent P, Kerhornou A, Kersey P, Flicek P. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database. 2011;2011:bar030. doi: 10.1093/database/bar030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Knowles DA, Burrows CK, Blischak JD, Patterson KM, Serie DJ, Norton N, Ober C, Pritchard JK, Gilad Y. Determining the genetic basis of anthracycline-cardiotoxicity by molecular response QTL mapping in induced cardiomyocytes. eLife. 2018;7:e33480. doi: 10.7554/eLife.33480. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Kryuchkova-Mostacci N, Robinson-Rechavi M. A benchmark of gene expression tissue-specificity metrics. Briefings in Bioinformatics. 2016;11:bbw008. doi: 10.1093/bib/bbw008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Lam TH, Shen M, Tay MZ, Ren EC. Unique allelic eQTL clusters in human MHC haplotypes. G3: Genes, Genomes, Genetics. 2017;7:2595–2604. doi: 10.1534/g3.117.043828. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Leffler EM, Gao Z, Pfeifer S, Ségurel L, Auton A, Venn O, Bowden R, Bontrop R, Wall JD, Sella G, Donnelly P, McVean G, Przeworski M. Multiple instances of ancient balancing selection shared between humans and chimpanzees. Science. 2013;339:1578–1582. doi: 10.1126/science.1234070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O'Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, Tukiainen T, Birnbaum DP, Kosmicki JA, Duncan LE, Estrada K, Zhao F, Zou J, Pierce-Hoffman E, Berghout J, Cooper DN, Deflaux N, DePristo M, Do R, Flannick J, Fromer M, Gauthier L, Goldstein J, Gupta N, Howrigan D, Kiezun A, Kurki MI, Moonshine AL, Natarajan P, Orozco L, Peloso GM, Poplin R, Rivas MA, Ruano-Rubio V, Rose SA, Ruderfer DM, Shakir K, Stenson PD, Stevens C, Thomas BP, Tiao G, Tusie-Luna MT, Weisburd B, Won HH, Yu D, Altshuler DM, Ardissino D, Boehnke M, Danesh J, Donnelly S, Elosua R, Florez JC, Gabriel SB, Getz G, Glatt SJ, Hultman CM, Kathiresan S, Laakso M, McCarroll S, McCarthy MI, McGovern D, McPherson R, Neale BM, Palotie A, Purcell SM, Saleheen D, Scharf JM, Sklar P, Sullivan PF, Tuomilehto J, Tsuang MT, Watkins HC, Wilson JG, Daly MJ, MacArthur DG, Exome Aggregation Consortium Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Lemos B, Meiklejohn CD, Cáceres M, Hartl DL. Rates of divergence in gene expression profiles of primates, mice, and flies: stabilizing selection and variability among functional categories. Evolution. 2005;59:126–137. doi: 10.1111/j.0014-3820.2005.tb00900.x. [DOI] [PubMed] [Google Scholar]
  50. Li J, Liu Y, Kim T, Min R, Zhang Z. Gene expression variability within and between human populations and implications toward disease susceptibility. PLOS Computational Biology. 2010;6:e1000910. doi: 10.1371/journal.pcbi.1000910. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics. 2014;30:2843–2851. doi: 10.1093/bioinformatics/btu356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30:923–930. doi: 10.1093/bioinformatics/btt656. [DOI] [PubMed] [Google Scholar]
  54. Liberzon A, Subramanian A, Pinchback R, Thorvaldsdottir H, Tamayo P, Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27:1739–1740. doi: 10.1093/bioinformatics/btr260. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Lindblad-Toh K, Garber M, Zuk O, Lin MF, Parker BJ, Washietl S, Kheradpour P, Ernst J, Jordan G, Mauceli E, Ward LD, Lowe CB, Holloway AK, Clamp M, Gnerre S, Alföldi J, Beal K, Chang J, Clawson H, Cuff J, Di Palma F, Fitzgerald S, Flicek P, Guttman M, Hubisz MJ, Jaffe DB, Jungreis I, Kent WJ, Kostka D, Lara M, Martins AL, Massingham T, Moltke I, Raney BJ, Rasmussen MD, Robinson J, Stark A, Vilella AJ, Wen J, Xie X, Zody MC, Broad Institute Sequencing Platform and Whole Genome Assembly Team. Baldwin J, Bloom T, Chin CW, Heiman D, Nicol R, Nusbaum C, Young S, Wilkinson J, Worley KC, Kovar CL, Muzny DM, Gibbs RA, Baylor College of Medicine Human Genome Sequencing Center Sequencing Team. Cree A, Dihn HH, Fowler G, Jhangiani S, Joshi V, Lee S, Lewis LR, Nazareth LV, Okwuonu G, Santibanez J, Warren WC, Mardis ER, Weinstock GM, Wilson RK, Genome Institute at Washington University. Delehaunty K, Dooling D, Fronik C, Fulton L, Fulton B, Graves T, Minx P, Sodergren E, Birney E, Margulies EH, Herrero J, Green ED, Haussler D, Siepel A, Goldman N, Pollard KS, Pedersen JS, Lander ES, Kellis M. A high-resolution map of human evolutionary constraint using 29 mammals. Nature. 2011;478:476–482. doi: 10.1038/nature10530. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Mähler N, Wang J, Terebieniec BK, Ingvarsson PK, Street NR, Hvidsten TR. Gene co-expression network connectivity is an important determinant of selective constraint. PLOS Genetics. 2017;13:e1006402. doi: 10.1371/journal.pgen.1006402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Marderstein AR, Uppal M, Verma A, Bhinder B, Tayyebi Z, Mezey J, Clark AG, Elemento O. Demographic and genetic factors influence the abundance of infiltrating immune cells in human tissues. Nature Communications. 2020;11:2213. doi: 10.1038/s41467-020-16097-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research. 2008;18:1509–1517. doi: 10.1101/gr.079558.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 2011;17:10. doi: 10.14806/ej.17.1.200. [DOI] [Google Scholar]
  60. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The ensembl variant effect predictor. Genome Biology. 2016;17:122. doi: 10.1186/s13059-016-0974-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Merkin J, Russell C, Chen P, Burge CB. Evolutionary dynamics of gene and isoform regulation in mammalian tissues. Science. 2012;338:1593–1599. doi: 10.1126/science.1228186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Meyer D, C Aguiar VR, Bitarello BD, C Brandt DY, Nunes K. A genomic perspective on HLA evolution. Immunogenetics. 2018;70:5–27. doi: 10.1007/s00251-017-1017-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Morgulis A, Gertz EM, Schäffer AA, Agarwala R. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. Journal of Computational Biology. 2006;13:1028–1040. doi: 10.1089/cmb.2006.13.1028. [DOI] [PubMed] [Google Scholar]
  64. Nakada Y, Canseco DC, Thet S, Abdisalaam S, Asaithamby A, Santos CX, Shah AM, Zhang H, Faber JE, Kinter MT, Szweda LI, Xing C, Hu Z, Deberardinis RJ, Schiattarella G, Hill JA, Oz O, Lu Z, Zhang CC, Kimura W, Sadek HA. Hypoxia induces heart regeneration in adult mice. Nature. 2017;541:222–227. doi: 10.1038/nature20173. [DOI] [PubMed] [Google Scholar]
  65. Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, Hoang CD, Diehn M, Alizadeh AA. Robust enumeration of cell subsets from tissue expression profiles. Nature Methods. 2015;12:453–457. doi: 10.1038/nmeth.3337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Ongen H, Buil A, Brown AA, Dermitzakis ET, Delaneau O. Fast and efficient QTL mapper for thousands of molecular phenotypes. Bioinformatics. 2016;32:1479–1485. doi: 10.1093/bioinformatics/btv722. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Pachter L. Models for transcript quantification from RNA-Seq. arXiv. 2011 https://arxiv.org/abs/1104.3889
  68. Pavlovic BJ, Blake LE, Roux J, Chavarria C, Gilad Y. A comparative assessment of human and chimpanzee iPSC-derived cardiomyocytes with primary heart tissues. Scientific Reports. 2018;8:15312. doi: 10.1038/s41598-018-33478-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Perry GH, Melsted P, Marioni JC, Wang Y, Bainer R, Pickrell JK, Michelini K, Zehr S, Yoder AD, Stephens M, Pritchard JK, Gilad Y. Comparative RNA sequencing reveals substantial genetic variation in endangered primates. Genome Research. 2012;22:602–610. doi: 10.1101/gr.130468.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010;464:768–772. doi: 10.1038/nature08872. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Popadin KY, Gutierrez-Arcelus M, Lappalainen T, Buil A, Steinberg J, Nikolaev SI, Lukowski SW, Bazykin GA, Seplyarskiy VB, Ioannidis P, Zdobnov EM, Dermitzakis ET, Antonarakis SE. Gene age predicts the strength of purifying selection acting on gene expression variation in humans. The American Journal of Human Genetics. 2014;95:660–674. doi: 10.1016/j.ajhg.2014.11.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Price AL, Helgason A, Thorleifsson G, McCarroll SA, Kong A, Stefansson K. Single-tissue and cross-tissue heritability of gene expression via identity-by-descent in related or unrelated individuals. PLOS Genetics. 2011;7:e1001317. doi: 10.1371/journal.pgen.1001317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Raser JM, O'Shea EK. Control of stochasticity in eukaryotic gene expression. Science. 2004;304:1811–1814. doi: 10.1126/science.1098641. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Ravarani CN, Chalancon G, Breker M, de Groot NS, Babu MM. Affinity and competition for TBP are molecular determinants of gene expression noise. Nature Communications. 2016;7:10417. doi: 10.1038/ncomms10417. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research. 2015;43:e47. doi: 10.1093/nar/gkv007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  77. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  78. Romero IG, Ruvinsky I, Gilad Y. Comparative studies of gene expression and the evolution of gene regulation. Nature Reviews Genetics. 2012;13:505–516. doi: 10.1038/nrg3229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Schug J, Schuller WP, Kappen C, Salbaum JM, Bucan M, Stoeckert CJ. Promoter features related to tissue specificity as measured by Shannon entropy. Genome Biology. 2005;6:R33. doi: 10.1186/gb-2005-6-4-r33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Selewa A, Dohn R, Eckart H, Lozano S, Xie B, Gauchat E, Elorbany R, Rhodes K, Burnett J, Gilad Y, Pott S, Basu A. Systematic comparison of High-throughput Single-Cell and Single-Nucleus transcriptomes during cardiomyocyte differentiation. Scientific Reports. 2020;10:1535. doi: 10.1038/s41598-020-58327-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Shabalin AA. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics. 2012;28:1353–1358. doi: 10.1093/bioinformatics/bts163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Shiina T, Hosomichi K, Inoko H, Kulski JK. The HLA genomic loci map: expression, interaction, diversity and disease. Journal of Human Genetics. 2009;54:15–39. doi: 10.1038/jhg.2008.5. [DOI] [PubMed] [Google Scholar]
  83. Shultz AJ, Sackton TB. Immune genes are hotspots of shared positive selection across birds and mammals. eLife. 2019;8:e41815. doi: 10.7554/eLife.41815. [DOI] [PMC free article] [PubMed] [Google Scholar]
  84. Simonovsky E, Schuster R, Yeger-Lotem E. Large-scale analysis of human gene expression variability associates highly variable drug targets with lower drug effectiveness and safety. Bioinformatics. 2019;35:3028–3037. doi: 10.1093/bioinformatics/btz023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. Storey JD, Tibshirani R. Statistical significance for genomewide studies. PNAS. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  86. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Tabula Muris Consortium. Overall coordination. Logistical coordination. Organ collection and processing. Library preparation and sequencing. Computational data analysis. Cell type annotation. Writing group. Supplemental text writing group. Principal investigators Single-cell transcriptomics of 20 mouse organs creates a tabula muris. Nature. 2018;562:367–372. doi: 10.1038/s41586-018-0590-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Tanaka T, Nei M. Positive darwinian selection observed at the variable-region genes of immunoglobulins. Molecular Biology and Evolution. 1989;6:a040569. doi: 10.1093/oxfordjournals.molbev.a040569. [DOI] [PubMed] [Google Scholar]
  89. Těšický M, Vinkler M. Trans-Species polymorphism in immune genes: general pattern or MHC-Restricted phenomenon? Journal of Immunology Research. 2015;2015:1–10. doi: 10.1155/2015/838035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  90. Tung J, Zhou X, Alberts SC, Stephens M, Gilad Y. The genetic architecture of gene expression levels in wild baboons. eLife. 2015;4:e04729. doi: 10.7554/eLife.04729. [DOI] [PMC free article] [PubMed] [Google Scholar]
  91. Veyrieras JB, Kudaravalli S, Kim SY, Dermitzakis ET, Gilad Y, Stephens M, Pritchard JK. High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLOS Genetics. 2008;4:e1000214. doi: 10.1371/journal.pgen.1000214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  92. Wang Z, Zhang J. Impact of gene expression noise on organismal fitness and the efficacy of natural selection. PNAS. 2011;108:E67–E76. doi: 10.1073/pnas.1100059108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  93. Ward MC, Gilad Y. A generally conserved response to hypoxia in iPSC-derived cardiomyocytes from humans and chimpanzees. eLife. 2019;8:e42374. doi: 10.7554/eLife.42374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  94. Westra HJ, Peters MJ, Esko T, Yaghootkar H, Schurmann C, Kettunen J, Christiansen MW, Fairfax BP, Schramm K, Powell JE, Zhernakova A, Zhernakova DV, Veldink JH, Van den Berg LH, Karjalainen J, Withoff S, Uitterlinden AG, Hofman A, Rivadeneira F, Hoen PAC', Reinmaa E, Fischer K, Nelis M, Milani L, Melzer D, Ferrucci L, Singleton AB, Hernandez DG, Nalls MA, Homuth G, Nauck M, Radke D, Völker U, Perola M, Salomaa V, Brody J, Suchy-Dicey A, Gharib SA, Enquobahrie DA, Lumley T, Montgomery GW, Makino S, Prokisch H, Herder C, Roden M, Grallert H, Meitinger T, Strauch K, Li Y, Jansen RC, Visscher PM, Knight JC, Psaty BM, Ripatti S, Teumer A, Frayling TM, Metspalu A, van Meurs JBJ, Franke L. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nature Genetics. 2013;45:1238–1243. doi: 10.1038/ng.2756. [DOI] [PMC free article] [PubMed] [Google Scholar]
  95. Wright FA, Sullivan PF, Brooks AI, Zou F, Sun W, Xia K, Madar V, Jansen R, Chung W, Zhou Y-H, Abdellaoui A, Batista S, Butler C, Chen G, Chen T-H, D'Ambrosio D, Gallins P, Ha MJ, Hottenga JJ, Huang S, Kattenberg M, Kochar J, Middeldorp CM, Qu A, Shabalin A, Tischfield J, Todd L, Tzeng J-Y, van Grootheest G, Vink JM, Wang Q, Wang W, Wang W, Willemsen G, Smit JH, de Geus EJ, Yin Z, Penninx BWJH, Boomsma DI. Heritability and genomics of gene expression in peripheral blood. Nature Genetics. 2014;46:430–437. doi: 10.1038/ng.2951. [DOI] [PMC free article] [PubMed] [Google Scholar]
  96. Yanai I, Benjamin H, Shmoish M, Chalifa-Caspi V, Shklar M, Ophir R, Bar-Even A, Horn-Saban S, Safran M, Domany E, Lancet D, Shmueli O. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics. 2005;21:650–659. doi: 10.1093/bioinformatics/bti042. [DOI] [PubMed] [Google Scholar]
  97. Yu G, Wang LG, Han Y, He QY. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012;16:284–287. doi: 10.1089/omi.2011.0118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  98. Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, Billis K, Cummins C, Gall A, Girón CG, Gil L, Gordon L, Haggerty L, Haskell E, Hourlier T, Izuogu OG, Janacek SH, Juettemann T, To JK, Laird MR, Lavidas I, Liu Z, Loveland JE, Maurel T, McLaren W, Moore B, Mudge J, Murphy DN, Newman V, Nuhn M, Ogeh D, Ong CK, Parker A, Patricio M, Riat HS, Schuilenburg H, Sheppard D, Sparrow H, Taylor K, Thormann A, Vullo A, Walts B, Zadissa A, Frankish A, Hunt SE, Kostadima M, Langridge N, Martin FJ, Muffato M, Perry E, Ruffier M, Staines DM, Trevanion SJ, Aken BL, Cunningham F, Yates A, Flicek P. Ensembl 2018. Nucleic Acids Research. 2018;46:D754–D761. doi: 10.1093/nar/gkx1098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  99. Zhang Z, Qian W, Zhang J. Positive selection for elevated gene expression noise in yeast. Molecular Systems Biology. 2009;5:299. doi: 10.1038/msb.2009.58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  100. Zhou X, Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nature Genetics. 2012;44:821–824. doi: 10.1038/ng.2310. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision letter

Editor: Hunter B Fraser1
Reviewed by: Charles G Danko2

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

Individuals show heritable variation in gene expression, which is associated with disease susceptibility. In contrast to many other studies that have focused on mapping associations between genetic and gene regulatory variation, the current work addresses group dispersion/variance of gene expression among samples as well as the evolutionary processes that shape differences in gene expression between individuals, in both humans and chimpanzees. Using computational deconvolution, the authors demonstrate that cell-type heterogeneity is an important component of expression variability, with significant overlap of orthologous genes associated with eQTLs in both species. The conclusion is that gene expression variability in humans and chimpanzees often evolves under similar evolutionary pressures.

Decision letter after peer review:

Thank you for submitting your article "Gene expression variability in human and chimpanzee populations share common determinants" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Detlef Weigel as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Charles G Danko (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, when editors judge that a submitted work as a whole belongs in eLife but that some conclusions require a modest amount of additional new data, as they do with your paper, we are asking that the manuscript be revised to either limit claims to those supported by data in hand, or to explicitly state that the relevant conclusions require additional supporting data.

Our expectation is that the authors will eventually carry out the additional experiments and report on how they affect the relevant conclusions either in a preprint on bioRxiv or medRxiv, or if appropriate, as a Research Advance in eLife, either of which would be linked to the original paper.

Summary:

This is a solid study, with a large sample size, identifying quantitative trait loci (eQTLs) in humans and chimpanzees, using gene expression data from primary heart samples. The authors complemented the analysis of gene expression with a comparative eQTL mapping, as opposed to relying on mean expression levels, as most comparative studies like this one do. Also unlike many studies focused on mapping associations between genetic and gene regulatory variation, the authors paid attention to the group dispersion/variance of gene expression among samples as well as the evolutionary processes that shape the differences in gene regulation between individuals. The calculation of power for discovering differentially expressed genes as a function of sample size at the beginning of the paper is a thoughtful analysis that is useful to many in the community. All of the analyses are extremely thorough and well-executed. The statistical tests are appropriate and rigorous. Results are interpreted in a conservative fashion.

The main limitation is that the authors are not able to conclusively disambiguate between different causes of dispersion. Genetics, cell type, and technical variation may all contribute to dispersion. The authors state this very clearly throughout the manuscript. In part, this may reflect the authors' underselling their results somewhat. But in part, this really does reflect reality: Cell type is a major confounder that may provide false signals in other analyses.

Revisions for this paper:

The reviewers suggested a number of potential additions to clarify current results or build upon them. I will leave it up to the authors to decide which are worth including in their revision.

1) The first test authors conducted is to identify differentially variable (DV) genes. A total of 2658 DV genes were identified. The problem of the result is that almost equal number of up- and down-regulated DV genes symmetrically distributed around DV=0. Often, this is an indication of a lack of biological signals in data analysis. This might be due to the pooling of gene groups with diverse functionality together. Therefore, this reviewer suggests that authors should break down genes into subgroups to detail the up and down-regulatory patterns with the hope that some of the gene groups give interpretable results

2) The second test is to correlate the higher coding sequence conservation with lower dispersion. Again, the positive result is not unexpected. There are many indirect and/or confounding factors that may explain the effect. This reviewer, however, understands it is impossible to control them all (also authors have attempted to address some of them in the next few tests). However, here it is better to add exploratory analyses for genes in different functional groups and also give examples of outlier genes that do not follow the rule.

3) The third test is to examine the correlation between gene expression variability with single-cell type heterogeneity of samples. Authors first used Tabula Muris dataset to show dispersion is strongly correlated with cell-type specificity/diversity. If this is true, then the point that authors really wanted to demonstrate is, in fact, hampered. Authors might really want to show the "true" single-cell variability (see, for example, PMID: 31861624) is correlated with the level of group variance of gene expression.

4) The fourth test authors conducted is to show that dN/dS and Pn/Ps ratios of genes are correlated with gene expression variability (variance). However, because of the existence of heterogeneity of cell-type composition in samples, any correlation observed may be utterly biased by this single uncontrollable confounding factor. Furthermore, heart tissues contain an over-abundant expression of genes encoded in the mitochondrial genome. The expression level of these mt-genes may vary substantially between samples and reflect the health status of primary sample donors. PEER normalization may have to take this into account as a covariant.

5) Several other tests authors performed are around eQTLs (eGene overlap and eSNP overlap) between the two species. These are typical tests evolutionary biologists usually try to do whenever data are available. However, the issues with these types of tests are the low power in general. More importantly, in order to be consistent with previous tests which are all around the explanation of gene expression variance, this part should address the overlap between expression vQTLs in humans and chimps.

6) I would like to see more discussion about the inter-relatedness of the chimpanzees in the analysis of gene expression. Is that contributing to the power of the DE analysis, which has really high numbers of DE genes. That may certainly be due to the large samples size, but should be addressed. Related to that, the support that the gene-wise dispersion estimates are well correlated in humans and chimpanzees overall (Figure 1C, and Figure 2—figure supplement 1) seems qualitative. It looks like the chimpanzees might have less dispersion overall?

7) What do the authors think these findings mean for study systems outside of humans and captive chimpanzees? Both on the technical level (e.g. sample size), and for how their approach could be helpful outside of these species. Generalizing this approach would broaden the impact and audience of the paper.

8) Did the authors test directly whether eQTLs were enriched in genes with a high dispersion? I could not find this going back through the paper. This seems almost trivially likely to be true. I may have missed this result? Or did the authors worry this is too likely to be confounded with cell type? Either way, this seems like a result that may be useful to show even if the authors did acknowledge that it was likely to be confounded.

9) Did the authors consider looking for cell-type QTLs? They state several times in the paper the possibility that genetic factors may influence cell types. They have enough data – at least in human – to obtain QTLs for specific cell types, as others have done (Marderstein et al., 2020; Donovan et al., 2020). If these cell type QTLs were enriched near genes with a high dispersion, this may bolster the author's argument that genetic factors underlie dispersion by affecting cell type composition.

10) The scRNA-seq reference used for estimating cell types in heart tissue was derived from mice. Could this lead the authors to underestimate the degree to which cell types drive dispersion in genes that are variable between human and chimp? Genes that are variable between human/ chimp may also be more likely to be variable between either species and mouse, and perhaps this variability has led to them becoming more/ less of a marker of a specific cell population (and hence their dispersion in primates does not correlate with cell type specificity in mouse).

11) Have the authors tried estimating dispersion on top of what is expected based on differences in cell type? There are several strategies that might work for this: There are new strategies for estimating a posterior of cell type specific expression from a bulk sample, conditional on scRNA-seq data as prior information (Chu and Danko, 2020). These cell type specific expression estimates could then be analyzed for dispersion. Alternatively, it may also work to regress the estimated proportion of each cell type out of the dispersion estimates. While there are certainly a lot of pitfalls with using these strategies, especially in the setting shown here (all of this would work better if there were species matched reference data), they might provide an avenue for depleting the contribution of cell type differences from dispersion estimates.

12) Can the authors add a dotted line to show the shape of the distribution for genes with low dispersion, or where dispersion is shared in both human and chimpanzee, in Figure 4B? Is this different from genes that are dispersed in either chimp or human?

eLife. 2020 Oct 21;9:e59929. doi: 10.7554/eLife.59929.sa2

Author response


Revisions for this paper:

The reviewers suggested a number of potential additions to clarify current results or build upon them. I will leave it up to the authors to decide which are worth including in their revision.

1) The first test authors conducted is to identify differentially variable (DV) genes. A total of 2658 DV genes were identified. The problem of the result is that almost equal number of up- and down-regulated DV genes symmetrically distributed around DV=0. Often, this is an indication of a lack of biological signals in data analysis. This might be due to the pooling of gene groups with diverse functionality together. Therefore, this reviewer suggests that authors should break down genes into subgroups to detail the up and down-regulatory patterns with the hope that some of the gene groups give interpretable results

We thank the reviewers for their helpful comments, which ultimately improved the manuscript.

Regarding this point, it is true that symmetric destitutions are expected when there is a lack of biological signal, but we also often see such distributions when there is abundant biological signal, especially when we compare quantitative traits between species. For example, it is well established that among inter-species differentially expressed genes in practically any comparison that was done to date, about half the genes have elevated expression levels and about half have decreased expression.

That said, we provide the analysis suggested by the reviewer (revised manuscript Figure 4—figure supplement 3). We performed GSEA enrichment to identify which gene categories are preferentially DV-up or DV-down between chimpanzee and human. We found immune related genes with higher dispersion in chimpanzee, and mitosis related genes with higher dispersion in humans. As described in the manuscript, our interpretation is that this partly reflects differences in life histories in the human population, from which many individuals have suffered cardiac ischemia and may have differentially expressed mitosis related genes as a response.

2) The second test is to correlate the higher coding sequence conservation with lower dispersion. Again, the positive result is not unexpected. There are many indirect and/or confounding factors that may explain the effect. This reviewer, however, understands it is impossible to control them all (also authors have attempted to address some of them in the next few tests). However, here it is better to add exploratory analyses for genes in different functional groups and also give examples of outlier genes that do not follow the rule.

This is a good suggestion. In the revised version, we provide a new analysis (revised manuscript Figure 3—figure supplement 1) in which we test for a correlation between higher coding sequence conservation (dN/dS metric) with lower dispersion on a GO category basis. Consistent with the narrative throughout this manuscript, we find that this correlation is strongest for immune related genes.

3) The third test is to examine the correlation between gene expression variability with single-cell type heterogeneity of samples. Authors first used Tabula Muris dataset to show dispersion is strongly correlated with cell-type specificity/diversity. If this is true, then the point that authors really wanted to demonstrate is, in fact, hampered. Authors might really want to show the "true" single-cell variability (see, for example, PMID: 31861624) is correlated with the level of group variance of gene expression.

This is an important question that we have pondered ourselves as we wrote this manuscript. This comment is similar to reviewer point 11, and we provide description of a new analysis to address this point in our response to point 11 (please read below).

4) The fourth test authors conducted is to show that dN/dS and Pn/Ps ratios of genes are correlated with gene expression variability (variance). However, because of the existence of heterogeneity of cell-type composition in samples, any correlation observed may be utterly biased by this single uncontrollable confounding factor. Furthermore, heart tissues contain an over-abundant expression of genes encoded in the mitochondrial genome. The expression level of these mt-genes may vary substantially between samples and reflect the health status of primary sample donors. PEER normalization may have to take this into account as a covariant.

We do not understand the first concern – that dN/dS and Pn/Ps may be confounded by the presence of cell type composition heterogeneity. Obviously, dN/dS and Pn/Ps are based on genotypes, and as such are orthogonal to gene expression based measurements from tissues (because all cell types from the same individual have the same genotypes). Given this, how can these measurements be biased by cell composition?

The second point – that mt-genes may be important markers of cellular or organismal fitness that might be important to correct for – is an interesting possibility and we further investigated it. Here, (Author response image1) we show that the first 10 gene expression principal components, which we include as covariates in the eQTL linear models, explain a nearly identical fraction of the total variance whether or not we include mt-genes in the gene expression matrix (the exact comparison is PCs from a gene expression matrix in which MT-genes are included, versus one where the median expression of MT-genes is set to the median across all samples, simulating zero variance contributed by MT-genes). Therefore, we believe our original principal component covariates are sufficient to capture the mitochondrial gene expression expression components that may reflect health status, and we did not alter the eQTL model or results from those in our original submission.

Author response image 1.

Author response image 1.

5) Several other tests authors performed are around eQTLs (eGene overlap and eSNP overlap) between the two species. These are typical tests evolutionary biologists usually try to do whenever data are available. However, the issues with these types of tests are the low power in general. More importantly, in order to be consistent with previous tests which are all around the explanation of gene expression variance, this part should address the overlap between expression vQTLs in humans and chimps.

We agree that eGene overlap analyses are limited by power, as eQTL analyses often require much larger sample sizes to detect modest eQTL effects. While we are fundamentally limited by the data available to us, we do address the overlap between dispersion and eGenes (revised Figure 6B, Figure 6—figure supplement 4). Unfortunately addressing any potential inter-species overlap with varianceQTLs (vQTLs) would require sample sizes larger than we can obtain. For example, [Sarkar et al., PLoS Genet. 2019] suggests a sample sizes in the thousands required to identify genetic variants that explain variance independent of mean expression.

6) I would like to see more discussion about the inter-relatedness of the chimpanzees in the analysis of gene expression. Is that contributing to the power of the DE analysis, which has really high numbers of DE genes. That may certainly be due to the large samples size, but should be addressed. Related to that, the support that the gene-wise dispersion estimates are well correlated in humans and chimpanzees overall (Figure 1C, and Figure 2—figure supplement 1) seems qualitative. It looks like the chimpanzees might have less dispersion overall?

Thank you for the good suggestion regarding the contribution of inter-related chimpanzees to DE analysis. We performed a set of new analyses to address this (revised manuscript Figure 1—figure supplement 3). First, we identified clusters of our chimpanzee samples based on genetic relatedness, identifying groups of inter-related chimpanzees and unrelated chimpanzees as test and control groups to investigate further. We note that qualitatively, the genomewide expression measurements of chimpanzee individuals that are more closely related are not more similar than that of unrelated individuals, likely due to confounding factors in the data, such as the batch effect of RNA isolation or other technical effects. We further used the VariancePartition R package to quantify the effects of inter-relatedness and technical batch effects, finding that technical batch effects likely have a greater effect on gene expression and DE analysis. Finally, we empirically address this question by reperforming DE analysis with the test (inter-related chimpanzee samples) and control (unrelated samples) and quantifying the number of DE genes and estimated false positive rate. We do not find any meaningful differences that would point to inter-related samples having effects of the similar magnitude as the technical batch effects.

7) What do the authors think these findings mean for study systems outside of humans and captive chimpanzees? Both on the technical level (e.g. sample size), and for how their approach could be helpful outside of these species. Generalizing this approach would broaden the impact and audience of the paper.

Referring to our analysis of DE power, this is an excellent question, but empirically answering this question is outside the scope of this paper so we can only speculate. Though, we reason that our findings may depend on the species comparison (species which have diverged more may have greater differences, and thus a similarly sized study may identify more differences), and the tissue type (tissues with high inter-individual variability due to technical or environmental factors may have less power). We have added a discussion of this in the revised manuscript. Further, we provide an analysis of GTEx tissues which quantifies the variability (gene-wise overdispersion parameter estimate measured from a negative binomial) across GTEx samples for different tissues (Author response image 2, top). We note that the median amount of population overdispersion within a tissue negatively correlates with the number of eGenes detected in GTEx (independent of GTEx sample size, Author response image 2 bottom left, bottom right). We believe this may serve as a reference as to which tissues may have more power in DE analysis or eQTL analysis due to the degree of inter-individual variability. We have decided not to include this analysis in the main manuscript as we feel it does not easily fit into the focus of existing narrative on inter-species differences in variability and eQTLs.

Author response image 2.

Author response image 2.

8) Did the authors test directly whether eQTLs were enriched in genes with a high dispersion? I could not find this going back through the paper. This seems almost trivially likely to be true. I may have missed this result? Or did the authors worry this is too likely to be confounded with cell type? Either way, this seems like a result that may be useful to show even if the authors did acknowledge that it was likely to be confounded.

Yes. As the reviewer hypothesizes, eQTL containing genes (eGenes) have higher dispersion than non-eGenes. We have added a new figure (revised manuscript Figure 6—figure supplement 4) to show this. Related to this point, in the initial submission we showed that the genes that are eGenes specifically in chimpanzee and not human, have higher dispersion in chimpanzee than human, and vice versa (revised manuscript Figure 6B).

9) Did the authors consider looking for cell-type QTLs? They state several times in the paper the possibility that genetic factors may influence cell types. They have enough data – at least in human – to obtain QTLs for specific cell types, as others have done (Marderstein et al., 2020; Donovan et al., 2020). If these cell type QTLs were enriched near genes with a high dispersion, this may bolster the author's argument that genetic factors underlie dispersion by affecting cell type composition.

Thank you for the good suggestion. Here, we performed the reviewer’s suggested analysis (Author response image3). We used cell composition estimates from [Donovan et al., 2020] on GTEx heart left ventricle samples and performed a genome-wide association study in an attempt to identify variants that associate with cell type composition (similar to [Marderstein et al., 2020]). We used a linear mixed model with a genetic relatedness matrix to account for ancestry, included sex as a covariate, and used the quantile normalized proportion of cardiac muscle cells as the response variable. No loci of achieved the standard GWAS stringent genome wide significance threshold of 5E-8 (Author response image 3, top), but nonetheless, we asked whether the top 100 loci (estimated FDR<0.5) are closest to highly dispersed or lowly dispersed genes (Author response image 3, lower left). Furthermore, under the hypothesis that eQTLs may reflect direct effects on cell type composition, we asked whether eQTLs (summary statistics obtained from GTEx) have inflated P-values for association with cell type composition (Author response image 3, lower right). We did not find any meaningful effect, suggesting that the GTEx mapping pipeline does a good job at accounting for cell type heterogeneity through PEER, and/or lack of power for detecting such cell type QTLs.

Author response image 3.

Author response image 3.

Our analysis in response to this point led us to an additional analysis which is not directly related to this reviewer’s point, but nonetheless we have included in the revised manuscript (revised manuscript Figure 4—figure supplement 2) and we will briefly summarize here: The cell type composition estimates from [Donovan et al., 2020] include all GTEx heart samples (from both left ventricle, and atrial appendage tissues, often from the same set of individuals). We used this as an opportunity to examine whether the gene expression variability due to cell type composition may have a strong genetic or individual component, as opposed to being a completely technical artifact of inconsistencies in tissue dissection and sample acquisition. We reasoned that if intentionally anatomically different heart sections (left ventricle, versus atrial appendage) from the same individual correlate better than matched tissue samples from different individuals, then the cell type composition differences across our chimpanzee samples likely are driven by individual level differences, rather than technical differences in sample acquisition. In Figure 4—figure supplement 1-A we show qualitatively that there is an obvious individual level correlation between the atrial appendage samples and left ventricle samples for matched individuals. We used a linear mixed model (implemented with VariancePartition R package) to quantify the contribution of individual level versus tissue level factors to cell type composition estimates, and find that the individual level factor generally explains more variance. We interpret this as strong evidence that the sample to sample differences captured by our dispersion estimates are driven largely by true differences between individuals, rather than random technical differences in sample acquisition and dissection.

10) The scRNA-seq reference used for estimating cell types in heart tissue was derived from mice. Could this lead the authors to underestimate the degree to which cell types drive dispersion in genes that are variable between human and chimp? Genes that are variable between human/ chimp may also be more likely to be variable between either species and mouse, and perhaps this variability has led to them becoming more/ less of a marker of a specific cell population (and hence their dispersion in primates does not correlate with cell type specificity in mouse).

Good point! We agree that there is likely a general downwards bias (ie regression dilution) in the estimated effect of cell type specificity on dispersion in primates due to the cell type specificity estimates being based on mouse single cell data. We added this point in the revised manuscript.

11) Have the authors tried estimating dispersion on top of what is expected based on differences in cell type? There are several strategies that might work for this: There are new strategies for estimating a posterior of cell type specific expression from a bulk sample, conditional on scRNA-seq data as prior information (Chu and Danko, 2020). These cell type specific expression estimates could then be analyzed for dispersion. Alternatively, it may also work to regress the estimated proportion of each cell type out of the dispersion estimates. While there are certainly a lot of pitfalls with using these strategies, especially in the setting shown here (all of this would work better if there were species matched reference data), they might provide an avenue for depleting the contribution of cell type differences from dispersion estimates.

Thanks for helpful suggestion! Though we agree that there are naturally a lot of caveats to such an analysis (estimating cell type specific expression and dispersion from bulk based on a mouse reference), in the revised manuscript, we describe a new analysis (Figure 4—figure supplement 1) using the suggested methodology of [Chu and Danko, 2020]: After estimating cell type composition and expression on a per individual basis, we then estimated dispersion on a per cell type basis in both human and chimpanzee. We present the results as a correlation matrix with hierarchal clustering. We find that while cell type specific expression measurements cluster by cell type before species, cell type specific dispersion estimates cluster in a complex pattern that is more strongly influenced by species. We find this consistent with the idea that dispersion across a population of bulk samples is meaningfully influenced by both cell type composition and genetic effects. In other words, when you are able to correct for the cell type composition effects, or rather, to estimate dispersion within a controlled cell type, the influence of genetic effects is made more obvious. As such, the chimpanzee and human cell specific dispersion estimates cluster partly by species, as the variation due to genetic variants is expected to almost completely segregate by species.

Furthermore, as we felt the cell type decomposition step of this new analysis was unnecessarily redundant with the cell type decomposition estimates we previously presented using CIBERSORT algorithm, we replaced the original figure and methods description referencing CIBERSORT cell decompositions with the cell decomposition estimates from the methodology of [Chu and Danko, 2020] (revised manuscript Figure 4—figure supplement 2). The cell type proportion estimates between the two methods are generally well correlated (R2=.88 across all individuals and cell types) and the small discrepancies do not alter any primary conclusions of the paper.

12) Can the authors add a dotted line to show the shape of the distribution for genes with low dispersion, or where dispersion is shared in both human and chimpanzee, in Figure 4B? Is this different from genes that are dispersed in either chimp or human?

We added a scatter plot inset in revised manuscript Figure 4B and Figure 4A to show the distribution dispersion estimates for the plotted genes in both human and chimp.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Fair BJ, Blake LE, Chavarria C, Sarkar A, Pavlovic BJ, Gilad YY. 2020. Gene expression variability in human and chimpanzee populations share common determinants. NCBI Gene Expression Omnibus. GSE151397 [DOI] [PMC free article] [PubMed]
    2. Fair BJ, Blake LE, Chavarria C, Sarkar A, Pavlovic BJ, Gilad YY. 2020. Whole genome sequencing of 39 captive born chimpanzees. NCBI BioProject. PRJNA635393
    3. Fair BJ. 2020. Whole genome sequencing of 39 captive born chimpanzees. EBI European Variation Archive. PRJEB39475
    4. Pavlovic BJ, Blake LE, Chavarria C, Gilad Y. 2018. A Comparative Assessment of iPSC Derived Cardiomyocytes with Heart Tissues in Humans and Chimpanzees. NCBI Gene Expression Omnibus. GSE110471 [DOI] [PMC free article] [PubMed]
    5. The GTEx Consortium 2019. GTEx Analysis V8. dbGaP. phs000424.v8.p2

    Supplementary Materials

    Figure 1—source data 1. RNA-seq datasets used in this study.
    Figure 1—figure supplement 1—source data 1. Full DE results.
    Figure 1—figure supplement 3—source data 1. Kinship matrix of chimpanzees in this study.
    Figure 2—source data 1. Gene-wise dispersion estimates and differential testing.
    Figure 3—source data 1. Full GSEA results based on human dispersion levels.
    Figure 3—figure supplement 1—source data 1. dN/dS correlation with dispersion by GO category.
    Figure 4—figure supplement 1—source data 1. Cell-type-specific expression and dispersion estimates.
    Figure 4—figure supplement 3—source data 1. Full GSEA results based on interspecies dispersion differences.
    Figure 5—source data 1. Gene-wise Pn/Ps statistics for chimpanzee and human.
    Figure 6—figure supplement 2—source data 1. Admixture group membership of chimpanzees in this study.
    Figure 7—figure supplement 1—source data 1. Full GO enrichment results of species-shared eGenes.
    Figure 7—figure supplement 1—source data 2. Full GO enrichment results of chimpanzee-specific eGenes.
    Supplementary file 1. Full GSEA results based on interspecies dispersion differences after excluding virally challenged chimpanzees.
    elife-59929-supp1.txt (70.7KB, txt)
    Supplementary file 2. Whole genome sequencing sample summary statistics.
    elife-59929-supp2.txt (3.6KB, txt)
    Supplementary file 3. Chimpanzee eGene summary statistics.
    elife-59929-supp3.txt (1.6MB, txt)
    Transparent reporting form

    Data Availability Statement

    RNA-Seq data available under GEO accession number GSE151397. Raw whole genome sequencing data under SRA accession PRJNA635393. Processed whole genome sequencing data available as variant calls at European variation archive, EVA accession PRJEB39475.

    The following datasets were generated:

    Fair BJ, Blake LE, Chavarria C, Sarkar A, Pavlovic BJ, Gilad YY. 2020. Gene expression variability in human and chimpanzee populations share common determinants. NCBI Gene Expression Omnibus. GSE151397

    Fair BJ, Blake LE, Chavarria C, Sarkar A, Pavlovic BJ, Gilad YY. 2020. Whole genome sequencing of 39 captive born chimpanzees. NCBI BioProject. PRJNA635393

    Fair BJ. 2020. Whole genome sequencing of 39 captive born chimpanzees. EBI European Variation Archive. PRJEB39475

    The following previously published datasets were used:

    Pavlovic BJ, Blake LE, Chavarria C, Gilad Y. 2018. A Comparative Assessment of iPSC Derived Cardiomyocytes with Heart Tissues in Humans and Chimpanzees. NCBI Gene Expression Omnibus. GSE110471

    The GTEx Consortium 2019. GTEx Analysis V8. dbGaP. phs000424.v8.p2


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES