Abstract
Motivation
Accurately detecting tissue specificity (TS) in genes helps researchers understand tissue functions at the molecular level. The Genotype-Tissue Expression project is one of the publicly available data resources, providing large-scale gene expressions across multiple tissue types. Multiple tissue comparisons and heterogeneous tissue expression make it challenging to accurately identify tissue specific gene expression. How to distinguish the inlier expression from the outlier expression becomes important to build the population level information and further quantify the TS. There still lacks a robust and data-adaptive TS method taking into account heterogeneities of the data.
Results
We found that the key to identify tissue specific gene expression is to properly define a concept of expression population. In a linear regression problem, we developed a novel data-adaptive robust estimation approach (AdaReg) based on density-power-weight under unknown outlier distribution and non-vanishing outlier proportion. The Gaussian-population mixture model was considered in the setting of identifying TS. We took into account heterogeneities of gene expression and applied the robust data-adaptive procedure to estimate the population parameters. With the well-estimated population parameters, we constructed the AdaTiSS algorithm.
Our AdaTiSS profiled TS for each gene and each tissue, which standardized the gene expression in terms of TS. We provided a new robust and powerful tool to the literature of defining TS.
Availability and implementation
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
In the analysis of gene expressions, one important task is to detect tissue specificity (TS) of genes, i.e. whether the genes are housekeeping, or are significantly differentially expressed in one or multiple tissues. Accurately detecting TS in genes helps researchers understand tissue functions at the molecular level. Understanding TS at the molecular level helps us further identify disease mechanisms and discover tissue-specific therapeutic targets (Greller and Tobin, 1999; Kim et al., 2018).
The development of high-throughput technologies greatly improves the study of TS in genes. The Genotype-Tissue Expression (GTEx) project (GTEx Consortium, 2015) is an ongoing effort to build a comprehensive public resource to study tissue-specific gene expression and regulation. Based on the RNA-sequencing (RNA-seq) technology, up to version 7, the GTEx project has generated RNA expressions for 18 777 human protein-coding genes from 11 688 samples and 53 tissues. Other gene expression databases include the Tissue-specific Gene Expression and Regulation (TiGER) (Liu et al., 2008), the Human Protein Atlas portal (HPA) (Uhlén et al., 2015) and the Tissue-specific Gene DataBase in cancer (TissGDB) (Kim et al., 2018).
Large-scale databases provide valuable resources but also make the task to quantify TS more challenging. If there were only two tissues, we could apply the -test testing for differential expressions. But now we have more than 50 tissue types in the GTEx project. Pairwise or triplet comparison is computationally inefficient and unfeasible, and creates a burden in multiple hypotheses testing (Cavalli et al., 2011). Moreover, different tissues have different compositions and contextures. Under highly heterogeneous tissue expression, how to distinguish inliers from outliers becomes challenging. We expect that a good TS measurement should have the properties not only identifying overall TS, but also sensitively and robustly measuring expression specificity for each tissue.
In the literature, several methods defined TS in different angles. We summarized them in a workflow presented in Figure 1. Basically, there are two branches. One is to measure overall specificity not identifying the specific tissues, while the other is capable to identify the specific tissue(s). The previous work (Kryuchkova-Mostacci and Robinson-Rechavi, 2017) gave a review on several metrics in these two branches. Here, we listed a few overall level metrices including tau, Gini, Hg, reviewed in Kryuchkova-Mostacci and Robinson-Rechavi (2017). In the tissue level, we categorized various methods to three categories. The category of proportion-based approach includes the Preferential Expression Measure (PEM) (Huminiecki et al., 2003), the Expression Enrichment (EE) (Yu et al., 2006) and the Specificity Measurement (SPM) (Xiao et al., 2010). These measurements rely on the constraint of total sum (or norm) of tissue expression. With or without an extremely high expression, the TS measurements can be dramatically changed.
Fig. 1.
Flowchart of TS score
The z-score relaxes such constraint and is easy to interpret. In the large-scale data analysis, when we measure many tissues, some tissues having related functions could have similar high expression. These outliers can be in unexpectedly large proportion. If we ignore the effect from outliers and simply estimate the mean and variance from all the samples, the TS based on the traditional z-score can be too conservative. Recent work tried to tackle this challenge by identifying inliers and outliers in a more robust way. Along the line of z-score, the work of Jiang et al. (2019) proposed the Robust regression z-score (REZ) to robustly estimate the sample mean and variance from Huber’s estimation taking into account technical variations.
Another category is the fold-change-based approach. This approach includes methods of HPA criterion (Uhlén et al., 2015) and the Specificity Index (SI) (Dougherty et al., 2010). The work of Uhlén et al. (2015) took a direct way to cluster inlier and outlier tissues based on the fold-change of tissue expression for each gene. Their HPA criterion differentiated the highly expressed tissues (outliers, with at least 5-fold-change) from the rest of the tissues (inliers), in the comparison of tissue expressions in terms of the transcripts per kilobase million (TPM) or the reads per kilobase million (RPKM) from RNA-sequencing technology. Based on the inlier and outlier configuration, they classified genes into six fine categories, but it is still a categorical method. For a particular gene, the highest tissue might be expressed as a 10-fold-change or even a 20-fold-change compared to the rest of the tissues, but it is only characterized as ‘enriched’ without a continuous score. More importantly, the specificity based on the fold-change is not comparable across genes. For one gene, an enriched tissue having a 10-fold change compared to the rest does not mean that tissue is more specific in another gene having a 5-fold-change compared to the rest. In the work of Dougherty et al. (2010), they also considered fold-change to identify TS but they converted the values of fold-changes to their ranks across all the genes. They took the significance of the average rank of the fold-changes between one target sample and the rest of the samples as their TS score.
Our work pursues the line of z-score-based approach by proposing a new robust estimation to identify tissue specific expression. With the well-estimated robust TS scores, we can quantify specificity for each tissue in each gene and compare TS across genes, which builds standardized scores for future comparison across -omics. Such standardized scores also provide quantitative TS comparison in GO terms and pathways, refining the biology functional analysis to a precise level.
The outline of the rest of this article is as follows. In Section 2, we introduced the key concept of population in the problem of defining TS, and then proposed a data-adaptive and robust estimation method to get TS scores (AdaTiSS) under Gaussian-population mixture model. The algorithm to obtain the AdaTiSS was developed. In Section 3, we summarized the simulation results on estimating population parameters. Then the performance of several methods was compared in the application on the gene expression from the GTEx project. Finally, in Section 4, we concluded the article, and discussed some limitations and future work.
2 Materials and methods
2.1 Population
From previous works, we can see the first task to quantify TS is to distinguish inlier and outlier tissues, where the outliers can be the tissue-specific expressions. The outlier expressions vary from gene to gene. Some genes are highly expressed (enriched) in only one tissue, while some are highly expressed in multiple tissues. Among enriched tissues, some expression is moderately high and some is extremely high. Due to the complexity of outliers, our contribution is to focus on the inliers, defining the concept of expression population. Inside the population, some tissue is highly expressed while some is not, but there is still a balance. Meanwhile, the tissue-specific expressions that go outside such balance become outliers. In this work, we focused mainly on the outliers in high expression but not in low expression.
After introducing the concept of population, the next question is what variations are inside the population. Most of the previous works summarized the tissue information in medians, then defined TS by comparing tissue medians, which only considered the variation between tissues. We considered our population containing not only between-tissue variation but also within-tissue variation. We allowed biological replicates in each tissue to incorporate within-tissue variation. When comparing sample expressions from multiple tissues, we considered the main effect in the population to be the tissue effect, which is confirmed in several studies from the analysis of tissue sample clustering (Jiang et al., 2020; Melé et al., 2015). In our own work of Jiang et al. (2020), we provided a website resource TSomics (http://snyderome.stanford.edu/TSomics.html), presenting RNA and protein expressions for the quantified human protein-coding genes. From there, for the majority of genes, the abundance distribution across tissues appears as one major density bump for the main population with some specific tissues outside the population. In biology, it is conventional to take the logarithm transformation assuming Gaussian noise in the analysis of expression data (Hill et al., 2008; Melé et al., 2015). Based on our experience and other previous works, here, we considered the population distribution as Gaussian but we did not an assume the outlier distribution.
Once we can identify such Gaussian population, it is natural to quantify TS in terms of the robust z-score:
measuring the distance from the tissue median to the population mean, standardized by the population standard deviation for each tissue. In contrast to the traditional z-score, the mean and variance in our robust z-score are from the population and thus the score is more sensitive to the TS outliers. Importantly, when such tissue balances are similar in all the genes, then the robust z-scores are comparable across genes.
The remaining question is how to estimate the population given the presence of various outlier expressions. In statistics, we formulated this question as robust estimation for the population in a mixture model. The literature of robust estimation is very rich, such as median and the median absolute deviation (MAD), Efron's method of empirical null distribution (Efron, 2005) and Tukey's biweight fitting, etc. In the cases where all the samples can be modeled as a mixture model of multiple Gaussian components, one can use the Bayesian method from the EM algorithm (Cavalli et al., 2011). They built a pipeline in the method of SpeCond to determine which Gaussian component belongs to the population, and which to the outliers.
If the outliers are in a large proportion, and if their distribution is complicated and unknown, the problem becomes challenging. We still lacked a robust, automatically data-adaptive method to estimate the population information. Recently in a general linear regression problem, we developed a novel data-adaptive robust estimation based on density-power-weight under unknown outlier distribution and non-vanishing outlier proportion (Wang et al., 2019). In the question of quantifying TS, we restricted the multivariable model analyzed in Wang et al. (2019) to a univariate model in the Gaussian population, and robustly estimate population information to get our data-adaptive quantitative TS in the form of robust z-score (AdaTiSS). Our AdaTiSS took into account gene heterogeneities under various outlier proportions and magnitudes. It achieves robustness and data-adaptiveness by selecting a turning parameter based on the data to optimize the population estimation. We summarize the procedure and the algorithm in the following subsections, and more statistical analysis can be found in Wang et al. (2019).
2.2 Data-adaptive robust TS scores
Suppose matrix is the gene expression in logarithm scale for genes from samples. The samples come from tissues and each tissue could have multiple biological replicates. The gene expression we worked on was after logarithm transformation. The transformed data is more symmetric and more Gaussian-distributed.
Our AdaTiSS score for gene and tissue is defined by
| (1) |
We estimated population mean and population standard deviation for each gene and summarized tissue expression in the median level for each tissue in the same gene. The score was defined in the term of robust z-score. It measured the deviation of the tissue median expression from the population mean divided by the population standard deviation.
The in (1) are the estimates for the population mean and population standard deviation under data-selected . Our estimation enjoyed both the properties of robustness and data-adaptiveness. To achieve the goal of robustness, we applied a robust approach of down-weighting the outliers by density-power weight (Windham, 1995). To achieve the goal of data-adaptiveness, we applied data-adaptive procedure in our own work of Wang et al. (2021) to select the tuning parameter to optimize the estimation for each gene. Below we elucidated statistical details on how the estimates obtain such properties.
We formulated the gene expression into a mixture model with Gaussian-population distribution, i.e.
| (3) |
The is the mixture density for gene , is the Gaussian population (the null) density parameterized by with mean and variance , and is the unknown outlier density. The is the population proportion, and indicates the outlier proportion. The model allows heterogeneous genes having population distributions with different (’s. Here, we assumed Gaussian distribution for the population expression under the logarithm transformation. In contrast to a traditional method assuming the full model on all the samples, our method relaxed the assumption on the outlier distribution. In our setting, we did not assume that vanishes to zero as increasing the sample size. In the real data, the outlier proportion may not be small, especially for a gene highly expressed in multiple tissues. Under the unknown outlier distribution and non-vanishing outlier proportion, our goal was to estimate the null parameters for each gene.
We took the technique of weighting the mixture density by the density power , where to down-weigh the outliers and therefore purify the reweighted mixture model. The rational is that under a proper , the outliers go to the tails of the reweighted null density and thus the outliers do not contribute much to the population estimation. The technique of density power weight played an important role in robust estimation (Basu, 1998; Fujisawa, 2013; Fujisawa and Eguchi, 2008; Windham, 1995).
For the Gaussian-null mixture model, the estimates for the null parameters are summarized in the Proposition 1. The estimate for is the same as based on -cross entropy in Fujisawa and Eguchi (2008) and the estimating equation in Windham (1995). Our estimator for agrees with the result from minimizing the density power score in Kanamori and Fujisawa (2015). The estimation can be done in an iterative fashion. We illustrated the iterations of the population fitting in Figure 2. Along the iterations, the estimated population densities are approaching to the underlying population density.
Fig. 2.

Histogram of a simulated data from Gaussian mixture model, i.e. . The blue bars indicate the samples from the population distribution and the green ones for the outliers. The black curve is the underlying mixture density. The curves in gradient colors from purple to blue are the fitted population densities along iterations and the blue curve is the final fitted density
Proposition 1.
For gene , consider the expressions where is the unknown outlier distribution. Our robust weighted log-likelihood function is
The robust estimates for and the estimate for satisfy
where
and is the Gaussian density.
Based on the density-power weight to down-weigh the outliers, our estimation for the population parameters obtained the robustness property. To achieve the goal of data-adaptiveness, notice that the tuning parameter , the exponent of the density-power weight function, essentially controls the balance of bias and variation of the estimation (Windham, 1995, Fujisawa and Eguchi, 2008). Since different genes have various TS patterns, they selected different . Figure 3 shows a simulation result on fitting the population density under various ’s. When , the fitting is affected by the highly expressed outliers. When , the fitting becomes locally trapped such that it only locally fits well on a small set of null points. From the density fittings shown in Figure 3, what really matters is the goodness-of-fit (GOF) of the fitted population density to the mixture density on the population points. Hence, we evaluated the GOF in the measure of population distribution by
| (4) |
where is the local false discovery rate introduced in Efron (2005). Comparing the estimated (4) to an approximate value one gives our selection criterion for ,
| (5) |
Fig. 3.

Histogram of 1000 's from in the simulations where the points from are colored in blue and the ones from in green. The the black curve is the underlying density for the mixture model. The blue curve is the fitted partial density of under where is selected from our data adaptive selection procedure
Our data-adaptive robust estimation is the under the selected from (5). The selected varies from gene to gene, trying to optimize the estimation for each gene. Wang et al. (2021) provided more statistical analysis on the data-adaptive procedure. With the well-estimated population parameters , our AdaTiSS provided an approach to define the TS in terms of robust z-score in (1).
2.3 Algorithm of AdaTiSS
In this section, we detailed the steps to obtain our data-adaptive TS scores (AdaTiSS). For the intensity data from the microarray or the mass spectrometry platform, the Gaussian distribution is usually used as a convention after taking the log transformation. One can directly apply our robust fitting method there. For the data from RNA-seq, one can work on the log transformed standardized TPM or RKPM. When comparing multiple tissues, it could happen that in some low expressed or unexpressed tissues, their TPM (or the RKPM) are zeroes. To deal with zeroes when taking the log transformation, there are several ways. One way is to add a small amount of positive perturbation on the TPM near zero. Another way is to ignore the expression differences when the TPM is less than one. Here, we take the approach of taking TPM less than one to be one. Since there could be a density peak at , we made adjustments on defining the population and provided criteria to determine whether such zero peak contributes to the population. Another possible approach dealing with the zero expressions is to take square root or cubic root, then the variation at zero is under control. Since here we more focused on the high expressions, we preferred the log transformation to more control the variation in high expressions. We summarized the procedures to implement AdaTiSS in Algorithm 1.
Algorithm 1: Implementation of AdaTiSS.
Input: Preprocessed expression matrix in logarithm scale at base 2, error tolerance (default ), step index .
Output: AdaTiSS for each gene and each tissue and fitted population parameters.
Initialization: take sample mean and sample variance to initialize for gene and do the same for all the genes
To obtain population parameters for each gene.
For in the sequence from 0 to 3 with increment 0.1
While or ,
based on the Proposition 1,
update from ;
update from ;
End while
Calculate based on the fitted according to Proposition 1.
Evaluate the estimate for in (3) from under each
End for
Select based on (4) and obtain .
For zero raw expression, to adjust the overall population parameters (optional, depending on the preprocessing step)
To obtain AdaTiSS for each gene and each tissue.
Based on population mean and standard deviation, obtain AdaTiSS defined in (1).
To diagnose the genes with low fitted population proportion.
Step 1: Preprocessing. First, to do normalization or standardization and to remove batch effects, as conventions. Since this article did not extend the discussion on the normalization, we assumed the data was well normalized or standardized before applying AdaTiSS. The biological replicates were allowed in the data. For the data having technical replicates, the average expressions within the tissue from the same subject were calculated. The low expression was filtered.
Step 2: Fitting the population distribution. The population parameters were estimated from Proportion 1 under the selected tuning parameter based on the criterion defined in (5). For the TPM or RPKM data, the negative expressions in the log scale concentrated were at zero. All the positive expression was modeled in the mixture model stated in (1). The overall population was adjusted in the next step.
Step 3: Calculating TS scores. For the log transformed intensity expression, with the fitted population parameters from Step 2, their TS scores were directly obtained from (1) to (2). For the log transformed TPM or RPKM expressions, more adjustments were needed to determine the overall population. There were two complementary cases. In Case (i): the gene has fitted from Step 2 and the zero peak lies outside three standard deviations (SD) away from the fitted mean, as an example shown in the Panel (A) in Figure 4. For this case, the overall population is defined by the fitted Gaussian bump from the positive expressions. In Case (ii): the gene has fitted or the zero peak lies inside three SDs away from the fitted mean, which is the complement to Case (i), as an example shown in the Panel (B) in Figure 4. The overall population is then set based on the samples inside the SDs away from the fitted mean and also includes the zeroes. Then the population information is obtained by taking the mean, standard deviation and the proportion of these inliers.
Fig. 4.

Two example of tissue specific genes. Panel (A) for a testis-specific gene (GPANK1) in Case (i) defined in Step 3 and panel (B) is for a brain-group-specific gene (NRG3) in Case (ii). The histograms are from gene sample expressions in log(TPM) from GTEx dataset in version 7. In the histograms, the curves are proportional to the robustly fitted density of the vertical dashed line is the threshold as three standard deviations from the fitted mean in the right tail; the small bars at the bottom indicate tissue medians; the gray partial colored bar in the left tail are for the zero log(TPM). The barplot in each panel indicates the AdaTiSS for each tissue. The dashed lines are at 3 standard deviation from the fitted mean
Step 4: Diagnosis. Since the real data may be more complicated, one diagnosis is added to check whether the fitting population proportion is too small. Such situation indicates the case of the fitting being locally trapped. If a gene has fitted population proportion less than , we marked this gene and further checked its fitting plot.
2.4 Methods in comparison
We summarized several commonly used methods for method comparison, including the standard z-score (Cheadle et al., 2003), Robust regression z-score (REZ) (Jiang et al., 2019), Preferential expression measure (PEM) (Huminiecki et al., 2003), Specificity Index (SI) (Dougherty et al., 2010) and HPA criterion (Uhlén et al., 2015).
In notation, matrix is the gene expression in logarithm scale with genes in rows and samples in columns. Before taking the log, the low expression (TPM or RPKM ) were set to 1. Let matrix be the average expression (TPM or RPKM) of the samples from the same tissue. For , .
Z-score. The standard z-score (Cheadle et al., 2003) for tissue in gene is defined by
where is the mean of ’s across tissues for gene and is their standard deviation
Preferential expression measure (PEM). The PEM score was to estimate how the expression of a gene is deviated from the expected expression under the assumption that the gene is uniformly expressed in all the tissues (Huminiecki et al., 2003). From the method comparison in the work of Kryuchkova-Mostacci and Robinson-Rechavi (2017), PEM score performed relatively well. The PEM for gene from tissue is defined by
where . The argument of the log function was modified by 1 for the small values since we here mainly cared about high expression and also avoided the case of taking log of non-positive values.
Robust regression z-score (REZ). The work of Jiang et al. (2019) considered another robust version of z-scores to evaluate the selective expression of genes. They measured the selective expression based on the robust regression z-score (REZ) whose population mean and standard deviation were calculated from Huber’s robust estimation with the weight proportional to inverse of technical variation. The REZ for gene from tissue is defined by
where are estimated from Huber linear regression of the sorted (in increasing order) regressed on the orders from 1 to , the regression weight associated to is proportional to the inverse of the standard deviation of technical replicates from tissue and the weights for the same gene are summed to 1.
Specificity Index (SI). The work of Dougherty et al. (2010) proposed an approach of identifying genes in specific cells based on the average rank of the fold-change between one target sample and the rest of the samples. In our setting, the SI for gene from tissue is defined by
where the rank is across all the genes. The low expression ( in TPM or PKPM by default in pSI package) has been filtered as NA. As mentioned in Dougherty et al. (2010), the raw SI values are not directly comparable across different samples. They used the P-value () associated with each SI to indicate gene enrichment. To make the in the comparable scale of z-score, we took normal quantile transformation , where is the normal quantile function. The NAs in (P-value < 0.1) were set to 0 in the z-transformed .
The scores of AdaTiSS, z-score, REZ and z-transformed were then constrained in the range of , i.e. .
HPA criterion. The work of Uhlén et al. (2015) proposed an approach of classifying genes into different categories based on the fold-change to define tissue enrichment. The HPA criterion defined genes into six categories: tissue enriched (one tissue is at least five-fold-change higher than the rest of the tissues), group tissues (multiple tissues are all at least five-fold-change higher than the rest of the tissues), tissue enhanced (at least one tissue is at least five fold-change higher than the average expression of all the tissues), expressed in all (the gene is detected in all tissues), mixed and not detected. Since all other methods provided TS for each tissue and each gene, we further defined the enrichment 0-1 indicator matrix for HPA criterion. If a tissue expression was at least five-fold-change higher than the rest of the tissues for the same gene, that tissue contributed 1 to the indicator matrix otherwise it was zero.
To make all the scores from different methods comparable in the same scale, the score for gene , tissue was rescaled to for each method.
3 Results
In this section, we first did simulation studies in estimating the population parameters. We then compared the performance of AdaTiSS to the methods listed in Section 2.4. As we summarized in Section 1, there are serval methods to define TS from different angles. Here, we applied them on the gene expression data generated from RNA-seq in the GTEx project (the dataset is from the release in version 7, downloaded from http://www.gtexportal.org). For the method comparison, a systematic correlation analysis between pairwise tissues was performed and the significance analysis of various GO terms among identified tissue enriched genes was further applied.
3.1 Simulation studies in estimating population parameters
In the setting to identify tissue specific genes, how to distinguish inliers from outliers is the key and thus how to estimate the population parameters becomes important. We compared our data-adaptive robust population estimation introduced in Section 2.2 to several other methods in simulations. Their performances were evaluated by the mean squared error (MSE). The methods under comparison included the fixed robust estimation, Tukey's biweight estimation, the EM algorithm under two Gaussian mixtures and Efron’s local fdr estimation. In the simulation studies, the outlier distribution was considered to be Gaussian or -distribution under various mixture models. The comparison results were summarized in Supplementary Figures S1–S9 in Supplementary Materials. We concluded that the -robust estimation methods performed better than the other methods in various situations. The EM algorithm only worked well when the mixtures were well separated. The local fdr method was mainly designed under light outliers. Our data-adaptive estimation performed better overall than the fixed estimation, especially under the outliers were in a large proportion. Moreover, it was more adaptive to various outlier distributions.
In Supplementary Materials, following the similar analysis in Gaussian-population mixture model, we developed a data-adaptive procedure for the -population mixture model. We found that our data-adaptive procedure preferred light-tailed population density. In the real practice when applying our data-adaptive procedure, we suggested taking log transformation or other transformation to make the data more Gaussian distributed in order to maintain the detection power of our procedure.
3.2 Tissue expression correlation comparison
For the RNA expression data in the GTEx project, the tissue level expression was summarized as median value of TPMs from the tissue samples. Low-expressed genes were removed from the following analysis if all the tissue medians were less than one. In total, 17 719 protein-coding genes were left for our comparison analysis.
We calculated Pearson correlation between tissues based on the scores from the methods mentioned in Section 2.4. In Figure 5A–B, the correlations from the original TPM (0.91 in median) showed much higher values than those from AdaTiSS (0.01 in median), which agreed with the result in Jiang et al. (2019). Based on the scores from AdaTiSS, all the brain sub-tissues and pituitary were clustered together. The correlation analysis for other methods was summarized in Supplementary Figure S10 in Supplementary Materials.
Fig. 5.

Tissue-wise Pearson correlation comparison. (A–B) Heatmap of Pearson correlation between tissue samples based on original tissue median TPM (A) and AdaTiSS (B). (C–D) Barplot of Pearson correlations comparison in the groups of Adipose—Breast (3 samples), Artery (3 subtypes), Cells (2 subtypes), Cervix—Uterus (3 samples), Esophagus Mucosa—Skin (3 samples) in (C) and the comparison in the groups of brain (13 subtypes), heart (2 subtypes) and muscle samples in (D). The error bar in (D) represents standard deviation from the average value
The physiologically similar tissues were expected to have high correlations. We considered several groups of tissues for comparison, including Adipose—Breast group [Adipose—Subcutaneous, Adipose—Visceral (Omentum), Breast—Mammary Tissue], Artery group (Artery—Aorta, Artery—Coronary, Artery—Tibial), Esophagus Mucosa—Skin group [Esophagus—Mucosa, Skin—Not Sun Exposed (Suprapubic), Skin—Sun Exposed (Lower leg)], Cervix—Uterus group (Cervix—Ectocervix, Cervix—Endocervix, Uterus), Cells group (Cells—EBV-transformed lymphocytes, Cells—Transformed fibroblasts). Figure 5C shows the average correlation within each physiologically similar tissue group based on TS scores. The correlations based on AdaTiSS are comparable as the ones based on standard z-score and REZ in the groups of Adipose—Breast, Artery, Cells and Esophagus Mucosa—Skin. In the group of Cervix—Uterus, the AdaTiss still preserves relative high correlation.
Since brain, heart and muscle are the major energy organs, they are expected to have high correlations in the gene expression. The Pearson correlations were also compared across the brain (13 subtypes), heart (2 subtypes) and muscle samples. Figure 5D shows the comparisons of brain versus brain, heart versus heart, heart versus muscle, muscle-heart versus brain. The correlations in average values based on the AdaTiSS are higher than the correlations based on the other scores, especially in the muscle-heart group versus brain.
3.3 Tissue enriched genes comparison
In the same definition as Kryuchkova-Mostacci and Robinson-Rechavi (2017), the overall TS at the gene level was defined by the maximum (rescaled) score over all the tissue types for each gene. The closer the score to 1, the more tissue-specific the gene expression is. Figure 6 shows the densities of the gene-level TS scores from each method. We can see that different methods have different distribution patterns, which illustrates different sensitivity levels. All the methods except PEM have a density peak at 1, while PEM has a small peak at around 0.87. The highest peak of AdaTiSS is between those of PEM and standard z-score. The gene-level scores for the HPA criterion were based on the 0-1 enrichment indicator matrix. Hence, its density has two peaks, one at 0 and the other one at 1. Interestingly, AdaTiSS and z-pSI have an additional small peak in the range of 0.625 and 0.75. It may indicate a small group of genes show moderate TS. As different methods have different sensitivity levels, it may not be reasonable to set a common threshold for each method to define tissue-specific genes. Hence, we took a data-driven approach to set the threshold from the data itself for each method. The specificity threshold was set as the median value of gene-level tissue-specificity scores, the vertical lines in Figure 6. The specific tissues were the ones that have higher scores than the threshold.
Fig. 6.

Density of gene-level tissue-specificity scores. The vertical lines are at the median values of the scores from each method
Similar to the work of Kryuchkova-Mostacci and Robinson-Rechavi (2017), based on the tissue-enriched gene lists from each method, we compared the significance of tissue enrichment in the Gene Ontology (GO) terms related in the tissue-specific or ubiquitous gene functions. The GO terms were considered here including: GO0007283 spermatogenesis (expected to be specific to testis), GO0050877 neurological system process (expected to be specific to brain and other neural tissues), GO0006805 xenobiotic metabolic process (expected to be specific to liver and kidney), GO0006457 protein folding (expected to be ubiquitous), GO0061024 membrane organization (expected to be ubiquitous) and GO0008380 RNA splicing (expected to be ubiquitous). Tissue enrichment analysis was applied from Fisher’s exact test (Jain and Tuteja, 2019). However, different from Jain and Tuteja (2019) only based on the HPA criterion, we set up the tissue enrichment criteria from different methods. Figure 7 shows the significance of tissue enrichment across tissues for six GO term gene lists under different tissue enrichment criteria. Overall the significance from all the methods agreed with the prior knowledge in the GO term biological function annotation. Only the results based on the standard z-score did not reach enough significance level in the GO term of neurological system process in a few subtypes for brain.
Fig. 7.

Significance of tissue enrichment across tissues for six GO term gene lists under different tissue enrichment criteria. The six GO terms are GO0007283 spermatogenesis, GO0050877 neurological system process, GO0006805 xenobiotic metabolic process, GO0006457 protein folding, GO0061024 membrane organization, GO0008380 RNA splicing. In each panel, the horizonal dashed line is at -log10(P-value) = 3
4 Discussion and conclusion
To define TS, we proposed a new method AdaTiSS focusing on the population estimation instead of tangling with heterogenous outlier expression. Our AdaTiSS method was based on our data-adaptive method AdaReg (Wang et al., 2021), robustly estimating the population distribution and thus constructing robust and data-adaptive TS scores. Our TS scores quantify the TS in each tissue for each gene and the scores are comparable across genes. We provided a new robust and powerful tool to the literature of TS identification. However, there are still some limitations in the proposed method, so more research is needed.
In the current work, we considered that comparing samples from different tissues, the main effect is the tissue effect, which is confirmed in several explanatory studies from sample clustering (Jiang et al., 2020; Melé et al., 2015), and thus we modeled the population distribution only including one covariate. There could be other effects such as gender, age affecting the population. One thing needs to notice is that such effects could be associated with the tissue effect, such as some tissue is gender specific. If to remove the gender effect, it may reduce the sensitivity of TS. If we incorporate more covariates in the population model, there needs careful consideration on the model selection when the samples have outliers. Thus, here we took a simple univariate model on the population and left the problem of model selection on the complex model as a future work.
When considering samples from multiple tissues, we observed the balance inside the population of tissue expressions. The highly expressed outliers may indicate TS. Therefore, we modeled the population as the symmetric Gaussian distribution. In Supplementary Materials, following similar procedures for the Gaussian population, we also developed a data-adaptive procedure for -distribution as the population. We found when the population has heavy tails, our algorithm cannot distinguish which samples in the population tails are outliers, and which are inliers, so the algorithm becomes unstable. Actually, in such cases, the concept of ‘population’ may not be valid. We think it is necessary to predefine the density of the population; otherwise all the samples can be in the population. Hence, here we considered the population as Gaussian. Forthe Gaussian population, there is only one density mode. It could happen there are two modes or even multiple modes, or that the tissue samples might not have any concentrated clusters. To address this concern, based on the current method, we added a diagnosis step: reporting the estimated population proportion. If the proportion is less than 70%, we marked it and checked its fitting. In our future work, we can provide another option for a population with a mixture of multiple Gaussian components. Additional research is also needed to determine the number of components in the population.
Our data-adaptive procedure works best for a light-tailed population such as Gaussian distribution. For the data from the microarray or the mass spectrometry platform, the Gaussian distribution is usually used as a convention after taking the log transformation. For the RNA data from RNA-seq, we worked on the standardized expressions in TPM or RKPM. There are several works analyzing the log of TPM or RKPM for RNA data. In the analysis of transcriptome variations, Melé et al. (2015) used the mixed effect model with Gaussian noise, and Li et al. (2017) detected the individual outliers within a particular tissue based on the traditional z-score estimated from the normalized log of TPM. In our problem of quantifying TS, we made several adjustments for analyzing the standardized RNA data in Section 2. First, we took the log transformation to make the expressions more symmetric. As we more focused on the high expressions not on the low expressions, we concentrated the low expression (TPM < 1) at 0 in the log scale to mitigate the inflation from the log transformation. Then, we modified the population concept when the low expressions are not in a small proportion. Finally, we established a criterion to determine whether the low expressions counted for the population.
Another approach of analyzing the RNA expressions is on the counts data based on the negative binomial distribution. Brechtmann et al. (2018) developed a method for detecting aberrantly expressed genes using the autoencoder. As they pointed out, their method does not work well if the outlier is not in a small proportion since the autoencoder cannot distinguish the outlier effect from the expected covariates. In our problem, where the outliers can be in a large proportion, fitting the negative binomial population in the presence of outliers is more challenging compared to the Gaussian population, since it could be hard to distinguish the over dispersion from the true outliers. There still needs more efforts to develop a data-adaptive robust estimation if working on the counts data.
In the analysis pipeline, our proposed method is applied after the preprocessing normalization and/or after removing batch effect. The robustness of the preprocessing steps could affect theTS scores. One approach for future work is to combine the preprocessing steps with quantifying TS to better detect outliers.
Funding
The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH and NINDS. The gene expression data used for the analyses described in this manuscript was obtained from the GTEx Portal in version 7 (https://gtexportal.org/home/datasets). This work has been supported by the GTEx grant [5U01HL13104203] and the CEGS grant [2RM1HG00773506].
Conflict of Interest: M.P.S. is a cofounder and is on the scientific advisory board of Personalis, Filtircine, SensOmics, Qbio, January, Mirvie, Oralome and Proteus. He is also on the scientific advisory board (SAB) of Genapsys and Jupiter. The other authors declare no competing interests.
Author contributions
M.W. developed the method. L.J. and M.P.S. helped analyzing the data. M.P.S. supervised the project. All of the authors wrote the article and contributed to discussion and revised the article.
Supplementary Material
Acknowledgements
The authors acknowledge the discussions with Dr. Hua Tang and Dr. Robert Tibshirani at Stanford and thank Inanc Birol, Associate Editor and three reviewers for the insightful comments that greatly improved the article.
Contributor Information
Meng Wang, Department of Genetics, Stanford University, Stanford, CA 94305, USA.
Lihua Jiang, Department of Genetics, Stanford University, Stanford, CA 94305, USA.
Michael P Snyder, Department of Genetics, Stanford University, Stanford, CA 94305, USA.
References
- Basu A. (1998) Robust and efficient estimation by minimising a density power divergence. Biometrika, 85, 549–559. [Google Scholar]
- Brechtmann F. et al. (2018) OUTRIDER: a statistical method for detecting aberrantly expressed genes in RNA sequencing data. Am. J. Hum. Genet., 103, 907–917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cavalli F.M. et al. (2011) SpeCond: a method to detect condition-specific gene expression. Genome Biol., 12, R101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- GTEx Consortium. (2015) The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science, 348, 648–660. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dougherty J.D. et al. (2010) Analytical approaches to RNA profiling data for the identification of genes enriched in specific cells. Nucleic Acids Res., 38, 4218–4230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Efron B. (2005) Local false discovery rates. Citeseer.. [Google Scholar]
- Fujisawa H. (2013) Normalized estimating equation for robust parameter estimation. Electronic J. Stat., 7, 1587–1606. [Google Scholar]
- Fujisawa H., Eguchi S. (2008) Robust parameter estimation with a small bias against heavy contamination. J. Multivariate Anal., 99, 2053–2081. [Google Scholar]
- Greller L.D., Tobin FLJGr. (1999) Detecting selective expression of genes and proteins. PLoS Genet., 9, 282–296. [PMC free article] [PubMed] [Google Scholar]
- Hill E.G. et al. (2008) A statistical model for iTRAQ data analysis. J. Proteome Res., 7, 3091–3101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huminiecki L. et al. (2003) Congruence of tissue expression profiles from Gene Expression Atlas, SAGEmap and TissueInfo databases. BMC Genomics, 4, 31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jain A., Tuteja G. (2019) TissueEnrich: tissue-specific gene enrichment analysis. Bioinformatics (Oxford, England), 35, 1966–1967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang L. et al. (2020) A quantitative proteome map of the human body. Cell, 0092–8674. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang L. et al. (2019) DESE: estimating driver tissues by selective expression of genes associated with complex diseases or traits. Genome Biol., 20, 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kanamori T., Fujisawa H.J.B. (2015) Robust estimation under heavy contamination using unnormalized models. Biometrika, 102, 559–572. [Google Scholar]
- Kim P. et al. (2018) TissGDB: tissue-specific gene database in cancer. Nucleic Acids Res., 46, D1031–D1038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kryuchkova-Mostacci N., Robinson-Rechavi M.J.B. (2017) A benchmark of gene expression tissue-specificity metrics. Brief. Bioinf., 18, 205–214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li X., Genome Browser Data Integration &Visualization—UCSC Genomics Institute, University of California Santa Cruz. et al. (2017) The impact of rare variation on gene expression across tissues. Nat. Genet., 550, 239–243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu X. et al. (2008) TiGER: a database for tissue-specific gene expression and regulation. BMC Bioinformatics, 9, 271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Melé M. et al. ; GTEx Consortium. (2015) The human transcriptome across tissues and individuals. Science, 348, 660–665. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Uhlén M. et al. (2015) Tissue-based map of the human proteome. Science, 347, 1260419. [DOI] [PubMed] [Google Scholar]
- Wang M. et al. (2021) AdaReg: data adaptive robust estimation in linear regression with application in GTEx Gene Expressions. Stat. Appl. Genet. Mol. Biol., 2021, 20200042. [DOI] [PubMed]
- Windham M.P. (1995) Robustifying model fitting. J. R. Stat. Soc. Ser. B (Methodological), 57, 599–609. [Google Scholar]
- Xiao S.-J. et al. (2010) TiSGeD: a database for tissue-specific genes. Bioinformatics (Oxford, England), 26, 1273–1275. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu X. et al. (2006) Computational analysis of tissue-specific combinatorial gene regulation: predicting interaction between transcription factors in human tissues. Nucleic Acids Res., 34, 4925–4936. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

