GTEx pro enables accurate multi-tissue gene expression analysis using robust normalization and batch correction

D Jothi

doi:10.1038/s41598-025-20697-0

. 2025 Sep 23;15:32684. doi: 10.1038/s41598-025-20697-0

GTEx pro enables accurate multi-tissue gene expression analysis using robust normalization and batch correction

D Jothi ^1,^✉

PMCID: PMC12457653 PMID: 40987889

Abstract

The Genotype-Tissue Expression (GTEx) project provides a valuable resource for investigating gene regulation across various human tissues. However, its cross-sectional design introduces technical artifacts and batch effects related to donor demographics and tissue processing. These confounders obscure biological signals and distort multi-tissue analyses. We present GTEx_Pro, a Nextflow-based pipeline for preprocessing GTEx v8 transcriptomic data, enhancing multi-tissue comparability. It integrates TMM + CPM normalization and SVA batch effect correction to improve biological signal recovery while reducing systematic variations across 54 GTEx tissues. Designed for scalability and reproducibility, GTEx_Pro facilitates accurate multi-tissue transcriptomic analysis, and a similar framework can be adapted to other large-scale transcriptome datasets.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-20697-0.

Subject terms: Computational biology and bioinformatics, Data processing

Introduction

Integrating and analyzing large-scale biological datasets is crucial for understanding complex biological processes and disease mechanisms. Public resources such as GTEx provide extensive gene expression data across diverse human tissues, enabling studies on gene expression trajectories, tissue-specific gene correlations, and expression quantitative trait loci (eQTL)^1,2. However, the cross-sectional nature of the GTEx dataset, where the tissue samples are derived from different individuals of the age group 20–79 of varying locations, introduces technical artifacts and batch effects based on donor demographics and tissue/sample processing. GTEx_Pro, a preprocessing pipeline (Fig. 1), addresses these challenges by integrating Trimmed Mean of M component (TMM) normalization, Counts Per Million (CPM) scaling, and Surrogate Variable Analysis (SVA) batch correction^3,4. TMM removes library size differences and compositional biases, ensuring that comparisons across samples are not skewed by sequencing depth or highly expressed genes. CPM then scales the adjusted counts to a per-million basis, making gene expression values comparable across samples. SVA further mitigates latent sources of variation, such as batch effects, improving the reliability of downstream gene expression analysis. Our results demonstrate enhanced biological signal recovery and reduced technical artifacts compared to other processing methods. Leveraging Nextflow and Docker, GTEx_Pro ensures automation, scalability, and reproducibility across computational environments. Its open-source design facilitates applications in diverse transcriptomic studies. Hence, GTEx_Pro supports more accurate multi-tissue gene expression analysis and downstream research applications by improving data cleaning and removing batch effects.

Fig. 1 — GTEx_Pro Preprocessing Workflow for Downstream Analysis of GTEx Tissues. This diagram (created in BioRender.com) illustrates the preprocessing steps performed in GTEx_Pro to analyze GTEx tissue datasets. The workflow begins with downloading the raw read count matrix, which undergoes data manipulation and filtering to pre-select a specific set of genes for further processing. Furthermore, quality control is performed by imputing missing, invalid values (Inf/NaN) and removing any potential outliers manually, followed by normalization using the TMM and CPM methods. Batch effects are then addressed using SVA with sex as a covariate. PCA is employed to explore principal component variance and tissue clustering quality. The pre-processed data can be subsequently used for various downstream analyses, such as tissue-specific expression studies.

Results and discussion

3D Principal Component Analysis (PCA) of the raw count data revealed substantial overlap between samples from different tissues, with clustering primarily driven by technical variation rather than biological signal (Fig. 2A). Following preprocessing by the TMM + CPM pipeline, 3D PCA showed a pronounced enhancement in tissue-specific clustering (Fig. 2A). Samples from the same tissue, such as Heart Atrial Appendages, Heart Left Ventricle, and Skin-SunExposed, Skin-NotSunExposed, grouped, reflecting improvements in the biological signal. Notably, biologically distinct tissues such as the brain, the Liver, the Heart, and the Skin were clustered separately as expected. However, the PC1 variance in TMM-normalized CPM values (TMM + CPM) was lower than that of raw read counts. This reduction is likely due to the scaling down of highly expressed genes during normalization, which adjusts for differences in sequencing depth and gene expression distribution. Normalization can occasionally lower principal component variance by minimizing the influence of highly expressed genes, thereby enhancing biologically relevant signals and improving tissue-based separation⁵. Furthermore, following SVA batch correction, the grouping of distinct tissue clusters was further apart than by TMM + CPM processing alone. This is evident from the increased Euclidean distance⁶ between the tissue clusters, as shown in Fig. 2A. More specifically, the average Euclidean distance score between tissue clusters increased after SVA batch correction across all 54 GTEx tissues (Fig. 2B, Fig. Supplementary Fig. 1), suggesting the effectiveness of the batch correction method. In addition, the overall variance in the first two principal components is increased by 1.5% (Fig. 2C).

Fig. 2 — Comparison of clustering quality of the pre-processing pipeline TMM + CPM + SVA. (A) 3D PCA plot displaying the separation of samples by tissue type before normalization, after normalization, and after batch correction. The tissue clusters of Brain-Cortex, Heart-Atrial Appendages, Heart-Left Ventricle, Liver, Skin-Sun Exposed, and Skin-Not Sun Exposed were shown as a representation, and the measure of Euclidean distance between PCA tissue clusters is denoted in red text. The 3D plots can be visualized using the links (https://dhana2403.github.io/3D_plots/3d_pca_plot_all_tissues_tmm.html) and (https://dhana2403.github.io/3D_plots/3d_pca_plot_all_tissues_sva.html). (B) The bar graph compares the average Euclidean distance after TMM + CPM and TMM + CPM + SVA processing steps across 54 GTEx tissues. The centroid for each tissue is calculated as the mean of PC1, PC2, and PC3. (C) The bar graph illustrates the percentage of variance explained after applying TMM + CPM normalization and SVA batch correction on gene expression data. The percentage of variance corresponding to the principal components in the x-axis is mentioned at the top of each bar. (D) The bar graph compares the Davies-Bouldin Index (DBI) for tissue clusters after TMM + CPM normalization and SVA batch correction. DBI is calculated based on the first three principal components (PC1, PC2, and PC3). Lower DBI values indicate better separation and compact clusters. E) The plot compares brain tissue-specific gene SNAP25 expression trajectory after TMM + CPM and TMM + CPM + SVA processing steps across 54 tissue types. The distribution of expression values in brain tissue samples is highlighted in the purple box.

Furthermore, the tissue clustering quality is estimated by the Davies-Bouldin index (DBI)^7,8. DBI measures the average similarity ratio of each tissue cluster with its most similar cluster, where a lower DBI indicates better clustering. It was observed that after SVA batch correction, the DBI index score decreased, suggesting better clustering following the batch correction method (Fig. 2D). Additionally, a trajectory plot of the brain-tissue-specific gene SNAP25 was generated using the implemented workflow to assess the pipeline’s efficacy for downstream analysis. It was observed that the distribution of expression values for brain tissue samples became more stable and consistent, with reduced variability, following the SVA batch correction method (Fig. 2E). This pattern was also observed in other tissue-specific genes, such as albumin (ALB) in the Liver, Pancreas, and Whole Blood, and keratin (KRT1) in Skin-SunExposed and Skin-NotSunExposed tissues (Supplementary Fig. 2). Moreover, these subtle improvements in PCA clustering quality lead to substantial alterations in the gene-gene correlation heatmap analysis (Supplementary Fig. 5). After SVA batch correction, the gene-gene correlation tend to be more positive compared to TMM + CPM processing alone. This suggests that the stabilization of expression values in tightly regulated, tissue-specific genes strengthens the overall correlations within the collective gene group. Altogether, these findings indicate that even marginal enhancements in Euclidean distance and PCA clustering quality can substantially impact the reliability and interpretability of downstream gene expression analyses.

For benchmarking, traditionally used GTEx normalization methods, such as TPM, were compared to the TMM + CPM processing method, along with SVA batch correction or quantile normalization⁹. The comparison demonstrated that SVA batch correction plays a critical role in enhancing the pipeline’s effectiveness. Notably, the TMM + CPM + SVA and TPM + SVA approaches outperformed TPM + quantile normalization and TMM + CPM + quantile in enhancing the Euclidean distance between PCA tissue clusters (Supplementary Fig. 3A). To further assess the impact of the GTEx_Pro preprocessing pipeline on downstream gene expression analysis, gene expression trajectories across 54 GTEx tissues were evaluated for all four processing methods. The results indicated that both TMM + CPM + SVA and TPM + SVA displayed stable and consistent gene expression values compared to other groups. However, gene expression values derived from the TMM + CPM + SVA pipeline were consistently higher than those from TPM + SVA, suggesting that TMM + CPM normalization enhances expression estimates while maintaining distributional consistency (Supplementary Fig. 3B). Additionally, the pipelines were cross-validated using an alternative set of genes, including AKT1, GSK3B, GDF11, FOXO1, SESN2, ULK1, PGC, PINK1, PDPK1, BCL2, HMOX1, FIS1, TNF, and PARP1. The results revealed that, upon SVA batch correction, the pipeline consistently improved the Euclidean distances between tissue clusters for most tissues, mirroring the findings from the previous analysis. This demonstrates the robustness and effectiveness of the pipeline in enhancing tissue-specific expression profiles for most of the GTEx data (Supplementary Fig. 4).

Although the current pipeline is specifically designed for GTEx data, the underlying framework of TMM + CPM + SVA can be applied broadly to other RNA-seq preprocessing workflows. We anticipate that this approach will enhance downstream gene expression analysis and facilitate accurate tissue-specific drug discovery efforts¹⁰. Future pipeline developments will include integrating machine learning-based batch correction¹¹ methods to assess their effectiveness in further preserving biological signals, as well as development of an interactive web application to make the pipeline accessible to non-programmers. A key limitation of the current pipeline is that it applies only to the GTEx dataset. Future updates will focus on addressing this limitation to expand the applicability of the framework for all types of tissue-specific analysis.

Conclusion

Overall, the pipeline effectively integrates existing normalization and batch correction methodologies in a novel way while highlighting critical yet previously overlooked clustering variations that substantially impact downstream gene expression analysis. By addressing these factors, the pipeline improves the accuracy and reliability of gene expression analysis, thereby enhancing the robustness of subsequent biological interpretations.

Methods

Data acquisition and processing

The GTEx project is a public resource providing gene expression data across a diverse range of human tissues. In this study, we utilized GTEx v8 RNA sequencing data, comprising 54 tissues with a total of 17,235 samples. The dataset includes RNA-seq data from individuals of varying ethnic backgrounds and sexes. The GTEx v8 data can be accessed through the dbGaP database (https://gtexportal.org) and is available in the form of raw read counts as well as normalized gene expression metrics (e.g., TPM, FPKM). To facilitate automated data retrieval, the GTEx_Pro pipeline directly downloads the raw read count file (GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct) along with associated metadata files(GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt) and (GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt) from the GTEx portal.

Metadata processing

Following data acquisition, sample metadata was processed in R using the data: table, tidyverse, and dplyr libraries. Key metadata attributes were extracted and cleaned, including sample ID, tissue type, batch effects, RNA integrity number (RIN), and ischemic time. Subject phenotype data were processed separately, with categorical variables converted into appropriate factor levels. The metadata files were merged based on the subject ID, and unique records were retained after filtering.

Gene expression data processing

Raw gene expression counts were imported using fread(), and gene identifiers were standardized by removing version numbers. The dataset was filtered to retain only the genes of interest as defined by the user. Infinite or missing values were imputed with the tissue-specific median expression for each gene, ensuring no rows were excluded from downstream analysis. Additionally, columns (samples) exhibiting zero variance were discarded to eliminate low-information data.

Tissue-specific filtering and data storage

Samples were grouped by tissue type, ensuring that each group contained a minimum of two replicates. The processed expression data for each GTEx tissue was stored in .rds format in the directory data/processed/expression/readcounts_all/. The metadata was saved separately as attphe_all.rds for further downstream analyses.

Data normalization by TMM

To normalize gene expression data, we applied the TMM normalization method using the edgeR package. TMM normalization corrects for differences in sequencing depth and composition biases across samples, ensuring comparability of gene expression levels. Each tissue-specific read count file is loaded along with its corresponding metadata. Samples flagged as outliers (if specified) are excluded. The dataset is then filtered to retain only samples present in both the metadata and read count matrices. For each tissue, the normalization factors were calculated using calcNormFactors(), which applies TMM normalization¹². Furthermore, normalized values were scaled using the CPM () function. The final normalized values were saved in .rds format in the output directory. The summary table containing the sample counts for each tissue was compiled and saved as a CSV file (sample_counts.csv) (Table 1).

Table 1.

Sample count values for all the GTEx tissues generated in the normalization step

Tissue	Sample count
Adipose-Subcutaneous.rds	651
Adipose-Visceral.rds	541
AdrenalGland.rds	257
Artery-Aorta.rds	431
Artery-Coronary.rds	240
Artery-Tibial.rds	650
Bladder.rds	21
Brain-Amygdala.rds	152
Brain-Anteriorcingulatecortex.rds	176
Brain-Caudate.rds	246
Brain-CerebellarHemisphere.rds	215
Brain-Cerebellum.rds	241
Brain-Cortex.rds	255
Brain-FrontalCortex.rds	209
Brain-Hippocampus.rds	197
Brain-Hypothalamus.rds	202
Brain-Nucleusaccumbens.rds	246
Brain-Putamen.rds	205
Brain-Spinalcord.rds	159
Brain-Substantianigra.rds	139
Breast-MammaryTissue.rds	459
Cells-Culturedfibroblasts.rds	474
Cells-EBV-transformedlymphocytes.rds	165
Cervix-Ectocervix.rds	8
Cervix-Endocervix.rds	8
Colon-Sigmoid.rds	373
Colon-Transverse.rds	404
Esophagus-GastroesophagealJunction.rds	375
Esophagus-Mucosa.rds	553
Esophagus-Muscularis.rds	514
FallopianTube.rds	8
Heart-AtrialAppendage.rds	429
Heart-Lef!Ventricle.rds	430
Kidney-Cortex.rds	85
Kidney-Medulla.rds	4
Liver.rds	224
Lung.rds	576
MinorSalivaryGland.rds	162
Muscle-Skeletal.rds	790
Nerve-Tibial.rds	610
Ovary.rds	178
Pancreas.rds	327
Pituitary.rds	283
Prostate.rds	244
Skin-NotSunExposed.rds	604
Skin-SunExposed.rds	691
Smalllntestine-Terminallleum.rds	187
Spleen.rds	240
Stomach.rds	357
Testis.rds	361
Thyroid.rds	651
Uterus.rds	141
Vagina.rds	154
WholeBlood.rds	733

Open in a new tab

Quantile normalization

To ensure the comparability of expression values across samples and mitigate technical variation, quantile normalization was applied to transcript-level expression matrices on a per-tissue basis. The input data consisted of pre-normalized matrices (e.g., TPM or TMM-normalized CPM) derived from RNA-seq datasets. Both sample metadata and expression matrices were loaded into R, and outlier samples identified a priori were excluded. Only samples present in both the expression and metadata files were retained. Quantile normalization was performed using the normalize.quantiles () function from the preprocessCore package¹³. This method aligns the empirical distribution of expression values across samples by sorting, averaging ranks, and reassigning values such that all columns share an identical distribution. Following normalization, the gene and sample identifiers were restored, and the normalized matrices were saved in .rds format for downstream analyses. The number of samples retained per tissue after filtering was recorded. In this study, quantile normalization was benchmarked with two distinct normalization pipelines: (1) TMM + CPM + Quantile Normalization and (2) TPM + Quantile Normalization. Although quantile normalization is more commonly applied alongside TPM to ensure comparability of expression values across samples, we also assessed its performance when combined with TMM-normalized CPM values to evaluate whether it provides consistent results in mitigating technical variation and improving the comparability of gene expression levels across different RNA-seq normalization strategies.

Batch correction by SVA

To correct for batch effects and hidden confounders in gene expression data, we employed Surrogate Variable Analysis (SVA) and the removeBatchEffect() function. The number of surrogate variables (SVs) was estimated using the num.sv() function with the ‘be’ method¹⁴, a technique based on parallel analysis. In the case of sex-specific tissues, SVA was omitted as the model incorporates sex as a covariate to account for sex-related expression differences¹⁵. In these cases, batch effects were corrected using removeBatchEffect(), with batch1 included as a covariate. Tissues with fewer than 20 samples were excluded from batch effect correction due to the insufficient sample size for reliable adjustment. The batch-adjusted gene expression data for all tissues were saved in .rds format within the adjusted_sva_all directory.

Principal component analysis (PCA)

Interactive 3D plot visualization

Normalized gene expression data were obtained from multiple tissues, each represented as an individual file. Negative values, if present, were replaced with a small positive constant (1e-9) to ensure numerical stability. Data matrices from all tissues were then merged into a single dataset. PCA was performed using the prcomp() function in R¹⁶, with centering and scaling applied to standardize gene expression values across tissues. The variance explained by each principal component (PC) was computed and reported. A three-dimensional PCA plot was generated using plotly¹⁷ to visualize sample clustering across tissues. Principal components 1, 2, and 3 were selected, and the variance explained by each was annotated accordingly. Specific colors were assigned to predefined key tissues (e.g., Brain-Cortex, Heart-AtrialAppendage, Liver) for representation, while colors for the remaining tissues were randomly assigned for distinct visualization. The interactive PCA plot was saved as an HTML file.

Quantification of tissue similarity using cluster distance metrics

To assess tissue similarity and the effect of batch correction on gene expression variation, we computed the Euclidean distances between tissue clusters in principal component (PC) space, using both uncorrected (TMM + CPM) and batch-corrected datasets (TMM + CPM + SVA). Tissue cluster centroids were computed as the mean coordinates of PC1, PC2, and PC3 for each tissue. Pairwise Euclidean distances between tissue centroids were calculated to quantify tissue similarity for TMM + CPM and TMM + CPM + SVA approaches. The distance matrices were converted into the long format and merged for comparative visualization. The average inter-tissue distance was computed for each tissue across both datasets. A bar plot was generated using ggplot2 to compare the average distance between tissue clusters under the two approaches. These distances demonstrate how “separate” the tissues are from each other in the PCA space. For instance, if tissues A and B have centroids that are very close to each other in PCA space, the Euclidean distance between A and B will be small, indicating that the two tissues are more similar in terms of the principal components. If tissues A and C are farther apart in PCA space, the Euclidean distance will be larger, indicating greater dissimilarity.

Variance computation

For each approach, normalized values were log-transformed (with negative values replaced by a small positive constant, 1e-9). PCA was then performed using the prcomp() function in R, with both centering and scaling enabled. The percentage of variance explained by each principal component (PC) was computed from the eigenvalues of the covariance matrix as:

where i² represents the variance of the ith principal component, and the denominator corresponds to the sum of all PC variances. The variance explained by the first 10 PCs was extracted for visualization.

Clustering quality assessment using the Davies-Bouldin index

To evaluate the impact of batch correction on clustering quality, we computed the Davies-Bouldin Index (DBI) for transcriptomic data processed with two different approaches. For each approach, normalized values were log-transformed, and PCA was performed using prcomp(). The first three principal components (PC1–PC3) were extracted for clustering assessment. Clustering quality was evaluated using the Davies-Bouldin Index (DBI), a measure of intra-cluster similarity and inter-cluster separation.

The DBI was computed as:

Where N is the number of clusters, Inline graphic and represents within cluster scatter (average distance of points to their cluster centroid) and denotes the inter-cluster distance between centroids. Lower DBI values indicate better clustering quality, implying greater separation between tissue clusters. Clusters were assigned based on tissue labels, and DBI values were computed using the index.DB () function from the clusterSim R package¹⁸.

Gene expression trajectory analysis

To investigate the impact of the batch correction on downstream analysis, we analyzed the expression of tissue-specific genes like albumin (ALB), keratin (KRT1), and Synaptosomal-associated protein 25 (SNAP25) (Fig. 2E, Supplementary Fig. 2) in both batch-corrected and uncorrected datasets. For each approach, tissue-specific .rds files were loaded, and the expression values of the gene of interest were extracted and log10-transformed. To compare the impact of batch correction on gene expression values across multiple tissues, a line plot was generated using ggplot2¹⁹. The tissue types were used as categorical variables on the x-axis, while the log-transformed gene expression values were plotted on the y-axis. Lines connect mean expression values per tissue for each normalization method. Points represent individual tissue-specific expression values for both TMM + CPM and TMM + CPM + SVA processing methods. The results demonstrate a direct comparison of gene expression trends across tissues of processing methods TMM + CPM and TMM + CPM + SVA, highlighting potential shifts introduced by SVA correction in tissue-specific genes.

Correlation analysis

Gene-gene correlation analysis was performed on tissue-specific gene expression datasets processed using the TMM + CPM and TMM + CPM + SVA pipelines. For each tissue, gene expression data were filtered to remove genes with zero variance. Spearman correlation was computed between gene pairs using the cor() function²⁰ with the “spearman” method and “pairwise.complete.obs” to handle missing values. Missing correlation values were replaced with zero. Heatmaps were generated with hierarchical clustering of genes and samples using the pheatmap package²¹. The resulting heatmaps were saved in the respective directory.

Pipeline implementation by nextflow

To automate the three core modules, we utilized Nextflow 24.10.3 (https://www.nextflow.io), an open-source workflow manager widely adopted in bioinformatics²², to construct a scalable and reproducible pipeline for transcriptomic data preprocessing. Nextflow serves as the backbone of the pipeline, organizing the execution of discrete tasks and managing data flow between them. The core components of Nextflow, including channels, processes, workflows, and executors, enable the smooth orchestration and execution of the pipeline. Each step of the pipeline is encapsulated within its Nextflow process, ensuring modularity and a clear definition of system resource requirements.

The data acquisition process handles the extraction of raw gene expression data and associated metadata, based on a predefined list of genes of interest given by the user. This process is executed by an R script that retrieves and stores the data in a specified output directory. The second step in the pipeline, data normalization, applies TMM + CPM normalization to the raw gene expression data. In this process, an R script performs the normalization, using the previously acquired raw expression data. The output from this step includes the normalized gene expression values and sample count data. The final step in the pipeline is SVA batch correction, which removes unwanted sources of variation in the gene expression data due to batch effects. This process ensures that the data used for downstream analysis is adjusted for technical variation that could otherwise confound the results. As with the normalization process, SVA batch correction is implemented through an R script. The input consists of the normalized expression data, and the output is the adjusted gene expression values, ready for further analysis.

The Nextflow workflow component integrates these processes into a coherent pipeline, managing the flow of data between them. The workflow begins with the data acquisition process, followed by data normalization, and concludes with SVA batch correction. This data flow ensures that each step receives the necessary inputs and generates the required outputs, which are passed seamlessly to the next step.

Once the pipeline is defined, Nextflow’s executor component manages the execution across the computational infrastructure. The workflow can be executed across various platforms, including local machines, cloud environments, or HPC clusters.

To run the pipeline, users must navigate to the directory containing the Dockerfile and workflow.nf file and execute the following command in the bash:

Pipeline execution

The pipeline’s overall execution time for the subset of genes was approximately 3 min and 48 s, with a total CPU usage of 0.1 CPU hours. This highlights the pipeline’s computational efficiency for evaluating a subset of high-dimensional data and performing complex preprocessing tasks such as data acquisition, normalization, and batch correction.

Validation & benchmarking

Cross-validation

Cross-validation techniques were employed to assess the generalizability and robustness of the preprocessing pipeline. To ensure independent validation, a distinct set of genes (AKT1, GSK3B, GDF11, FOXO1, SESN2, ULK1, PGC1, PINK1, PDPK1, BCL2, HMOX1, FIS1, TNF, and PARP1), which were not initially used, was introduced. The resulting data were subsequently visualized using PCA, and Euclidean distance between tissue clusters was compared (Supplementary Fig. 4) to examine the consistency of the pipeline’s performance across new gene sets.

Benchmarking

Benchmarking was performed by obtaining TPM values from the GTEx dataset, which were then processed to eliminate technical variation either by SVA batch correction or quantile normalization. The Euclidean distance between tissue clusters was calculated and compared across different preprocessing groups to evaluate clustering quality. Additionally, tissue-specific expression trajectories were assessed across multiple preprocessing strategies, including: (1) TMM + CPM + SVA, (2) TPM + SVA, (3) TMM + CPM + Quantile Normalization, and (4) TPM + Quantile Normalization, to determine the impact of normalization and batch effect correction methods on clustering and trajectory patterns.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1^{(1.6MB, pdf)}

Acknowledgements

I would like to thank Friedrich Schiller University Jena for enabling open-access funding for my project. The data used for the analysis described in this manuscript were obtained from the GTEx Portal (https://gtexportal.org/) on 23/08/24.

Author contributions

Dhanalakshmi Jothi: conceptualization, coding, and manuscript writing.

Funding

Open Access funding enabled and organized by Projekt DEAL. Open access funding is enabled by Project Deal.

Data availability

The GTEx raw data can be directly downloaded from https://storage.googleapis.com/adult-gtex/bulk-gex/v8/rna-seq/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct.gz and metadata from https://storage.googleapis.com/adult-gtex/annotations/v8/metadata-files/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt, https://storage.googleapis.com/adult-gtex/annotations/v8/metadata-files/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt.

Code availability

The R code for the preprocessing pipeline and guidelines for executing the Nextflow workflow are available at https://github.com/dhana2403/GTEx_Pro. The interactive web application for non-programmers is under development and can be accessed via https://gtexprov1.streamlit.app/.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Lonsdale, J. et al. The Genotype-Tissue expression (GTEx) project. Nat. Genet.45(6), 580–585 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Kosti, I. et al. Cross-tissue analysis of gene and protein expression in normal and cancer tissues. Sci. Rep.6(1), 24799 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Leek, J. T. et al. The Sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics28(6), 882–883 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol.11(3), R25 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Cuevas-Diaz Duran, R., Wei, H. & Wu, J. Data normalization for addressing the challenges in the analysis of single-cell transcriptomic datasets. BMC Genom.25(1), 444 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Elmore, K. L. & Richman, M. B. Euclidean distance as a similarity metric for principal component analysis. Mon. Weather Rev.129(3), 540–549 (2001). [Google Scholar]
7.Yao, F. & Coquery, J. Lê Cao, K.-A. Independent principal component analysis for biologically meaningful dimension reduction of large biological data sets. BMC Bioinform.13(1), 24 (2012). [DOI] [PMC free article] [PubMed]
8.Davies, D. L. & Bouldin, D. W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell.1(2), PAMI (1979). [PubMed] [Google Scholar]
9.Zhao, Y., Wong, L. & Goh, W. W. B. How to do quantile normalization correctly for gene expression data analyses. Sci. Rep.10(1), 15534 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Ryaboshapkina, M. & Hammar, M. Tissue-specific genes as an underutilized resource in drug discovery. Sci. Rep.9(1), 7233 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Danino, R., Nachman, I. & Sharan, R. Batch correction of single-cell sequencing data via an autoencoder architecture. Bioinf. Adv.4(1) (2023). [DOI] [PMC free article] [PubMed]
12.Robinson, M. D., McCarthy, D. J. & Smyth, G. K. EdgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics26(1), 139–140 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Bolstad, B. preprocessCore: A collection of pre-processing functions. R package version 1.48.0. (2023).
14.Buja, A. & Eyuboglu, N. Remarks on parallel analysis. Multivar. Behav. Res.27(4), 509–540 (1992). [DOI] [PubMed] [Google Scholar]
15.Oliva, M. et al. The impact of sex on gene expression across human tissues. Science369(6509), eaba3066 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Blighe, K. L.A., PCAtools: Everything Principal Components Analysis. R package version 2.18.0.(2024).
17.Inc., P. T. Collaborative Data Science (Plotly Technologies Inc., 2015).
18.Walesiak, M. D.A., The choice of variable normalization method in cluster analysis. Int. Bus. Inform. Manage. Association (IBIMA) 325–340 (2020).
19.Wickham, H. ggplot2: Elegant Graphics for Data Analysis(Springer International Publishing, 2016).
20.Makowski, D. B. S. et al. Methods and algorithms for correlation analysis in {R}. J. Open. Source Softw.5, 2306 (2020).
21.Kolde, R. pheatmap: Pretty Heatmaps. R package version 1.0.12. (2018).
22.Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol.35(4), 316–319 (2017). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1^{(1.6MB, pdf)}

Data Availability Statement

[CR1] 1.Lonsdale, J. et al. The Genotype-Tissue expression (GTEx) project. Nat. Genet.45(6), 580–585 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Kosti, I. et al. Cross-tissue analysis of gene and protein expression in normal and cancer tissues. Sci. Rep.6(1), 24799 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Leek, J. T. et al. The Sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics28(6), 882–883 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol.11(3), R25 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Cuevas-Diaz Duran, R., Wei, H. & Wu, J. Data normalization for addressing the challenges in the analysis of single-cell transcriptomic datasets. BMC Genom.25(1), 444 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Elmore, K. L. & Richman, M. B. Euclidean distance as a similarity metric for principal component analysis. Mon. Weather Rev.129(3), 540–549 (2001). [Google Scholar]

[CR7] 7.Yao, F. & Coquery, J. Lê Cao, K.-A. Independent principal component analysis for biologically meaningful dimension reduction of large biological data sets. BMC Bioinform.13(1), 24 (2012). [DOI] [PMC free article] [PubMed]

[CR8] 8.Davies, D. L. & Bouldin, D. W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell.1(2), PAMI (1979). [PubMed] [Google Scholar]

[CR9] 9.Zhao, Y., Wong, L. & Goh, W. W. B. How to do quantile normalization correctly for gene expression data analyses. Sci. Rep.10(1), 15534 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Ryaboshapkina, M. & Hammar, M. Tissue-specific genes as an underutilized resource in drug discovery. Sci. Rep.9(1), 7233 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Danino, R., Nachman, I. & Sharan, R. Batch correction of single-cell sequencing data via an autoencoder architecture. Bioinf. Adv.4(1) (2023). [DOI] [PMC free article] [PubMed]

[CR12] 12.Robinson, M. D., McCarthy, D. J. & Smyth, G. K. EdgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics26(1), 139–140 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Bolstad, B. preprocessCore: A collection of pre-processing functions. R package version 1.48.0. (2023).

[CR14] 14.Buja, A. & Eyuboglu, N. Remarks on parallel analysis. Multivar. Behav. Res.27(4), 509–540 (1992). [DOI] [PubMed] [Google Scholar]

[CR15] 15.Oliva, M. et al. The impact of sex on gene expression across human tissues. Science369(6509), eaba3066 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Blighe, K. L.A., PCAtools: Everything Principal Components Analysis. R package version 2.18.0.(2024).

[CR17] 17.Inc., P. T. Collaborative Data Science (Plotly Technologies Inc., 2015).

[CR18] 18.Walesiak, M. D.A., The choice of variable normalization method in cluster analysis. Int. Bus. Inform. Manage. Association (IBIMA) 325–340 (2020).

[CR19] 19.Wickham, H. ggplot2: Elegant Graphics for Data Analysis(Springer International Publishing, 2016).

[CR20] 20.Makowski, D. B. S. et al. Methods and algorithms for correlation analysis in {R}. J. Open. Source Softw.5, 2306 (2020).

[CR21] 21.Kolde, R. pheatmap: Pretty Heatmaps. R package version 1.0.12. (2018).

[CR22] 22.Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol.35(4), 316–319 (2017). [DOI] [PubMed] [Google Scholar]

PERMALINK

GTEx pro enables accurate multi-tissue gene expression analysis using robust normalization and batch correction

D Jothi

Abstract

Supplementary Information

Introduction

Fig. 1.

Results and discussion

Fig. 2.

Conclusion

Methods

Data acquisition and processing

Metadata processing

Gene expression data processing

Tissue-specific filtering and data storage

Data normalization by TMM

Table 1.

Quantile normalization

Batch correction by SVA

Principal component analysis (PCA)

Interactive 3D plot visualization

Quantification of tissue similarity using cluster distance metrics

Variance computation

Clustering quality assessment using the Davies-Bouldin index

Gene expression trajectory analysis

Correlation analysis

Pipeline implementation by nextflow

Pipeline execution

Validation & benchmarking

Cross-validation

Benchmarking

Supplementary Information

Acknowledgements

Author contributions

Funding

Data availability

Code availability

Declarations

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases