Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2022 Dec 8;12:21242. doi: 10.1038/s41598-022-25524-4

A tensor decomposition-based integrated analysis applicable to multiple gene expression profiles without sample matching

Y-h Taguchi 1,, Turki Turki 2
PMCID: PMC9732005  PMID: 36481877

Abstract

The integrated analysis of multiple gene expression profiles previously measured in distinct studies is problematic since missing both sample matches and common labels prevent their integration in fully data-driven, unsupervised training. In this study, we propose a strategy to enable the integration of multiple gene expression profiles among multiple independent studies with neither labeling nor sample matching using tensor decomposition unsupervised feature extraction. We apply this strategy to Alzheimer’s disease (AD)-related gene expression profiles that lack precise correspondence among samples, including AD single-cell RNA sequence (scRNA-seq) data. We were able to select biologically reasonable genes using the integrated analysis. Overall, integrated gene expression profiles can function analogously to prior- and/or transfer-learning strategies in other machine-learning applications. For scRNA-seq, the proposed approach significantly reduces the required computational memory.

Subject terms: Computational biology and bioinformatics, Gene regulatory networks

Introduction

The integrated analysis of gene expression profiles is difficult to accomplish14. Its primary purpose is to compensate for small sample sizes by integrating gene expression profiles measured from multiple studies because the process is generally not straightforward, while matching samples is rare. In this study, sample matching refers to correspondence between independent studies. For instant, the straightest sample matching is control and treatment taken from the same individuals, e.g., younger age and older age. Or more weakly, two individuals with the same properties other than treatment, e.g., have the same weight, age, or sex. Sharing labeling is not sample matching, yet similar. For example, when healthy controls and disease patients are collected from two different hospitals, it is simple label sharing rather than matching. In this study, we considered the cases where even label sharing is missing; in this sense, removing batch effect is out of scope in this study. Batch effect refers to removing bias between two set of samples with the same labeling. For example, in two hospitals, A and B, healthy controls and disease patients should be collected. The difference between the two hospitals can be larger than that between healthy controls and patients in individual hospitals. This is referred to as “batch effect”, which must be removed before investigating the difference between patients and healthy controls common in two hospitals. Since we do not consider the cases that share labels, the batch effect is therefore not considered in this study. If sample-matching information is missing, and samples are not associated with common labeling (e.g., healthy controls or patients), the requirements of this process are not always fulfilled, even if we simply group gene expression profiles using their labels within individual sets. Hence, the establishment of general frameworks for integrating multiple gene expression profiles without sample matching and common sample labeling would be of considerable benefit5. Owing to these basic requirements, such methods must be unsupervised, since the integration of multiple gene expression profiles that share nothing other than the genes is impossible. In this study, we employ tensor decomposition (TD) for this purpose6. Integrating two independent studies without either sample matching or label sharing remains to be accomplished. Nevertheless, this kind of expectation often exist behind biological studies. For example, genes with different expression between healthy controls and patients can be compared with those whose expression vary with aging if the considered disease are related to aging. Directly integrating two gene expression profiles of comparison between healthy controls and patients and that of difference with aging would be promising, which is the primary aim of this study.

Results

Figure S1 shows the workflow of analyses performed in this study.

Integrated analysis of datasets 1, 2, and 3

First, to determine whether this simple idea can integrate multiple gene expression profiles lacking sample matching, we attempted to integrate datasets 1, 2, and 3. Datasets 1 and 3 comprised cell lines modeled as either healthy controls or ADs, whereas dataset 2 comprised cell lines aiming to mutate AD-associated genes indirectly related to AD-only. After obtaining a singular value vector, u1i, as described in Eq. (8), we first investigated whether they were associated with classifications. Singular value vectors attributed to the jkth sample, v1jkk,1k3, were computed from u1i using Eq. (10), and the coincidence of v1jkk with the classifications of datasets 1, 2, or 3 was investigated (Table S1 and Fig. S2).

v1jkk,115 were significantly correlated with classifications in any of datasets 1, 2, or 3. Next, we tried to determine u1i, which is used for gene selection, and we selected u1i,115 (see Table S5). Pi are attributed to i using Eq. (9) while taking Ω1={1|115}. The obtained Pi were corrected using the BH criterion6, and 565 i (genes) associated with adjusted Pi less than 0.01 were selected. To validate the selected 565 genes biologically, gene symbols were uploaded to Enrichr7. We observed various enrichments (Data S1). For example, the “KEGG 2021 Human” category listed six neurodegenerative disease-related pathways within the top-10 pathways (Table S2A). Similarly, the “Jensen Diseases” category listed four neurodegenerative diseases within the top-10(see Table S2A). Additionally, “Disease Perturbations from GEO down” and “Disease Perturbations from GEO up” listed similar neurodegenerative diseases within the top-ranked diseases (Data S1). We noted the “Allen Brain Atlas up” and “Allen Brain Atlas down” categories in which many AD-related brain regions were highly ranked (Data S1). Enrichment analyses using databases other than Enrichr also reported convincing results. The top- and the third-ranked terms in the “GAD_DISEASE_CLASS” category of DAVID8,9 were PSYCH and NEUROLOGICAL, respectively (Table S2A). The top-10 ranked terms in the “GAD_DISEASE” category of DAVID included five neurodegenerative diseases (Table S2A). g:Profiler10 listed 16 KEGG pathways as significantly enriched with gene symbols associated with 565 selected genes; 16 pathways included six neurodegenerative disease-related pathways (Table S2A). Thus, TD-based unsupervised feature extraction (FE) was used to integrate three gene expression profiles lacking sample matching, genes of expressions associated with classifications in three independent studies were identified, and genes selected were enriched by various neurodegenerative disease-related terms from three biological databases. This suggests that TD-based unsupervised FE is a promising method for integrating gene expression profiles without sample matching.

Drug repositioning using the tensor obtained with datasets 1, 2, and 3

We have previously demonstrated11 that integrated analyses of gene expression profiles between model animals treated by various drugs and patients were useful to determining which drug compounds are effective against diseases. There, we integrated only one disease gene expression profile with that of the drug treatment. In the present study, we integrated more gene expression profiles of diseases with those of drug treatment because of the novel framework introduced herein. We employed the gene expression profile (i.e., the integration of datasets 1, 2, and 3, xik,1k3), and dataset 4 as a gene expression profile of drug treatment. We obtained xi4, as described in Materials and Methods, to be integrated with xik,1k3, as obtained previously.  Higher-order singular value decomposition (HOSVD6) was applied to the obtained xik,1k4.

First, we sought to validate the singular value vectors, v1jkk, associated with classifications (Table S1 and Fig. S3). It is obvious that they are coincident with the classifications of individual datasets. Then, we selected u1i,115 as those associated with u2,124 by investigating G(1,2,3) in Eq. (8) while fixing 124 (see Table S6). Pi are attributed to i using Eq. (9) while taking Ω1={1|115}. The obtained Pi were corrected using the BH criterion and 544 i genes associated with adjusted Pi less than 0.01, as selected. To validate the selected 544 genes biologically, their associated gene symbols were uploaded to Enrichr. We then observed various enrichments (Data S2). For example, the “KEGG 2021 Human” category listed six neurodegenerative disease pathways within the top-10 pathways (Table S2B). Similarly, the “Jensen Diseases” category listed four neurodegenerative diseases within the top-10 diseases as well (Table S2B). In addition, “Disease Perturbations from GEO down” and “Disease Perturbations from GEO up” listed similar neurodegenerative diseases within the top-ranked diseases (Data S2). Upon examining the “Allen Brain Atlas up” and “Allen Brain Atlas down” categories, we observed that many AD-related brain regions were highly ranked (Data S2). Enrichment analyses using databases other than Enrichr also reported convincing results. The first- and fourth-ranked terms in the “GAD_DISEASE_CLASS” category of DAVID were PSYCH and NEUROLOGICAL, respectively (Table S2B). The top-10 ranked terms in the “GAD_DISEASE” category of DAVID included four neurodegenerative diseases (Table S2B). The g:Profiler listed 13 KEGG pathways as significantly enriched with gene symbols associated with 544 selected genes; 13 pathways included six neurodegenerative disease-related pathways (Table S2B). Hence, the TD-based unsupervised FE integrated four gene expression profiles lacking sample matching; it identified genes whose expressions were associated with classifications in three independent studies simultaneously; it selected genes that were enriched by various neurodegenerative disease-related terms from three biological databases.

Next, we attempted to identify how this strategy ranked drugs based upon gene expression profiles in dataset 4. After computing v1j4[1]j4[2]j4[3]4R60617×94×4×3 by Eq. (11), HOSVD was applied to v1j4[1]j4[2]j4[3]4 with fixed 1, and we obtained

v1j4[1]j4[2]j4[3]4=1~=1942~=143~=13G~[1]~1,~2,~3u~~1j4[1][1]u~~2j4[2][1]u~~3j4[3][1], 1

where u~~1j4[1][1]R94×94 was a singular value vector attributed to drugs, u~~2j4[2][1]R4×4 was a singular value vector attributed to dose density, and u~~3j4[3][1]R3×3 was a singular value vector attributed to biological replicates. We checked the drugs having larger ~1=15u~~1j4[1][1]2, and Table S3 shows the top-five drugs for 114. Because most were regarded as effective in the original study12, drug repositioning seems to have been successful.

Transfer Learning  using the tensor obtained with datasets 1, 2, and 3

One possible application of the integrated analysis of multiple gene expressions lacking sample matching is analogous to transfer learning (TL)13, where pre-trained machine-learning models are used to achieve better performance, even with smaller sample sizes. Although deep learning (DL) is often used for TL, DL architectures are not suitable for gene expression profile processing. For example, Yifei et al.14 found that DL achieved better performance than linear regression (LR) only by less than 7 %. This can be attributed to the lack of structure for gene expression profiles. In contrast to images and sentences (or documents) to which DL was applied successfully, gene expressions are orderless vectors of real numbers. In contrast, TD was successfully applied to gene expression profiles6 because tensor similarly does not consider the order of real numbers at all. Thus, TD is a suitable architecture for processing gene expression profiles.

To determine whether the integrated analysis of multiple gene expression profiles obtained using datasets 1, 2, and 3 can be used as a pre-trained system in TL, we combined dataset 5 with xik,1k3 as a pre-trained system. Dataset 5 comprises gene expression profiles created by the overexpression of the ABCC1 gene, which was recently recognized as an Alzheimer’s disease (AD) therapy target15. Usually, only gene expression profiles caused by overexpression of ABCC1 gene are analyzed, and genes whose expression are altered by ABCC1 over-expression are selected. Then, selected genes are compared to those whose expressions are known to have been altered by AD. Nevertheless, this procedure is somewhat indirect because no sample matching between gene expression profiles of AD and ABCC1 overexpression experiments exist. Our strategy is more suitable for comparison between gene expression profiles of AD and ABCC1 over-expressions since these expression profiles can be directly compared.

After obtaining the TD as in Eq. (8) for dataset 5, v1j5[1]j5[2]j5[3]5 was computed by Eq. (11); the correlation between six categories in dataset 5 and v1j5[1]j5[2]j5[3]5 was then computed (Table S1), and the results are shown in Fig. S4. Although we have considered scenarios up to 1=5, v1j5[1]j5[2]j5[3]5 were coincident with classification. Thus, as expected, the data show correlations between gene expression profiles of AD and ABCC1 overexpression through TD.

We selected u1i,115 by investigating G(1,2,3) in Eq. (8) and fixing 2{1,3,4,5} (see Table S7). Pi were attributed to i using Eq. (9) while taking Ω1={1|115}. The obtained Pi were corrected with the BH criterion and 660 i (genes) associated with adjusted Pi less than 0.01 that were selected. To validate the selected 660 genes biologically, we uploaded gene symbols associated with genes to Enrichr and observed various enrichments thereof (Data S3). For example, the “KEGG 2021 Human” category listed six neurodegenerative disease-related (Table S2C) pathways within the top-10. Similarly, the “Jensen Diseases” category listed four neurodegenerative diseases (Table S2C) within the top-10. “Disease Perturbations from GEO down” and “Disease Perturbations from GEO down” listed similar neurodegenerative diseases within the top-ranked diseases (Data S3). In “Allen Brain Atlas up” and “Allen Brain Atlas down” categories, many AD-related brain regions were highly ranked (Data S3). Enrichment analyses using databases other than Enrichr also reported convincing results. The top-three terms in the “GAD_DISEASE_CLASS” category of DAVID were PSYCH, NEUROLOGICAL, and AGING (Table S2C). The top-10 terms in the “GAD_DISEASE” category of DAVID included four neurodegenerative diseases (Table S2C). The g:Profiler listed 16 KEGG pathways as significantly enriched with gene symbols associated with 633 selected genes; 16 pathways included six neurodegenerative disease-related pathways (Table S2C). Thus, we successfully selected genes whose expressions were simultaneously altered by both AD and ABCC1 over-expressions and biological properties evaluated by enrichment analysis.

Integrated analysis of single-cell RNA-sequence (scRNA-seq) data

In this subsection, we applied our strategy to the integrated analysis of the scRNA-seq dataset. Because individual experiments with scRNA-seq included 104 cells, it was unclear how to integrate multiple scRNA-seq profiles. After obtaining the TD, as shown in Eq. (8), where k was replaced with an integer, 1c25, representing the cth RNA-seq measurement described in “Integrated analysis of scRNA-seq data” section, we focused on u3cR25×25. Variable c was divided into four groups, including either AD or healthy controls in either hippocampus or cortex brain regions. Next, we determined which u3c was associated with these four classes (Fig. S5). We found that only u6c was associated with these four classes; categorical regression was applied to u3c for each 3 separately, and P values were computed and corrected by BH criterion. Only 3=6 was associated with adjusted P values less than 0.05. The results showed that 2G(626)2 was the largest among 2G(126)2, where a summation was taken over 2 since there is no reason to select one specific 2 (Table S8). Pi were attributed to i using Eq. (9) with Ω1={1|6}. Pi were corrected, and 177 i (genes) were associated with an adjusted Pi of less than 0.01. Gene symbols associated with these 177 i were uploaded to Enrichr (Data S4). The results were very distinct from those in Table S2. No neurodegenerative disease terms were detected within the top-10 ranked terms in either “KEGG 2021 HUMAN” or “JENSEN DISEASES” categories. Instead, the top-10 ranked terms in the “Human Gene Atlas” category contained several brain tissues (Table S4). In contrast, the “Disease Perturbations from GEO down” and “Disease Perturbations from GEO down” categories included many diseases related to the brain (Table S4).

Although these are only few examples, notwithstanding the simplicity of our strategy, it successfully listed genes associated with tissue specificity as well as AD. Therefore, our proposed strategy can deal with scRNA-seq very easily, even without directly considering the massive number of cells in individual scRNA-seq measurements because we consider only the top-10 singular value vectors within individual scRNA-seq gene expression profiles, including up to 104 cells.

Discussion

Advantages of the proposed implementation

It is evident that our implementation has at least two unique advantages: First, it enables the integration of multiple gene expression profiles without sample matching. Datasets 1, 2, and 3 had sample sizes of 9, 23, and 8, respectively. Replacing these numbers of samples with those of singular value vectors (i.e., sample size of eight), we obtained a single tensor, xik, to which the HOSVD was easily applied. It is also possible to evaluate the consistency of singular value vectors, v1jkk, attributed to samples in individual datasets by computing it from u1i with Eq. (10) after applying HOSVD to xik with classifications of individual datasets. Because this strategy is applicable to the general number of samples in individual gene expression profiles, it is very useful. Its second advantage is a reduction in the computational memory required for analysis. When we applied this implementation to scRNA-seq data, we employed only the highest-ranked 10 singular value vectors computed by applying singular value decomposition (SVD) to individual scRNA-seq data, rather than considering as many as 104 cells. This drastically reduced the required memory by a factor of 100. Nevertheless, we successfully selected genes associated with various significantly enriched biological terms related to tissue specificity as well as diseases, as expected. Because more cells are expected to be sequenced in individual experiments via scRNA-seq in the future, our strategy is recommended for a wide range of scRNA-seq analysis applications.

Visualization of relation between samples without sample matching

Figures S2, S3, and S4 show that the proposed strategy can relate gene expression profiles lacking sample matching with one another. For example, v1jkk are always coincident with classifications regardless of k or datasets considered. Furthermore, they look very similar among distinct integration (i.e., among Figs. S2, S3, and S4). Hence, profiles seen in v1j11 that represent distinctions between AD and control in dataset 1 correspond to those seen in v1j22 that represent the distinction between WT and the other three treated cell lines in dataset 2. This is a reasonable coincidence because the distinction between WT and treated cell lines is supposed to correspond to that between control and AD. Thus, our strategy can successfully relate gene expression profiles without apparent sample matching. Meanwhile, if we consider v1j33, the situation differs a bit. Although we expect that v1j33 represent distinctions between control and two AD cell lines, they represent the distinction between AD1 and the other two cell lines: AD2 and control. This suggests the possibility that AD2 fails to represent some property of AD. Usually, this kind of investigation is impossible, because we cannot compare two gene expression profiles without sample matching. Nevertheless, our strategy enables us to figure out the discrepancy in dataset 3 if it is compared with datasets 1 and 2. Although this is just an example, detailed comparisons of v1jkk between distinct k (i.e. datasets) allows for comparison of gene expression profiles even without sample matching.

Comparisons with previous works

To demonstrate the superior performance of our proposed strategy, we compared the performance to some conventional methods that integrate multiple matrices. Most methods have been designed for genomic science require sample matching. For example, all nine methods listed in reference16 require sample matching. intNMF17 is also limited to datasets attributed to the same individuals. In contrast to these studies, although a review18 comprehensively reported on integrated multi-view analyses, samples were implicitly assumed to be shared, and no integrated methods aimed to integrate multiple matrices or tensors that did not share samples; only features were included. Although MINT19 was proposed to integrate gene expression profiles sharing genes rather than samples, the method assumed common labeling among multiple studies in a supervised learning framework. Conversely, its unsupervised framework required the common quantitative variables associated with all samples included in the studies; thus, this method cannot be applied to the present task of investigating the correlations between latent variables and classifications not shared between individual studies, as shown in Table S2. We could not find any implementations designed to integrate multiple gene expression profiles formatted as matrices or tensors sharing only genes and not samples. Thus, we sought suitable methods outside the genomic field. First, we applied CMF20, implemented as a CMF package in R21, to the integration of datasets 1, 2, and 3. We computed four latent variables assuming a Poisson distribution for gene expression. None of the obtained latent variables were significantly correlated with classification in datasets 1, 2, and 3 (Table S9). Then, we tried GFA22, which was also implemented as a GFA package in R. Although we computed five latent variables, none were correlated with the classification in dataset 1, although some were correlated with the classifications in datasets 2 and 3 (Table S9). Hence, these two advanced methods, CMF and GFA, failed to identify latent variables correlated with classifications of all three datasets using integrated analysis. Finally, we tried simple concatenation. Ironically, SVD applied to a 60617×40 contracted matrix comprising xij1R60617×9, xij2R60617×23 and xij3R60617×8 provided the first singular value vector, which was correlated with all classifications of the three datasets. However, the P values for the correlation with classification of dataset 1 were only very slightly significant (P=0.05, Table S9). We then selected 147 genes using Eq. (9) by replacing u1i with the first singular-value vectors correlated with all classifications of datasets 1, 2, and 3. To confirm the inferiority of simple concatenations toward TD-based unsupervised FE, although we evaluated their enrichment by uploading 147 genes to Enrichr, DAVID, and g:Profiler, their enrichments were clearly poorer than those of genes selected by TD-based unsupervised FE (Table S2D and Data S5). Thus, we could not identify other methods comparable with or superior to TD-based unsupervised FE. Considering that the two advanced methods, CMF and GFA, were inferior to the SVD applied to simple concatenated matrices, the methods developed for this purpose (i.e., integrated analysis of multiple matrices) seem to be inapplicable to the present purpose. Hence, more advanced methods must be developed to be suitable for the purpose of the present work (i.e., integrated analysis of gene expression profiles that lack common matching samples).

Treatment of missing values

An additional advantage of the proposed strategy may be noted as a by-product. scRNA-seq data is known to include many missing values, and matrix representation was previously employed to resolve this problem23. In the proposed strategy, although the employment of SVD and HOSVD did not aim to fill missing values, it should naturally function as a mechanism to do so. This may be the reason the present strategy also works for scRNA-seq data as well with neither any specific additional modifications nor implementations toward the treatment of scRNA-seq data set. Notably our proposed strategy successfully provided a platform to integrate typical gene expression profile measurements with small samples using scRNA-seq, which includes massive numbers of single cells, since both can be represented as a tensor form, xijkRN×L×K, no matter how large the number of cells included in the scRNA-seq. Because we can process scRNA-seq with L=10 in this study, this suggests the possibility that scRNA-seq does not provide the expected large amount of information according to the number of single cells.

Comparisons with the original individual studies

Although we assessed original studies to evaluate how coincident individual studies are with the present integrated one, it was not easy since how the original study treated samples differed from the present study. For dataset 1, since original study24 did not selected genes whose expression differs between NDC and others, we could not compare their results with ours. For dataset 2, since original study25 did not simultaneously compare four groups of genes, but compared them only pairwisely, we could not compare their results with ours. For dataset 3, since original study26 did not distinguish between AD1 and AD2, we could not compare their results with ours. For datasets 4 and 5, since original studies12,27 compared treated cell lines with controls, which we did not try, we could not compare their results with ours. For dataset 6, since original studies28 did not compare AD samples and controls directly, but in only tissue specific manner, we could not compare their results with ours. In general, since the individual studies have their own purposes and compared samples along these specific lines while our methods are fully data driven, how samples are compared with hardly matches between individual studies and the present integrated study.

Classification

In general, classification performance is less likely to be good, since by considering all datasets together, specificity to individual datasets is expected to decrease. Table S10 shows the classification performance achieved by v1jkk for integrated analysis of datasets 1, 2 and 3 in Table S1. As expected, the performance was not good. Thus, integrated analysis proposed in this study less likely achieves good classification performance.

Comparison with principal component analysis applied to individual data sets

In order to determine how better integrated analysis can select common genes than separated analysis, we separately applied principal component analysis (PCA) to datasets 1, 2 and 3 so that PC scores are attributed to individual genes. Then, 100 top ranked genes with larger absolute PC scores are selected; Fig. S6 shows the Venn diagram. Although the number of genes selected by TD and shared with PCA for the first, second, and third PCs is not small, there are very few genes shared with PCA for the fourth and the fifth PCs. Since the fourth and the fifth components are often coincident with classification for more than one dataset with high significance (see Table S1), they are very important: the integrated analysis clearly allows for the identification of more common genes for datasets 1, 2, and 3 than separated analysis. In addition to this, PC scores attributed to genes by PCA are neither well correlated with one another nor correlated with u1i computed by TD (Table S11). This suggests that not only top ranked genes but also over all gene expression profiles are hardly similar with one another if PCA is applied to individual datasets. Thus integrated analysis using TD that allows us to have unique u1i valid for all three datasets is useful.

Conclusion

In this study, we proposed a new strategy that integrates multiple gene expression profiles that lack both sample matching and common labeling. This strategy successfully integrated AD gene expression profiles that lacked sample matching and AD scRNA-seq datasets. The former can be used for drug repositioning and TL. Massive amounts of computational memory can be saved with the latter. The proposed strategy appears to be useful for integrating gene expression profiles, even those lacking both sample matching and common labeling among them.

Methods

Mathematical formulations

Integration of multiple gene expression profiles with TD

Here, we consider cases in which the number of genes, N, is much larger than the number of samples in the kth gene expression profile, Mk, as NMk. Given K,(1kK) gene expression profiles represented as a matrix,

xijkRN×Mk, 2

or a (S+1)-mode tensor,

xijk[1]jk[2]jk[S]RN×Mk[1]×Mk[2]××Mk[S], 3

where 1sS stands for the sth experimental condition in the kth gene expression profile. Then, SVD or HOSVD6 is applied to obtain

xijk=ui[k]λ[k]vjk[k], 4

where ui[k]RMk×N are eigenmatrices attributed to genes . vjk[k]RMk×Mk are eigenmatrices attributed to samples, λ[k] are eigenvalues. The reason why we did not write vj but vjk[k] is because vj usually means that vjRL×M, i.e., M is independent of k, but in our case, M=Mk, which has k dependence. Therefore, vj is not vjk[k] since j is jk with k dependence. Alternatively,

xijk[1]jk[2]jk[S]=12SS+1G12SS+1u1jk[1][k]u2jk[2][k]uSjk[S][k]uS+1i[k], 5

where G12SS+1RMk[1]×Mk[2]××Mk[S]×N is a core tensor representing a weight of products, u1jk[1][k] u2jk[2][k] uSjk[S][k]uS+1i[k], usjk[s][k]RMk[s]×Mk[s], and uS+1i[k]RN×N are singular value orthogonal matrices. Thus, we derive the reduced matrices as either

xik=jk=1Mkxijkvjk[k],1L, 6

or

xik=jk[1]=1Mk[1]jk[S]=1Mk[S]xijk[1]jk[2]jk[S]u1jk[1][k]uSjk[S][k],1sLk[s](Mk[s]),1L=s=1SLk[s]. 7

This implementation involves a problem of note. For example, in Eq. (4), simultaneous changes, ui[k]-ui[k] and vjk[k]-vjk[k], also satisfy Eq. (4). Hence, we cannot fix the signs of ui[k] and vjk[k]. This does not present a problem if only individual gene expression profiles are investigated. However, if we need to compare multiple profiles, challenges might arise. For example, we compare multiple profiles in scRNA-seq data analysis in this study. Thus, we fix signs of vjk[k] such that the correlation coefficients between ui[1] and ui[k],k>1 have positive values prior to the computations in Eq. (6).

HOSVD is applied to xikRN×L×K to obtain

xik=1=1N2=1L3=1KG(123)u1iu2u3k, 8

where G(123)RN×L×K is a core tensor, and u1iRN×N. u2RL×L and u3kRK×K are singular value matrices that are orthogonal.

Gene selection

We can also select genes, i, by attributing P values to the ith genes while assuming u1i obeys a Gaussian distribution,

Pi=Pχ2>1Ω1u1iσ12, 9

where Pχ2[>x] is the cumulative χ2 distribution where the argument is larger than x, σ1 is the standard deviation, and the summation of 1 is taken over a set of 1, Ω1, which is a set of 1 selected as having larger ,3G(13)2. Pi are corrected via the Benjamini-Hochberg (BH) criterion6, and i is associated with adjusted P values less than 0.01.

Projection of individual gene expression profiles onto the space spanned by u1i

To determine how samples jk within each K gene expression profile are located in the space spanned by the obtained singular value vectors attributed to genes, u1i, we project individual gene expression profiles onto the space spanned by u1i as

v1jkk=i=1Nu1ixijkk 10

or

v1jk[1]jk[2]jk[S]k=i=1Nu1ixijk[1]jk[2]jk[S], 11

where v1jkkRN×Mk×K and v1jk[1]jk[2]jk[S]kRN×M1××MS×K are regarded as coordinates of individual samples, jk or jk[s], in the space spanned by u1i.

Gene expression profiles

Herein, we specifically focused on AD. All gene expression profiles included in the six studies listed below are selected from the gene expression omnibus (GEO). Individual profiles are normalized to have zero mean and a standard deviation of one within individual profiles. Thus, when gene expression profiles are formatted as a matrix, xijk,RN×Mk representing expression of the ith gene at the jkth sample,

i=1Nxijk=0, 12
i=1Nxijk2=N, 13

and when gene expression profiles are formatted as a tensor, xijk[1]jk[S]RN×Mk[1]××Mk[S], representing the expression of the ith gene at the sample annotated by indices j1,,jS, and

i=1Nxijk[1]jk[S]=0, 14
i=1Nxijk[1]jk[S]2=N. 15

Dataset 1: GSE160224

The dataset24 denoted as 1 (i.e., k=1) in this study comprises as few as nine samples, including nondemented controls (NDC), three APP duplications, and three isogenically corrected induced pluripotent stem-cell lines. In this study, three control samples are treated as controls, and the other six are treated samples when the coincidence to singular value vectors or latent variables are investigated, and genes whose expression are altered between these two classes are intended to be selected. Gene expression profiles are measured using RNA-seq technology, while genes are annotated using the Ensembl gene ID. The genes whose expressions are measured numbers as many as 58,302 in contrast to the small number of samples. As a result, gene expression profiles are formatted as a matrix, xij1R58302×9, representing expressions of the ith gene of the j1th sample.

Dataset 2: GSE155567

The dataset25 denoted as 2 (i.e., k=2) in this study comprises four classes, including THP1 macrophages after the knockout of CD33 and/or the knockdown (silencing) of PTPN6 (six samples per class). Thus, in all, it includes as few as 24 samples. In this study, we do not specifically assume which classes are considered controls, but they are regarded as four categorical classes when the coincidence with singular value vectors or latent variables are investigated. For unknown reasons, although gene expression profiles of one sample are missing, it is unlikely to affect the outcome because it includes almost all (23 out of 24) samples. Gene expression profiles are measured using RNA-seq technology, and genes are annotated using the Ensembl gene ID. The number of genes whose expressions are measured amount to as many as 60,617 in contrast to the small number of samples. As a result, gene expression profiles are formatted as a matrix, xij2R60617×23, representing the expression of the ith gene of the j2th sample.

Dataset 3: GSE162873

The dataset26 denoted as 3 (i.e., k=3) in this study comprises eight samples, four of which are AD cell lines, and the other four are normal. Because four AD cell lines comprise two sets of two samples using two distinct cell lines, this dataset includes three categorical classes when the coincidence with singular value vectors or latent variables is investigated. Two samples are taken from the first AD cell line, two samples are taken from the second AD cell line, and four samples are taken from normal cell lines. Gene expression profiles are measured by RNA-seq technology, and genes are annotated using the Ensembl gene ID. The number of genes whose expressions are measured amounts to as many as 47,749 in contrast to the small number of samples. As a result, gene expression profiles are formatted as a matrix, xij3R47749×8, representing the expression of the ith gene of the j3th sample.

Dataset 4: GSE164788

The dataset12 denoted as 4 (i.e., k=4) in this study is an in-vitro differentiated mixture of neuron and glial cells derived from the ReNcell VM neural progenitor cell line treated with 80 different compounds; mRNA levels are measured using RNA-seq. Among the 80 compounds, we select 94 drugs and combinations from which at least two doses are tested. For each dose, three biological replicates are provided. When more than three are used, we randomly select three of them. Gene expression profiles are measured by RNA-seq technology, and genes are annotated using the Ensembl gene ID. The genes whose expressions numbers as many as 28,044. As a result, gene expression profiles are formatted as a tensor, xij4[1]j4[2]j4[3]R28044×94×4×3, representing the expression of the ith gene at the j1th drug combination, the j2th dose, and the j3th biological replicates.

Dataset 5: GSE164642

The dataset27 denoted as 5 (i.e., k=5) in this study comprises three sets of six samples; each includes two classes corresponding to ABCC1 activated cells by distinct RNA or control cells. They include three biological replicates (18 total samples). Gene expression profiles are measured using RNA-seq technology, and genes are annotated using the Ensembl gene ID. Thus, they are treated as six categorical classes when the coincidence with singular value vectors or latent variables is investigated. The genes whose expressions are measured numbers as many as 58,003 in contrast to the small number of samples. As a result, gene expression profiles are formatted as a tensor, xij5[1]j5[2]j5[3]R58003×3×2×3, representing the expression of the ith gene at those treated by the j5[1]th RNA, the j5[2]th treatment (j5[2]=1:control, j5[2]=2:ABCC1 activated,) and the j5[3]th biological replicates.

Dataset 6: GSE163577

This is an scRNA-seq dataset28 denoted as 6 (i.e., k=6) in this study, including both AD and healthy controls from 25 hippocampus and superior frontal cortex samples across 17 control and eight AD patients. 25 individual datasets are formatted as a matrix, xij6[c]R33538×M6[c],1c25, representing the expression of the ith gene at the j6[c]th cell within the cth scRNA-seq profile. The number of cells, M6[c], in individual datasets vary and is roughly 104.

Integrated analysis of gene expression profiles

Integrated analysis of datasets 1, 2, and 3

When we integrate the datasets 1, 2, and 3, we apply SVD to them to get

xij1=ui[1]λ[1]vj1[1] 16
xij2=ui[2]λ[2]vj2[2] 17
xij3=ui[3]λ[3]vj3[3] 18

Then, we compute xik using Eq. (6) while setting L=8 because M1=9,M2=24,M3=8, and L cannot exceed Mk. Because the number of genes whose expressions are measured differs among datasets 1, 2, and 3, we employ N=60617 as the number of genes whose expressions are measured in dataset 2 and the largest number of genes measured among datasets 1, 2, and 3. Then, these three gene expression profiles are formatted as a tensor,

xikR60617×8×3, 19

where the missing expressions in datasets 1 and 3 caused by the smaller number of genes than in dataset 2 are filled with zero. HOSVD is applied to xik as in Eq. (8).

Drug repositioning using the tensor obtained with datasets 1, 2, and 3

HOSVD is applied to xij4[1]j4[2]j4[3]R28044×94×4×3. Then, xi4 is computed using Eq. (7) with L4[1]=4,L4[2]=2, and L4[3]=1. The motivations of this choice are as follows. Biological replicates are expected to have common gene expression, and the first singular value vector, u1j4[3]], is expected to have constant value regardless of j4[3], based upon previous studies. Thus, it is sufficient to consider only u1j4[3] for biological replicates. Then, two choices remain, including L4[1]=4,L4[2]=2 or L4[1]=2,L4[2]=4. Clearly, the former is reasonable because the number of drug combinations, at 94, is much larger than that of dose density, which is four. Missing values are filled with zero. In addition to xik,1k3, obtained in the previous subsection, HOSVD is applied to xikR60617×8×4 as in Eq. (8).

TL using the tensor obtained with datasets 1, 2, and 3

To evaluate the performance of TL, HOSVD is applied to xij5[1]j5[2]j5[3]R58003×3×2×3. Then, xi5 is computed using Eq. (7) with L5[s]=2,1s3. HOSVD is applied to xikR60617×8×4, obtained with Eq. (8) after xi4 is replaced with xi5.

Integrated analysis of scRNA-seq data

SVD is applied to xij6[c],1c25 one-by-one. Then, xic is computed by Eq. (6) while replacing k with c and L=10. Then, HOSVD is applied to xicR33538×10×25.

Methods to be compared

In the following, we tested three methods that assume the latent variables attributed to genes, i, which are also common among datasets 1, 2, and 3.

Collective matrix factorization (CMF)

To perform CMF, we used xIj1R60617×9, xij2R60617×23, and xij3R60617×8, such that they shared the same number of genes. Missing values were filled with zero and were normalized to have zero mean and a standard deviation of one, as denoted in the beginning of this section. The structure of assumed modeling is given as

xijk==1Luiujk[k]+bi[k]+bjk[k]+εijk[k], 20

where uiRL×60617,ujk[k]RL×Mk, and bi[k]R60617, bjk[k]RMk, εijk[k]R60617×MK. We also employed an option wherein xijk obeys a Poisson distribution because the negative signed binary distribution usually assumed for the RNA-seq dataset is not provided as an option. Before applying this model to xijk, some constants were added so that they do not take negative values, because Poisson distributions will not accept them. L=4 because TD-based unsupervised FE can identify a singular value vector correlated with classification in datasets 1, 2, and 3 within the top four.

Group factor analysis (GFA)

The datasets used were the same as those used in the trials using CMF. The structure of assumed model is given as

xijk==1Luiujk[k]+εijk[k]. 21

The primary difference from CMF, which employs Bayesian inferencing, is that GFA does not assume the distribution of xijk. L=5 was assumed because it was employed in the example in the GFA tutorial and is larger than the number of singular value vectors computed with TD-based unsupervised FE correlated with classifications in datasets 1, 2, and 3.

Simple concatenation

A matrix, xijR60617×40, where 40=k=13MK, is generated by concatenating three matrices, xij1, xij2, and xij3, to share the row number:

xij=xij1,j1=j,1j9xij2,j2=j-9,10j32xij3,j3=j-32,33j40 22

Then, SVD was applied to xij to obtain

xij=uiλvj. 23

vj for 1j9 were used for the latent variables attributed to nine samples in dataset 1, vj for 10j32 were used for the latent variables attributed to 23 samples in dataset 2, and vj for 33j40 were used for the latent variables attributed to eight samples in dataset 3.

Classification

Evaluation of classification performances were tested via linear discriminant analysis (LDA) using +lda+ function in MASS package in R21. Labels are treated as factors and prior probabilities of individual labels are set to be equal. Leave one out cross validation was employed with setting “CV=T” option.

Source code

Sample R21 source code is available as supplementary material (R version 4.1.3).

Supplementary Information

Acknowledgements

This work was supported by KAKENHI [Grant Numbers 19H05270, 20H04848, and 20K12067] to YHT. Also, this research work was funded by Institutional Fund Project under grant no (IFPIP: 924-611-1442). Therefore, authors gratefully acknowledge technical and financial support from the Ministry of Education and King Abdulaziz University, DSR, Jeddah, Saudi Arabia.

Author contributions

Y.H.T. planned the research and performed analyses. Y.H.T. and T.T. evaluated the results, discussions, and outcomes and wrote and reviewed the manuscript.

Data availability

All data sets used in this study can be obtained via the NIH/NCBI Gene Expression Omnibus (GEO) repository using accession numbers GSE160224, GSE155567, GSE162873, GSE164788, GSE164642, and GSE163577.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-022-25524-4.

References

  • 1.Huang C, et al. Integrated analysis of multiple gene expression profiling datasets revealed novel gene signatures and molecular markers in nasopharyngeal carcinoma. Cancer Epidemiol. Prev. Biomark. 2012;21:166–175. doi: 10.1158/1055-9965.EPI-11-0593. [DOI] [PubMed] [Google Scholar]
  • 2.Hu P, et al. Integrative analysis of multiple gene expression profiles with quality-adjusted effect size models. BMC Bioinform. 2005;6:128. doi: 10.1186/1471-2105-6-128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kyoon Choi J, et al. Integrative analysis of multiple gene expression profiles applied to liver cancer study. FEBS Lett. 2004;565:93–100. doi: 10.1016/j.febslet.2004.03.081. [DOI] [PubMed] [Google Scholar]
  • 4.Yang Z-Y, et al. Multi-view based integrative analysis of gene expression data for identifying biomarkers. Sci. Rep. 2019 doi: 10.1038/s41598-019-49967-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Wu M, Yi H, Ma S. Vertical integration methods for gene expression data analysis. Brief. Bioinform. 2020 doi: 10.1093/bib/bbaa169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Taguchi Y-H. Unsupervised Feature Extraction Applied to Bioinformatics. Springer; 2020. [Google Scholar]
  • 7.Kuleshov MV, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucl. Acids Res. 2016;44:W90–W97. doi: 10.1093/nar/gkw377. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Huang DW, et al. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 2008;4:44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
  • 9.Huang DW, et al. Bioinformatics enrichment tools: Paths toward the comprehensive functional analysis of large gene lists. Nucl. Acids Res. 2008;37:1–13. doi: 10.1093/nar/gkn923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Raudvere U, et al. g: Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update) Nucl. Acids Res. 2019;47:W191–W198. doi: 10.1093/nar/gkz369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Taguchi Y-H. Identification of candidate drugs using tensor-decomposition-based unsupervised feature extraction in integrated analysis of gene expression between diseases and DrugMatrix datasets. Sci. Rep. 2017 doi: 10.1038/s41598-017-13003-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Rodriguez S, et al. Machine learning identifies candidates for drug repurposing in alzheimer’s disease. Nat. Commun. 2021 doi: 10.1038/s41467-021-21330-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Weiss K, et al. A survey of transfer learning. J. Big Data. 2016 doi: 10.1186/s40537-016-0043-6. [DOI] [Google Scholar]
  • 14.Chen Y, et al. Gene expression inference with deep learning. Bioinformatics. 2016;32:1832–1839. doi: 10.1093/bioinformatics/btw074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.ElAli A, Rivest S. The role of ABCB1 and ABCA1 in beta-amyloid clearance at the neurovascular unit in alzheimer’s disease. Front. Physiol. 2013;4:45. doi: 10.3389/fphys.2013.00045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Cantini L, et al. Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer. Nat. Commun. 2021 doi: 10.1038/s41467-020-20430-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Chalise P, Fridley BL. Integrative clustering of multi-level ‘omic data based on non-negative matrix factorization algorithm. PLoS ONE. 2017;12:1–18. doi: 10.1371/journal.pone.0176278. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Li Y, et al. A review on machine learning principles for multi-view biological data integration. Brief. Bioinform. 2016;19:325–340. doi: 10.1093/bib/bbw113. [DOI] [PubMed] [Google Scholar]
  • 19.Rohart F, et al. MINT: a multivariate integrative method to identify reproducible molecular signatures across independent experiments and platforms. BMC Bioinform. 2017 doi: 10.1186/s12859-017-1553-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Klami, A. et al. Group-sparse embeddings in collective matrix factorization. arXiv:1312.5921 (2014).
  • 21.R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2020).
  • 22.Leppäaho E, et al. GFA: exploratory analysis of multiple data sources with group factor analysis. J. Mach. Learn. Res. 2017;18:1–5. [Google Scholar]
  • 23.Hu Y, et al. WEDGE: Iimputation of gene expression values from single-cell RNA-seq datasets using biased matrix decomposition. Brief. Bioinform. 2021 doi: 10.1093/bib/bbab085. [DOI] [PubMed] [Google Scholar]
  • 24.Ye T, et al. Efficient manipulation of gene dosage in human iPSCs using CRISPR/cas9 nickases. Commun Biol. 2021 doi: 10.1038/s42003-021-01722-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Wißfeld J, et al. Deletion of Alzheimer’s disease-associated CD33 results in an inflammatory human microglia phenotype. Glia. 2021;69:1393–1412. doi: 10.1002/glia.23968. [DOI] [PubMed] [Google Scholar]
  • 26.Hanna R, et al. G-quadruplexes originating from evolutionary conserved l1 elements interfere with neuronal gene expression in Alzheimer’s disease. Nat. Commun. 2021 doi: 10.1038/s41467-021-22129-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Jepsen WM, et al. Adenosine triphosphate binding cassette subfamily c member 1 (ABCC1) overexpression reduces APP processing and increases alpha- versus beta-secretase activity, in vitro. Biol. Open. 2020 doi: 10.1242/bio.054627. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Yang AC, et al. A human brain vascular atlas reveals diverse cell mediators of Alzheimer’s disease risk. bioRxiv. 2021 doi: 10.1101/2021.04.26.441262. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

All data sets used in this study can be obtained via the NIH/NCBI Gene Expression Omnibus (GEO) repository using accession numbers GSE160224, GSE155567, GSE162873, GSE164788, GSE164642, and GSE163577.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES