Skip to main content
PLOS One logoLink to PLOS One
. 2022 Sep 29;17(9):e0275472. doi: 10.1371/journal.pone.0275472

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Y-h Taguchi 1,*, Turki Turki 2
Editor: Chi-Hua Chen3
PMCID: PMC9521941  PMID: 36173994

Abstract

Identifying differentially expressed genes is difficult because of the small number of available samples compared with the large number of genes. Conventional gene selection methods employing statistical tests have the critical problem of heavy dependence of P-values on sample size. Although the recently proposed principal component analysis (PCA) and tensor decomposition (TD)-based unsupervised feature extraction (FE) has often outperformed these statistical test-based methods, the reason why they worked so well is unclear. In this study, we aim to understand this reason in the context of projection pursuit (PP) that was proposed a long time ago to solve the problem of dimensions; we can relate the space spanned by singular value vectors with that spanned by the optimal cluster centroids obtained from K-means. Thus, the success of PCA- and TD-based unsupervised FE can be understood by this equivalence. In addition to this, empirical threshold adjusted P-values of 0.01 assuming the null hypothesis that singular value vectors attributed to genes obey the Gaussian distribution empirically corresponds to threshold-adjusted P-values of 0.1 when the null distribution is generated by gene order shuffling. For this purpose, we newly applied PP to the three data sets to which PCA and TD based unsupervised FE were previously applied; these data sets treated two topics, biomarker identification for kidney cancers (the first two) and the drug discovery for COVID-19 (the thrid one). Then we found the coincidence between PP and PCA or TD based unsupervised FE is pretty well. Shuffling procedures described above are also successfully applied to these three data sets. These findings thus rationalize the success of PCA- and TD-based unsupervised FE for the first time.

Introduction

In genomic sciences, selecting a limited number of differentially expressed genes (DEGs) among as many as several tens of thousands of genes is a critical problem. Unfortunately, this is a very difficult task as the number of genes, N, is usually much larger than the number of available samples, M. However, as this is not a mathematically solved problem, it has most frequently been tackled empirically using statistical test-based feature selection strategies [1, 2]. Despite huge efforts along this direction, these statistical test-based feature selection strategies cannot be said to work well.

Selection of biologically informative genes including DEGs is essentially performed as follows (For simplicity, ixij=0,ixij2=N and M samples are composed of multiple classes having an equal number of samples). Suppose that we have properties yRM attributed to M samples. We would like to relate a matrix form of some omics data, e.g., gene expression profiles, XRN×M to y. The overall purpose is to derive bRN whose absolute values represent the importance of the ith gene. The first and the most popular strategy outside genomic sciences is a regression strategy that requires minimization of

(y-bX)2 (1)

resulting in

b=yXT(XXT)-1. (2)

The regression approach, Eq (2), is less popular in genomic sciences than in other scientific fields, possibly because of NM, which always results in exactly (ybX)2 = 0 with an infinitely large number of b. Thus, it is useless to select a limited number of important features among the total N features. Although adding the regulation term of L2 norm to Eq (1) as

(y-bX)2+λb2 (3)

with the positive constant λ > 0 enables selection of a unique b by minimizing Eq (3) as

b=yXT(XXT+λI)-1, (4)

because it does not satisfy (ybX)2 = 0 anymore, it is not an ideal solution. Although the solution using the Moore-Penrose Pseudoinverse [3]

b=yX (5)

might be better as it satisfies (ybX)2 = 0 under the condition of minbb2, it is unclear whether minbb2 is a good constraint from the biological viewpoint. Adding the regulation term of L1 norm [4] to Eq (1)

(y-bX)2+λ|b| (6)

can yield at most M variables, which is not effective when NM, because variables larger than M might be biologically informative and should not be neglected. Moreover, addition of L1 norm is known to be a poor strategy when X is not composed of independent vectors, which are very common in genomic science.

The second strategy is a projection strategy

b=yXT (7)

that is equivalent to the maximization of

y·bX-12b2 (8)

and is employed in PCA- and TD-based unsupervised FE (see below). Through the concept of projection pursuit [5] (PP), it is understood that seeking the projection vector b maximizes interestingness

H(bX) (9)

which is Eq (8) in this study. As H(bX) is a function of b, it is also denoted as I(b), which is called projection index. I(b) can be any other function, but its selection should be decided such that the biologically most meaningful results are obtained. Upon obtaining b that maximizes I(b), we can select i having a larger absolute bi as mentioned above. In the framework of PP, in a high dimensional system, almost all b have finite projections [6]. Thus, the only the point is if it is accidental or biologically meaningful.

In genomic science, projection strategy, Eq (7), is also unpopular. Although the reason for the unpopularity of the projection strategy, Eq (7), is unclear, this may be explained by the ignorance of the contribution perpendicular to y, |(xi·y^)y^-xi|, where y^ is a unit vector parallel to y and is defined as y/|y|. Nevertheless, in contrast to the regression strategy requiring the computation of (XXT)−1, Eq (7) can be always computable even if NM, which is a great advantage of the projection strategy when compared with the regression strategy.

Instead of these two strategies, feature selection based on statistical tests [1, 2] is more popular in genomic sciences as mentioned above. They try to identify genes whose expression is significantly distinct between classes. Despite its popularity, feature selection based on statistical tests has critical problems; in particular, significance is heavily dependent on sample size, M. Even in the case of a small distinction, more significant results are obtained when more samples are considered; this is not applicable biologically because determination of whether gene expression between classes differs significantly should not be a function of sample size. To compensate this heavy sample dependence of significance, other criteria such as fold change between classes are often employed. Thus, feature selection based on statistical tests is at best, the best among the worst approaches. If better strategies can be employed, there will be no reason to employ strategies based on statistical tests.

Despite the unpopularity of projection strategy, it was sometimes evaluated as more effective [7, 8] than the standard feature selection strategy based on statistical tests. Thus, it can be a candidate strategy that can be replaced with feature selection based on statistical tests. In this paper, we try to understand why PCA-based unsupervised FE and TD-based unsupervised FE [3] are effective in feature selection based on projection strategy, since PCA-like as well as TD-like methods were successfully applied in other fields, too [911]. We consider the cases biomarker identification of kidney cancer [12] as well as SARS-CoV-2 infection problem [13]; in these studies, despite unsuccessful results obtained by conventional feature selection based on statistical tests, TD-based unsupervised FE identified biologically reasonable genes (for more details about how PCA- and TD-based unsupervised FE are superior to statistical test-based feature selection tools in these specific examples, see these previous studies [12, 13]).

Materials and methods

Sample R cods is available in https://github.com/tagtag/peoj.

Expression profiles

mRNA, miRNA, and gene expression profiles in the first, second, and third data sets can be downloaded from TCGA as well as GEO. Their availability is described in detail in previous studies [12, 13].

Excluding low expressed miRNAs, mRNAs, and genes

To draw Figs 1(B), 2(B) and 3(B), low expressed miRNAs, mRNAs, and genes were screened out. For this, we rank them using ∑j |xij|, ∑j |xik|, ∑jkm |xijkm| and only selected the top ranked ones.

Fig 1. Histogram of raw P-values computed using the null distribution generated by shuffling when miRNAs in the first data set were considered.

Fig 1

(A) All miRNAs (B) Top 500 most expressive miRNAs.

Fig 2. Histogram of raw P-values computed using the null distribution generated by shuffling when the mRNAs in the first data set were considered.

Fig 2

(A) All mRNAs (B) Top 3000 most expressive mRNAs.

Fig 3. Histogram of raw P-values computed using the null distribution generated by shuffling when genes in the third data set were considered.

Fig 3

(A) All genes (B) Top 2780 most expressive genes.

QQplot

QQplot [14] was used to visualize the coincidence between two distributions that do not always have same number of elements. The qqplot function implemented in R [15] was employed to draw QQplots (Figs 4 and 5) in this study.

Fig 4. QQplot between P-values computed by TD-based unsupervised FE and projection (A) mRNA in the first data set (B) miRNA in the first data set (C) mRNA in the second data set (D) miRNA in the second data set.

Fig 4

Fig 5. QQplot of P-values between TD-based unsupervised FE and PP (the third data set).

Fig 5

Null distribution

The null distributions used for computing P-values in Figs 13 and 6 were generated by gene order shuffling as follows. First, the order of i was shuffled within each xij or within each xijkm and that of k was shuffled within each xkj. Thus, the order of mRNAs, miRNAs, and genes was shuffled such that they differed between samples. Then SVD or TD was applied to xijk or xijkm and u2i and u2k from SVD and u5i from TD were generated one hundred times. The null distributions were composed of the generated singular value vectors and P-values were computed.

Fig 6. Histogram of raw P-values computed using the null distribution generated by shuffling when the second data set were considered.

Fig 6

(A) All miRNAs (B) All mRNAs.

Results

Fig 7 shows the work flow of this study. In PP, the projection direction is predefined by y in a supervised manner while if we do not want to set projection directions in advance we can use those determined by PCA or TD, which we call unsupervised FE. There are some advantages of PCA and TD, which are not shared with PP. For example, projection directions not related to the label y may have additional information. In that case, PCA and TD can capture what PP cannot. PCA and TD can be applicable even if pre-defined y is not provided. Thus, PCA and TD have more potential to be applied to wide range of data sets that PP.

Fig 7. Discussion of work flow used in this study.

Fig 7

Tensor decomposition (HOSVD) was applied to tenors and using obtained singular value vectors assumed to obey Gaussian distribution, P-values are attributed to genes. The genes associated with adjusted P-values less than 0.01 are selected. P-values are also computed by shuffling and the genes associated with adjusted P-values less than 0.1 are well coincident with the genes selected by HOSVD. The correspondence between singular value vectors and K-means applied to unfolded matrices is also discussed.

PCA-based unsupervised FE

Before starting to rationalize PCA- and TD-based unsupervised FE, we briefly summarize how they work. The purpose of PCA- and TD-based unsupervised FE is to select biologically sound features (typically genes) based on the given omics data such as gene expression profiles, in an unsupervised manner. In this subsection, we introduce PCA-based unsupervised FE; TD-based unsupervised FE is an advanced version of PCA-based unsupervised FE and will be introduced in the next subsection.

Suppose that we have gene expression data in a matrix form, XRN×M for N genes measured across M samples. First, we need to standardize X as ∑ixij = 0 and ixij2=N as we will attribute principal component (PC) scores to genes whereas PC loading will be attributed to samples. The th PC score attributed to the ith gene, ui, can be obtained as the ith component of the th eigenvector, uRN, of a gram matrix XXTRN×N, where XT is a transpose matrix of X, as

XXTu=λu (10)

where λ is the th eigenvalue. Further, the th PC score attributed to the jth sample, vℓj, can be obtained as the jth component of the vector vRM defined as

v=XTu. (11)

Notably, v is also an eigenvector of the covariance matrix, XTXRM×M because

XTXv=XTXXTu=Xtλu=λv. (12)

PCA-based unsupervised FE works as follows. First, we need to identify the v of interest. The v of interest depends on the problem. It might be the one coincident with the samples cluster, or the one with monotonic dependence on some external parameter such as time. After identifying the v of interest, we try to attribute P-values to genes assuming that the components of the corresponding u follow a normal distribution

Pi=Pχ2[>(uiσ)2] (13)

where Pχ2[> x] is the cumulative χ2 distribution that the argument is larger than x and σ is the standard deviation. Computed P-values are adjusted based on the BH criterion [3] and features associated with adjusted P-values less than a specified threshold value can be selected. The reason for the proper working of such a simple procedure is explained later.

Finally, we would like to emphasize the equivalence between singular value decomposition (SVD) and PCA. Suppose we have the SVD of X as

xij==1min(N,M)λuivj. (14)

It is straight forward to show

XXTu=λu (15)
XTXv=λv (16)

where u = (u1, u2, ⋯, uN)T and v = (v1, v2, ⋯, vM)T. Thus, SVD and PCA are mathematically equivalent problems.

TD-based unsupervised FE

TD-based unsupervised FE works quite similar to PCA-based unsupervised FE. Instead of PCA, we apply TD to xijkRN×M×K, that is, for example, the expression of the ith gene measured in the kth tissue of the mth person (even though we consider a three-mode tensor here, extension to the higher mode tensor is straightforward). To obtain TD, we specify the higher-order singular decomposition [3] (HOSVD) as

xijk=123G(123)u1ju2ku3i (17)

where G(123)RM×K×N is a core tensor, and u1jRM×M, u2kRK×K, u3iRN×N are singular value matrices. After identifying the u1j and u2k of interest, for instance, the distinction between healthy controls and patients as well as tissue specific expression, we seek 3 associated with G(123) having the largest absolute value given as 1, 2. Then using the identified 3, we attribute P-values to the ith feature as in the case of PCA-based unsupervised FE,

Pi=Pχ2[>(u3iσ3)2] (18)

where σ3 is the standard deviation. Computed P-values are adjusted based on the BH criterion and features associated with adjusted P-values less than a specified threshold value can be selected. The reason for the proper working of such a simple procedure is explained later.

Rationalization of PCA- and TD-based unsupervised FE

To explain why PCA- and TD-based unsupervised FE work rather well, we consider two recent works [12, 13], in which the superiority of PCA- and/or TD-based unsupervised FE over conventional statistical methods was shown; in these studies, conventional statistical test-based methods failed to select a reasonable number of genes whereas TD-based unsupervised FE successfully selected a biologically reasonable restricted number of genes.

In the first study [12], two independent sets of data including the mRNA and miRNA expression of kidney cancer and normal kidney were analyzed in an integrated manner using PCA as well as TD-based unsupervised FE.

The first data set

The first data set comprised M = 324 samples including 253 kidney tumors and 71 normal kidney tissues. The expression of N mRNAs and K miRNAs was formatted as matrices as xijRN×M and xkjRK×M, respectively. The three mode-tensor xijkRN×M×K was generated as

xijk=xijxkj. (19)

As the data were too large to be loaded into the memory available in a standard stand-alone server, it was impossible to obtain TD

xijk=123G(123)u1iu2ju3k. (20)

Instead, we generated

xik=jxijk (21)

and SVD was applied to xik as

xik==1=3min(N,K)λu1iu3k (22)

to obtain u1i and u3k approximately. Missing singular value vectors attributed to mRNA and miRNA samples were approximately recovered using the equations

u1jmRNA=i=1Nxiju1i (23)
u3jmiRNA=k=1Kxkju3k (24)

respectively. Although we do not intend to insist that these approximations are precise enough, we decided to employ them as since they turned out to work well empirically. After investigating the obtained u1jmRNA and u3jmiRNA, we realized that 1 = 3 = 2 are coincident with the distinction between tumors and normal tissues; therefore, we attributed P-values to mRNA and miRNA using u2i and u2k, respectively with the equations

Pi=Pχ2[>(u2iσ2)2] (25)
Pk=Pχ2[>(u2kσ2)2]. (26)

These P-values were corrected by the BH criterion and we selected 72 mRNAs and 11 miRNAs associated with adjusted P-values less than 0.01, respectively.

The second data set

The second data set comprised M = 34 samples including 17 kidney tumors and 17 normal kidney tissues. The same procedures applied to the first data set were also applied to the second data set and we selected 209 mRNAs and 3 miRNAs associated with adjusted P-values less than 0.01, respectively. Although various biological evaluations were performed for mRNAs and miRNAs selected using the first data set, the most remarkable achievement was that all three miRNAs selected using the second data set were included in the 11 miRNAs selected using the first data set, and there were as many as 11 common mRNAs selected between the first and second data sets. If we consider that there are as many as several hundred miRNAs and a few tens of thousand mRNAs available, these overlaps are a great achievement as these two data sets are completely independent of each other.

Comparisons with PP

To understand why such simple procedures can work well in the framework of PP, we replaced the singular value vectors attributed to samples with projections. For this, we applied PP as mentioned above.

yj={-MMN,jNNMMT,j>NN (27)

where MN, MT are the numbers of normal tissues and cancer samples, respectively, and MN + MT = M. Then we applied PP as

bi=j=1Mxijyj (28)
bk=j=1Mxkjyj. (29)

Since bis and bks are expected to play the roles of u2i and u2k in Eqs (25) and (26), respectively, we used the absolute values of bi and bk to select mRNAs and miRNAs that are presumably coincident with the distinction between tumors and normal tissues. P-values are attributed to mRNA and miRNA as

Pi=Pχ2[>(biσb)2] (30)
Pk=Pχ2[>(bkσb)2]. (31)

These P-values are corrected by the BH criterion and we selected 78 mRNAs and 13 miRNAs for the first data set and 194 mRNAs and 3 miRNAs for the second data set, associated with adjusted P-values less than 0.01, respectively.

We try to estimate the coincidence of genes between TD and PP; Tables 14 list the comparisons of genes between TD-based unsupervised FE and PP, Eqs (30) or (31) and demonstrate a high coincidence with each other. Fig 4 show the comparisons of Pi and Pk between TD-based unsupervised FE and PP, Eqs (30) or (31). It is obvious that smaller P-values used for gene selection as well as the overall distributions of P-values are coincident between TD-based unsupervised FE and PP, Eqs (30) or (31).

Table 1. Confusion matrix of selected mRNAs between TD-based unsupervised FE and PP in the first data set.

P-value computed by Fisher’s exact test is 1.90 × 10−149.

PP
adjusted Pi > 0.01 adjusted Pi < 0.01
TD based unsupervised FE adjusted Pi > 0.01 19447 17
adjusted Pi < 0.01 11 61
Table 4. Confusion matrix of selected miRNAs between TD based unsupervised FE and PP in the second data set.

P-value computed by Fisher’s exact test is 1.87 × 10−7.

PP
adjusted Pk > 0.01 adjusted Pk < 0.01
TD based unsupervised FE adjusted Pk > 0.01 316 0
adjusted Pk < 0.01 0 3
Table 2. Confusion matrix of selected miRNAs between TD-based unsupervised FE and PP in the first data set.

P-value computed by Fisher’s exact test is 2.76 × 10−23.

PP
adjusted Pk > 0.01 adjusted Pk < 0.01
TD based unsupervised FE adjusted Pk > 0.01 812 2
adjusted Pk < 0.01 0 11
Table 3. Confusion matrix of selected mRNAs between TD-based unsupervised FE and PP in the second data set.

P-value computed by Fisher’s exact test is 0.0 within numerical accuracy (i.e., smaller than the possible smallest number given numerical accuracy).

PP
adjusted Pi > 0.01 adjusted Pi < 0.01
TD based unsupervised FE adjusted Pi > 0.01 33781 8
adjusted Pi < 0.01 23 186

Equivalence between K-means and PCA

To understand these excellent and unexpected coincidences between TD-based unsupervised FE and PP, we first considered the relationship between PCA and PP and later related it with TD. PCA was known to be equivalent to K-means [3]; the space spanned by centroids of optimal sample clusters can be reproduced by the PC score attributed to the features. Suppose that we have xijRN×M which is the value of the ith feature of the jth sample. M samples are supposed to be clustered into S clusters. The centroid of sth cluster, msRN is defined as

ms=1nsjCsxj (32)

where xj=(x1j,x2j,,xNj)TRN, Cs is a set of js that belong to the sth cluster, ns is the size of the sth cluster. Here we define the projection of any vector xRN onto the centroid subspace as

Sbx=s=1Sns(msT·x) (33)

where

Sb=s=1Snsmsm2TRN×N (34)

where ⊗ is the Kronecker product. Sb is also known to be represented as

Sb=s=1SXhshsTXT=X(s=1ShshsT)XT (35)

where hsRM is

hjs={1nsjCs0jCs, (36)

which take non-zero values only when the jth sample belongs to the sth cluster. K-means is an algorithm to find clusters that minimize

JS=s=1SjCs(xj-ms)2. (37)

Minimization of Jk is known to be equivalent to the maximization of TrSb, which means the trace of matrix Sb. It is known that

min{hs}Sb==1S-1λuuT (38)

where uRN is the vector whose components are th PC scores attributed to the features and eigenvector of the gram matrix as

XXTu=λu. (39)

If we compare Eq (35) with Eq (38), we can notice that s=1SXhshsTXT corresponds to =1S-1λuuT, and PCA can give us an optimal centroid subspace, Sb, even without realizing the clusters by K-means, i.e., in a fully unsupervised manner.

At first, when the clusters are the solution of K-means, the centroid subspace can be represented by the PC score which can also be expressed by X hs. hs is clearly coincident with yj defined in Eq (27). This means that PP employing u as b should result in projection onto the centroid subspace when yj is coincident with the clusters. Here we define yj, Eq (27), such that it can represent distinction between tumors and normal tissues, which should be detected by K-means. This explains why TD-based unsupervised FE works well and why PP can be replaced with TD-based unsupervised FE. To our knowledge, this is the first rationalization on why TD- and PCA-based unsupervised FE work well.

One might wonder whether the above explanation is applicable to PCA while TD was applied to the first and second data sets. This gap can be explained as follows. Tensor xijk, was generated as the product of xij and xkj. Suppose these two are decomposed as

xij=λuivj (40)
xkj=λukvj. (41)

If vj=vj then

xik=jxijxjk=jλuivjλukvj (42)
=λλuiukjvjvj (43)
=λλuiδ=λλvivk. (44)

This means that if vj=vj, the SVD of xik gives ui and uk that are obtained when SVD is applied to xij and xkj. Here u2jmRNA is highly correlated with u2jmiRNA [12]. This is coincident with the requirement vj=vj. As SVD is equivalent to PCA, this might explain why TD-based unsupervised FE works well even though the above rationalization is applied only to PCA.

The third data set

Next, we would like to extend the above discussion to TD. Therefore, we consider a third data set analyzed in another study [13] where we performed in silico drug discovery for SARS-CoV-2 by applying TD-based unsupervised FE to the gene expression profiles of human cell lines infected with SARS-CoV-2. The third data set comprises five cell lines infected with either mock (control) or SARS-Cov-2, including three biological replicates. It is formatted as tensor, xijkmRN×5×2×3, that represents the expression of the ith gene of the jth cell line from the infected (k = 1) or control (k = 2) group in the mth biological replicate. HOSVD was applied to xijkm and we got

xijkm=1=152=123=134=1NG(1234)u1ju2ku3mu4i. (45)

In this study, we selected 1 = 1, 2 = 2, 3 = 1 based on biological discussions. We then realized that G(5, 1, 2, 1) has the largest absolute value given 1 = 1, 2 = 2, 3 = 1. Thus, u5i was used to attribute P-values to gene i using

Pi=Pχ2[>(u5iσ5)2] (46)

and the obtained P-values were corrected using the by BH criterion; further, 163 genes associated with adjusted P-values less than 0.01 were selected. We now relate TD to the above discussion about PCA. Because of the HOSVD algorithm, u4i can also be obtained by applying SVD to the unfolded matrix, XRN×30. Here 30 columns correspond to one of 30 combinations of j, k, m. Here we select 1 = 1, 2 = 2, 3 = 1 so that the gene expression is independent of the cell lines and biological replicates and has opposite signs between the control and infected cells. Thus, two clusters are expected, each of which corresponds to either the control or infected cell lines. The reason why 4 = 5 is selected is simply because u5i is composed of the centroid subspace coincident with two clusters. Thus, in this sense, the above discussion about PCA can be directly applied to this result.

To confirm this, yj was taken to be

yjkm=αjβkγm (47)
αj=1 (48)
βk=(-1)k (49)
γm=1 (50)

such that it represented the distinction between k = 1 and k = 2 (i.e. that between infected and control cell lines), where yjkmN5×2×3,αjN5,βkN2, and γmN3. Then PP was performed as

bi=j,k,mxijkmyjkm. (51)

P-values were attributed to genes as

Pi=Pχ2[>(biσb)2] (52)

and 155 genes associated with corrected P-values less than 0.01 were selected, where bi is expected to play a role of u5i in Eq (46). Table 5 lists high coincidence of selected genes between TD-based unsupervised FE and PP. Fig 5 shows the overall coincidence of distributions of P-values between TD-based unsupervised FE and PP. Thus, why TD based unsupervised FE can work well is explained by the ability of singular value vectors to generate a centroid subspace of clusters coincident with control and infected cell lines.

Table 5. Confusion matrix of selected genes between TD-based unsupervised FE and PP in the third data set.

P-value computed by Fisher’s exact test is 1.40 × 10−241.

PP
adjusted Pi > 0.01 adjusted Pi < 0.01
TD based unsupervised FE adjusted Pi > 0.01 21582 52
adjusted Pi < 0.01 60 103

One might wonder why TD is needed if u4i can be computed by applying SVD to the unfolded matrix. To understand this, we compared v5(ijk) obtained by applying SVD to an unfolded matrix, and corresponding to u5i as well as u1ju2ku1m with yikm. While u1ju2ku1m is well coincident with yjkm, v5(jkm) is not (Fig 8). Thus, we need to apply TD to xijkm to obtain singular value vectors attributed to samples, which are coincident with two clusters but cannot be obtained when SVD is applied to an unfolded matrix.

Fig 8. Comparisons between yjkm and either v5(jkm) or u1ju2ku1m.

Fig 8

Red straight lines indicate linear regressions.

Rationalization of threshold P-values

As we have successfully shown that TD as well as PCA are equivalent to PP that aims to maximize projection onto the subspace centroid of clusters coincident with the desired distinction (cancer vs. normal tissue or control vs. infected cell lines), we would next like to rationalize the P-values computed by the χ2 distribution and threshold values of P = 0.01, which have long been employed to select DEGs with PCA- and TD-based unsupervised FE. Because distribution of projection in the infinite sample number limits is proven to be always Gaussian [6], this null hypothesis might seem reasonable. Nonetheless, the individual distribution of gene expression is far from Gaussian and is rather close to negative signed binomial distribution and when the number of samples is not large enough, the distribution of projection does not converge with a Gaussian distribution at all. Thus, a more straightforward rationalization is needed. Therefore, we generated a null distribution by shuffling i in each sample and recomputed the singular value vectors, u1i (for mRNA in the first and the second data sets), u3k (for miRNA in the first and the second data sets), and u5i(for genes in the third data set). Then P-values were recomputed using the generated null distribution and were corrected using the BH criterion to obtain genes associated with significant adjusted P-values. In the following, we apply the shuffling to three data sets, the first, the second, and the third data set, and select genes using P-values obtained by shuffling. Coincidence of selected genes and distribution of P-values between PCA or TD and shuffling is estimated. These evaluations enable us to discuss the suitability of threshold P-values.

Fig 1(A) shows the histogram of raw P-values computed using the null distribution generated by shuffling one hundred times when the miRNAs in the first data set were considered. As it is obvious that there are too many P-values near 1, we excluded some miRNAs with low values to obtain a P-value distribution more coincident with the null distribution. Fig 1(B) shows the histogram of raw P-values computed to be restricted to the top 500 more expressive miRNAs; this seems more coincident with the null distribution. We then found that twelve miRNAs are associated with adjusted P-values less than 0.1. Table 6 lists the comparison of selected miRNAs between TD-based unsupervised FE and the null distribution generated by shuffling. Although the threshold P-values differ between the two, the selected miRNAs are quite coincident. A threshold P-value 0.01 was empirically employed for PCA- and TD-based unsupervised FE as it often gave us biologically reasonable results. P = 0.01 in Gaussian distribution is assumed as the null hypothesis corresponding to P = 0.1 when the null distribution is generated by shuffling. Although this discrepancy must be fulfilled in the future, we conclude that their performances are quite similar.

Table 6. Confusion matrix of selected miRNAs between TD-based unsupervised FE and shuffling in the first data set.

P-value computed by Fisher’s exact test is 1.28 × 10−21.

shuffling
adjusted Pk > 0.1 adjusted Pk < 0.1
TD based unsupervised FE adjusted Pk > 0.01 488 1
adjusted Pk < 0.01 0 11

Fig 2(A) shows the histogram of raw P-values computed using the null distribution generated by shuffling one hundred times when mRNAs in the first data set were considered. As it is obvious that there are too many P-values near 1, we excluded some mRNAs with low values to obtain a P-value distribution more coincident with the null distribution. Fig 2(B) shows the histogram of raw P-values computed to be restricted to the top 3000 more expressive mRNAs; this seems more coincident with the null distribution. We then found that 69 mRNAs are associated with adjusted P-values less than 0.1. Table 7 lists the comparison of selected mRNAs between TD-based unsupervised FE and the null distribution generated by shuffling. Although threshold P-values differ between the two, the selected mRNAs are quite coincident. A threshold P-value 0.01 was empirically employed for PCA- and TD-based unsupervised FE as it often gave us biologically reasonable results. P = 0.01 in a Gaussian distribution is assumed as the null hypothesis corresponding to P = 0.1 when the null distribution is generated by shuffling. Although this discrepancy must be fulfilled in the future, we conclude that their performances are quite similar.

Table 7. Confusion matrix of selected mRNAs between TD-based unsupervised FE and shuffling in the first data set.

P-value computed by Fisher’s exact test is 2.69 × 10−137.

shuffling
adjusted Pi > 0.1 adjusted Pi < 0.1
TD based unsupervised FE adjusted Pi > 0.01 2928 0
adjusted Pi < 0.01 3 69

Fig 6(A) shows the histogram of raw P-values computed using the null distribution generated by shuffling one hundred times when miRNAs in the second data set were considered. As it is unlikely to get significant P-values, we did not select miRNAs associated with significant P-values. Fig 6(B) shows the histogram of raw P-values computed for mRNAs in the second data set; there are no peaks around P = 1. We then found that 262 mRNAs are associated with adjusted P-values less than 0.1. Table 8 lists the comparison of selected mRNAs between TD-based unsupervised FE and the null distribution generated by shuffling. Although threshold P-values differ between the two, the selected mRNAs are well coincident. A threshold P-value 0.01 was empirically employed for PCA- and TD-based unsupervised FE as it often gave us biologically reasonable results. P = 0.01 in a Gaussian distribution is assumed as the null hypothesis corresponding to P = 0.1 when the null distribution is generated by shuffling. Although this discrepancy must be fulfilled in the future, we conclude that their performances are quite similar.

Table 8. Confusion matrix of selected mRNAs between TD-based unsupervised FE and shuffling in the second data set.

P-value computed by Fisher’s exact test is 0.0 within numerical accuracy (i.e., smaller than the possible smallest number given numerical accuracy).

shuffling
adjusted Pi > 0.1 adjusted Pi < 0.1
TD based unsupervised FE adjusted Pi > 0.01 33736 53
adjusted Pi < 0.01 0 209

Fig 3(A) shows the histogram of raw P-values computed using the null distribution generated by shuffling one hundred times when considering the genes in the third data set. As there were too many P-values less than 0.2, we excluded some mRNAs with low values to obtain a P-value distribution more coincident with the null distribution. Fig 3(B) shows the histogram of raw P-values computed to be restricted to the top 2780 more expressive mRNAs; this seems more coincident with the null distribution. We then found that 48 mRNAs are associated with adjusted P-values less than 0.1. Table 9 lists the comparison of selected mRNAs between TD-based unsupervised FE and the null distribution generated by shuffling. Although threshold P-values differ between two, selected mRNAs are well coincident. A threshold P-value 0.01 was empirically employed for PCA- and TD-based unsupervised FE as it often gave us biologically reasonable results. P = 0.01 in a Gaussian distribution is assumed as the null hypothesis corresponding to P = 0.1 when the null distribution is generated by shuffling. Although this discrepancy must be fulfilled in the future, we conclude that their performances are quite similar.

Table 9. Confusion matrix of selected genes between TD-based unsupervised FE and shuffling in the third data set.

P-value computed by Fisher’s exact test is 5.00 × 10−63.

shuffling
adjusted Pi > 0.1 adjusted Pi < 0.1
TD based unsupervised FE adjusted Pi > 0.01 2617 0
adjusted Pi < 0.01 115 48

Discussion

In the previous section, we explained why PCA- and TD-based unsupervised FE work well (because singular value vectors correspond to projection onto the centroid subspace obtained by K-means) and how the criterion to select genes associated with adjusted P-values less than 0.01, which was computed assuming the null hypothesis that singular value vectors obey Gaussian distribution, is empirically coincident with another criterion to select the genes associated with adjusted P-values less than 0.1, which are computed assuming the null distribution generated by shuffling.

There are many points to be discussed. In the above example, we only dealt with the case wherein only two clusters could be distinguished in a one-dimensional space (i.e., only one singular value vector). Considering cases with more clusters might be challenging, projections onto subspace centroids do not have a one-to-one correspondence with singular value vectors as the coincidence between the projection to the subspace centroid and singular value vectors stands only between the spaces spanned by them, and not between themselves. Despite this, TD- and PCA-based unsupervised FE applied to more than two classes is known to work rather as well as in the case with only two clusters [16].

On the contrary, although we could only discuss cases with a finite number of clusters, PCA- and TD-based unsupervised FE are also known to work in detecting parameter dependence, e.g., time development [17, 18]. Extending the discussion here to regression analysis without any clusters will be the next step.

One might also wonder whether we need TD if singular value vectors attributed to genes are common between TD and PCA. At first, in the integrated analysis of mRNA and miRNA, TD-based unsupervised FE could outperform PCA-based unsupervised FE [12]. Similarly, TD-based unsupervised FE outperformed PCA-based unsupervised FE in the integrated analysis of gene expression and DNA methylation [19]. Thus, TD-based unsupervised FE is required when integrated analysis is targeted. Even when no integrated analysis was targeted, TD based unsupervised FE can give singular value vectors that are more coincident with biological clusters (Fig 8). Thus, despite the apparent equality of singular value vectors attributed to genes between TD and PCA, TD-based unsupervised FE is a more useful strategy than PCA-based unsupervised FE.

Although we did not clearly denote this, conventional gene selection strategies based on statistical tests are known to fail when applied to the first, second, and third data sets [12, 13]; they always selected too many or too few genes, mRNAs, and miRNA, which is in contrast to TD-based unsupervised FE that could always select a restricted number of genes, from tens to hundreds.

One might also wonder why we did not employ the null distribution generated by shuffling instead of the un-justified Gaussian distribution, with PCA- and TD-based unsupervised FE. As can be seen above, employment of null distribution generated by shuffling is not straightforward; in some cases, e.g, the first and the third data sets mentioned above, we needed to exclude low expressed genes manually whereas this was not required for the second data set. No miRNAs that were significantly expressed distinctly between controls and cancers in the second data sets were detected with the null distribution generated by shuffling. In addition, the number of low expressed genes to be removed cannot be decided uniquely. On the contrary, the criterion that genes associated with adjusted P-values less than 0.01 assuming the null hypothesis that singular value vectors obey a Gaussian distribution is more robust. This often can give a restricted number of genes without excluding low expressed genes. Although why this works so well must be explored in the future, it is an empirically more useful strategy than the null distributions generated by shuffling.

One may also wonder why we did not employ the centroid subspace, Sb, instead of singular value vectors if these two are equivalent for optimal clusters and the meaning of centroid subspace is easier to understand compared to singular value vectors. At first, we needed to apply K-means which often fail in unbalanced data sets composed of clusters with a very distinct number of samples. Next, K-means always identifies the primary cluster. Nevertheless, in the case of SARS-CoV-2 (the third data set), distinction between infected cell lines and control cell lines was detected using the fifth singular value vectors whose contribution will probably be neglected by K-means because of its too small contribution. In addition, singular value vectors can be computed in a fully unsupervised manner that does not require any labeling. Considering these advantages, it is reasonable to use singular value vectors instead of a centroid subspace despite its apparent usefulness. Further, as the yj used to compute projection b is decided manually, even if some biological features that yj assumes, such as clusters, do not exist, b can be computed. This might result in wrong conclusions. However, if there are no clusters at all, because no corresponding singular value vectors attributed to samples and coincident with yj are obtained, we can have an opportunity to realize any misunderstanding. Thus, usage of singular value vectors but not projection b might be advantageous.

One might also wonder why other more frequently used TD such as CP decomposition [3] have not been employed instead of HOSVD. This might be understood as follows. In the above description, we could relate the singular value vectors obtained by HOSVD to the centroid subspace, because singular value vectors attributed to genes are common between HOSVD and PCA. This equivalence will be broken if HOSVD is replaced with other TDs. When we invented TD-based unsupervised FE, though we also tested other TDs [3], HOSVD always outperformed other TDs when used for feature selections. The equivalence of HOSVD and PCA might explain why HOSVD could outperform other popular TDs as a feature selection tool.

Another possible concern is that only one hundred times shuffling was performed for the computation in Figs 1 to 3 whereas we considered P-values equal to 0.01; nevertheless, it is not problematic at all because of the following two reasons. First of all, the P-values we considered were not raw P-values but corrected P-values. Thus total number of probabilities computed are much larger than one hundred. Since the numbers of computed P-values are as many as those of mRNAs and miRNAs, they are as many as 103 or 104. Thus, the number of shuffling, one hundred, is not directly related to P-values of 0.01 at all. Second, individual P-values are not related to the number of shuffling at all; what we have performed was to generate P-values whose number is equal to that of miRNAs or mRNAs, i.e., 103 or 104. Thus, individual P-values can take much smaller values than 0.01, say 10−3 and 10−4 for miRNAs and mRNAs, respectively. Increasing or decreasing the number of shuffling does not affect the absolute values of P-values at all. The number of shuffling is only related to the reproducibility; if we can compute P-values based upon only one shuffling, it might heavily fluctuate. On the other hand, if we take average of P-values over one hundred shuffling, their outcome is expected to be more stable. The purpose of taking average over one hundred shuffling is simply because of stability of outcome. Apparent relationship between P = 0.01 and one hundred times shuffling does not make any sense. In conclusion, even if we take P = 0.01 as a threshold for one hundred times shuffling, it is not a problem at all.

Based upon the studies presented in the above, we emphasize that the usages of PCA or TD based unsupervised FE are recommended, since generally we do not know to which direction we project the data sets. PCA and TD turned out to have ability to give the directions of projections in an unsupervised manner. When projections directions are trivial, e.g., distinction between two classes, PCA and TD can correctly give us the directions. Even if the data sets are more complicated, we can employ higher mode tensors to tackle more complicated data sets. PCA and TD based unsupervised methods will be promising methods.

Data Availability

All the data sets and source code are available in GitHub repositry https://github.com/tagtag/peoj.

Funding Statement

Japan Society for the Promotion of Science http://dx.doi.org/10.13039/501100001691 KAKENHI [grant numbers 19H05270, 20H04848, and 20K12067] Professor Y-h. Taguchi The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Fang Z, Martin J, Wang Z. Statistical methods for identifying differentially expressed genes in RNA-Seq experiments. Cell & Bioscience. 2012;2(1):26. doi: 10.1186/2045-3701-2-26 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Chen JJ, Wang SJ, Tsai CA, Lin CJ. Selection of differentially expressed genes in microarray data analysis. The Pharmacogenomics Journal. 2006;7(3):212–220. doi: 10.1038/sj.tpj.6500412 [DOI] [PubMed] [Google Scholar]
  • 3.Taguchi YH. Unsupervised Feature Extraction Applied to Bioinformatics. Springer International Publishing; 2020. Available from: 10.1007/978-3-030-22456-1. [DOI]
  • 4. Tibshirani R. Regression Shrinkage and Selection Via the Lasso. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B. 1994;58:267–288. [Google Scholar]
  • 5. Huber PJ. Projection Pursuit. The Annals of Statistics. 1985;13(2):435–475. doi: 10.1214/aos/1176349519 [DOI] [Google Scholar]
  • 6. Bickel PJ, Kur G, Nadler B. Projection pursuit in high dimensions. Proceedings of the National Academy of Sciences. 2018;115(37):9151–9156. doi: 10.1073/pnas.1801177115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Ospina L, López-Kleine L. Identification of differentially expressed genes in microarray data in a principal component space. SpringerPlus. 2013;2(1):60. doi: 10.1186/2193-1801-2-60 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Clark NR, Hu KS, Feldmann AS, Kou Y, Chen EY, Duan Q, et al. The characteristic direction: a geometrical approach to identify differentially expressed genes. BMC Bioinformatics. 2014;15(1):79. doi: 10.1186/1471-2105-15-79 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Shahbazi A, Monfared MS, Thiruchelvam V, Ka Fei T, Babasafari AA. Integration of knowledge-based seismic inversion and sedimentological investigations for heterogeneous reservoir. Journal of Asian Earth Sciences. 2020;202:104541. doi: 10.1016/j.jseaes.2020.104541 [DOI] [Google Scholar]
  • 10. Khayer K, Kahoo AR, Monfared MS, Tokhmechi B, Kavousi K. Target-Oriented Fusion of Attributes in Data Level for Salt Dome Geobody Delineation in Seismic Data. Natural Resources Research. 2022. doi: 10.1007/s11053-022-10086-z [DOI] [Google Scholar]
  • 11. Khayer K, Roshandel-Kahoo A, Soleimani-Monfared M, Kavoosi K. Combination of seismic attributes using graph-based methods to identify the salt dome boundary. Journal of Petroleum Science and Engineering. 2022;215:110625. doi: 10.1016/j.petrol.2022.110625 [DOI] [Google Scholar]
  • 12. Ng KL, Taguchi YH. Identification of miRNA signatures for kidney renal clear cell carcinoma using the tensor-decomposition method. Scientific Reports. 2020;10(1). doi: 10.1038/s41598-020-71997-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Taguchi Yh, Turki T. A new advanced in silico drug discovery method for novel coronavirus (SARS-CoV-2) with tensor decomposition-based unsupervised feature extraction. PLOS ONE. 2020;15(9):1–16. doi: 10.1371/journal.pone.0238907 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Dodge Y. Q-Q Plot (Quantile to Quantile Plot). In: The Concise Encyclopedia of Statistics. New York, NY: Springer New York; 2008. p. 437–439. [Google Scholar]
  • 15.R Core Team. R: A Language and Environment for Statistical Computing; 2019. Available from: https://www.R-project.org/.
  • 16.Ding C, He X. K-Means Clustering via Principal Component Analysis. In: Proceedings of the Twenty-First International Conference on Machine Learning. ICML’04. New York, NY, USA: Association for Computing Machinery; 2004. p. 29. Available from: 10.1145/1015330.1015408. [DOI]
  • 17. Taguchi YH. Principal component analysis based unsupervised feature extraction applied to budding yeast temporally periodic gene expression. BioData Mining. 2016;9(1). doi: 10.1186/s13040-016-0101-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Taguchi Yh, Turki T. Tensor Decomposition-Based Unsupervised Feature Extraction Applied to Single-Cell Gene Expression Analysis. Frontiers in Genetics. 2019;10:864. doi: 10.3389/fgene.2019.00864 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Taguchi YH. Tensor decomposition-based and principal-component-analysis-based unsupervised feature extraction applied to the gene expression and methylation profiles in the brains of social insects with multiple castes. BMC Bioinformatics. 2018;19(S4). doi: 10.1186/s12859-018-2068-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Chi-Hua Chen

23 Aug 2022

PONE-D-22-20332Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection toolsPLOS ONE

Dear Dr. Taguchi,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Oct 07 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Chi-Hua Chen, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

3. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. 

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

4. Thank you for stating the following in the Acknowledgments Section of your manuscript: 

"This work was supported by KAKENHI [grant numbers 19H05270, 20H04848, and 20K12067] to YT and Institutional Fund Project (IFPIP) from the Ministry of Education and King Abdulaziz University (DSR), Jeddah, Saudi Arabia [grant number IFPIP: 924-611-1442] to TT."

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. 

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: 

"The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

5. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

6. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

********** 

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

********** 

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

********** 

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

********** 

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The paper analyzes the reason why the recently proposed principal component analysis (PCA) and tensor decomposition (TD)-based unsupervised feature extraction (FE) has often outperformed these statistical test-based methods in the context of projection pursuit that was proposed a long time ago. Some findings in this paper rationalize the success of PCA- and TD-based unsupervised FE for the first time. I have the following suggestions for this manuscript. Other comments can see the attached file.

Reviewer #2: General Comments:

Is the paper new, technically correct, and relevant?

Yes, the paper is new and technically sounds. Results somehow does support the methodology, but needed to be more cleared by the author in case of properties of the data.

Is the paper well organized?

The paper is properly organized, good literature review, suitable motivation and clear explanation on results are positive points to that.

Is the abstract concise?

Yes, but I think it needs to be rephrased after revision to add some comments about any artifacts or negative points in the method, if exist.

Is the introduction motivating?

Yes, Introduction section is motivating.

Are the methodology, results, and conclusions completely developed?

No, they need to be modified and developed according to the technical comments.

Are there language, mathematics, reference, or style errors? There is no mathematical, reference or style error.

Technical Comments:

Are the codes available for this research? As I found, there is no code available for this study, e. g. in Github. If the authors could make the codes available, the manuscript could be much better evaluated, not only for reviewers, but also for possible readers. When it is not possible to upload the code for public access, such as in Github, could they be provided for reviewer for better assessment of the study?

The study is comprehensive and requires large time to be read carefully and being reviewed. The theoretical background has been well explained in details, and the experiments and related models are presented and the algorithm in Fig. 1 is also well presented. I think more explanation about the steps and the parameters in Fig. 1 is required.

The result comparison parts are well organized and presented. The display way is good. But quantitative evaluation is somehow too much that one can get lost in that. I think it would be better that you add more explanation to that.

How did you evaluate the final result? How did you consider to finally selection a methodology for the most complicate problem?

What about when the models are more complex?

The introduction section is a nice one. It is architected very beautifully, while written fully academic and comprehend. I assume that any change in the introduction section is not necessary, but one of the important tasks after publishing a study is to increase its chance to be seen by the most possible number of researchers, so I would like to give two recommendations. First, to get your published study in the list of searched for papers based on keywords, I propose to increase variety of your keywords. In my viewpoint, they do not cover the whole topic of the study and are not widely searched words. I propose to add at least the keyword “data analysis”. Second, one of the methods in the publisher’s website that brings a publication on to the researchers, is based on the similar publications that they have read before. So, the more you cite similar publication, the more the chance that the search engine in the publisher website propose your paper to the researcher. Besides of that, it will also complete your introduction section. As another advantage, it rises new ideas to the researchers by combining various methods, or resolving drawback of one seen paper by reading the similar one, or extending the methodology to a fully automatic one. So, based on these points, I would like to ask to cite to the following similar publication in the manuscript which used PCA and feature selection for deep learning, but in different field of study. The first proposed publication is: Shahbazi, A., Soleimani Monfared, M., Thiruchelvam, V., Ka Fei, T., Babasafari, A.A., (2020). Integration of knowledge-based seismic inversion and sedimentological investigations for heterogeneous reservoir. Journal of Asian Earth Sciences. The second publication for citation is: Khayer, K., Kahoo, A.R., Soleimani Monfared, M., Tokhmechi, B., and Kavousi, K., (2022). Target-Oriented Fusion of Attributes in Data Level for Salt Dome Geobody Delineation in Seismic Data. Natural resource research, and the other publication could be: Khayer, K., Kahoo, A.R., Soleimani Monfared, M., and Kavouosi, K., (2022). Combination of seismic attributes using graph-based methods to identify the salt dome boundary. Journal of Petroleum Science and Engineering. 215, Part A, 110625,

The abstract focusses mainly on the general problem and ignores the other items of the abstract such as the methodology, good introduction, results and conclusion.

The authors should explain what limitations did they find out about the proposed method.

Best Regard

********** 

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: PONE-D-22-20332_reviewer.docx

Attachment

Submitted filename: Comments-PONE-D-22-20332.pdf

Decision Letter 1

Chi-Hua Chen

19 Sep 2022

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

PONE-D-22-20332R1

Dear Dr. Taguchi,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Chi-Hua Chen, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This manuscript has enriched the content of the article and enhanced the readability of the article through modification, but there are still some small problems.

1. It is suggested that the paragraphs of the full paper should be aligned at both ends, which may make the article look more beautiful.

2. In line 201 on page 8, u3k does not exist in (26).

3. In line 284 on page 11, a sentence uses two verbs, “P-values were attributed to genes as... 155 genes associated with corrected P-values less than 0.01 were selected, bi is expected to play a role of u5i in eq. (46).”

4. Please check the references carefully. For example, reference [3], [10], [12], [14], [15], [16], and [19] etc.

Reviewer #2: Dear Authors;

I have read your response and edited manuscript carefully and I was pleased with your answers and the way of developing the research and the manuscript.

So, I have no further comment for you.

Best Regards

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

Acceptance letter

Chi-Hua Chen

20 Sep 2022

PONE-D-22-20332R1

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Dear Dr. Taguchi:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Professor Chi-Hua Chen

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: PONE-D-22-20332_reviewer.docx

    Attachment

    Submitted filename: Comments-PONE-D-22-20332.pdf

    Attachment

    Submitted filename: Replies_to_reviewers.docx

    Data Availability Statement

    All the data sets and source code are available in GitHub repositry https://github.com/tagtag/peoj.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES