Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2024 Nov 5;14:26801. doi: 10.1038/s41598-024-77080-8

A tied-weight autoencoder for the linear dimensionality reduction of sample data

Sunhee Kim 1, Sang-Ho Chu 2, Yong-Jin Park 2, Chang-Yong Lee 1,
PMCID: PMC11538474  PMID: 39501008

Abstract

Dimensionality reduction is a method used in machine learning and data science to reduce the dimensions in a dataset. While linear methods are generally less effective at dimensionality reduction than nonlinear methods, they can provide a linear relationship between the original data and the dimensionality-reduced representation, leading to better interpretability. In this research, we present a tied-weight autoencoder as a dimensionality reduction model with the merit of both linear and nonlinear methods. Although the tied-weight autoencoder is a nonlinear dimensionality reduction model, we approximate it to function as a linear model. This is achieved by removing the hidden layer units that are largely inactivated by the input data, while preserving the model’s effectiveness. We evaluate the proposed model by comparing its performance with other linear and nonlinear models using benchmark datasets. Our results show that the proposed model performs comparably to the nonlinear model of a similar autoencoder structure to the proposed model. More importantly, we show that the proposed model outperforms the linear models in various metrics, including the mean square error, data reconstruction, and the classification of low-dimensional projections of the input data. Thus, our study provides general recommendations for best practices in dimensionality reduction.

Keywords: Tied-weight autoencoder, Dimensionality reduction, Data reconstruction, Code size, Mean square error

Subject terms: Computational science, Information technology

Introduction

Dimensionality reduction1,2 refers to a technique that aims to reduce the number of dimensions in a dataset while preserving as much of the original information as possible. The primary goal is to minimize the dimensionality of the dataset so that essential parameters can describe effectively the dataset. Dimensionality reduction finds applications in various domains such as feature selection, computational efficiency, efficient data storage, and noise reduction in data visualization. Moreover, it is instrumental in addressing statistical modeling and inference challenges that arise in cases characterized by the “large p, small n” problem, where the sample dimensions p significantly exceeds the sample size n. For example, dimensionality reduction becomes necessary in cases such as microarray data analysis3 and genomic selection4, where there are a large number of genetic variants (typically in the tens of thousands), but the number of samples available for each gene is relatively limited. As a crucial aspect of data preprocessing, dimensionality reduction methods are of great importance in the fields of machine learning and data science5,6.

Techniques for dimensionality reduction can be broadly categorized into linear and nonlinear methods. While available methods, such as principal component analysis (PCA)7 and independent component analysis (ICA)8, fall into the linear method, Isomap (Isometric Mapping)9 and neural networks10 stand out as widely used nonlinear approach. Both linear and nonlinear methods have pros and cons. Nonlinear methods offer advantages over their linear counterparts in that they can capture complex nonlinear relationships present in the data, resulting in more informative representations. Thus, nonlinear methods generally have demonstrated superior performance over linear methods in dimensionality reduction.

While the linear methods are generally less effective at dimensionality reduction than the nonlinear methods, they offer greater interpretability than the nonlinear methods. This interpretability comes from the linear nature of these methods, which allows the dimensionality-reduced representation (or low-dimensional projection) to be expressed as a linear combination of the original data and vice versa. Thus, it is possible to understand how each component of the dimensionality-reduced result contributes to the original data. Consequently, linear methods make interpreting the results and extracting insights from the data easier. From this perspective, it would be desirable to have a linear method that outperforms current linear methods in the effectiveness of dimensionality reduction while maintaining similar interpretability to current linear methods. That is, we would like to have a method that combines the advantages of both linear and nonlinear methods.

In this study, we introduce a tied-weight autoencoder (AE) that has the advantages of both linear and nonlinear methods. An AE is a neural network model designed for learning compressed representations of unlabeled data11. An AE aims to produce an efficient encoding by training on a dataset, often used for dimensionality reduction purposes1214. It consists of an encoder and a decoder. The encoder converts the input data into a dimensionality-reduced representation, while the decoder reconstructs the original data from this encoded form. The tied-weight in the proposed AE is implemented by mirroring the weights in the decoder to match those in the encoder. This model is designed to function as an approximately linear method, creating a linear transformation that connects the original data to the reduced-dimensional representation. It not only outperforms other linear methods in dimensionality reduction but also, with the tied-weight, ensures that the inverse of the linear transformation is simply the transpose of the matrix reprsenting the linear transformation. Thus, the proposed model provides similar interpretability and more effective dimensionality reduction than other linear models. As an unsupervised learning technique, a typical AE is classified as a nonlinear dimensionality reduction method. However, the proposed model provides a linear transformation in an approximate sense by collecting units in the hidden layer that are activated by the dataset.

We demonstrate the effectiveness of our proposed model by comparing its performance with other linear and nonlinear models using widely known datasets. We compare the dimensionality reduction result of the proposed model with that of the linear models of PCA and ICA, and the nonlinear models of stacked autoencoder (SAE)15, variational autoencoder (VAE)16, locally linear embeddings (LLE)17, and Isomap9. The experimental datasets used are image datasets from MNIST18, Fashion-MNIST (or FMNIST)19, SVHN20, and CIFAR1021 in addition to non-image datasets from the Breast Cancer22 and Wine23 datasets.

We demonstrate the effectiveness of the proposed model by classifying the dimensionally reduced outputs into the classes of the original data. We use a support vector machine (SVM) classifier to evaluate how well the dimensionally reduced outputs capture the characteristics of the original data, Our evaluation shows that SVMs trained on the dimensionally reduced outputs of the proposed model achieve higher F1 scores and balanced accuracy compared to those trained on the outputs of the linear methods of PCA and ICA. These results highlight the superiority of the proposed model over PCA and ICA and establish it as a valuable linear approach for dimensionality reduction. Our further comparison focuses on evaluating their performance in terms of the mean square error and input image data reconstruction. The proposed model showed comparable performance to other AEs. However, the proposed model offers a distinct advantage over other AEs by providing a linear transformation between the original data and the low-dimensional representation, a feature lacking in the AEs. The proposed model achieves better performances compared to PCA and ICA of the linear method. These results show that the proposed model is superior in dimensionality reduction of the linear methods. Based on these results, our study provides general recommendations for best practices in dimensionality reduction.

The proposed model, SAE, and VAE were realized using the Keras at https://keras.io and its library. LLE and Isomap were realized using the scikit-learn library at https://scikit-learn.org/stable/index.html. Both the Keras and the scikit-learn were carried out on Python 3.6.8. PCA and ICA implementations and classification experiments were carried out in R 4.3.1. The source codes are publicly available at https://github.com/infoLab204/tw_autoencoder.

Proposed autoencoder model of linear transformation

Basic structure of proposed model

The schematic structure of the proposed model is shown in Fig. 1. The proposed model is a neural network designed to autonomously learn data features by replicating the input as the output. The model includes three hidden layers in addition to the input and output layers. In a multilayer autoencoder, the central hidden layer encodes the original data. We refer to this central layer as the “code layer” to distinguish it from other hidden layers. In essence, the code layer is a special type of hidden layer. When an autoencoder has only one hidden layer, in addition to the input and output layers, this hidden layer also functions as the code layer. The hidden layers exhibit symmetry, with the layers adjacent to the code layer sharing the same architecture. The encoder is responsible for mapping input data to the code layer, and a decoder is tasked with reconstructing input data from the code layer. The code layer serves as the dimensionality-reduced representation of input data and should have fewer units than the input layer, while the flanking hidden layers are not subject to such constraints. Due to its AE nature, both input and output layers have identical structures.

Fig. 1.

Fig. 1

Schematic description of the proposed model. xαi and x~αi represent input and output, respectively; hαj and h~αj represent outputs of the hidden layer in the encoder and the decoder, respectively; zαk represents output of the code layer. wij(1) and ϕ(1) denote the weights and the ReLU activation function of the hidden layer, respectively, in the encoder; wjk(2) and ϕ(2) denote the weights and the identity activation function of the code layer, respectively. As a tied-weight model, the weights in the decoder are the transpose of the weights in the encoder. We omit the biases for simplicity.

The rectified linear unit (ReLU) (ϕ(1) in Fig. 1) is used as the activation function for the flanking hidden layers to the code layer; the identity function (ϕ(2) in Fig. 1) is adopted for both the code and output layers. The proposed model incorporates the tied weight, where the decoder weights mirror the encoder weights. That is, the weights (wji(1) and wkj(2) in Fig. 1) in the decoder are the transpose of the weights (wij(1) and wjk(2) in Fig. 1) in the encoder. In Fig. 1, the indices α, i, j, and k refer to n observed data, the dimensionality m of the data, q units in the hidden layer, and p units in the code layer, respectively.

This approach of weight sharing reduces the number of parameters estimated in the model and reduces the risk of overfitting24. More importantly, in the tied-weight model, the inverse linear transformation from the dimensionality-reduced result to the original input data becomes just the transpose of the matrix representing the linear transformation. Besides a simple construction of the inverse transformation, the tied-weight autoencoder improves the learning efficiency of the model compared to models without tied weights25 and mimics the structure of a Restricted Boltzmann Machine (RBM), which inherently has tied weights26.

Construction of linear transformation

To derive the linear relationship between input data and its dimensionality-reduced representation, we start with the expression of units in the code layer in terms of units in the input layer and the weights. Using notations in Fig. 1, it is straightforward to express zαk, the kth unit in the code layer for an observation (or sample) α, in terms of input data xαi and the weights wij(1) and wjk(2).

zαk=i=1mxαiWikα+Δαk(2),whereWikαjAαwij(1)wjk(2)andΔαk(2)jAαw0j(1)wjk(2)+w0k(2), 1

where m is the dimension of input data. In addition, w0j(1) and w0k(2) are the biases of the hidden and the code layers, respectively; Δαk(2) is a collective expression of different biases. A detailed derivation of Eq. (1) is given in “Methods” Section.

In Eq. (1), the set Aα is a subset of the units in the hidden layer given as

Aαj|hαj0orAαj|i=0mxαiwij(1)0forj=0,1,,q. 2

Because of the ReLU activation function, Aα is a collection of units in the hidden layer that output non-zero contribution to the code layer. It depends on xαi of each observation α and the weights connecting the input and the hidden layers. By absorbing the bias terms Δαk(2) into zαk, Eq. (1) can be expressed as

zαk-Δαk(2)zαk=i=1mxαiWikαorzα=xα·Wα, 3

where zα and xα are row vectors of p and m components, respectively. Note that m×p matrix Wα defined in Eq. (1) linearly transforms m units in the input layer into p units in the code layer. From Eq. (3), we find that the hidden layer is effectively integrated out, and the units zα in the code layer are linearly related to the units xα in the input layer with weight matrix Wα.

The matrix Wα in Eq. (1) is an observation-dependent linear transformation by the dependence of Aα on the observation α in Eq. (2). The matrix Wα is observation-dependent in the sense that it varies from observation to observation. To establish a linear relationship between the input data and its dimensionality-reduced representation (i.e., the units in the code layer), a linear transformation W should be constructed that can be applied to all observations in common. For this purpose, we modify the structure of the proposed model so that the constructed W differs as little as possible from Wα, so that the matrix W has a similar effectiveness to Wα Similar to Wα the matrix W is constructed using the weights associated with the units in the hidden layer.

To ensure that W minimally differs from Wα, we identify the units in the hidden layer activated (i.e., outputs a non-zero value) by a larger set of observations. This identified set of units forms a subset of all units in the hidden layer, and the weights associated with the units within this subset are used to construct W. The more observations that activate a unit in the hidden layer, the more likely the unit will be included in constructing W. Consequently, if a unit is in the subset, then any unit activated by a larger number of observations than the unit must also be part of that subset. This means that there is a minimum number of observations for a unit to be included in the subset.

To estimate the minimum number of observations, we require that W shows similar performance to the method using Wα. One way to ensure similar performance is to minimize the difference between two outputs: one using all hidden layer units (i.e., Wα) and the other using only the units in the subset (i.e., W). Thus, estimating the minimum number of observations becomes an optimization problem. Once the minimum number is estimated, we can construct the optimal subset of units activated by the minimum number of observations. The optimal subset is a subset of α=1nAα, the union of Aα over all observations. Using the units in the optimal subset and their associated weights, we obtain a linear transformation W similar to Wα.

To construct the optimal subset, we train the model to update the weights and to identify Aα, a collection of units in the hidden layer activated by an observation α. Training the proposed model is realized by minimizing the loss function, the mean square error between target (i.e., input data) and output (i.e., reconstructed input). Once we have Aα for each observation α, we construct a matrix Mαj which represents Aα quantitatively across all observations. We set Mαj=1, if observation α in the input data activates the jth unit in the hidden layer (i.e., the jth unit is an element of Aα); we set Mαj=0, otherwise. The matrix Mαj, with dimensions n×q, indicates whether observation α activates the jth unit in the hidden layer. Each row of Mαj reflects the units activated by each observation, while each column denotes the observations that activate each unit in the hidden layer.

Using Mαj, we can count the number of observations that activate the jth unit in the hidden layer, denoted by Mj. That is, for α=1,2,,n and j=1,2,,q, we have

Mjα=1nMαj,whereMαj=1ifjAα0ifjAαand0Mjn. 4

As Mj becomes larger, more observations activate the jth unit in the hidden layer.

Using Mj and an integer θ, we find Cθ, a set of units in the hidden layer, that is activated by at least θ or more observations:

Cθ={j|Mjθ},whereCθα=1nAα. 5

The set Cθ is a subset of α=1nAα and consists of units in the hidden layer activated by at least θ observations. Finding the optimal subset Cθ consists of estimating an integer θ that minimizes the discrepancy between the outputs using all units in the hidden layer and using the units in the subset Cθ. We express the discrepancy in terms of the mean square error:

Δ(θ)=12nmα=1ni=1mx~αi-x~αiθ2, 6

where x~αi is the output using all units in the hidden layer and x~αiθ is the output using the units in Cθ.

Thus, the estimate of the optimal θ renders to find θ^ that minimizes Δ(θ) and can be formulated as

θ^=argminθΔ(θ). 7

Once the optimal subset Cθ^ is obtained, we exclude the units not in Cθ^ from the model. Since the proposed AE includes tied weights, we are required to maintain the symmetry of the hidden layers. This means that we also exclude the units in the decoding hidden layer that do not belong to Cθ. As the units in the hidden layers have been changed from the full units to the units in Cθ^, we need to train the proposed model again.

The proposed model is a sparse, tied-weight autoencoder designed for linear dimensionality reduction. Sparsity is introduced by selectively removing units in the hidden layers in such a way as to minimize Eq. (6). As shown in Eq. (27) of “Methods” Section, the tied-weight structure allows the inverse linear transformation from dimensionality-reduced results to the original data to be simply the transpose of the matrix representing the linear transformation. With the tied weights, the original data can similarly be reconstructed from the reduced representation. Without the tied weights, finding the inverse of the matrix representing the linear transformation becomes difficult, especially for data with high dimensionality. It is important to note that the sparsity of the model must satisfy the tied-weight condition. Since the model uses tied weights, the symmetry of the hidden layers must be maintained. Specifically, the sparsity pattern in the decoder is a mirror structure of that in the encoder.

The proposed model is re-trained to minimize the loss function of the reconstruction error using the units in Cθ^. The loss function is generally expressed as the mean square error (or L2-norm) between input data and the output from the low-dimensional projection. That is,

L(X,X~;θ^)=||X-X~||2=12nmα=1ni=1m(xαi-x~αiθ^)2, 8

where x~αiθ^ is the output from the input xαi using the optimal subset Cθ^. With the optimal subset Cθ^, the kth unit in the code layer zαk in Eqs. (1) and (3) can be expressed approximately as

zαki=1mxαiWik,whereWikjCθ^wij(1)wjk(2). 9

Equation (9) can be rewritten in a matrix form as

ZpXWp, 10

where the subscript p denotes the number of units in the code layer to emphasize the degree of dimensionality reduction. In addition, Zp, X, and Wp are n×p, n×m, and m×p matrices, respectively. Since the model uses tied weights, we can explicitly express the output in terms of the input and the linear transformation matrix Wp That is, we have

X~=ZpWpT, 11

where the superscript T stands for the matrix transpose. In addition, X~, Zp, WpT are n×m, n×p, and p×m matrices, respectively. A detailed derivation of X~ in Eq. (11) is given in “Methods”Section.

Finding the optimal subset Cθ^ is morphologically similar to, but not the same as, the dropout technique27 and the sparse (or k-sparse) AE28. As a regularization technique, the dropout is used to mitigate the problem of overfitting. The difference between the proposed method and the dropout is twofold. First, while the dropout uses the dropout probability as a hyper-parameter to randomly remove units in a layer, the proposed method systematically selects the units to be removed by minimizing the discrepancy given in Eq. (6). The purpose is to establish a linear relationship between the original data and the dimensionality-reduced result. In this approach, Eq. (6) serves as a guide for the dropout process. Second, the dropout of units can occur in any hidden layers, while the proposed method removes units only in the hidden layers adjacent to the code layer. While the proposed model and the sparse AE are similar in that both models consider inactive units, they are conceptually and structurally different. The sparse AE imposes the sparsity condition on the code layer with a given probability, while the proposed model makes the hidden layers (not the code layer) sparse. In addition, the sparse AE uses either probability or hyper-parameter k to eliminate the units in the code layer, while the proposed method specifies the number and content of the units in the hidden layers by optimizing the discrepancy given in Eq. (6).

The procedure for constructing the linear transformation matrix W in Eq. (9) is schematically described in Fig. 2 and can be summarized as follows.

  1. Learn the proposed model using each observation α to obtain Aα.

  2. Construct a matrix Mαj and calculate M,j using Aα.

  3. Estimate θ^ to find the subset Cθ^ by minimizing Δ(θ) given in Eq. (6).

  4. Exclude the units not in Cθ^ from the model.

  5. Learn again the model with the units in Cθ^ to update the weights.

Fig. 2.

Fig. 2

The procedure for obtaining the linear transformation matrix W. (1) Learn the model to update the weights. (2) Evaluate Mαj and Mj. For example, we have 6 observations (n=6) and five units in the hidden layer (q=5). (3) Estimate the minimum number of observations θ by minimizing Δ(θ) to construct the optimal subset Cθ^. (4) Eliminate the units in the hidden layer that do not belong to the set Cθ^. (5) Re-learn the model with the units in Cθ^ to update the weights.

Model comparison

We compare the performance of the proposed model with that of linear and nonlinear models. Specifically, the proposed model is compared in its performance with linear models of PCA and ICA; with autoencoder models of SAE and VAE; and with other nonlinear models of LLE and Isomap.

PCA can be thought of as creating a set of new variables that are the linear combinations of the current variables in terms of the eigenvectors of the correlation matrix from the data. PCA transforms a data matrix X of n observations with dimensionality m into an n×p matrix Zp with reduced dimensionality p (pm). The transformation between X and Zp is defined by an m×p matrix Vp, which is a set of eigenvectors called principal components:

Zp=XVp. 12

With Zp, we can reconstruct the input data X using Eq. (12) as

X~=ZpVpT=XVpVpT, 13

where VpT is the transpose of Vp. Note that when p=m, VmVmT=I and X~=X. With the output X~, the mean square error, which is equivalent to the loss function in the proposed model, is given by

L(X,X~)=||X-X~||2=||X-ZpVpT||2=12nmα=1ni=1m(xαi-x~αi)2,wherex~αi=j=1p(Zp)αjVpTji. 14

ICA is a method for decomposing multivariate data into distinct non-Gaussian components. ICA finds a linear representation of non-Gaussian data such that the components are statistically independent. ICA focuses on independent sources unlike principal component analysis, which attempts to maximize the variance of the data points. As a method for dimensionality reduction, ICA is similar to PCA in that it takes a set of data and transforms it into a set of features. The difference is that while PCA tries to maximize variance, ICA assumes the dataset is a mixture of independent sources and tries to identify features that contribute independently to the dataset. In this study, we adopt FastICA algorithm29, which is known to be the most popular and the fastest algorithm to perform ICA. Given n×m input data matrix of X, ICA reduces the dimensionality by finding an orthonormal un-mixing matrix W such that

Sp=XKpW, 15

where Kp is the pre-whitening matrix that projects X on to the largest p principal components and Sp is the estimated source matrix. Once Kp and W are evaluated, we can reconstruct the input data as follows.

X~=SpW-1KpT, 16

where W-1 is the inverse of W and KpT is the transpose of Kp. Similar to PCA, we can quantify the mean square error as

L(X,X~)=||X-X~||2=12nmα=1ni=1m(xαi-x~αi)2,wherex~αi=j=1p=1p(Sp)αj(W-1)j(KpT)i. 17

SAE consists of stacking a basic autoencoder (BAE) in hidden layers. BAE is the simplest autoencoder, composed of one code layer flanked by input and output layers. BAE is forced to learn features from the input data in the sense that fewer code layer units are required to reconstruct the input data. In BAE, the activation functions of the code layer units are usually determined by the model selection. In a SAE, the code layer of one BAE serves as the input for the next. This allows SAE to learn hierarchical representations of the data, with each layer capturing more complex features. The SAE uses the greedy layer-wise unsupervised learning algorithm with fine tuning for deep belief networks30,31. The SAE has a similar loss function to the proposed model given in Eq. (8), except for the optimal subset of the units:

L(X,X~)=||X-X~||2=12nmα=1ni=1m(xαi-x~αi)2, 18

where x~αi is the output from the input xαi. For a fair comparison, we use an SAE of the three hidden layers constructed by stacking two BAEs, so that the proposed model and the SAE have the same number of units and hidden layers. We prefer to use SAE over BAE because it is known that SAE usually outperforms BAE in dimensionality reduction24. While primarily a generative model, the VAE can also be viewed as a dimensionality reduction method similar to traditional autoencoders. The VAE reduces the dimensionality of the input data by encoding it into a lower-dimensional latent space, but with a probabilistic framework that offers several advantages over purely deterministic methods. VAE has the same loss function as SAE of Eq. (18).

LLE is a nonlinear dimensionality reduction method that attempts to preserve the local structure of the data in the lower dimensional embedding. It assumes that each data point and its neighbors lie on a locally linear patch of a high-dimensional manifold. For a dataset of n samples, LLE computes the n×n weight matrix Uij by minimizing the square difference between the original data and the expression given by a linear combination of its neighbors under the condition j=1nUij=1. LLE then obtains the dimensionality-reduced result using Uij in minimizing the cost function. Isomap is a nonlinear dimensionality reduction method that focuses on preserving the global geometric properties of the data. It extends classical multi-dimensional scaling (MDS) to nonlinear manifolds by computing geodesic distances between data points. While classical MDS requires a matrix of pairwise distances between all data points, Isomap assumes that the pairwise distances are only known between adjacent data points. This effectively estimates the full matrix of pairwise geodesic distances between all the data points. Isomap then uses classical MDS to compute the dimensionality-reduced result. It is effective for manifolds that are globally nonlinear but locally linear.

Results and discussion

Data acquisition and model selection

We conducted experiments on the proposed model and compared it with the linear models of PCA and ICA, and the nonlinear models of SAE, VAE, LLE, and Isomap. For the performance comparison, we used datasets of image and non-image: image datasets from MNIST, FMNIST, SVHN, and CIFAR10; non-image datasets from the Breast Cancer and Wine datasets.

MNIST, FMNIST, SVHN, and CIFAR10 datasets are commonly used for training various image processing systems, including classification in machine learning. MNIST is a dataset of handwritten digit images (0–9) of 28×28 pixels and consists of 70,000 images. FMNIST is a dataset of item images from Zalando comprising 70,000 gray-scale images. Each image has 28×28 pixels associated with a label from 10 classes. For MNIST and FMNIST, we have normalized the original pixel intensities, which range from 0 to 255, to be in the interval [0, 1] for training purposes. MNIST and FMNIST can be downloaded at http://yann.lecun.com/exdb/mnist/ and https://github.com/zalandoresearch/fashion-mnist, respectively.

The SVHN is a dataset of color images obtained from house numbers in Google Street View images. We used images of 32×32 pixels centered on a single number. It consists of 99,289 digit images, from which we randomly selected 90,000 images for equal partition. It can be downloaded from http://ufldl.stanford.edu/housenumbers/. The CIFAR10 dataset consists of color images of 32×32 pixels in completely mutually exclusive 10 classes, with 6,000 images per class. Thus, there are 60,000 images in total. The dataset can be downloaded at https://www.cs.toronto.edu/~kriz/cifar.html. Because the color images require excessive computation, we simplified the color image to binary images using the adaptive mean thresholding32 for training purposes.

We partitioned each dataset into subsets of 10,000 images each. For instance, MNIST and FMNIST (70,000 images for each dataset) are divided into seven subsets. Similarly, SVHN and CIFAR10 with 90,000 and 60,000 images were divided into nine and six subsets, respectively. We then applied the proposed and other models to each subset with varying degrees of dimensionality reduction. We repeated the experiment with all subsets to estimate statistics such as the mean and standard error. For example, we repeated the experiment seven times for MNIST and FMNIST to obtain the average and standard deviation of the results.

In the Breast Cancer dataset, 30 features (or dimensions) are computed from a digitized image of a breast mass. The dataset contains 569 samples which are classified into two classes: malignant and benign. It can be downloaded from https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic. The Wine dataset describes the chemical analysis of wines grown in the same region but from three different varieties. The dataset consists of 178 samples, each characterized by 13 continuous variables (or dimensions) and classified into three classes. It is useful for small-scale dimensionality reduction experiments and can be downloaded from https://archive.ics.uci.edu/dataset/109/wine.

For each of the four image datasets, a subset of 10,000 images was used to generate 10,000 low-dimensional projections. The experiment used five-fold cross-validation (or rotation estimation) for model validation. The 10,000 projections were randomly divided into five partitions (or folds). In each iteration, four partitions were used to train the SVM classifier, while the remaining partition was used for testing. This process was repeated five times, ensuring each partition was used once as test data. The test results from all rounds were then averaged to estimate the predictive performance of the models. This cross-validation method helps to reduce the variability of the results. For each non-image dataset, 569 samples for Breast Cancer and 178 for Wine were divided into five and three subsets, respectively. Each subset is then used to generate low-dimensional projections. Similar to the image datasets, the low-dimensional projections in each set were randomly divided into four partitions for the four-fold cross-validation. The results from all rounds were then averaged.

As for the model selection for LLE and Isomap, we found that setting the number of neighbor data points, the only hyper-parameter in the models, to five gave optimal performance. In addition, performance is relatively insensitive to changes in this hyper-parameter. We optimized the hyper-parameters in the neural networks of the proposed model, SAE, and VAE through experimentation. We explored different optimizers and learning rates. We found that Stochastic Gradient Descent (SGD) for the proposed model and Adam for SAE as well as VAE consistently produced a more stable reduction in the loss function than RMSprop and Adagrad. In addition, SGD and Adam outperformed other values with learning rates around 0.1 and 0.001, respectively. We observed that RMSprop and Adam converged faster than the other optimizers in the mean square error. We refer to a review33 for an overview of the optimizers.

We tried different sizes for the hidden layers. For the proposed model, the behavior of the loss function was little affected by the number of units in the hidden layers as long as it exceeded the number of units in the input layer. We used 2,000 units in the hidden layers. For the SAE and VAE, we found that 400 and 128 units respectively outperformed the other sizes, although the difference in performance was not significant. We tested different methods for initializing weights and biases. We found no significant differences in the loss function as long as normal and uniform distributions were used between different initialization methods. We also found that the mini-batch size was insensitive to the loss function when a sufficiently large epoch was used, while a mini-batch of 100 performed better when the number of epochs was limited. When using optimal initial weights, biases, optimizers, and the number of units in the hidden layer, there was no significant discrepancy between different epochs, provided they exceeded 100.

To avoid overfitting, we used a validation dataset, a subset of the training data, to fine-tune the hyper-parameters of the proposed model, SAE, and VAE. We divided each dataset into 90% for training and 10% for validation. We then trained the two models on the training data and monitored the loss function on the validation dataset. We mitigated overfitting by “early stopping”34, a simple technique, which is more effective than other regularization methods.

Characteristics of the optimal subset of activating units

The construction of the subset of activating units Cθ, as described in Eq. (5), relies on the number of observations that activate each unit in the hidden layer. To explore the property of these activating observations, we consider the set {Mj|j=1,2,,q}, where Mj, defined in Eq. (4), represents the number of observations that activated unit j. We plot this set in ascending order of Mj for different code sizes, with the results shown in Fig. 3a for the MNIST dataset. As shown in Fig. 3a, the number of observations increases approximately linearly, especially in the mid-range, as the units are ordered by increasing number of observations. This pattern is more or less consistent across all code sizes, suggesting that the activation of hidden layer units is independent of the degree of dimensionality reduction. A similar trend was observed in other datasets, with different rates of linear increases, as shown in Supplementary Information Fig. S1.

Fig. 3.

Fig. 3

(a) Plot of the number of observations versus the unit indices in ascending order of their observations for the MNIST dataset. The number of units used and the total number of observations are 2,000 and 10,000, respectively. Each plot is drawn with different code sizes: 4, 8, 12, 16, and 20. The red dotted line represents a slope as a guide to a linear relationship. (b) A plot of Δ(θ) versus θ for the MNIST dataset. Each plot is drawn with different code sizes: 4, 8, 12, 16, and 20.

To demonstrate the existence of an optimal θ^ in Eq. (7), we plot Δ(θ) versus θ using the MNIST dataset as an example. Figure 3b shows the behavior of Δ(θ) versus θ for various code sizes (i.e., the number of units in the code layer) in the MNIST dataset. From Fig. 3b, we observe that there is an optimal θ^ at which Δ(θ) reaches its minimum for all code sizes. Additionally, the discrepancy Δ(θ) decreases as the code size increases, indicating the degree of dimensionality reduction. This is expected because a larger code size simplifies the dimensionality reduction, resulting in a lower mean square error. These characteristics are consistent across all types of datasets. Results for other datasets are presented in Supplementary Information Fig. S2.

To quantify the contribution of hidden layer units to the optimal subset for a linear transformation, we analyzed the ratio of activated units (those that output non-zero values) to the total number of units in the hidden layer. Table 1 shows this ratio for all datasets. As shown in Table 1, about 50–60% of the units are included in the linear transformation, regardless of code size and dataset. In addition, the non-image datasets have a lower ratio than the image datasets.

Table 1.

The ratio of activated units in the hidden layer averaged over independent trials for each dataset.

Dataset Code size
4 8 12 16 20
MNIST 0.54 (0.002) 0.53 (0.002) 0.54 (0.002) 0.56 (0.002) 0.57 (0.002)
FMNIST 0.55 (0.003) 0.57 (0.003) 0.59 (0.003) 0.60 (0.003) 0.60 (0.003)
SVHN 0.57 (0.003) 0.57 (0.003) 0.58 (0.003) 0.58 (0.003) 0.56 (0.003)
CIFAR10 0.59 (0.003) 0.60 (0.004) 0.57 (0.004) 0.56 (0.004) 0.55 (0.004)
BC 0.48 (0.005) 0.49 (0.005) 0.50 (0.005) 0.50 (0.005) 0.51 (0.005)
WINE 0.43 (0.006) 0.47 (0.005) 0.50 (0.005) 0.51 (0.005) 0.51 (0.005)

The corresponding standard deviations are given in parentheses. The “BC” stands for the Breast Cancer dataset, and the code sizes for the Wine dataset are 2, 4, 6, 8, and 10.

Signature for classification

The proposed method enhances interpretability by employing a linear transformation that allows the dimensionality-reduced results to be expressed as a linear combination of the original data, and vice versa. This allows us to understand the contribution of each component from the original data to the reduced-dimensional result. Consequently, the level of interpretability is tied to how accurately the reduced-dimensional results reflect the original data. In this framework, the performance of different models can be evaluated based on how effectively their low-dimensional projections (or dimensionality-reduced results) represent the original data.

This evaluation involves training a support vector machine (SVM) classifier using the low-dimensional projections as input to predict the classes of the original data. We used the low-dimensional projection with its corresponding class (or label) to determine how well the low-dimensional projection captures and represents the original data in the classification task. Higher performance of the low-dimensional projection would indicate a better representation of the original data. We evaluated the performance of SVM using metrics such as F1 score and balanced accuracy. A detailed explanation of SVM and the metrics can be found in “Methods”Section.

The SVM used cross-validation to learn from the proposed and the comparison models, aiming to predict the class (or label) of the test data. We evaluated all models based on the confusion matrix generated by the SVM classifier. For the datasets in this study, the models were tasked with identifying the optimal separating hyperplane (or regression curve) to classify images into c different classes: c=10 for the four image datasets and c=2 or c=3 for the Breast Cancer and Wine datasets, respectively. The classification results were summarized in a c×c confusion matrix, denoted as D={Dij|i,j=1,2,,c}, where c is the number of classes. Using the confusion matrix, we evaluated the performance of the classifiers using the F1 score and balanced accuracy. The confusion matrix was analyzed using the R package “caret”35.

Figure 4 illustrates the performance of the SVM classifier using the F1 score when the code size is four. While the F1 scores show fluctuations across class labels, the proposed method and nonlinear methods produce higher scores implying more accurate label predictions than the linear methods of PCA and ICA. For the image datasets, we find that the F1 scores of MNIST and FMNIST datasets from all models are higher than those of SVHN and CIFAR10 datasets; the F1 scores of the CIFAR10 dataset are the lowest over all datasets. In particular, the F1 scores of the CIFAR10 dataset are less than 0.5, which implies that the low-dimensional projections of the CIFAR10 dataset are hard to classify. For the non-image datasets of Breast Cancer and Wine, both the proposed and nonlinear models generally perform better than the linear models of PCA and ICA, though the performance difference is insignificant. This is because the dimensionality of the non-image datasets is much lower than that of the image datasets, making it easier to reduce dimensionality. This assertion is supported by the higher F1 scores of the non-image datasets than those of the image datasets.

Fig. 4.

Fig. 4

Plots of F1 scores for each class label when the number of codes (or number of components) was four. The F1 score was averaged over the cross-validation and the error bars represent the corresponding standard errors estimated from independent trials. Some error bars are too small to be identified. (a) MNIST dataset, (b) FMNIST dataset, (c) SVHN dataset, (d) CIFAR10 dataset, (e) Breast Cancer dataset, and (f) Wine dataset.

We have found similar results with different code sizes in the datasets as provided in Supplementary Information Figs. S3 and S4, from which we found that the CIFAR10 results showed higher F1 scores as the code size increased. The difference in the F1 scores between different methods decreased as the number of codes (or principal components) increased. The results using the balanced accuracy show similar characteristics to the case of the F1 score as shown in Fig. 5 when the code size is four. We provide the results with different code sizes and datasets in Supplementary Information Figs. S5 and S6. Comparing the F1 score and the balanced accuracy across the four image datasets, it is clear that MNIST and FMNIST are more amenable to dimensionality reduction than SVHN and CIFAR10. Of these, the CIFAR10 dataset poses the greatest challenge to dimensionality reduction, making it the least feasible for a given code size. We also find that the F1 score and the balanced accuracy for non-image datasets show the same trend.

Fig. 5.

Fig. 5

Plots of balanced accuracies for each class label when the number of codes (or number of components) was four. The balanced accuracy was averaged over the cross-validation and the error bars represent the corresponding standard errors estimated from independent trials. Some error bars are too small to be identified. (a) MNIST dataset, (b) FMNIST dataset, (c) SVHN dataset, (d) CIFAR10 dataset, (e) Breast Cancer dataset, and (f) Wine dataset.

Mean square error and image reconstruction

Besides the classification, we compared the performance of the proposed model with that of PCA, ICA, SAE, and VAE in terms of the mean square error (MSE) and input image reconstruction. Specifically, we use the MSE between the original and reconstructed data as a performance metric. The experiment was performed for each model using a subset of 10,000 data for image datasets; five and three subsets for the non-image datasets of Breast Cancer and Wine, respectively. We repeated the experiment independently for all subsets to obtain the mean and standard deviation of the MSEs. The proposed model, SAE, and VAE learned each subset to obtain the dimensionality-reduced result for given units in the code layer. The dimensionality reduction of PCA and ICA was performed by Eqs. (12) and (15), respectively. The number of the largest principal components (or statistically independent components) used to obtain the dimensionality-reduced result in PCA (or ICA) corresponds to the code size (i.e., the number of units in the code layer) in the proposed model, SAE, and VAE. We implemented PCA and ICA using the function prcomp() in the stat R-package36 and the fastICA algorithm in fastICA R-package37, respectively.

The experimental results are shown in Fig. 6. It can be seen that as the code size increases, the MSE decreases for all models. This indicates that dimensionality reduction becomes more effective with larger code sizes. Comparing the MSEs across the four image datasets, MNIST and FMNIST have lower MSEs than SVHN and CIFAR10, and CIFAR10 has the highest mean square error. This indicates that MNIST and FMNIST are more amenable to dimensionality reduction than SVHN and CIFAR10. The CIFAR10 dataset poses the greatest challenge to dimensionality reduction, making it the least feasible for a given code size. These results are consistent with the findings in the classification shown in Figs. 4 and 5. Namely, MNIST and FMNIST have higher F1 scores and balanced accuracy than SVHN and CIFAR10, and CIFAR10 has the lowest F1 score and balanced accuracy. When comparing the MSEs of the two non-image datasets, the Breast Cancer dataset has lower MSEs than the Wine dataset. For the Breast Cancer dataset, all models except the VAE perform similarly (Fig. 6e). In contrast, for the Wine dataset, the proposed model and the SAE outperform the other models (Fig. 6f). In particular, the VAE performs the worst in two datasets, suggesting that VAE may not be well suited for non-image datasets.

Fig. 6.

Fig. 6

Plots of the mean square error versus the code size (or number of principal and independent components). The mean square error was averaged over different independent trials from the datasets, and the error bars represent corresponding standard errors estimated from independent trials. They are too small to be identified. (a) MNIST dataset, (b) FMNIST dataset, (c) SVHN dataset, (d) CIFAR10 dataset, (e) Breast Cancer dataset, and (f) Wine dataset.

More importantly, regardless of the code size, the proposed model and the autoencoders of SAE and VAE consistently showed lower MSEs than PCA and ICA. This indicates that the proposed model and the autoencoders achieved more effective dimensionality reduction compared to PCA and ICA. The superior performance of the autoencoders was expected due to its nonlinear nature, which is generally considered more powerful than linear methods. In general, nonlinear methods are more likely to outperform linear methods because they can better capture any nonlinearity in the original data. Notably, the linear approach of the proposed model outperformed other linear methods such as PCA and ICA, and showed comparable performance to the nonlinear autoencoders, except for the CIFAR10 dataset. This suggests that the proposed linear model has an advantage over PCA and ICA in dimensionality reduction. In addition, the proposed model provides a better interpretation of the dimensionality-reduced representation than the autoencoders, while maintaining similar performance as evidenced by the mean square error.

We also analyzed the MSE over different image classes. As shown in Fig. 7a–d, the proposed model and the autoencoders consistently achieved lower MSEs than PCA and ICA for all image classes, with the autoencoders performing best. We found some variability in the MSE between different classes, which can be attributed to the different degrees of complexity of the images. As shown in Fig. 7e–f, the proposed model and SAE consistently achieved lower MSEs than PCA and ICA for all non-image classes, while VAE performs the worst. These results are consistent with the behavior of the MSEs in Fig. 6e–f. These results indicate that the superior performance of the proposed model over PCA and ICA is consistent regardless of the datasets. Furthermore, these trends remained stable across different code sizes, as shown in Supplementary Information Figs. S7 and S8.

Fig. 7.

Fig. 7

Plots of the mean square error for each class label when the number of codes (or number of components) was four. The mean square error was averaged over different independent trials from the datasets, and the error bars represent the corresponding standard errors estimated from independent trials. Error bars are too small to be identified. The number of independent trials for each dataset is the same as in Fig. 6. (a) MNIST dataset, (b) FMNIST dataset, (c) SVHN dataset, (d) CIFAR10 dataset, (e) Breast Cancer dataset, and (f) Wine dataset.

We validated the performance of the proposed model by reconstructing input images, as shown in Fig. 8. Upon visual inspection, the image reconstructions produced by the proposed model and other AE models were superior to those produced by PCA and ICA. This visual assessment is corroborated by the lower mean square errors achieved by the proposed model and SAE compared to PCA and ICA, as shown in Fig. 7. These trends remained consistent across image reconstructions with different code sizes, as detailed in Supplementary Information Fig. S9. An exception was the CIFAR10 dataset, where reconstructed images from different models were not easily distinguishable. This may be due to the higher difficulty of dimensionality reduction in the CIFAR10 dataset, as shown in the mean square error result of Fig. 6.

Fig. 8.

Fig. 8

The reconstructed input images when the number of codes (or principal components) was four. From top to bottom: original test images; reconstructions by the proposed model, PCA, ICA, SAE, VAE. (a) MNIST, (b) FMNIST, (c) SVHN, and (d) CIFAR10.

The experimental results can be summarized as follows: First, while the proposed model, PCA, and ICA all provide linear transformations, the proposed model outperforms PCA and ICA in terms of the classification of the low-dimensional projections, the mean square error, and input image reconstruction. Consequently, the proposed model shows superior performance in dimensionality reduction compared to PCA and ICA. Second, the proposed model performs comparably to other nonlinear models in all datasets except CIFAR10. Although not as effective as other nonlinear models in some cases, the proposed model provides a linear transformation between the input data and the low-dimensional representation, a feature that the nonlinear models lack.

Conclusion

In this study, we proposed a tied-weight autoencoder model for dimensionality reduction. The proposed model consisted of two hidden layers flanking the code layer with the input and output layers, which effectively reduced the dimensionality of the data. Using the ReLU activation function for the hidden layers, the proposed model could provide an approximate linear transformation connecting the original data and the dimensionality-reduced representation. From the performance analyses, we demonstrated that the proposed model preserves some features of interest in the original high-dimensional data by producing a low-dimensional linear mapping. Considering that nonlinear dimensionality reduction methods do not provide a linear relationship between the original data and its low-dimensional projection, the linear transformation of the proposed model is certainly an advantage. In this sense, the proposed model has the advantage of both linear and nonlinear models.

Since an autoencoder is mainly applied to dimensionality reduction, we showed that the proposed AE can learn a low-dimensional projection, which is better than conventional linear methods. We conducted a comparative analysis between the proposed model and other linear models, which showed that our proposed approach outperforms other linear methods such as PCA and ICA in terms of loss function optimization and input data reconstruction. In addition, our proposed model showed superior accuracy in classifying original features using dimensionality-reduced representations. Furthermore, we employed an SAE with an identical structure to our proposed model and demonstrated comparable performance between the two. This suggests that our model offers improved interpretability through linear transformation, unlike the SAE, which relies on nonlinear methods and may not have the same interpretative advantage.

The proposed method offers a solution to the problem of handling high-dimensional data by effectively reducing it to lower-dimensional features. This capability is valuable, particularly in addressing the curse of dimensionality, especially in scenarios where the dimensionality of the data far exceeds the sample size. Given that addressing these challenges requires novel approaches and theories in statistics and machine learning, our proposed model represents a promising avenue for dimensionality reduction. For example, it can be applied to estimating model parameters in regression analyses of high-dimensional datasets. Here, the approach estimates covariates with reduced dimensionality and transforms them back to their original variables through linear transformations, allowing for more manageable and interpretable analyses.

Regarding the proposed model as the basic model, we can construct a generalized model by stacking it. In the stacked model, the code layer in the outer basic model plays the role of the input layer in the inner basic model. Pre-training consists of learning each basic model individually and then pre-trained basic models are “unrolled” to create a stacked model. We then fine-tuned the stacked model. This layer-by-layer learning can be repeated as many times as desired. The stacked model may provide better performance than the basic model while maintaining the linear transformation.

In addition to the advantages of the stacked model, there are some potential disadvantages of the model. Training stacked autoencoders can be computationally expensive, often requiring layer-wise pre-training followed by fine tuning. Following this learning strategy, the time complexity of training the model would be proportional to the number of base models stacked. The stacked model often has many hyper-parameters (e.g., number of layers, units, learning rates) that can be difficult to tune optimally. Choosing the wrong hyper-parameters can lead to suboptimal performance. Similar to other deep neural networks, the stacked model can suffer from the vanishing gradient problem, where gradients diminish as they are backpropagated through the layers, resulting in slower training or poor performance. In addition, if not properly regularized, the stacked model can overfit the training data, resulting in poor generalization of the test data. In this sense, the stacked model should be investigated in future studies.

Methods

Observation-wise linearity between input and codes

Let m, q, and p represent the number of units in the input, hidden, and code layers, respectively. Referring to Fig. 1, we can express the jth unit of the first hidden layer in terms of input observation α and the weights as

hαj=ϕ(1)i=0mxαiwij(1), 19

where w0j(1) is the bias with xα0=1. Similarly, the kth unit in the code layer can be expressed using Eq. (19) as

zαk=ϕ(2)j=0qhαjwjk(2)=j=1qϕ(1)i=0mxαiwij(1)wjk(2)+w0k(2), 20

where w0k(2) is the bias with hα0=1, and we used the identity function as the activation function ϕ2(·) of the code layer.

By rearranging the summations, we can rewrite Eq. (20) as

zαk=i=0mxαijAαwij(1)wjk(2)+w0k(2), 21

where the set Aα is a subset of the units in the hidden layer. Using the property of the ReLU activation function being an identity function for non-zero arguments, the set Aα is defined as

Aαj|i=0mxαiwij(1)0forj=1,2,,q. 22

The set Aα is a collection of units in the hidden layer that contributes to the code layer and depends on xαi of observation α. We can express the unit in the code layer in terms of the input observation by rewriting Eq. (21) as

zαk=i=1mxαiWikα+Δαk(2),whereWikαjAαwij(1)wjk(2)andΔαk(2)jAαw0j(1)wjk(2)+w0k(2), 23

where Δαk(2) collectively represents biases.

Reconstruction

Since the model uses the tied weights, the weights in the decoder are the transpose of corresponding weights in the encoder. Specifically, the weights connecting the code layer and the hidden layer in the decoder are the transpose of wjk(2) in the encoder. Similarly, the weights connecting the hidden layer and the output layer in the decoder are the transpose of wij(1) in the encoder. With the tied weights, the units in the hidden layer of the decoder can be written in terms of the units in the code layer as

h~αj=ϕ(1)k=0pzαkwkj(2). 24

Using the above expression, the units in the output layer can be expressed as

x~αi=jCθ^h~αjwji(1)+w0i(1)=jCθ^ϕ(1)k=0pzαkwkj(2)wji(1)+w0i(1)=k=0pzαkjCθ^wkj(2)wji(1)+w0i(1), 25

where we have used the identity activation function ϕ(2)(·) of the output layer.

By defining WkijCθ^wkj(2)wji(1), which is the transpose of Wik in Eq. (9), the units in the output layer can be expressed as

x~αi=k=1pzαkWki+Δi(1),whereΔi(1)jCθ^w0j(2)wji(1)+w0i(1). 26

Absorbing the bias terms Δi(1) in the output layer, we have

x~αi-Δi(1)x~αi=k=1pzαkWkiorX~=ZpWpT 27

in the matrix form and the superscript T stands for the transpose. In addition, X~, Zp, WpT are n×m, n×p, and p×m matrices, respectively.

Classifiers and evaluation metrics

As a supervised machine learning technique, the SVM is a binary classifier that demonstrates effectiveness and robustness in high-dimensional data spaces. Its application can extend to multi-class classification tasks. The practical performance of SVM depends on the parameter configuration, specifically the cost parameter (C) and the nonlinear kernel parameter (γ). Here, C represents the penalty imposed for misclassifications, while γ defines the characteristics of the radial basis kernel. To improve classification accuracy, we performed model selection using a grid search method to identify optimal values for γ and C. Our SVM implementation used the R package “e1071”38, which uses the R package “libsvm”39 for the one-against-one approach in multi-class classification scenarios.

Using the confusion matrices, we can evaluate the effectiveness of different models using appropriate metrics. Among these, accuracy stands out as a straightforward metric, calculated as the ratio of correctly predicted instances to the total number of instances. However, when dealing with unbalanced data, where instances are unevenly distributed across different classes, relying solely on accuracy can be misleading. This is because accuracy does not distinguish between correctly classified instances of different classes. To refine performance evaluation in such scenarios, we turned to three commonly used metrics: precision, recall (or sensitivity), and specificity.

For the ith class, precision, recall, and specificity are respectively defined as

pi(TP)i(TP)i+(FP)i,ri(TP)i(TP)i+(FN)i,andsi(TN)i(TN)i+(FP)i. 28

Here, (TP)i, (FP)i, (FN)i, and (TN)i are the number of the true-positive, the false-positive, the false-negative, and true-negative of the ith class, respectively. They are given, in terms of the confusion matrix D, as

(TP)i=Dii,(FP)i=jicDij,(FN)i=jicDji,(TN)i=jicDjj, 29

where c is the number of classes. Precision measures the ratio of true positives to all positive predictions and indicates how often a prediction correctly predicts a positive outcome. Higher precision corresponds to fewer false positives. Recall quantifies the ratio of predicted positives to total positives and indicates the accuracy of positive predictions. A higher recall means fewer false negatives. Specificity is simply a true negative rate.

Using the three metrics, we evaluate the F1 score and the balanced accuracy as the performance measures. They are defined respectively as

(F1)i2piripi+riand(BA)i12(ri+si), 30

The F1 score is a harmonic mean of precision and recall. It is well suited for unbalanced datasets and provides a balanced view of precision and recall simultaneously. Balanced accuracy is defined as the average of sensitivity and specificity. Note that the balanced precision becomes the usual accuracy when instances are evenly distributed across different classes.

Supplementary Information

Acknowledgements

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korean government(MSIT) No. 2022R1A4A1030348 and No. 2021R1I1A3044289, Cooperative Research Program for Agriculture Science and Technology Development No. RS-2023-00222739 of RDA, and by the research grant of the Kongju National University in 2022.

Author contributions

C.-Y. L. contributed to the conception, analysis, and interpretation of data; S. K. contributed to the design of the work and creation of new software used in this work; S.-H. C. contributed to the acquisition of data and design of the work; Y.-J. P. contributed to the interpretation of data and drafted the work. C.-Y. L. and Y.-J. P. wrote the manuscript and all authors reviewed the manuscript.

Data availability

The preprocessed data, outputs, and scripts for the data analysis are available and maintained at the GitHub repository (https://github.com/infoLab204/tw_autoencoder/).

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-024-77080-8.

References

  • 1.Palo, H. K., Sahoo, S. & Subudhi, A. K. Dimensionality reduction techniques: Principles, benefits, and limitations. In Data Analytics in Bioinformatics: A Machine Learning Perspective. 77–107 (Wiley, 2021). 10.1002/9781119785620.ch4.
  • 2.Van Der Maaten, L., Postma, E. O. & van den Herik, H. J. Dimensionality reduction: A comparative review. J. Mach. Learn. Res.10, 66–71 (2009). [Google Scholar]
  • 3.Aziz, R., Verma, C. K. & Srivastava, N. Dimension reduction methods for microarray data: A review. AIMS Bioeng.4, 179–197. 10.3934/bioeng.2017.1.179 (2017). [Google Scholar]
  • 4.Manthena, V. et al. Evaluating dimensionality reduction for genomic prediction. Front. Genet.13, 958780. 10.3389/fgene.2022.958780 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Cunningham, J. P. & Ghahramani, Z. Linear dimensionality reduction: Survey, insights, and generalizations. J. Mach. Learn. Res.16, 2859–2900 (2015). [Google Scholar]
  • 6.Kruger, U., Zhang, J. & Xie, L. Developments and Applications of Nonlinear Principal Component Analysis-a Review (eds. Gorban, A.N., Kégl, B., Wunsch, D.C., & Zinovyev A.Y.) Principal Manifolds for Data Visualization and Dimension Reduction. Lecture Notes in Computational Science and Engineering. 58, 1–43 (2008). 10.1007/978-3-540-73750-6_1
  • 7.Abdi, H. & Williams, L. J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat.2, 433–459. 10.1002/wics.101 (2010). [Google Scholar]
  • 8.Hyvärinen, A. & Oja, E. Independent component analysis: Algorithms and applications. Neural Netw.13, 411–430 (2000). [DOI] [PubMed] [Google Scholar]
  • 9.Tenenbaum, J. B., Silva, V. D. & Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science29, 2319–2323 (2000). [DOI] [PubMed] [Google Scholar]
  • 10.Kramer, M. A. Nonlinear principal component analysis using autoassociative neural networks. AIChE J.37, 233–243. 10.1002/aic.690370209 (1991). [Google Scholar]
  • 11.Kramer, M. A. Autoassociative neural networks. Comput. Chem. Eng.16, 313–328 (1992). [Google Scholar]
  • 12.Alsenan, S., Al-Turaiki, I. & Hafez, A. Autoencoder-based dimensionality reduction for QSAR modeling. In 3rd International Conference on Computer Applications & Information Security (ICCAIS) 1–4 (2020). 10.1109/ICCAIS48893.2020.9096747
  • 13.Fournier, Q. & Aloise, D. Empirical comparison between autoencoders and traditional dimensionality reduction methods. In 2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering 211–214 (2019). 10.1109/AIKE.2019.00044.
  • 14.Wang, Y., Yao, H. & Zhao, S. Auto-encoder based dimensionality reduction. Neurocomputing184, 232–242. 10.1016/j.neucom.2015.08.104 (2016). [Google Scholar]
  • 15.Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & Manzagol, P. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res.11, 3371–3408 (2010). [Google Scholar]
  • 16.Mahmud, M., Huang, J. & Fu, X. Variational autoencoder-based dimensionality reduction for high-dimensional small-sample data classification. Int. J. Comput. Intell. Appl.19, 2050002. 10.1142/S1469026820500029 (2020). [Google Scholar]
  • 17.Roweis, S. & Saul, L. Nonlinear dimensionality reduction by locally linear embedding. Science290, 2323–2326. 10.1126/science.290.5500.2323 (2000). [DOI] [PubMed] [Google Scholar]
  • 18.LeCun, Y., Cortes, C. & J. C. Burges, C. THE MNIST DATABASE of handwritten digits. (accessed 10 September 2024); http://yann.lecun.com/exdb/mnist/
  • 19.Xiao, H., Rasul, K. & Vollgraf, R. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv: 1708.07747 (2017).
  • 20.Netzer, Y. et al. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 7 (2011).
  • 21.Krizhevsky, A., Nair, V. & Hinton G. The CIFAR-10 dataset. (accessed 10 Sep 2024); https://www.cs.toronto.edu/~kriz/cifar.html
  • 22.Wolberg, W., Mangasarian, O., Street, N. & Street, W. Breast cancer Wisconsin (Diagnostic). UCI machine learning repository. (accessed 10 Sep 2024); 10.24432/C5DW2B
  • 23.Aeberhard, S. & Forina, M. Wine. UCI machine learning repository. (accessed 10 Sep 2024); 10.24432/C5PC7J
  • 24.Vincent, P. et al. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res.11, 3371–3408 (2010). [Google Scholar]
  • 25.An, Z., Jiang, X. & Liu, J. Mode-decoupling auto-encoder for machinery fault diagnosis under unknown working conditions. IEEE Trans. Ind. Inf.20, 4990–5003. 10.1109/TII.2023.3331129 (2024). [Google Scholar]
  • 26.Kasun, L., Yang, Y., Huang, G. & Zhang, Z. Dimension reduction with extreme learning machine. IEEE Trans. Image Process.25, 3906–3918 (2016). [DOI] [PubMed] [Google Scholar]
  • 27.Wan, L., Zeiler, M., Zhang, S., Le Cun, Y. & Fergus, R. Regularization of neural networks using DropConnect. In Proceedings of the 30th International Conference on Machine Learning28, 1058–1066 (2013).
  • 28.Makhzani, A. & Frey, B. k-Sparse Autoencoders. arXiv: 1312.5663 (2013).
  • 29.Hyvärinen, A. & Oja, E. Independent component analysis: Algorithms and applications. Neural Netw.13, 411–430 (2000). [DOI] [PubMed] [Google Scholar]
  • 30.Hinton, G. E., Osindero, S. & Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Comput.18, 1527–1554 (2006). [DOI] [PubMed] [Google Scholar]
  • 31.Bengio, Y., Lamblin, P., Popovici, D. & Larochelle, H. Greedy layer-wise training of deep networks. In Proceedings of the 19th International Conference on Neural Information Processing Systems 19 (2006).
  • 32.Shapiro, L. & Stockman, G. Computer Vision. 83 (Prentice Hall, 2001). ISBN 978-0-13-030796-5.
  • 33.Abdulkadirov, R., Lyakhov, P. & Nagornov, N. Survey of optimization algorithms in modern neural networks. Mathematics11, 2466. 10.3390/math11112466 (2023). [Google Scholar]
  • 34.Prechelt, L. Early Stopping - But When?. (eds. Montavon, G., Orr, G.B. & Müller, KR.) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, (Springer, 2012).
  • 35.Kuhn, M. A short introduction to the caret package. R Found Stat. Comput.1, 1–10 (2015). [Google Scholar]
  • 36.Accessed 10 Sep 2024; https://cran.r-project.org/web/packages/STAT/index.html
  • 37.Accessed 10 Sep 2024; https://cran.r-project.org/web/packages/fastICA/index.html
  • 38.Meyer, D. & Wien, F. T. Support vector machines. R News1, 23–26 (2001). [Google Scholar]
  • 39.Chang, C. & Lin, C. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol.2, 1–27 (2011). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The preprocessed data, outputs, and scripts for the data analysis are available and maintained at the GitHub repository (https://github.com/infoLab204/tw_autoencoder/).


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES