Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jan 27.
Published in final edited form as: J Comput Graph Stat. 2021 Jan 27;30(3):519–529. doi: 10.1080/10618600.2020.1856119

Forward Stepwise Deep Autoencoder-based Monotone Nonlinear Dimensionality Reduction Methods

Youyi Fong 1, Jun Xu 2
PMCID: PMC8673912  NIHMSID: NIHMS1733511  PMID: 34924737

Abstract

Dimensionality reduction is an unsupervised learning task aimed at creating a low-dimensional summary and/or extracting the most salient features of a dataset. Principal components analysis (PCA) is a linear dimensionality reduction method in the sense that each principal component is a linear combination of the input variables. To allow features that are nonlinear functions of the input variables, many nonlinear dimensionality reduction methods have been proposed. In this paper we propose novel nonlinear dimensionality reduction methods based on bottleneck deep autoencoders (Kramer, 1991). Our contributions are two-fold: (1) We introduce a monotonicity constraint into bottleneck deep autoencoders for estimating a single nonlinear component and propose two methods for fitting the model. (2) We propose a new, forward stepwise (FS) deep learning architecture for estimating multiple nonlinear components. The former helps extract interpretable, monotone components when the assumption of monotonicity holds, and the latter helps evaluate reconstruction errors in the original data space for a range of components. We conduct numerical studies to compare different model fitting methods and use two real data examples from the studies of human immune responses to HIV to illustrate the proposed methods.

Keywords: machine learning, neural network

1. Introduction

Monotone functions have been used extensively in nonparametric regression models (Gijbels, 2005). Use of monotone functions in unsupervised learning, on the other hand, has been under-studied. This is unfortunate because while consistent model selection is achievable through cross validation in nonparametric regression (Silverman and Green, 1993), it is generally not possible in unsupervised learning tasks such as nonlinear dimensionality reduction (Duchamp et al., 1996). This makes restricting model space to monotone functions especially beneficial in unsupervised learning when the monotonicity assumption is reasonable.

Our interest in monotone nonlinear dimensionality reduction originated from the studies of human immune responses to vaccines or infectious agents. One of our motivating examples is shown in Figure D.2.1, which contains pairwise scatterplots between seven immune response biomarkers measured in Permar et al. (2015). These biomarkers measure different features of immune responses against HIV in HIV-infected pregnant women from a historical cohort (Rich et al., 2000). The markers fall into three groups. Markers in the first group – V3_BioV3B, V3_BioV3M and V3_gp70MNV3 – measure specific antibody binding to the V3 loop on the HIV envelope protein, whereas markers in the second group – CD4.JRFL, CD4.6240 and CD4.63521 – measure specific antibody binding to the CD4 binding site on the HIV envelope protein. Supplementary Figure A.1 shows the locations of the V3 loop and the CD4 binding site on the HIV envelope protein. Within each group, different biomarkers correspond to different HIV strains from which the particular V3 or CD4 binding site sequence originates. The last group, NAb_MN3, measures antibody-mediated neutralization of the HIV MN3 strain. The scatterplots clearly indicate that monotone and nonlinear relationships exist between these immune response biomarkers. The monotonicity may be expected because these biomarkers are measurement outcomes from biological assays probing human immune responses, while the nonlinearity may arise because the involved immunoassays have different ranges of linearity.

Many nonlinear dimensionality reduction (NLDR) methods have been proposed in the literature (see Van Der Maaten et al. (2009) and Sorzano et al. (2014) for two excellent reviews). To put the methods we will propose in perspective, we next discuss two broad categories of methods: (1) methods that work towards preserving proximity; and (2) methods that work towards minimizing a divergence metric between the observed data and a low dimensional representation of the data. Category 1 methods aim to find a faithful embedding on a low dimensional manifold, where ‘faithful’ means “ nearby points remain nearby and that distant points remain distant” (Weinberger and Saul, 2006). Many of the category 1 methods are convex techniques; one advantage of grouping convex techniques together, as Van Der Maaten et al. (2009) did, is that it highlights the connection, as pointed out by e.g. Ham et al. (2004) and others, between (1) kernel PCA (Schölkopf et al., 1997); (2) full spectral techniques such as isomap (Tenenbaum et al., 2000) and maximum variance unfolding (Weinberger and Saul, 2006); and (3) sparse/local spectral techniques (see Ting and Jordan (2018) for a differential operators-based theoretical framework) such as local linear embedding (LLE) (Roweis and Saul, 2000), Laplacian eigenmaps (Belkin and Niyogi, 2002), Hessian LLE (Donoho and Grimes, 2003), and local tangent space analysis (Zhang and Zha, 2004). Another big group of methods under category 1 are multidimensional scaling (MDS) methods, which can be further subdivided into classical MDS (Kruskal and Wish, 1978), metric MDS (e.g. Sammon, 1969), and nonmetric MDS. As both of these groups of methods aim to preserve proximity, it is not surprising that they have some overlap, e.g. kernel PCA using an isotropic kernel function can be viewed as a kind of metric MDS (Williams, 2002).

Not all nonlinear dimensionality reduction methods work towards preserving proximity. Category 2 methods aim to minimize a divergence metric between the observed data and a low dimensional representation of these data. Examples of category 2 methods include (1) principal curves, surfaces, and manifolds (e.g. Hastie and Stuetzle, 1989; Smola et al., 1999), which formalize the notion of a curve/surface/manifold through the middle of a dataset; (2) generative topological models (Bishop et al., 1998), a form of nonlinear latent variable model based on a constrained mixture of Gaussians; (3) dictionary-based methods such as nonnegative matrix factorization (NMF) (Lee and Seung, 1999), local NMF (Feng et al., 2002), and nonnegative sparse coding (Hoyer, 2004), which factorize a nonnegative data matrix into two nonnegative matrices; and (4) bottleneck autoencoders (Kramer, 1991; Scholz and Vigário, 2002; Hinton and Salakhutdinov, 2006; Plaut, 2018), another form of nonlinear latent variable model based on deep neural networks (see the next section for additional details). The distinction between these two broad categories of methods is not absolute, for example, Ting and Jordan (2018) developed a theory for manifold learners through a connection to nonparametric smoothing/autoencoders.

In the HIV-1 immune response biomarker application described earlier and in a second application that will be described in Section 4.3, the main goal is to extract features that underlie the measurements. In this paper we will develop novel bottleneck autoencoder-based models with monotonicity constraints to help us extract more interpretable features. To promote feature discrimination, we develop a new architecture to extract multiple uncorrelated components. The features/components estimated in this way can be useful for downstream analyses such as data visualization, feature clustering, and supervised learning tasks. The proposed method are implemented in an R package FSDAM and is available from the Comprehensive R Archive Network.

The rest of the paper is organized as follows. In Section 2, we propose Deep Autoencoder-based Monotone (DAM) nonlinear dimensionality reduction methods for estimating the first nonlinear component. In Section 3 we propose a Forward Stepwise (FS) deep learning model for estimating subsequent nonlinear components. In Section 4 we conduct Monte Carlo studies to compare the performance of two proposed methods and related existing methods, and illustrate their application using the aforementioned immune response dataset from HIV-1 infected pregnant women and a second immune response dataset from the HVTN 505 immune correlates study (Neidich et al., 2019). We conclude the paper with a discussion in Section 5.

2. DAM: Deep bottleneck Autoencoder-based Monotone nonlinear dimensionality reduction

Autoencoders (Rumelhart et al., 1986; LeCun, 1987; Bourlard and Kamp, 1988) are a group of deep neural networks that are trained to recreate the input variables (instead of other outcomes of interest). Thus, the input and output layers of an autoencoder always have the same number of nodes as the number of input variables. For this reason, autoencoders are also called auto-associative neural networks. There are several types of autoencoders. The type that we are interested in is called bottleneck autoencoders or undercomplete autoencoders (Goodfellow et al., 2016, Chapter 14.1). The internal representation of these autoencoders, i.e. the layers between the input and the output layers, has fewer dimensions than the input representation. It is interesting to compare autoencoders with another type of neural network-based dimensionality reduction method used in natural language processing. The state-of-the-art word2vec program (Mikolov et al., 2013) for word embedding is similar to an autoencoder, but instead of training through reconstruction, it trains words against surrounding words in the input text. In this aspect, word2vec is reminiscent of the Category 1 methods that preserve proximity.

A simple linear bottleneck autoencoder is shown in Figure A.2 in the Supplementary Materials. There is only one internal layer – the code layer (Hinton and Salakhutdinov, 2006). The code layer contains latent variables or intrinsic components, also known as codes. In this particular model there is only one variable in the code layer. There is no nonlinearity in this model: the latent variable is a linear combination of the input variables plus an intercept term, and each output layer variable is a linear function of the latent variable plus an intercept. The model is trained to minimize the differences between the input variables, which contain the observed data, and the output variables, which depend on the model parameters. It can be shown that the latent variable estimated from this model corresponds to the first principal component of the observed data. More generally, autoencoders with one fully connected code layer of more than one latent variable project the input data onto the low dimensional principal subspaces, although we cannot sort the latent variables into ordered principal components as in PCA (Baldi and Hornik, 1989; Plaut, 2018).

To perform nonlinear dimensionality reduction, we can build upon the model in Figure A.2 by adding layers with nonlinear activation. Kramer (1991) proposed a model as shown in Figure 1(a). It has two additional hidden (i.e. between the input and output) layers: mapping and demapping. Both of these layers are equipped with nonlinear activation, which simply means that each variable in these layers is modeled as a nonlinear transformation of a linear combination of the variables from the previous layer.

Fig. 1.

Fig. 1

Network architectures. The input and output layer nodes are colored black, and the hidden layer nodes are colored gray. Nodes with nonlinear activation are marked by σ. In DAM, monotonicity constraints can be imposed in one of two ways; see details in the text. In FS-DAM, In the code layer, the node in the code layer that does not have connections to the mapping layer represents the first latent variable estimated from the previous stage.

Although it has only one latent variable in the code layer, the model of Figure 1(a) has the universal fitting property (Cybenko, 1989), implying that given enough variables in the mapping and demapping layers, it is possible to achieve perfect reconstruction of the input at the output layer in theory. This is overfitting at its extreme. Kramer (1991) proposed to control model complexity by selecting k, the number of variables in the mapping and demapping layers, through one of three methods: (a) looking for an elbow in the cross-validated reconstruction error as a function of k, (b) comparing information theoretic criteria such as Akaike’s Information Criterion, and (c) using an ad hoc inequality relating the number of parameters of the whole model and n × p, where n is the number of observations and p is the number of input variables. All three methods require human input because, as mentioned before, in unsupervised learning models such as autoencoders, the usual bias-variance trade-off in supervised learning that underpins cross validation-based consistent model selection procedures does not apply (Duchamp et al., 1996).

This motivates us to find additional information to help regularize the model. One such source of information comes in the form of monotonicity assumption. To describe what we mean by monotonicity in the model of Figure 1(a), we need some notations. Let i ∈ {1, ⋯, n} index the samples, and let j ∈ {1, ⋯, p} index the variables in the input and output layers. For the ith observation, denote by yi j,R the jth input variable, by xiR the code layer representation, and by ziRk the demapping layer representation. zi and xi have the following relationship: zi = σ(α + βxi), where σ is a nonlinear activation function, α, β, ∈ Rk and the operation in α + βxi is taken to be elementwise.

Three of the most popular types of activation functions are the logistic sigmoid activation function, the hyperbolic tangent activation function (tanh), and the rectified linear unit (ReLU). The ReLU function, which outputs the input directly if it is positive and 0 otherwise, is the default recommendation in modern neural networks (Goodfellow et al., 2016, Chapter 6). The logistic sigmoid function (f(x) = 1/(1 + ex)) and the tanh function (tanh(x) = (1 − e−2x)/(1 − e−2x)) are closely related because tanh(x) = 2f(2x) −1. The logistic sigmoid function was used in Kramer (1991), but the tanh function is easier to optimize and is recommended over the logistic sigmoid function when a sigmoidal activation function must be used. Based on experiments with both the ReLU and tanh activation functions (results not shown), we find that the tanh activation function performs better than the ReLU activation function in the context of our proposed models. Thus, the tanh function will be used for σ.

Let gj(xi)=γj+δjTzi denote the jth variable in the output layer, where γjR, δjRk. The problem we have at hand can be stated as

minα,β,{γj,δj}j=1pn1p1j=1pi=1n(yi,jgj(xi))2

subject to {gj}j=1p being monotone functions of xi.

To keep the problem statement concise, we focus on the decoder part of the autoencoder in the above statement, with the understanding that the encoder part enters the problem through xi. We propose two approaches to solve this problem: a constrained approach and a penalization approach.

In the penalization approach, we add a penalty term for negative gradients to the least squares loss term and seek to minimize

n1p1j=1pi=1n(yi,jgj(xi))2+τ1N1p1j=1pt=1N|dgj/dxt*|_, (1)

where |·| _ denotes a function that returns the absolute value if its argument is negative and 0 otherwise, and T1 is a positive tuning parameter. Details on the gradient calculation can be found in Supplementary Materials Section B. Importantly, the penalty term in (1) is not evaluated at {xi}i=1n, the set of latent variable values that the reconstruction error term in (1) is evaluated at, but at {xt*}t=1N, which we define as a set of evenly spaced values in the range of {xi}i=1n. The choice of N here is not important and we set it to 2n by default.

If both the reconstruction error term and the penalty term in (1) are evaluated at {xi}i=1n, the trained model tends to create a ‘vacuum’ in the range of {xi}i=1n such that gj(x) has negative slopes in that region, but none of {xi}i=1n actually resides there (e.g. Supplementary Figure A.3). In doing so, the vacuum creates a partition of the dataset such that monotonicity only needs to be observed within each sub-dataset rather than across the whole dataset. In a way the vacuum helps the trained model escape the negative gradient penalty while allowing it to explore a larger model space to achieve smaller reconstruction error.

To aid the choice of T1 in the penalization approach, we mean-center and scale the input data to have standard deviation 1. After scaling, the mean squared reconstruction error falls between 0 to 1. The penalty term in (1), sans the tuning parameter, is the mean of gradient violators. Alternatively, we may change the penalty term to the sum of gradient violators by removing N−1p−1. We will compare these two forms through numerical studies and illustrate tuning parameter selection in Section C of the Supplement Material.

In the constrained approach, we restrict β and δj to be non-negative component-wise during the optimization process. This can be implemented by setting negative β and δj to 0 at the end of each training epoch.

It is interesting to compare these two approaches for implementing monotonicity in an autoencoder with methods for implementing monotone regression. Gijbels (2005) grouped the latter into three categories: the projection framework for constrained smoothing (Ramsay et al., 1988; Mammen et al., 2001), isotonic regression (Brunk et al., 1958; Barlow et al., 1972), and the tilting method (Hall et al., 2001). The proposed penalization approach for autoencoders does not group with any of these methods, but the proposed constrained approach for autoencoders can be described as “smooth, then monotonize” (Mammen et al., 2001), thus resembling the projection framework for monotone regression.

3. Forward stepwise DAM

There are two distinct but related model selection problems when we consider more than one nonlinear component: 1) selecting the number of nonlinear components, and 2) estimating multiple nonlinear components. Selecting the number of nonlinear components in nonlinear dimensionality reduction is a complicated problem, perhaps more so than in linear dimensionality reduction, because it is intertwined with selecting the complexity of each nonlinear component. Lee and Verleysen (2007) reviewed the methods for estimating the number of nonlinear components in the nonlinear dimensionality reduction problem and divided them into three broad groups: fractal dimensions-based methods, e.g. correlation dimension (Grassberger and Procaccia, 1983); local methods, e.g. local PCA (Fukunaga and Olsen, 1971; Kambhatla and Leen, 1997); and “trial and error” methods, e.g. plotting Sammon’s stress (Sammon, 1969) against the number of nonlinear components and finding the number needed to bring the stress close to 0.

The first two groups of methods sidestep the problem of having to deal with the model complexity of each nonlinear component mentioned above. The fractal dimensions-based methods are beautiful in theory but have limited success in practice. For example, to estimate the correlation dimension of the MTCT7 dataset, we make log-log plots of the estimated correlation sum, namely the number of neighboring points lying closer than a certain threshold ϵ, and the estimated correlation coefficients based on different threshold choice in Supplementary Figure A.4. As the estimated correlation dimension changes with the threshold choice, we do not have a stable estimate of correlation dimension. The second group of methods, the local methods, suffers from the curse of dimensionality. As the dimensionality of the input data increases, it becomes harder to divide the whole space into small space windows and obtain stable estimates from each window. The third group of methods may also avoid the problem of having to define the complexity of each nonlinear component if the chosen criterion function does not depend on it. For example, Sammon’s stress function is based on embedding the input data in a low-dimensional representation while preserving pairwise distances between the observations, which sidesteps the issue of nonlinear components complexity. However, there is a significant downside to this approach: the low-dimensional representation learned from such methods may have no resemblance to the true latent structure.

In this section we study methods for estimating m nonlinear components using autoencoders. Kramer (1991) proposed two approaches for doing this: sequential and simultaneous. In the sequential approach, estimation is broken into m stages with one component estimated in every stage. Each stage uses the same model (Figure 1(a)), but the input changes from stage to stage because it is taken to be the residuals from the previous stage (Figure A.5b). In this approach the ith observation yiRp is approximated with the sum of output from all stages: g1(f1(yi))+…+ gm(fm(yi)) , where g1, …, gm denote the vector-valued decoding functions estimated in stage 1, …, m, respectively, and f1, …, f m denote the coding functions estimated in stage 1, …, m, respectively. Because of this model form and because each pair of ft and gt are estimated in a different stage, when we combine this model architecture with the monotonicity constraints, the resulting sequential DAM model has difficulty reconstructing the input even when m = p as Fig 5(a) and Fig 6(a) show.

Fig. 5.

Fig. 5

MTCT7 dataset. (a) MSE plots. Two runs of FS-DAM results are shown. (b) Relationship between the first estimated components from PCA, FS-DAM (first run), and kernel PCA. INT: rank-based inverse normal transformation.

Fig. 6.

Fig. 6

HVTN 505 dataset. (a) MSE plots. Two runs of FS-DAM results are shown. (b) Relationship between the first estimated components from PCA, FS-DAM (first run), and kernel PCA. INT: rank-based inverse normal transformation.

In the simultaneous approach, all components are estimated in one stage using a model whose code layer contains m nodes (Figure A.5a). The ith observation yiRp is approximated by g(f1(yi), …, fm(yi)), where f1,…, fm denote the m coding functions and g denote the vector-valued decoding function, all of which are optimized together in a single stage. The simultaneous approach is much more expressive than the sequential approach. The problem with the simultaneous approach and the reason that Kramer (1991) proposed the sequential approach is that the components estimated by the simultaneous approach have no ordering and are not unique.

Scholz and Vigário (2002) proposed an alternative simultaneous estimation method that promotes feature discrimination. For example, if we are to estimate two nonlinear components, the method fits a model such as shown in Supplementary Figure A.5(a) with a modified criterion function. This function is a linear combination of the usual reconstruction error and a partial reconstruction error measuring the distance between the input and the output if only the one of the two nonlinear components is used. Using such a combination creates an asymmetry that can lead to an ordering of the two nonlinear components. However, how to choose the combination coefficient in the modified criterion function is an important issue to be addressed and that becomes more serious as more nonlinear components are added.

We propose a new sequential method for estimating multiple nonlinear components, which we call the Forward Stepwise (FS) method. Like the sequential method in Kramer (1991), the FS method estimates one nonlinear component at a time. However, the FS method differs from Kramer’s sequential method in two important ways: 1) the model in each stage always has the same number of variables in the code layer as the stage it is in; and 2) the input of each stage is always the original data. For example, Figure 1(b) shows a model used in the second stage. There are two nodes in the code layer: one node, representing the nonlinear component estimated in stage 1, is connected with the output but not the input, the other node is connected with both the input and the output and represents the nonlinear component to be estimated in stage 2. If a third nonlinear component is to be estimated, the model in Figure 1(b) can be extended to have two nodes in the code layer to represent the nonlinear components already estimated. The process can be continued until there are as many components as the input variables.

One potential issue with the FS method is that the nonlinear components estimated at different stages may be correlated with each other. To promote feature discrimination, we add a penalty term to the criterion function that is proportional to the L1 norm of the covariance between the current stage nonlinear component and the nonlinear components estimated in the previous stages. Thus the criterion function to be minimized at every stage differs. At the mth stage for m > 1, the criterion function becomes:

n1p1j=1pi=1n(yi,jgj(xi))2+τ1n1p1j=1pi=1n|dgjdu|xi|_+τ2j=1m1|Cov(x,j,x,m)|, (2)

where xRm denotes the code vector for the ith observation, u denotes the mth element of x, and the population covariance formula is used in the covariance function. Note that in contrast to (1), the negative gradient penalty in (2) is evaluated at the same set of latent variable values as the reconstruction error term. This is because the curse of dimensionality makes it impractical to evaluate the derivatives on an m–dimensional grid of evenly spaced points. Empirically we have not observed any adverse, vacuum-forming, effects after the first stage.

From an FS-DAM fit we can make an MSE plot, i.e. a line plot of the mean squared reconstruction errors (MSE) for m = 0 to p. MSE equals 1 at m = 0 and decreases as m increases. We often start the MSE plot at m = 1 to allow more details to be shown when the MSE approaches 0. The MSE plot is an important output of the FS-DAM model fit because 1-MSE is comparable to the “proportion of variance/total variability explained” in PCA (Meredith and Millsap, 1985; Westfall et al., 2017). In applied papers, we often see statements like “the first three components explained 63% of total variability” (Jolliffe and Cadima, 2016) based on PCA analyses. The MSE plot provides a counterpart from nonlinear dimensionality reduction. It is worth noting that for some nonlinear dimensionality reduction methods, it is a challenge just to measure MSE. For example, since kernel PCA mainly operates in the dual space, measuring MSE requires solving a pre-image problem (Mika et al., 1999).

4. Numerical studies

In this section we study the performance of the proposed FS-DAM method. We implement the proposed deep learning models in Python using the PyTorch machine learning library (Paszke et al., 2017). For optimization, we use the Adam optimization algorithm (Kingma and Ba, 2014). For parameter initialization, we use the Kaiming Uniform (He et al., 2015), which samples from U(q,q), where q is the inverse of the number of input features of a node. The processes for tuning the model parameters are described in Section C of the Supplementary Materials. We will use these parameter values τ1 = 10, τ2 = 1, k = 100 in this section. The times it takes to train these models are listed in Table D.1.9 of the Supplementary Materials. At the moderate dimensions of our motivating datasets, FS-DAM is faster to train on a CPU than on a GPU, but the program scales better on a GPU than on a CPU as n and k increase.

4.1. Monte Carlo studies

In this subsection we carry out simulation studies to compare the two implementations of monotonicity: constrained and penalization. We simulate nine variables in two phases. First, we simulate three latent variables independently from the uniform(0, 1) distribution: χ1, χ2 and χ3; second, we simulate nine observed variables as follows:

Y1 = χ1 + ϵ1 Y6 = χ2 −0.9(χ2 – 0.5)+ + ϵ6
Y1 = χ1 + ϵ1 Y6 = χ2 −0.9(χ2 – 0.5)+ + ϵ6
Y2 = χ1 −0.9(χ1 – 0.5)+ + ϵ2 Y7 = 0.1* χ2 −0.9(χ2 – 0.5)+ + ϵ7
Y3 = χ1 + ϵ3 Y8=χ22+ϵ8
Y4 = 0.1* χ1 +0.9(χ1 – 0.5)+ + ϵ4
Y5=χ12+ϵ5 Y9 = χ3 + ϵ9,

where (x)+ = x if x > 0 and 0 if x ≤ 0, and ϵ’s are i.i.d N(0, σ = 0.06). Pairwise scatterplots of a simulated dataset is shown in the Supplementary Materials Figure D.1.1.

We simulate 100 datasets with sample size 250 from the above model and fit FS-DAM to each dataset. The resulting MSE plots are shown in the left and middle panels of Figure 2. When m = 1, the MSE appears similar between the two approaches. But as m increases, the MSE from the penalization approach dips below the MSE from the constrained approach. The rightmost panel of Figure 2 contrasts the MSE at m = 3 from the two approaches. The median MSE from the constrained approach is 0.13, while the median MSE from the penalization approach is 0.074; the two-sided Wilcoxon signed rank two-sample test p value for the null hypothesis that the two approaches perform the same is < 2.2 × 10−16. This difference in performance suggests that hard-thresholding model parameters at the end of each training epoch may make convergence more difficult for gradient-based optimization. Based on these results, we henceforth adopt the penalization approach to achieving monotonicity.

Fig. 2.

Fig. 2

Comparison of monotonicity approaches. Left and middle: MSE plots from FS-DAM fits of 100 Monte Carlo dataset using either the constrained approach or the penalization approach. MSE: mean squared reconstruction error. The dashed horizontal lines are drawn at height 0.05. Right: comparing MSE between the two approaches when there are three components.

We now examine two Monte Carlo datasets in more detail. The first dataset is generated from the model above, and the second dataset is generated from a variant of the model where Y3 and Y8 are changed to linear combinations of two latent components χ1 and χ2: Y3 = 0.3χ1 + 0.7χ2 + ϵ3, Y8 = 0.7χ1 + 0.3χ2 + ϵ8. Pairwise scatterplots of the two datasets are shown in Figure D.1.1 and D.1.2, respectively. For each dataset, we compare MSE plots for three methods: PCA, sequential DAM, and FS-DAM. Figure 3 shows the MSE plots for these two datasets. The results are similar for both datasets. To explain at least 90% of the total variance, each needs 5 PCA components, 4 sequential DAM components, and 3 FS-DAM components.

Fig. 3.

Fig. 3

MSE plots.

Because our main goal for developing FS-DAM is for feature extraction, we also examine the estimated components in detail by looking at their correlation with the true underlying components. Figure 4 shows the scatterplots between the true latent components χ1χ3 and the first three estimated components C1–C3. We apply a rank-based inverse normal transformation Φ−1(r(x)/(n + 1)) (Van der Waerden, 1952) to the estimated components, where r(x) is the sample rank function, because the untransformed C1 appears heavy-tailed (Figure D.1.3). These results show that in both datasets we are able to extract features that correspond to the underlying latent components, but the performance is not as good in the case that some of the input variables are associated with more than one latent component.

Fig. 4.

Fig. 4

Scatterplots between the true underlying components χ1χ3 and the first three estimated components C1-C3. Spearman correlation coefficients are shown in each plot. INT: rank-based inverse normal transformation.

In Figure D.1.7 of the Supplementary Materials, we compare the first components estimated by PCA, kernel PCA, and FS-DAM. The results show that the relationship between the first components estimated by PCA and FS-DAM are largely monotone, which is consistent with the fact that they both pick up χ1 (Figure D.1.8). In contrast there is a cyclical relationship between those estimated by PCA and kernel PCA, which is harder to interpret.

To further investigate the effect of sample sizes on the ability to estimate the latent components, we simulate two datasets of sample size 50 and 100 from the model listed in the beginning of this subsection and fit FS-DAM. Table D.1.4 compares the Spearman correlation coefficients between the true latent components and the estimated components across different sample sizes. As we may expect, the correspondence between χ1χ3 and C1-C3 gradually drops as sample sizes decrease.

4.2. MTCT7

We now return to the MTCT dataset mentioned in the introduction. Figure 5(a) shows the MSE plots for PCA, sequential DAM, and two runs of FS-DAM from two random starting values, and the raw numbers are shown in Table D.2.2. As the number of nonlinear components increases from 1 to 7, the PCA MSE drops from 0.251 to 0.000, the FS-DAM MSE drops from 0.150/0.149 to 0.000, and the sequential DAM MSE plot drops from 0.150 to 0.023, the latter number corresponding to the PCA MSE between 4 and 5 components.

Table 1 shows the correlations between the input variables and the estimated components from one run of FS-DAM. All of the input variables are highly correlated with the first component. Results for the second run show a similar pattern and are shown in Table D.2.3.

Table 1.

MTCT7 dataset. Spearman correlation coefficients between the estimated components and the input variables from the first FS-DAM run.

C1 C2 C3 C4 C5 C6 C7
V3_BioV3M .9 .2 −.1 .2 −.1 .1 .0
V3_BioV3B .9 .2 −.1 .1 .1 .1 .1
V3_gp70MNV3 .8 .3 −.1 .0 .2 .3 .0
CD4.JRFL .9 .0 .5 −.1 .0 .1 −.1
CD4.63521 .9 −.1 .3 .0 −.2 .3 .1
CD4.6240 .8 −.2 .2 −.2 −.2 .1 .1
NAb_MN3 .8 .5 .1 −.1 −.1 .1 .0

In Figure 5(b), we compare the first components estimated by PCA, FS-DAM (first run), and kernel PCA. Here we see the same pattern as for the Monte Carlo datasets, namely the relationship between the first components estimated by PCA and FS-DAM are largely monotone, but there is a cyclical relationship between those estimated by PCA and kernel PCA. This suggests that the first components from PCA and FS-DAM are picking up the same underlying signal.

4.3. HVTN 505 primary immune responses

In this subsection we illustrate the application of FS-DAM with another immune response biomarker dataset. HVTN 505 was a phase IIb trial of a DNA/rAd5 HIV-1 vaccine (Hammer et al., 2013) conducted in the United States from 2009 to 2013; the trial showed no overall efficacy. Janes et al. (2017), Fong et al. (2018) and Neidich et al. (2019) studied the immune responses elicited by the DNA/rAd5 vaccine. In particular, Neidich et al. (2019) examined a select panel of 8 immune response biomarkers that were of primary scientific interest. Figure D.3.1 in the Supplementary Materials shows the pairwise scatterplots between these biomarkers.

Figure 6(a) shows the MSE plots for PCA, sequential DAM, and two runs of FS-DAM from two random starting values, and the raw numbers are shown in Table D.3.2. The PCA MSE drops gradually and does not approach 0 until there are 7 or 8 components; the sequential DAM MSE also drops gradually and remains somewhat high even when there are 8 components; the FS-DAM MSE drops more quickly and is around 1% at m = 5.

Table 2 shows the correlations between the input variables and the first four estimated components from one run of FS-DAM. The results suggest that the eight biomarkers can be clustered into three groups based on their correlations with the estimated components. Five of the eight immune response biomarkers, IgGw28_env_mdw, R2aConSgp140CFI, ADCP1, IgG3w28_env_mdw, IgGw28_gp41_mdw, are most highly correlated with the first component; CD8_env_pfs is most highly correlated with the third component; and IgAw28_env is most highly correlated with the second component. This clustering makes biological sense since the group of five biomarkers are all related to binding of IgG isotype antibodies to the HIV envelope proteins, CD8_env_pfs measures cell-based (instead of antibody-based) immune responses, and IgAw28_env measures binding of IgA isotype antibodies to the HIV envelope proteins. IgGw28_V1V2 does not neatly fall into any of the three groups as it is most highly correlated with both the first and second components; interestingly it measures binding of IgG isotype antibodies to a specific region of an HIV envelope protein. Results for the second show a similar pattern and are shown in Table D.3.3.

Table 2.

HVTN 505 dataset. Spearman correlation coefficients between the estimated components and the input variables from the first FS-DAM run.

C1 C2 C3 C4
CD8_env_pfs .4 −.1 .9 −.3
IgGw28_env_mdw .8 .1 .0 −.3
R2aConSgp140CFI .9 .0 .0 −.1
ADCP1 .8 .0 .1 .2
IgG3w28_env_mdw .7 −.2 .2 .6
IgGw28_gp41_mdw .7 −.1 −.2 −.3
IgGw28_V1V2_mdw .6 .4 −.1 −.1
IgAw28_env_mdw .4 .8 .0 .1

In Figure 6(b), we compare the first components estimated by PCA, kernel PCA, and FS-DAM (first run). The results show that, as in the MTCT7 example and the Monte Carlo datasets, there is a monotone relationship between the first components estimated by PCA and FS-DAM (first run), suggesting that the first components from PCA and FS-DAM are estimating the same underlying signal, but not between the first components estimated by PCA and kernel PCA.

5. Discussion

While linear dimensionality reduction approaches (e.g. PCA) are in almost every data analyst’s toolbox, nonlinear dimensionality reduction approaches are used less often because it is difficult to choose the appropriate model complexity for such approaches. Moreover, in contrast with supervised learning, cross validation cannot be used to determine model complexity in unsupervised learning. To alleviate these problems, we proposed Deep Autoencoder-based Monotone (DAM) nonlinear dimensionality reduction methods, which rely on imposing monotonicity constraints to regularize the model and extract interpretable, monotone components.

The second issue we tackled is estimation of multiple components. A natural way to achieve feature discrimination is to take a sequential approach to estimating multiple nonlinear components. However, when we impose monotonicity constraints at each stage, the reconstruction error does not approach 0 as the number of components reaches the input dimension. We solved this problem by including the nonlinear components estimated in the previous stages when estimating a new component in a new network architecture called Forward Stepwise (FS). As FS-DAM is not a convex technique and the solutions depend on the starting values, it is important to compare results from different random starting values.

From a FS-DAM fit we can make a MSE plot, a line plot in which MSE equals 1 at m = 0 and decreases as m increases. Since 1-MSE has the same interpretation as percent variance/total variability explained in PCA, the MSE plot provides practitioners a useful tool to interpret the components. The components estimated from FS-DAM can also be useful in downstream analyses. For example, they can be used for feature clustering based on the correlations between the estimated components and the input variables. In Section 4.3 we saw that the input variables in the HVTN 505 example fell into three clusters that align with their biological distinction. This contrasts with the results in Section 4.2, where we saw that the input variables in the MTCT example formed a single cluster, even though these input variables are related to different aspects of the human immune response just as in Section 4.3. In Section D.3.1 of the Supplementary Materials we used the HVTN 505 example to illustrate further uses of the estimated components for data visualization and supervised learning tasks.

An unexpected and interesting phenomenon occurred when we used a negative gradient penalty term evaluated at the set of latent variable values that corresponded to the input data in (1). The trained model circumvented the monotonicity constraint by creating a vacuum in the range of the latent variable. We fixed the problem by evaluating the penalty term at an evenly spaced grid of latent variable values. An alternative solution to this problem, suggested by an anonymous reviewer, is to sort the latent variables {xi}i=1n in increasing order and penalize negative values of g(xi+1)−g(xi). Doing so effectively integrates the gradients over the space between xi and xi+1. We have not encountered this problem in estimation of the later components. It is hard to precisely explain why it is not commonly observed. One possible explanation is that models are different in the later stages. For example, in the mth stage the code layer contains m – 1 nodes that are assigned values from previously estimated components and are not connected to the input. We conjecture that this somehow limits the vacuum-forming potential of the model. If problems do arise in later stages, a potential fix is to add another gradient penalty term in (2): τ3N1p1j=1pt=1N|dgj/du|xi*|_, where u denotes the mth component variable, and {xt*}t=1N denotes a series of points whose bth coordinate equals the median of {xi,b}i=1n for b ∈ 1, …, m − 1, and whose mth coordinate takes N = 2n values evenly spaced in the range of {xi,m}i=1n.

A drawback of imposing monotonicity constraints is that when the monotonicity assumption is violated, more components than would be ideal are needed to reach a given level of total variability explained. A simple example is given in Section D.5 of the Supplementary Materials. One way to check the monotonicity assumption is to examine pairwise relationships between the input variables. If they are not all monotone, it suggests that the monotonicity assumption likely does not hold. Extensions to the proposed methods allowing for a mix of monotone and non-monotone relationships between the output layer variables and the code layer variables warrant future research.

Supplementary Material

Supp 1
Supp 2

Acknowledgment

The authors thank the Editor, the AE, and two anonymous referees for their highly constructive comments. The authors are also indebted to the investigators of the immune correlates study of mother-to-child transmission of HIV-1, in particular Sallie Permar, and the participants and investigators of HVTN 505, in particular Georgia Tomaras and Julie McElrath, for providing the biomarker data for the examples. The authors thank Lindsay N. Carpp for help with editing. This work was supported by the National Institutes of Health (R01-AI122991; UM1-AI068635; S10OD028685).

Contributor Information

Youyi Fong, Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle WA 98109, USA.

Jun Xu, Cincinnati, United States.

References

  1. Baldi P and Hornik K (1989), “Neural networks and principal component analysis: Learning from examples without local minima,” Neural networks, 2, 53–58. [Google Scholar]
  2. Barlow RE, Bartholomew DJ, Bremner JM and Brunk HD (1972), Statistical inference under order restrictions;: The theory and application of isotonic regression, Wiley, New York. [Google Scholar]
  3. Belkin M and Niyogi P (2002), “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Advances in neural information processing systems, pp. 585–591. [Google Scholar]
  4. Bishop C, Svensen M and Williams C (1998), “GTM: The generative topographic mapping,” Neural Computation, 10, 215–234. [Google Scholar]
  5. Bourlard H and Kamp Y (1988), “Auto-association by multilayer perceptrons and singular value decomposition,” Biological cybernetics, 59, 291–294. [DOI] [PubMed] [Google Scholar]
  6. Brunk H et al. (1958), “On the estimation of parameters restricted by inequalities,” The Annals of Mathematical Statistics, 29, 437–454. [Google Scholar]
  7. Cybenko G (1989), “Approximation by superpositions of a sigmoidal function,” Mathematics of control, signals and systems, 2, 303–314. [Google Scholar]
  8. Donoho DL and Grimes C (2003), “Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data,” Proceedings of the National Academy of Sciences, 100, 5591–5596. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Duchamp T, Stuetzle W et al. (1996), “Extremal properties of principal curves in the plane,” The Annals of Statistics, 24, 1511–1520. [Google Scholar]
  10. Feng T, Li SZ, Shum HY and Zhang H (2002), “Local non-negative matrix factorization as a visual representation,” in Proceedings 2nd International Conference on Development and Learning. ICDL 2002, pp. 178–183, IEEE. [Google Scholar]
  11. Fong Y, Shen X, Ashley VC, Deal A, Seaton KE, Yu C et al. (2018), “Vaccine-induced antibody responses modify the association between T-cell immune responses and HIV-1 infection risk in HVTN 505,” The Journal of infectious diseases, 217, 1280–1288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fukunaga K and Olsen DR (1971), “An algorithm for finding intrinsic dimensionality of data,” IEEE Transactions on Computers, 100, 176–183. [Google Scholar]
  13. Gijbels I (2005), “Monotone regression,” in The Encyclopedia of Statistical Sciences, Wiley, Hoboken, NJ. [Google Scholar]
  14. Goodfellow I, Bengio Y and Courville A (2016), Deep Learning, Adaptive computation and machine learning, MIT Press, Cambridge, MA. [Google Scholar]
  15. Grassberger P and Procaccia I (1983), “Measuring the strangeness of strange attractors,” Physica D: Nonlinear Phenomena, 9, 189–208. [Google Scholar]
  16. Hall P, Huang LS et al. (2001), “Nonparametric kernel regression subject to monotonicity constraints,” The Annals of Statistics, 29, 624–647. [Google Scholar]
  17. Ham J, Lee DD, Mika S and Schölkopf B (2004), “A kernel view of the dimensionality reduction of manifolds,” in Proceedings of the twenty-first international conference on Machine learning, p. 47. [Google Scholar]
  18. Hammer SM, Sobieszczyk ME, Janes H, Karuna ST, Mulligan MJ, Grove D et al. (2013), “Efficacy Trial of a DNA/rAd5 HIV-1 Preventive Vaccine,” New England Journal of Medicine, 369, 2083–2092. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Hastie T and Stuetzle W (1989), “Principal curves,” Journal of the American Statistical Association, 84, 502–516. [Google Scholar]
  20. He K, Zhang X, Ren S and Sun J (2015), “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. [Google Scholar]
  21. Hinton GE and Salakhutdinov RR (2006), “Reducing the dimensionality of data with neural networks,” science, 313, 504–507. [DOI] [PubMed] [Google Scholar]
  22. Hoyer PO (2004), “Non-negative matrix factorization with sparseness constraints,” Journal of machine learning research, 5, 1457–1469. [Google Scholar]
  23. Janes HE, Cohen KW, Frahm N, De Rosa SC, Sanchez B, Hural J et al. (2017), “Higher T-cell responses induced by DNA/rAd5 HIV-1 preventive vaccine are associated with lower HIV-1 infection risk in an efficacy trial,” The Journal of infectious diseases, 215, 1376–1385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Jolliffe IT and Cadima J (2016), “Principal component analysis: a review and recent developments,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374, 20150202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Kambhatla N and Leen TK (1997), “Dimension reduction by local principal component analysis,” Neural computation, 9, 1493–1516. [Google Scholar]
  26. Kingma DP and Ba J (2014), “Adam: A method for stochastic optimization,” in Proceedings of the 3rd International Conference on Learning Representations (ICLR). [Google Scholar]
  27. Kramer MA (1991), “Nonlinear principal component analysis using autoassociative neural networks,” AIChE journal, 37, 233–243. [Google Scholar]
  28. Kruskal JB and Wish M (1978), Multidimensional scaling, Sage. [Google Scholar]
  29. LeCun Y (1987), “Modeles connexionnistes de l’apprentissage [PhD thesis],”.
  30. Lee DD and Seung HS (1999), “Learning the parts of objects by nonnegative matrix factorization,” Nature, 401, 788–791. [DOI] [PubMed] [Google Scholar]
  31. Lee JA and Verleysen M (2007), Nonlinear dimensionality reduction, Springer Science & Business Media, New York. [Google Scholar]
  32. Mammen E, Marron J, Turlach B, Wand M et al. (2001), “A general projection framework for constrained smoothing,” Statistical Science, 16, 232–248. [Google Scholar]
  33. Meredith W and Millsap RE (1985), “On component analyses,” Psychometrika, 50, 495–507. [Google Scholar]
  34. Mika S, Ratsch G, Weston J, Scholkopf B and Mullers K (1999), “Fisher discriminant analysis with kernels,” in Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop., pp. 41–48, IEEE. [Google Scholar]
  35. Mikolov T, Chen K, Corrado G and Dean J (2013), “Efficient estimation of word representations in vector space,” International Conference on Learning Representations. [Google Scholar]
  36. Neidich SD, Fong Y, Li SS, Geraghty DE, Williamson BD, Young WC et al. (2019), “Antibody Fc effector functions and IgG3 associate with decreased HIV-1 risk,” Journal of Clinical Investigation, 129, 4838–4849. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z et al. (2017), “Automatic differentiation in PyTorch,” in NIPS-W. [Google Scholar]
  38. Permar SR, Fong Y, Vandergrift N, Fouda GG, Gilbert P, Parks R et al. (2015), “Maternal HIV-1 envelope–specific antibody responses and reduced risk of perinatal transmission,” Journal of Clinical Investigation, 125, 2702–2706. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Plaut E (2018), “From Principal Subspaces to Principal Components with Linear Autoencoders,” arXiv preprint arXiv:1804.10253. [Google Scholar]
  40. Ramsay JO et al. (1988), “Monotone regression splines in action,” Statistical science, 3, 425–441. [Google Scholar]
  41. Rich KC, Fowler MG, Mofenson LM, Abboud R, Pitt J, Diaz C et al. (2000), “Maternal and Infant Factors Predicting Disease Progression in Human Immunodeficiency Virus Type 1-Infected Infants,” Pediatrics, 105, e8. [DOI] [PubMed] [Google Scholar]
  42. Roweis ST and Saul LK (2000), “Nonlinear dimensionality reduction by locally linear embedding,” science, 290, 2323–2326. [DOI] [PubMed] [Google Scholar]
  43. Rumelhart D, Hinton G and Williams R (1986), “Learning internal representations by error propagation,” in Parallel Distributed Processing, chap 8, MIT Press, Cambridge. [Google Scholar]
  44. Sammon JW (1969), “A nonlinear mapping for data structure analysis,” IEEE Transactions on computers, 100, 401–409. [Google Scholar]
  45. Schölkopf B, Smola A and Müller K (1997), “Kernel principal component analysis,” in Lecture Notes in Computer Science, pp. 583–588, Springer. [Google Scholar]
  46. Scholz M and Vigário R (2002), “Nonlinear PCA: a new hierarchical approach.” in Proceedings ESANN,, pp. 439–444. [Google Scholar]
  47. Silverman BW and Green P (1993), Nonparametric regression and generalized linear models: A roughness penalty approach, Chapman and Hall/CRC. [Google Scholar]
  48. Smola AJ, Williamson RC, Mika S and Schölkopf B (1999), “Regularized principal manifolds,” in European Conference on Computational Learning Theory, pp. 214–229, Springer. [Google Scholar]
  49. Sorzano COS, Vargas J and Montano AP (2014), “A survey of dimensionality reduction techniques,” arXiv preprint arXiv:1403.2877. [Google Scholar]
  50. Tenenbaum JB, De Silva V and Langford JC (2000), “A global geometric framework for nonlinear dimensionality reduction,” science, 290, 2319–2323. [DOI] [PubMed] [Google Scholar]
  51. Ting D and Jordan MI (2018), “On nonlinear dimensionality reduction, linear smoothing and autoencoding,” arXiv preprint arXiv:1803.02432. [Google Scholar]
  52. Van Der Maaten L, Postma E and Van den Herik J (2009), “Dimensionality reduction: a comparative,” J Mach Learn Res, 10, 13. [Google Scholar]
  53. Van der Waerden B (1952), “Order tests for the two-sample problem and their power,” in Indagationes Mathematicae (Proceedings), vol. 55, pp. 453–458, Elsevier. [Google Scholar]
  54. Weinberger KQ and Saul LK (2006), “Unsupervised learning of image manifolds by semidefinite programming,” International journal of computer vision, 70, 77–90. [Google Scholar]
  55. Westfall PH, Arias AL and Fulton LV (2017), “Teaching principal components using correlations,” Multivariate behavioral research, 52, 648–660. [DOI] [PubMed] [Google Scholar]
  56. Williams CK (2002), “On a connection between kernel PCA and metric multidimensional scaling,” Machine Learning, 46, 11–19. [Google Scholar]
  57. Zhang Z and Zha H (2004), “Principal manifolds and nonlinear dimensionality reduction via tangent space alignment,” SIAM journal on scientific computing, 26, 313–338. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1
Supp 2

RESOURCES