Skip to main content
Entropy logoLink to Entropy
. 2021 Aug 3;23(8):1012. doi: 10.3390/e23081012

A Factor Analysis Perspective on Linear Regression in the ‘More Predictors than Samples’ Case

Sebastian Ciobanu 1,*, Liviu Ciortuz 1,*
Editor: Kevin R Moon1
PMCID: PMC8394575  PMID: 34441152

Abstract

Linear regression (LR) is a core model in supervised machine learning performing a regression task. One can fit this model using either an analytic/closed-form formula or an iterative algorithm. Fitting it via the analytic formula becomes a problem when the number of predictors is greater than the number of samples because the closed-form solution contains a matrix inverse that is not defined when having more predictors than samples. The standard approach to solve this issue is using the Moore–Penrose inverse or the L2 regularization. We propose another solution starting from a machine learning model that, this time, is used in unsupervised learning performing a dimensionality reduction task or just a density estimation one—factor analysis (FA)—with one-dimensional latent space. The density estimation task represents our focus since, in this case, it can fit a Gaussian distribution even if the dimensionality of the data is greater than the number of samples; hence, we obtain this advantage when creating the supervised counterpart of factor analysis, which is linked to linear regression. We also create its semisupervised counterpart and then extend it to be usable with missing data. We prove an equivalence to linear regression and create experiments for each extension of the factor analysis model. The resulting algorithms are either a closed-form solution or an expectation–maximization (EM) algorithm. The latter is linked to information theory by optimizing a function containing a Kullback–Leibler (KL) divergence or the entropy of a random variable.

Keywords: more predictors than samples, linear regression, factor analysis, semisupervised regression, missing data

1. Introduction

In machine learning, models can be grouped into two categories: probabilistic and nonprobabilistic. Probabilistic models can be classified as generative and discriminative [1]. Examples of classic generative models are naive Bayes and Gaussian mixture models (GMM). Examples of classic discriminative models are linear regression (LR) and logistic regression. The key difference is whether they model the joint probability of the input and the output—generative models—or they just model the conditional probability of the output given the input—discriminative models. For a classification or a regression task, one may argue that what you need is just a discriminative model, but the generative models have their advantages: they can sometimes handle missing data, can easily generate new data, can be extended to be unsupervised or semisupervised, etc. ([2] p. 268).

As one may notice, there are generative models for unsupervised learning that have counterparts in supervised learning, even though this is not widely discussed in the literature. One such example is the GMM ([2] p. 339) with its counterpart, the Gaussian joint Bayes model ([2] p. 102), also known as quadratic discriminant analysis. Their training/fitting algorithms are similar, as one may notice, for example, in [3,4]:

  • for Gaussian joint Bayes:
    πj=1ni=1n1{zi=j}
    μj=i=1n1{zi=j}xii=1n1{zi=j}
    Σj=i=1n1{zi=j}(xiμj)(xiμj)i=1n1{zi=j}
    where xi is an input observation, (πj,μj,Σj) are the parameters of a GMM, observable zi is the class index corresponding to xi, j is a class index, and 1{zi=j} is the indicator function, which returns 1 if the condition zi=j is true and 0 otherwise.
  • for expectation–maximization (EM) for the GMM, which optimizes a function concerning a Kullback–Leibler (KL) divergence or the entropy of a random variable—check Appendix A for these details—:

    E step:
    wij=p(zi=j|xi;π,μ,Σ)=p(xi|zi=j;μ,Σ)p(zi=j;π)l=1Kp(xi|zi=l;μ,Σ)p(zi=l;π)
    M step:
    πj=1ni=1nwij
    μj=i=1nwijxii=1nwij
    Σj=i=1nwij(xiμj)(xiμj)i=1nwij
    where xi is an input observation, (πj,μj,Σj) are the parameters of a GMM, unobservable zi is the cluster index corresponding to xi, j is a cluster index, and wij is the probability that xi belongs to cluster j.

This similarity between GMM and Gaussian joint Bayes is intriguing; hence, we decided to further explore this aspect but starting from other supervised–unsupervised counterparts. As a result, we changed the root model into factor analysis (FA) [5] ([2] p. 381), which is normally used for dimensionality reduction or for density estimation when the dimensionality of the data is greater than the number of samples. Factor analysis is a Gaussian generative model used in unsupervised learning. We aimed at creating its supervised counterpart in order to handle a regression task and then exploit it as much as possible.

After creating the supervised counterpart, we proved a significant property, namely that linear regression is equivalent to (supervised) factor analysis—with one-dimensional latent space—when no constraints are imposed on the covariance matrices.

A linear regression model can be fitted via a closed-form solution or an iterative algorithm. When the number of predictors is greater than the number of samples, there is no closed-form solution. There are other solutions to this problem, as we will see.

We were at the point where we knew that factor analysis was linked to linear regression and that it could be used when the number of samples was lower than the dimensionality of the data—from now on, this is denoted as D>>n or n<<D. As a result, we shifted our focus from solely exploiting the factor analysis model to highlighting novel linear regression versions applicable in the D>>n regime—linear regression being a widely known and used model—:

  • linear regression when D>>n,

  • semisupervised linear regression when D>>n,

  • (semisupervised) linear regression when D>>n with missing data.

The structure of this paper is as follows. In Section 2, we include some theoretical background to enhance the readability of this paper. Section 3 contains related work. In Section 4, we include the models we proposed, starting from factor analysis. Section 5 contains experiments using the proposed models. In Section 6, we conclude the paper and show future directions.

We include the full algorithms in the appendices in a pseudocode format, two of them being instances of the expectation–maximization schema.

2. Theoretical Background

We started our analysis from two core models in machine learning: linear regression and factor analysis. We will discuss the aspects of those two models that are relevant to understanding the next sections of this paper.

2.1. Linear Regression

Proposition 1.

Let {(x(i),y(i))|x(i)RD×1,y(i)R,i{1,,n}} be a data set where D is the dimensionality of the input data, {x(i)|i{1,,n}} is the input, and {y(i)|i{1,,n}} is the output. The linear regression model is as follows:

Y(i)=wx(i)+b+ϵ(i)

where Y(i) is a random variable corresponding to y(i), ϵ(i)N(0,σ2), σR+*, wR1×D, bR. Then, the parameters w and b can be estimated via maximum likelihood as follows:

w^LR=ny¯x¯YXnx¯x¯XX1 (1)
b^LR=y¯w^LRx¯, (2)

or, equivalently, as follows:

w^LRb^LR=(X˜X˜)1X˜Y (3)

where x¯=x(1)++x(n)n, y¯=y(1)++y(n)n,

X=x(1)x(n)RD×n,X˜=x˜(1)x˜(n)R(D+1)×n,

x˜(i)=x(i)1=x1(i)xD(i)1RD+1, and Y=y(1)y(n)R1×n.

A potential problem with Equation (3) is when X˜X˜ is not invertible. Such case arises when D+1>n, i.e., there are more predictors than samples. Two standard solutions to this problem are the following:

  • Let ARa×b be a matrix. Then, the Moore–Penrose inverse of A can be defined as A+=limα0+(AA+αI)1A, which, algorithmically, is computed via the singular value decomposition of A ([6] Section 2.9). One may notice that the matrix (X˜X˜)1X˜ from Equation (3) is just (X˜)+ when X˜X˜ is invertible. When it is not, the solution is to replace the matrix (X˜X˜)1X˜ from Equation (3) with (X˜)+.

  • L2 regularization, which results in ridge regression ([2] p. 225). The matrix (X˜X˜)1X˜ from Equation (3) is replaced with X˜X˜+α000α00001X˜, with α>0; the bigger the α, the more regularization we add to the model, i.e., move away from overfitting. From this point of view, the first solution using the Moore–Penrose inverse can be interpreted as achieving the asymptotically lowest L2 regularization.

2.2. Factor Analysis

The formulas stated in the conclusion of the following Proposition were proved in [5] and are relevant for the factor analysis algorithm—although the matrix Ψ is considered as being diagonal there, the formulas stay the same even if Ψ is not diagonal.

Proposition 2.

Let us consider the following factor analysis model:

zN(0,I)-latent variable, zRd×1

x|zN(μ+Λz,Ψ), xRD×1,μRD×1,ΛRD×d,andΨRD×D a diagonal matrix. Then:

xzNμ0,ΛΛΛΛI

xN(μ,ΛΛ+Ψ)

z|xN(Λ(ΛΛ+Ψ)1(xμ),IΛ(ΛΛ+Ψ)1Λ).

A factor analysis model can be fitted via an EM algorithm using the maximum likelihood estimation (MLE) principle, and factor analysis can be used as a density estimation technique when the dimensionality of the data is greater than the number of samples [5].

For the algorithms we developed, we will let zN(μz,Σz) and not zN(0,I), because z becomes observed data, and we want to learn its parameters, and not impose something unrealistic like zN(0,I). This generalization leads to the following result.

Proposition 3.

Let us consider the following linear Gaussian system:

zN(μz,Σz)-latent variable, zRd×1, μzRd×1, ΣzRd×d a symmetric and positive definite matrix,

x|zN(μ+Λz,Ψ), xRD×1,μRD×1,ΛRD×d, and ΨRD×D a symmetric and positive definite matrix. (If Ψ is a diagonal matrix, then we say that we are in the FA case. If Ψ is a scalar matrix, Ψ=ηI, ηR+*, then we are in the probabilistic principal component analysis (PPCA) case [7]. If Ψ is any symmetric and positive definite matrix, then we say we are in the unconstrained factor analysis (UncFA) case. If the first two terms are standard (FA and PPCA), the third one is proposed by us—UncFA.) Then:

xzNμ+Λμzμz,ΛΣzΛ+ΨΛΣz(ΛΣz)ΣzxN(μ+Λμz,ΛΣzΛ+Ψ)z|xN(μz+ΣzΛ(ΛΣzΛ+Ψ)1(xμΛμz),           ΣzΣzΛ(ΛΣzΛ+Ψ)1ΛΣz). (4)

The proof can be found in ([8] pp. 9–11).

3. Related Work

Although factor analysis is widely used for dimensionality reduction, its supervised counterpart is, to the best of our knowledge, not present in the literature. What is present is a model called supervised principal component analysis or latent factor regression ([2] p. 405). The idea is that not only the input, for a regression task, is generated by a latent variable, as one applies factor analysis to replace the input in the problem with a low dimensional embedding, but also the output. The key idea is that the purpose of supervised principal component analysis is still dimensionality reduction and not at all regression, which is where we want to push the factor analysis model.

There is also a term called linear Gaussian system ([2] p. 119). This was already presented in Section 2, and it generalizes the factor analysis generative process by considering that z has a learnable mean and covariance matrix, but it does not go further.

Factor analysis is strongly related to principal component analysis (PCA) [9] because by imposing a certain constraint in factor analysis, we get a model called probabilistic principal component analysis [7] that can be fitted using a closed-form solution, which, in an asymptotic case, is also the solution for PCA. Probabilistic PCA can be kernelized using a model called the Gaussian process latent variable model (GPLVM) [10]. This model also has supervised counterparts [11], but, as in the case of FA, the supervised extension targets dimensionality reduction, and the idea is similar to the one in supervised PCA.

4. Proposed Models

In this section, we propose three models starting from the FA model, each in a new subsection: simple-supervised factor analysis (S2.FA), simple-semisupervised factor analysis (S3.FA), and missing simple-semisupervised factor analysis (MS3.FA). While S2.FA is applicable in the supervised case, regression, S3.FA is meant to be used in a semisupervised context. MS3.FA handles missing input data in a (semi)supervised scenario.

One important remark that will not be restated in this paper is that all the models (FA, S2.FA, S3.FA, MS3.FA) are fitted by maximizing the likelihood (the MLE principle) of the observed data. Another important observation is regarding the names of our proposed algorithms: the algorithms are called “simple-”—S2 = simple-supervised; S3 = simple-semisupervised; MS3 = missing simple-semisupervised—not only because they constitute a simple adaptation of the factor analysis model, but mostly because we created an adaptation of the (simple-)supervised FA model called (simple-)supervised PPCA, and we did not want this model to be confused with the already existing supervised PCA model in the literature. Simple-supervised probabilistic principal component analysis (S2.PPCA) is not discussed in this paper, but it is implemented and usable in the R package that we developed (https://github.com/aciobanusebi/s2fa; accessed on 31 July 2021) along with other undiscussed but related models: Simple-semisupervised unconstrained factor analysis (S3.UncFA), Simple-semisupervised probabilistic principal component analysis (S3.PPCA), Missing simple-semisupervised unconstrained factor analysis (MS3.UncFA), and Missing simple-semisupervised probabilistic principal component analysis (MS3.PPCA).

4.1. The S2.FA Model

The core of this subsection regards the S2.FA model, but in order to link it with LR, we need to also introduce a similar model to S2.FA: S2.UncFA. This link will make S2.FA a good candidate for replacing LR when D>>n. These three ideas—S2.FA and S2.UncFA, S2.UncFA-LR link, and replacing LR via S2.FA—will be expanded below.

4.1.1. The S2.FA Model. S2.FA and S2.UncFA

The first model that we propose is called simple-supervised factor analysis. It is a linear Gaussian system with slight changes:

zN(μz,σz2)-observed variable, zR, μzR, σzR+*

x|zN(μ+Λz,Ψ), xRD×1,μRD×1,ΛRD×1, and ΨRD×D a diagonal matrix.

If we do not impose the constraint of Ψ being diagonal, we arrive at the simple-supervised unconstrained factor analysis (S2.UncFA):

zN(μz,σz2)-observed variable, zR, μzR, σzR+*

x|zN(μ+Λz,Ψ), xRD×1,μRD×1,ΛRD×1, and ΨRD×D a symmetric and positive definite matrix.

In contrast with the factor analysis model, which is fitted via an EM algorithm, S2.UncFA and S2.FA are fitted via analytic formulas (see Propositions 4 and 5).

Proposition 4.

Let {(x(i),z(i))|x(i)RD×1,z(i)Rd×1,i{1,,n}} be a data set where D is the dimensionality of the input data, d is the dimensionality of the output data (although in the context of this paper d=1, we decided to expose more general results—d1—in order for the reader to gain more insight; this is the reason why we write Σz and not just σz2, or z(i) and not just z(i), or z¯ and not just z¯, etc.), {x(i)|i{1,,n}} is the input, and {z(i)|i{1,,n}} is the output. We suppose that the data was generated as follows:

z(i)N(μz,Σz), z(i)Rd×1, μzRd×1, ΣzRd×d a symmetric and positive definite matrix and

x(i)|z(i)N(μ+Λz,Ψ), x(i)RD×1,μRD×1,ΛRD×d, while ΨRD×D is a symmetric and positive definite matrix.

Then, the parameters in the S2.UncFA algorithm (training phase) can be estimated via maximum likelihood using the following closed-form formulas:

μ^z=i=1nz(i)n (5)
Σ^z=i=1n(z(i)μ^z)(z(i)μ^z)n (6)
Λ^=nx¯z¯i=1nx(i)z(i)nz¯z¯i=1nz(i)z(i)1 (7)
μ^=x¯Λ^z¯ (8)
Ψ^=i=1n(x(i)μ^Λz(i))(x(i)μ^Λz(i))n (9)

where x¯=i=1nx(i)n and z¯=μ^z. For the testing/prediction phase, one uses the formula for z|x from (4).

For more elaborate notations and the proof, see [8] pp. 13–17.

Proposition 5.

[We will denote the parameter Ψ^ in (9) as Ψ^S2.UncFA.]

For the S2.FA algorithm, (9) is replaced by

Ψ^=diagΨ^S2.UncFA (10)

where “diag” takes the diagonal of a matrix and returns the corresponding diagonal matrix.

The proof of Equation (10) is relatively simple, and we skip it for brevity. It can be found in [8] pp. 21–23.

For the step-by-step S2.FA algorithm and also for the matrix form of the algorithm, see Appendix B.

4.1.2. The S2.FA Model. The Link between LR and S2.UncFA

Linear regression and S2.UncFA have the same prediction function after fitting, as we claim and prove below.

Proposition 6.

Let {(x(i),z(i))|x(i)RD×1,z(i)Rd,i{1,,n}} be a data set where D is the dimensionality of the input data, d is the dimensionality of the output data (The same observation as earlier: in the context of this paper, only the “d=1" case is relevant.), {x(i)|i{1,,n}} is the input, and {z(i)|i{1,,n}} is the output.

One can fit an S2.UncFA model and obtain—via the relationships (5)–(9)—μ^z, Σ^z, Ψ^, Λ^, μ^. Remember that at the test phase (see (4)), the predicted value is

predictedS2.UncFA(x*)=μ^z+Σ^zΛ^(Λ^Σ^zΛ^+Ψ^)1(x*μ^Λ^μ^z),x*RD×1.

One can fit a linear regression model and obtain w^, b^ from (1) and (2). Remember that at the test phase, the predicted value is

predictedLR(x*)=w^x*+b^,x*RD×1.

Then:

predictedS2.UncFA(x*)=predictedLR(x*),x*RD×1.
Proof. 

Let X=x(1)x(n)RD×n and Z=z(1)z(n)Rd×n.

We begin by computing Ψ^.

Ψ^=(9)1ni=1n(x(i)μ^Λ^z(i))(x(i)μ^Λ^z(i))=1ni=1n(x(i)x(i)x(i)μ^x(i)z(i)Λ^μ^x(i)+μ^μ^+μ^z(i)Λ^Λ^z(i)x(i)+Λ^z(i)μ^+Λ^z(i)z(i)Λ^)=1nXXx¯μ^1nXZΛ^μ^x¯+μ^μ^+μ^z¯Λ^1nΛ^ZX++Λ^z¯μ^+1nΛ^ZZΛ^=1nXX1nXZΛ^1nΛ^ZX+1nΛ^ZZΛ^x¯μ^μ^x¯+μ^μ^++Λ^z¯μ^+μ^z¯Λ^.

We substitute μ^ with x¯Λ^z¯.

x¯μ^=x¯(x¯Λ^z¯)=x¯x¯x¯z¯Λ^

μ^x¯=(x¯Λ^z¯)x¯=x¯x¯Λ^z¯x¯

μ^μ^=(x¯Λ^z¯)(x¯Λ^z¯)=x¯x¯x¯z¯Λ^Λ^z¯x¯+Λ^z¯z¯Λ^

Λ^z¯μ^=Λ^z¯(x¯Λ^z¯)=Λ^z¯x¯Λ^z¯z¯Λ^

μ^z¯Λ^=(x¯Λ^z¯)z¯Λ^=x¯z¯Λ^Λ^z¯z¯Λ^

We return to compute Ψ^:

Ψ^=1nXX1nXZΛ^1nΛ^ZX+1nΛ^ZZΛ^x¯x¯+x¯z¯Λ^x¯x¯++Λ^z¯z¯+x¯x¯x¯z¯Λ^Λ^z¯z¯+Λ^z¯z¯Λ^+Λ^z¯x¯Λ^z¯z¯Λ^++x¯z¯Λ^Λ^z¯z¯Λ^=1nXX1nXZΛ^1nΛ^ZX+1nΛ^ZZΛ^x¯x¯+Λ^z¯x¯++x¯z¯Λ^Λ^z¯z¯Λ^. (11)

We continue by computing Λ^Σ^zΛ^.

Λ^Σ^zΛ^=(6)Λ^1nZZz¯z¯Λ^=1nΛ^ZZΛ^Λ^z¯z¯Λ^. (12)

We observe that the above term (see (12)) is also included in Ψ^ (see (11)).

Σ^zΛ^=(7)1nZZz¯z¯(nz¯z¯ZZ)1(nz¯x¯ZX)=1nZZz¯z¯1n1nZZz¯z¯1(nz¯x¯ZX)=1nZXz¯x¯. (13)

Since (Λ^Σ^zΛ^)=Λ^Σ^zΛ^=Λ^Σ^zΛ^Λ^Σ^zΛ^ is symmetric.

We have that:

Λ^Σ^zΛ^=Λ^(Σ^zΛ^)=(13)Λ^1nZXz¯x¯=1nΛ^ZXΛ^z¯x¯. (14)

We also have that:

Λ^Σ^zΛ^=(Λ^Σ^zΛ^)=(13)1nΛ^ZXΛ^z¯x¯=1nXZΛ^x¯z¯Λ^. (15)

We continue by computing Λ^Σ^zΛ^+Ψ^.

As we have already noticed above, Λ^Σ^zΛ^ is also included in Ψ^ (see (11) and (12)). In the computation of Λ^Σ^zΛ^+Ψ^, we replace Λ^Σ^zΛ^ once with (14) and then with (15). We get:

Λ^Σ^zΛ^+Ψ^=(14)(15)(11)(12)1nXX1nXZΛ^1nΛ^ZXx¯x¯+Λ^z¯x¯+        +x¯z¯Λ^+1nΛ^ZXΛ^z¯x¯+1nXZΛ^x¯z¯Λ^=1nXXx¯x¯. (16)

Observation: The result is exactly the maximum likelihood estimate of the covariance matrix Σ of the input data set if xN(μ,Σ): ΣMLE=1nXXx¯x¯. This is natural because according to the relationship (4) we have xN(μ,ΛΣzΛ+Ψ), and there are enough free parameters in ΛΣzΛ+Ψ, i.e., dD+d2d2+d+D2D2+D free parameters—dD in Λ, d2d2 in Σz, D2D2 in Ψ, for it to become ΣMLE=1nXXx¯x¯, since Σ has D2D2+D free parameters.

We return to the initial computation:

predictedS2.UncFA(x*)=

=(4)μ^z+Σ^zΛ^(Λ^Σ^zΛ^+Ψ^)1(x*μ^Λ^μ^z)

=(5)(13)(16)z¯+1nZXz¯x¯1nXXx¯x¯1(x*x¯+Λ^z¯Λ^z¯)

=1nZXz¯x¯1nXXx¯x¯1x*+z¯1nZXz¯x¯1nXXx¯x¯1x¯

=nz¯x¯ZXnx¯x¯XX1x*+z¯nz¯x¯ZXnx¯x¯XX1x¯

=(1)w^x*+z¯w^x¯

=(2)w^x*+b^

=predictedLR(x*),x*RD×1. □

4.1.3. The S2.FA Model. A New Approach for LR When D>>n

Since FA can be used to estimate the density of a data set when D>>n, and S2.UncFA is equivalent to LR, we consider S2.FA as a new approach to extend LR when D>>n besides the two solutions mentioned in Section 2.

4.2. The S3.FA Model

Factor analysis is a classic generative unsupervised model. Its supervised counterpart is S2.FA as shown in the previous subsection. Those two can be merged into a semisupervised model that we propose here, named simple-semisupervised factor analysis:

zN(μz,σz2)-either observed or latent variable, zR, μzR, σzR+*

x|zN(μ+Λz,Ψ), xRD×1,μRD×1,ΛRD×1, and ΨRD×D a diagonal matrix.

If we were to speak about Gaussian naive Bayes and the GMM, good hints for combining these two supervised–unsupervised counterparts into a semisupervised model can be found in [12]. We applied those hints for our supervised–unsupervised counterparts—S2.FA and FA—and created an EM algorithm to fit an S3.FA model. For the step-by-step algorithm and the matrix form of the algorithm, see Appendix C. For more elaborate notations and the proof, see [8] pp. 26–34.

4.3. The MS3.FA Model

The algorithm that fits an S3.FA model can be adapted also for the case when not all the components of x are known. We call the resulted model missing simple-semisupervised factor analysis:

zN(μz,σz2)-either observed or latent variable, zR, μzR, σzR+*

x|zN(μ+Λz,Ψ), xRD×1,μRD×1,ΛRD×1, and ΨRD×D a diagonal matrix; each component of x: x1,,xD is either observed or latent.

The resulting algorithm that fits a MS3.FA model is an EM algorithm. For the step-by-step algorithm, see Appendix D. For more elaborate notations and the proof, see [8] pp. 34–37.

5. Experiments

In this section, we include the experiments we carried out on data with D>>n using the S2.FA, S3.FA, and MS3.FA models, comparing them with other methods. In all the experiments, we computed errors between the real values and the predicted values; the metric we used is mean squared error (MSE):

MSE=1Ni=1N(realipredictedi)2,

where N is the number of the unknown elements whose real and predicted values are reali and predictedi, respectively; an unknown element represents an output number for S2.FA and S3.FA or an input/output number for MS3.FA. We ran each experiment five times and computed a 95% confidence interval using the t-distribution. Furthermore, in each experiment we used the same three data sets:

All three of these data sets have multiple outputs, but for each data set, we selected only the first output column that appears in the text data file and used it as the output: ace_conc for gas sensor array under flow modulation data set, LBL_ALLminpA_fut_001 for atp1d, the first column in the propvals file for m5spec.

We preprocessed each data set simply by dropping the constant columns. Only the second data set has constant columns: from 432 columns, we obtain 370.

5.1. The S2.FA Model: Experiment

The experiment concerning S2.FA covers the comparison of the three solutions presented so far for LR when D>>n:

  • Moore–Penrose inverse

  • ridge regression—L2 regularization

  • S2.FA.

Each data set was split into a training part—80%—and a testing part—20%. If the model had hyperparameters, ridge regression, the training part was also split into a new training part—60% of the whole data set—and a validation part—20% of the whole data set—in order to be able to set the hyperparameters (for ridge regression, we used a simple technique: pick α{102,101.9,101.8,,101.9,102}, which attains the minimum validation error); after setting the hyperparameters, we train a new model on the initial training part—80% of the whole data set—and obtain the final model. All of the MSE errors are reported on the testing part and shown in Table 1.

Table 1.

Simple-supervised factor analysis (S2.FA) experiment: mean squared error (MSE) 95% confidence intervals on three data sets using three methods for regression when D>>n; the best MSE means are marked in bold.

Data Set/Method Moore–Penrose Ridge Regression S2.FA
Gas sensor array under flow modulation 0.0251 ± 0.0254 0.0062 ± 0.007 0.0452 ± 0.0208
atp1d 94,627.7239 ± 80,183.0076 27,770.9253 ± 42,887.5216 4724.2957 ± 1616.3341
m5spec 0.00004 ± 0.00001 0.02676 ± 0.01372 0.37344 ± 0.25025

As one may notice, the best method is different for each data set, so our general advice is to use all the methods on a given data set and pick the best one.

5.2. The S3.FA Model: Experiment

The experiment concerning S3.FA includes an analysis of algorithms for semisupervised regression when D>>n:

  • Moore–Penrose inverse—a supervised method

  • S2.FA—a supervised method

  • S3.FA—a semisupervised method

  • label propagation [15]—a semisupervised method: we used the function

    sslLabelProp in the SSL R package [16] with the parameter alpha set to 1.

Each data set was split into a training part—80%—and a testing part—20%. We retained from the training part 5%, 10%, 15%, 20%, 25%, …, 100% of the output labels. For the supervised methods, we used only the labeled data in the training set, and for the semisupervised methods, we used the full training set when fitting the model. We initialized the S3.FA method with the fitted parameters returned by the S2.FA algorithm. All of the MSE errors are reported on the testing part and shown in Figure 1, Figure 2 and Figure 3—inspired from [17]—and Table 2. The figures contain all the results from using 5% to 100% of the output labels, but we include less information in the table for brevity.

Figure 1.

Figure 1

Simple-semisupervised factor analysis (S3.FA) experiment: MSE 95% confidence intervals on the gas sensor array under flow modulation data set using four methods for semisupervised regression when D>>n; p{5%,10%,15%,,100%} of the training output labels are retained.

Figure 2.

Figure 2

S3.FA experiment: MSE 95% confidence intervals on the atp1d data set using four methods for semisupervised regression when D>>n; p{5%,10%,15%,,100%} of the training output labels are retained.

Figure 3.

Figure 3

S3.FA experiment: MSE 95% confidence intervals on the m5spec data set using four methods for semisupervised regression when D>>n; p{5%,10%,15%,,100%} of the training output labels are retained.

Table 2.

S3.FA experiment: MSE 95% confidence intervals on three data sets using four methods for semisupervised regression when D>>n; p{5%,10%,15%,30%,50%,70%} of the training output labels are retained; the best MSE means are marked in bold.

Data Set Method p=5% p=10% p=15%
Gas sensor
array under
flow modulation
Moore–Penrose 0.034 ± 0.0437 0.0292 ± 0.0421 0.0267 ± 0.0326
S2.FA 0.2511 ± 0.3953 0.0899 ± 0.2367 0.0285 ± 0.0333
S3.FA 3.9799 ± 6.7087 0.349 ± 0.6626 0.2294 ± 0.1561
Label Propagation 0.0853 ± 0.0073 0.0825 ± 0.0051 0.0708 ± 0.013
atp1d Moore–Penrose 3763.2939 ± 749.0966 3079.5042 ± 1842.7676 3428.7567 ± 1722.6437
S2.FA 2706.9906 ± 477.3245 2339.3504 ± 324.8581 2279.0802 ± 99.5443
S3.FA 3771.605 ± 1563.6543 2972.1137 ± 639.6724 2633.0816 ± 274.7409
Label Propagation 110,820.3235 ± 0 110,820.3235 ± 0 110,820.3235 ± 0
m5spec Moore–Penrose 0.4602 ± 0.354 0.3609 ± 0.7631 0.0187 ± 0.0221
S2.FA 0.7003 ± 0.3834 0.4269 ± 0.5441 0.4999 ± 0.5172
S3.FA 0.8133 ± 0.6366 1.653 ± 3.9497 0.5118 ± 0.5503
Label Propagation 0.2175 ± 0.0707 0.1939 ± 0.0706 0.2343 ± 0.0624
Data Set Method p=30% p=50% p=70%
Gas sensor
array under
flow modulation
Moore–Penrose 0.0313 ± 0.0259 0.0192 ± 0.0145 0.0132 ± 0.0075
S2.FA 0.0588 ± 0.0576 0.0233 ± 0.0224 0.0279 ± 0.0116
S3.FA 0.645 ± 1.3044 0.1666 ± 0.1059 0.0912 ± 0.0443
Label Propagation 0.0668 ± 0.0086 0.0675 ± 0.018 0.0605 ± 0.0122
atp1d Moore–Penrose 6401.7587 ± 3602.9182 7907.484 ± 4449.2312 36,041.874 ± 21,300.704
S2.FA 2444.9001 ± 404.2393 2439.162 ± 108.3605 2399.7948 ± 160.1038
S3.FA 2760.6602 ± 391.9217 2659.6472 ± 112.9795 2521.9861 ± 222.3765
Label Propagation 110,820.3235 ± 0 110,820.3235 ± 0 110,820.3235 ± 0
m5spec Moore–Penrose 0.00057 ± 0.00085 0.00009 ± 0.00008 0.00009 ± 0.00004
S2.FA 0.37449 ± 0.12508 0.35796 ± 0.07275 0.3995 ± 0.03164
S3.FA 0.37865 ± 0.12808 0.36181 ± 0.07463 0.40419 ± 0.03415
Label Propagation 0.19391 ± 0.02754 0.17424 ± 0.00857 0.17752 ± 0.01545

We notice that on the selected data sets, S3.FA returns poorer results even than S2.FA, which uses only the labeled data. As in the previous experiment, the best method is also data-dependent. The models that show a greater amount of variability compared to the others are S3.FA—in two data sets—and Moore–Penrose—in one data set. Moreover, as expected, the errors tend to decrease as the percentage of labeled training data points increases; the plots do not help us in this regard, but this decrease can be seen numerically in Table 2.

5.3. The MS3.FA Model: Experiment

The experiment concerning MS3.FA includes a comparison of two different types of algorithms for data imputation when D>>n:

  • Mean imputation: for a given attribute (input column), compute its mean ignoring the missing values, then replace the missing data on that attribute with this computed mean

  • MS3.FA.

We also tried two other R packages: mice [18] and Amelia [19], but they could not be applied successfully on our data sets perhaps because they have a peculiarity: D>>n.

For each data set, we removed 10%, 20%, 30%, 40%, 50%, and 60% of the input (We could have added missing data also in the output, but we wanted to focus on the missing input data scenario and not on the semisupervised case.) cells and imputed those via the above mentioned algorithms. The results are presented in Figure 4, Figure 5 and Figure 6 and in Table 3.

Figure 4.

Figure 4

Missing simple-semisupervised factor analysis (MS3.FA) experiment: MSE 95% confidence intervals on the gas sensor array under flow modulation data set using two methods for imputing missing data when D>>n; p{10%,20%,30%,40%,50%,60%} of the input data are removed.

Figure 5.

Figure 5

MS3.FA experiment: MSE 95% confidence intervals on the atp1d data set using two methods for imputing missing data when D>>n; p{10%,20%,30%,40%,50%,60%} of the input data are removed.

Figure 6.

Figure 6

MS3.FA experiment: MSE 95% confidence intervals on the m5spec data set using two methods for imputing missing data when D>>n; p{10%,20%,30%,40%,50%,60%} of the input data are removed.

Table 3.

MS3.FA experiment: MSE 95% confidence intervals on three data sets using two methods for imputing missing data when D>>n; p{10%,20%,30%,40%,50%,60%} of the input data are removed; the best MSE means are marked in bold.

Data Set Method p=10% p=20% p=30%
Gas sensor array under
flow modulation
Mean 0.0108 ± 0.0022 0.0241 ± 0.0027 0.0359 ± 0.0011
MS3.FA 0.00603 ± 0.00029 0.01176 ± 0.00081 0.01747 ± 0.00094
atp1d Mean 3239.9757 ± 68.2511 6831.0884 ± 105.6119 10,077.1798 ± 117.4693
MS3.FA 1807.7748 ± 105.03 3697.1146 ± 124.2197 5276.899 ± 104.6252
m5spec Mean 0.00013 ± 0.00001 0.00026 ± 0 0.00039 ± 0.00001
MS3.FA 0.14835 ± 0.000002 0.148433 ± 0.000002 0.148518 ± 0.000002
Data Set Method p=40% p=50% p=60%
Gas sensor array under
flow modulation
Mean 0.0494 ± 0.0042 0.0597 ± 0.0015 0.0731 ± 0.0012
MS3.FA 0.02372 ± 0.00159 0.0299 ± 0.00047 0.0364 ± 0.0011
atp1d Mean 13,423.3625 ± 252.7735 16,899.3166 ± 223.6971 20,156.2493 ± 62.1299
MS3.FA 7071.6793 ± 143.3232 8921.5319 ± 165.5759 10,558.5723 ± 53.1182
m5spec Mean 0.000521 ± 0.000005 0.000658 ± 0.000008 0.00080 ± 0.000005
MS3.FA 0.1486 ± 0.00001 0.14869 ± 0.00001 0.148781 ± 0.000001

From these results, we discover that MS3.FA is better than mean imputation on two data sets, and, as expected, the error increases as the percentage of missing data increases.

6. Conclusions and Future Work

The initial purpose of this paper was to extend an already existing model: factor analysis. We developed its supervised counterpart (S2.FA) and noticed that the unconstrained version (S2.UncFA) is equivalent to linear regression. Because FA is applied in density estimation when the dimensionality of the data is greater than the number of samples, and because of the already mentioned equivalence, the purpose of the paper became to analyze this new method of applying LR when D>>n, i.e., via S2.FA. Since FA and S2.FA are generative models and are unsupervised–supervised counterparts, we combined both into a new model S3.FA as an extension of LR to semisupervised learning when D>>n. The final extension regards missing data; it is called MS3.FA. We developed an R package (s2fa) with these algorithms; it can be found on GitHub. The experimental parts included several comparisons in the D>>n scenario:

  • of S2.FA with other techniques extending LR to the D>>n case,

  • of S3.FA with other (semi)supervised regression methods,

  • of MS3.FA with another data imputation algorithm.

The bottom line is that we do not necessarily recommend S3.FA for semisupervised regression since our results suggest that it gives poor results, but we encourage the consideration of S2.FA for regression and MS3.FA for missing data imputation as algorithms to be compared with others on a given data set.

As for future work, we could further explore the S2.FA, S3.FA, and MS3.FA algorithms when z is a real vector, not just a real number. Moreover, we can experiment with the PPCA version of the algorithms. Questions regarding the time complexity—empirical or not—can be also addressed; we expect the fitting time to be impractical if the number of columns is large. Because there are models such as mixture of factor analyzers [20] and mixture of linear regression models ([21] Section 14.5.1), another research direction involves mixtures of S2.FAs. Another idea would be to investigate the memory resources required by the algorithms when the data set increases and also to consider scalable systems such as Spark [22] for implementation.

Acknowledgments

We thank Cristian Gaţu and Daniel Stamate for attentive proofreading and useful comments of previous versions of this paper.

Abbreviations

   The following abbreviations are used in this manuscript:

GMM Gaussian mixture model
LR Linear regression
EM Expectation–maximization
KL Kullback–Leibler
FA Factor analysis
PCA Principal component analysis
PPCA Probabilistic principal component analysis
GPLVM Gaussian process latent variable model
MLE Maximum likelihood estimation
UncFA Unconstrained factor analysis
S2.UncFA Simple-supervised unconstrained factor analysis
S2.FA Simple-supervised factor analysis
S2.PPCA Simple-supervised probabilistic principal component analysis
S3.UncFA Simple-semisupervised unconstrained factor analysis
S3.FA Simple-semisupervised factor analysis
S3.PPCA Simple-semisupervised probabilistic principal component analysis
MS3.UncFA Missing simple-semisupervised unconstrained factor analysis
MS3.FA Missing simple-semisupervised factor analysis
MS3.PPCA Missing simple-semisupervised probabilistic principal component analysis
MSE Mean squared error
ELBO Evidence lower bound

Appendix A. On the Expectation–Maximization Algorithm

This section provides theoretical details on the EM algorithm [23]. These are relevant in order to establish a link between EM and the information theory field.

A latent variable model assumes that the data we observed—usually, denoted by the random variable X—is not complete: there is also some latent data modeled via random variables—usually, denoted by Z. Often, pairs consisting of an observed point and a latent one constitute the complete dataset, e.g., in a GMM the latent data is the cluster number and such a number is assigned to each point in the observed dataset.

Usually, in a latent variable model the likelihood of the observed data is not tractable—although there are exceptions, like in GMM or factor analysis—and therefore we cannot maximize it directly. Instead we maximize a lower bound for the log-likelihood function called ELBO (evidence lower bound). To simplify the discussion, we consider that we have one single observed datapoint, x. We will also consider the case where Z is continuous; when Z is discrete the ∫ sign is replaced by the ∑ sign.

Let q be any distribution over z, where the z values are the possible values of the random variable Z. The log-likelihood of the observed datapoint x is:   

logp(x;θ)=marginallogzp(x,z;θ)dz=logzq(z)q(z)p(x,z;θ)dz=logzq(z)p(x,z;θ)q(z)dz=logEqp(x,z;θ)q(z)JensenEqlogp(x,z;θ)q(z)ELBO(θ,q).

Furthermore, the following relationships important for the E step of the EM algorithm—see below—can be proven:

logp(x;θ)ELBO(θ,q)=KL(q(·)||p(·|x;θ))logp(x;θ)=ELBO(θ,q)+KL(q(·)||p(·|x;θ))ELBO(θ,q)=logp(x;θ)KL(q(·)||p(·|x;θ)).

Moreover, the following relationships important for the M step of the EM algorithm—see below—can be proven:

ELBO(θ,q)=def.Eqlogp(x,z;θ)q(z)=zq(z)logp(x,z;θ)q(z)dz=zq(z)logp(x,z;θ)dzzq(z)logq(z)dz=Eq[logp(x,z;θ)]+H(q).

Now, instead of carrying out maxθlogp(x;θ) we will execute maxθ,qELBO(θ,q), since ELBO is a lower bound for the log-likelihood of x and hence its maximization will not hurt the process of maximizing logp(x;θ).

The ELBO can be maximized in at least two ways:

  • via (block) coordinate ascent

    The resulting meta-algorithm is the EM algorithm.

    In fact this is the case for many classic models—EM for GMM [3], EM for factor analysis [5] etc.—.

    EM is an iterative algorithm and an iteration encompasses two steps:

    1. E step:

      q(t)=argmaxqELBO(θ(t1),q) for θ(t1) fixed—from the previous iteration.

      Since ELBO(θ,q)=logp(x;θ)KL(q(·)||p(·|x;θ)), we have:   
      q(t)=argmaxqELBO(θ(t1),q)=argmaxq(logp(x;θ(t1))KL(q(·)||p(·|x;θ(t1))))=argminqKL(q(·)||p(·|x;θ(t1)))=KLpropertyp(·|x;θ(t1)).

      (In this case, we have logp(x;θ(t1))=ELBO(θ(t1);q(t)).)

      So, we obtained the distribution q(t) as a posterior distribution. Although in classic models where conjugate priors are used it is tractable to compute p(·|x;θ(t1))—this type of inference is called analytical inference—, in other models this is not the case and a solution to this shortcoming is represented by approximate/variational inference.

    2. M step:

      θ(t)=argmaxθELBO(θ,q(t)) for q(t) fixed—from the E step.

      Since ELBO(θ,q)=Eq[logp(x,z;θ)]+H(q), we have:
      θ(t)=argmaxθELBO(θ,q(t))=argmaxθ(Eq(t)[logp(x,z;θ)]+H(q(t)))=argmaxθEq(t)[logp(x,z;θ)].

      So, we obtained a relatively simpler term to maximize. Note that the maximization is further customized using the probabilistic assumptions at hand.

  • via gradient ascent: this is the case of Variational Autoencoder [24] which will not be discussed since it is not necessarily relevant to this study.

Appendix B. S2.FA

Algorithm A1 S2.FA—nonmatrix form.
  • 1:

    functionTrain( {(x(i),z(i))|i{1,,n}})

  • 2:

         x¯=i=1nx(i)n

  • 3:

         μ^z=i=1nz(i)n=z¯

  • 4:

         Σ^z=i=1n(z(i)μ^z)(z(i)μ^z)n

  • 5:

         Λ^=nx¯z¯i=1nx(i)z(i)nz¯z¯i=1nz(i)z(i)1

  • 6:

         μ^=x¯Λ^z¯

  • 7:

         Ψ^=diagi=1n(x(i)μ^Λz(i))(x(i)μ^Λz(i))n

  • 8:

        return ( μ^z, Σ^z, Λ^, μ^, Ψ^)

  • 9:

    functionTest( x*,( μ^z, Σ^z, Λ^, μ^, Ψ^))

  • 10:

        value = μ^z+Σ^zΛ^(Λ^Σ^zΛ^+Ψ^)1(x*μ^Λ^μ^z)

  • 11:

        covarianceMatrix = Σ^zΣ^zΛ^(Λ^Σ^zΛ^+Ψ^)1Λ^Σ^z

  • 12:

        return (value, covarianceMatrix)

Algorithm A2 S2.FA—matrix form.
  • 1:

    functionTrain(X,Z)                               ▹ X=x(1)x(n)

  • 2:

                                            ▹ Z=z(1)z(n)

  • 3:

         x¯=i=1nX:in

  • 4:

         μ^z=i=1nZ:in=z¯

  • 5:

         Σ^z=1nZμ^z111R1×nZμ^z111=1nZZμ^zμ^z  ▹ 2 ways to compute

  • 6:

         Λ^=(nx¯μ^zXZ)(nμ^zμ^zZZ)1

  • 7:

         μ^=x¯Λ^z¯

  • 8:

         Ψ^=diag1nXμ^111R1×nΛ^ZXμ^111Λ^Z

  • 9:

        return ( μ^z, Σ^z, Λ^, μ^, Ψ^)

  • 10:

    functionTest( x*,( μ^z, Σ^z, Λ^, μ^, Ψ^))

  • 11:

        value = μ^z+Σ^zΛ^(Λ^Σ^zΛ^+Ψ^)1(x*μ^Λ^μ^z)

  • 12:

        covarianceMatrix = Σ^zΣ^zΛ^(Λ^Σ^zΛ^+Ψ^)1Λ^Σ^z

  • 13:

        return (value, covarianceMatrix)

Appendix C. S3.FA

The algorithms below are more general: z is not just a real number, as we state in the paper, but a real vector.

Algorithm A3 S3.FA—nonmatrix form.
  • 1:

    functionLogLikelihood( {(x(1),z(1)),,(x(a),z(a)),x(a+1),,x(n)}, (μz,Σz,μ,Λ,Ψ))

  • 2:

      return  i=1alnN(z(i)|μz,Σz)+lnN(x(i)|μ+Λz(i),Ψ)+i=a+1nlnN(x(i)|μ+Λμz,ΛΣzΛ+Ψ)

  • 3:

    functionTrain( {(x(1),z(1)),,(x(a),z(a)),x(a+1),,x(n)},nMaxIterations,eps)

  • 4:

         x¯=i=1nx(i)n

  • 5:

         θ(0)=initializeParameters({(x(1),z(1)),,(x(a),z(a)),x(a+1),,x(n)})

  • 6:

         lRV_Do(0)= LogLikelihood( {(x(1),z(1)),,(x(a),z(a)),x(a+1),,x(n)}, θ(0))

  • 7:

        for t = 0:nMaxIterations do

  • 8:

            E step: Compute E[Z(i)], E[Z(i)Z(i)],i{a+1,,n}:

  • 9:

             EZ(i)|X(i)=x(i),θ(t)[Z(i)]=μz(t)+Σz(t)Λ(t)(Λ(t)Σz(t)Λ(t)+Ψ(t))1(x(i)μ(t)Λ(t)μz(t))

  • 10:

             EZ(i)|X(i)=x(i),θ(t)[Z(i)Z(i)]=Σz(t)+EZ(i)|X(i)=x(i),θ(t)[Z(i)]EZ(i)|X(i)=x(i),θ(t)[Z(i)]

  • 11:

            M Step: Compute θ(t+1)=(μz(t+1),Σz(t+1),μ(t+1),Λ(t+1),Ψ(t+1)):

  • 12:

             μz(t+1)=i=1az(i)+a+1nE[Z(i)]n

  • 13:

             Σz(t+1)=1n(i=1a(z(i)μz(t+1))(z(i)μz(t+1))+i=a+1n(E[Z(i)Z(i)]E[Z(i)]μz(t+1)μz(t+1)E[Z(i)]+μz(t+1)μz(t+1)))

  • 14:

             Λ(t+1)=nx¯μz(t+1)i=1ax(i)z(i)i=a+1nx(i)E[Z(i)]

    nμz(t+1)μz(t+1)i=1az(i)z(i)i=a+1nE[Z(i)Z(i)]1

  • 15:

             μ(t+1)=x¯Λ(t+1)μz(t+1)

  • 16:

             Ψ(t+1)=diag(1n(i=1a(x(i)μ(t+1)Λ(t+1)z(i))(x(i)μ(t+1)Λ(t+1)z(i))+i=a+1n((x(i)μ(t+1))(x(i)μ(t+1))(x(i)μ(t+1))E[Z(i)]Λ(t+1)Λ(t+1)E[Z(i)](x(i)μ(t+1))+Λ(t+1)E[Z(i)Z(i)]Λ(t+1))))

  • 17:

             lRV_Do(t+1)= LogLikelihood( {(x(1),z(1)),,(x(a),z(a)),x(a+1),,x(n)}, θ(t+1))

  • 18:

            if  θ(t)θ(t+1)22θ(t)22eps or |lRV_Do(t)lRV_Do(t+1)||lRV_Do(t)|eps then

  • 19:

               break

  • 20:

        return  θ(t)

  • 21:

    functionTest( x*,( μ^z, Σ^z, Λ^, μ^, Ψ^))                         ▹ The same as in S2.FA

  • 22:

        value = μ^z+Σ^zΛ^(Λ^Σ^zΛ^+Ψ^)1(x*μ^Λ^μ^z)

  • 23:

        covarianceMatrix = Σ^zΣ^zΛ^(Λ^Σ^zΛ^+Ψ^)1Λ^Σ^z

  • 24:

        return (value, covarianceMatrix)

Algorithm A4 S3.FA—matrix form.
  • 1:

    functionLogLikelihood(X,Z, (μz,Σz,μ,Λ,Ψ))                  ▹ X=x(1)x(n)

  • 2:

                                            ▹ Z=z(1)z(a)

  • 3:

        return  i=1alnN(Z:i|μz,Σz)+lnN(X:i|μ+ΛZ:i,Ψ)+i=a+1nlnN(X:i|μ+Λμz,ΛΣzΛ+Ψ)

  • 4:

    functionTrain(X,Z,nMaxIterations,eps)                      ▹ X=x(1)x(n)

  • 5:

                                            ▹ Z=z(1)z(a)

  • 6:

         x¯=i=1nX:in

  • 7:

         θ(0)=initializeParameters(X,Z)

  • 8:

         lRV_Do(0)= LogLikelihood(X,Z, θ(0))

  • 9:

        for t = 0:nMaxIterations do

  • 10:

            E step: Compute E[Z(i)], E[Z(i)Z(i)],i{a+1,,n}:

  • 11:

             E_Z=μz(t)+Σz(t)Λ(t)(Λ(t)Σz(t)Λ(t)+Ψ(t))1(X:,(a+1):nμ(t)111R1×(na)Λ(t)μz(t)111R1×(na))

  • 12:

             E_Z_Z_T=Σz(t)+E_ZE_Z

  • 13:

             E_Z=ZE_Z

  • 14:

             E_Z_Z_T=ZZ+E_Z_Z_T

  • 15:

            M Step: Compute θ(t+1)=(μz(t+1),Σz(t+1),μ(t+1),Λ(t+1),Ψ(t+1)):

  • 16:

             μz(t+1)=i=1nE_Z:in

  • 17:

             Σz(t+1)=1nE_Z_Z_Tμz(t+1)μz(t+1)

  • 18:

             Λ(t+1)=nx¯μz(t+1)XE_Z  nμz(t+1)μz(t+1)E_Z_Z_T1

  • 19:

             μ(t+1)=x¯Λ(t+1)μz(t+1)

  • 20:

             Ψ(t+1)=diag(1n((Xμ(t+1)111R1×n)(Xμ(t+1)111R1×n)(Xμ(t+1)111R1×n)E_ZΛ(t+1)Λ(t+1)E_Z(Xμ(t+1)111R1×n)+Λ(t+1)E_Z_Z_TΛ(t+1)))

  • 21:

             lRV_Do(t+1)= LogLikelihood(X,Z, θ(t+1))

  • 22:

            if  θ(t)θ(t+1)22θ(t)22eps or |lRV_Do(t)lRV_Do(t+1)||lRV_Do(t)|eps then

  • 23:

               break

  • 24:

        return  θ(t)

  • 25:

    functionTest( x*,( μ^z, Σ^z, Λ^, μ^, Ψ^))                         ▹ The same as in S2.FA

  • 26:

        value = μ^z+Σ^zΛ^(Λ^Σ^zΛ^+Ψ^)1(x*μ^Λ^μ^z)

  • 27:

        covarianceMatrix = Σ^zΣ^zΛ^(Λ^Σ^zΛ^+Ψ^)1Λ^Σ^z

  • 28:

        return (value, covarianceMatrix)

Appendix D. MS3.FA

The algorithms below are more general: z is not just a real number, as we state in the paper, but a real vector.

Algorithm A5 MS3.FA—Other functions.
  • 1:

    functionCondNormalNA(y,( μ, Σ))

  • 2:

        I = {i|yi=NA}                                 ▹ I = Indexes

  • 3:

        OI = {i|yiNA}                              ▹ OI = Other Indexes

  • 4:

        mean = μI+ΣI,OIΣOI1(yOIμOI)

  • 5:

        covarianceMatrix = ΣIΣI,OIΣOI1ΣI,OI

  • 6:

         E[Y]I = mean

  • 7:

         E[Y]OI=yOI

  • 8:

         E[YY]=E[Y]E[Y]

  • 9:

         E[YY]I,I= E[YY]I,I + covarianceMatrix

  • 10:

        return ( E[Y], E[YY])

  • 11:

    functionFullNormal( μz,Σz,μ,Λ,Ψ)

  • 12:

        mean = μ+Λμzμz

  • 13:

        covarianceMatrix = ΛΣzΛ+ΨΛΣz(ΛΣz)Σz

  • 14:

        return (mean,covarianceMatrix)

  • 15:

    functionLogLikelihood( {(x(1),z(1)),,(x(n),z(n))}, ( μz,Σz,μ,Λ,Ψ))

  • 16:

        (mean,cov) = FullNormal( μz,Σz,μ,Λ,Ψ)

  • 17:

        result = 0

  • 18:

        for i = 1:n do

  • 19:

             y(i)=x(i)z(i)

  • 20:

            OI = {j|yj(i)NA}                          ▹ OI = Other indexes

  • 21:

            result = result + lnN(yOI(i)|meanOI,covOI,OI)

  • 22:

        return result

Algorithm A6 MS3.FA—Train.
  • 1:

    functionTrain( {(x(1),z(1)),,(x(n),z(n))},nMaxIterations,eps)         ▹ data can have NAs

  • 2:

         θ(0) = initializeParameters({(x(1),z(1)),,(x(n),z(n))})

  • 3:

         lRV_Do(0)= LogLikelihood ({(x(1),z(1)),,(x(n),z(n))},θ(0))

  • 4:

        for t = 0:nMaxIterations do

  • 5:

            E step: Compute the following for all i{1,,n}:

  • 6:

             y(i)=x(i)z(i)

  • 7:

            fullNormal = FullNormal(μz(t),Σz(t),μ(t),Λ(t),Ψ(t))

  • 8:

             (E[Y(i)],E[Y(i)Y(i)])= CondNormalNA (y(i),fullNormal)

  • 9:

             E[X(i)]=E[Y(i)]1:D                             ▹ D = dim( x(i))

  • 10:

             E[Z(i)]=E[Y(i)](D+1):(D+d)                          ▹ d = dim( z(i))

  • 11:

             E[X(i)X(i)]=E[Y(i)Y(i)]1:D,1:D

  • 12:

             E[X(i)Z(i)]=E[Y(i)Y(i)]1:D,(D+1):(D+d)

  • 13:

             E[Z(i)X(i)]=E[X(i)Z(i)]

  • 14:

             E[Z(i)Z(i)]=E[Y(i)Y(i)](D+1):(D+d),(D+1):(D+d)

  • 15:

            M Step: Compute θ(t+1)=(μz(t+1),Σz(t+1),μ(t+1),Λ(t+1),Ψ(t+1)):

  • 16:

             μz(t+1)=i=1nE[Z(i)]n

  • 17:

             Σz(t+1)=i=1n(E[Z(i)Z(i)]E[Z(i)]μz(t+1)μz(t+1)E[Z(i)]+μz(t+1)μz(t+1))n

  • 18:

             x¯=i=1nE[X(i)]n

  • 19:

             Λ(t+1)=nx¯μz(t+1)i=1nE[X(i)Z(i)]nμz(t+1)μz(t+1)i=1nE[Z(i)Z(i)]1

  • 20:

             μ(t+1)=x¯Λ(t+1)μz(t+1)

  • 21:

             Ψ(t+1)=diag(1ni=1n(E[X(i)X(i)]E[X(i)]μ(t+1)μ(t+1)E[X(i)]+μ(t+1)μ(t+1)E[X(i)Z(i)]Λ(t+1)+μ(t+1)E[Z(i)]Λ(t+1)Λ(t+1)E[Z(i)X(i)]+Λ(t+1)E[Z(i)]μ(t+1)+Λ(t+1)E[Z(i)Z(i)]Λ(t+1)))

  • 22:

             lRV_Do(t+1) = LogLikelihood ({(x(1),z(1)),,(x(n),z(n))},θ(t+1))

  • 23:

            if  θ(t)θ(t+1)22θ(t)22eps or |lRV_Do(t)lRV_Do(t+1)||lRV_Do(t)|eps then

  • 24:

               break

  • 25:

        return  θ(t)

Algorithm A7 MS3.FA—Test and Impute.
  • 1:

    functionTest( y*=(x*,z*) partially known, ( μ^z, Σ^z, Λ^, μ^, Ψ^))

  • 2:

        fullNormal = FullNormal(μ^z,Σ^z,μ^,Λ^,Ψ^)

  • 3:

         (E[Y*],E[Y*Y*])= CondNormalNA(y*, fullNormal)

  • 4:

        value = E[Y*]

  • 5:

        covarianceMatrix = E[Y*Y*]E[Y*]E[Y*]

  • 6:

        return (value,covarianceMatrix)

  • 7:

    functionImpute( y*=(x*,z*) partially known, ( μ^z, Σ^z, Λ^, μ^, Ψ^))

  • 8:

        (value, covarianceMatrix) = Test( y*,( μ^z, Σ^z, Λ^, μ^, Ψ^))

  • 9:

        return value

Author Contributions

Conceptualization, S.C. and L.C.; methodology, S.C. and L.C.; software, S.C.; validation, S.C.; formal analysis, S.C.; investigation, S.C.; resources, L.C.; data curation; writing—original draft preparation, S.C.; writing—review and editing, S.C. and L.C.; visualization, S.C. and L.C.; supervision, L.C.; project administration, S.C. and L.C.; funding acquisition. Both authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available data sets were analyzed in this study. This data can be found here: http://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+flow+modulation (accessed on 31 July 2021) for the gas sensor array under flow modulation data set, https://www.openml.org/d/41475 (accessed on 31 July 2021) for atp1d, http://www.eigenvector.com/data/Corn (accessed on 31 July 2021) for m5spec.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Mitchell T. 2017. [(accessed on 31 July 2021)]. Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression. (Additional Chapter to Machine Learning; McGraw-Hill: New York, NY, USA, 1997.) Published Online. Available online: https://bit.ly/39Ueb4o. [Google Scholar]
  • 2.Murphy K. Machine Learning: A Probabilistic Perspective. MIT Press; Cambridge, MA, USA: 2012. [Google Scholar]
  • 3.Ng A. Machine Learning Course, Lecture Notes, Mixtures of Gaussians and the EM Algorithm. [(accessed on 31 July 2021)]; Available online: http://cs229.stanford.edu/notes2020spring/cs229-notes7b.pdf.
  • 4.Singh A. Machine Learning Exercise Book (In Romanian) Alexandru Ioan Cuza University of Iași; Iași, Romania: 2019. [(accessed on 31 July 2021)]. Machine Learning Course, Homework 4, pr 1.1; CMU: Pittsburgh, PA, USA, 2010; p. 528 in Ciortuz, L.; Munteanu, A.; Bădărău, E. Available online: https://bit.ly/320ZuIk. [Google Scholar]
  • 5.Ng A. Machine Learning Course, Lecture Notes, Part X. [(accessed on 31 July 2021)]; Available online: http://cs229.stanford.edu/notes2020spring/cs229-notes9.pdf.
  • 6.Goodfellow I., Bengio Y., Courville A. Deep Learning. MIT Press; Cambridge, MA, USA: 2016. [(accessed on 31 July 2021)]. Available online: http://www.deeplearningbook.org. [Google Scholar]
  • 7.Tipping M.E., Bishop C.M. Probabilistic Principal Component Analysis. [(accessed on 31 July 2021)];J. R. Stat. Soc. Ser. (Stat. Methodol.) 1999 61:611–622. doi: 10.1111/1467-9868.00196. Available online: https://bit.ly/2PCxoRr. [DOI] [Google Scholar]
  • 8.Ciobanu S. Master’s Thesis. Alexandru Ioan Cuza University of Iași; Iași, Romania: 2019. [(accessed on 31 July 2021)]. Exploiting a New Probabilistic Model: Simple-Supervised Factor Analysis. Available online: https://bit.ly/31UsBx6. [Google Scholar]
  • 9.Ng A. Machine Learning Course, Lecture Notes, Part XI. [(accessed on 31 July 2021)]; Available online: http://cs229.stanford.edu/notes2020spring/cs229-notes10.pdf.
  • 10.Lawrence N.D. Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data. [(accessed on 31 July 2021)];Adv. Neural Inf. Process. Syst. 2004 :329–336. Available online: https://papers.nips.cc/paper/2540-gaussian-process-latent-variable-models-for-visualisation-of-high-dimensional-data.pdf. [Google Scholar]
  • 11.Gao X., Wang X., Tao D., Li X. Supervised Gaussian Process Latent Variable Model for Dimensionality Reduction. [(accessed on 31 July 2021)];IEEE Trans. Syst. Man, Cybern. Part (Cybern.) 2010 41:425–434. doi: 10.1109/TSMCB.2010.2057422. Available online: https://ieeexplore.ieee.org/document/5545418. [DOI] [PubMed] [Google Scholar]
  • 12.Mitchell T., Xing E., Singh A. Machine Learning Exercise Book (In Romanian) Alexandru Ioan Cuza University of Iași; Iași, Romania: 2019. [(accessed on 31 July 2021)]. Machine Learning Course, Midterm Exam, pr. 5.3; CMU: Pittsburgh, PA, USA, 2010; p. 565 Ciortuz, L.; Munteanu, A.; Bădărău, E. Available online: https://bit.ly/320ZuIk. [Google Scholar]
  • 13.Ziyatdinov A., Fonollosa J., Fernández L., Gutierrez-Gálvez A., Marco S., Perera A. Bioinspired early detection through gas flow modulation in chemo-sensory systems. Sens. Actuators Chem. 2015;206:538–547. doi: 10.1016/j.snb.2014.09.001. [DOI] [Google Scholar]
  • 14.Spyromitros-Xioufis E., TSOUMAKAS G., WILLIAM G., Vlahavas I. Drawing parallels between multi-label classification and multi-target regression. arXiv. 20141211.6581 v2 [Google Scholar]
  • 15.Xiaojin Z., Zoubin G. Learning from Labeled and Unlabeled Data with Label Propagation. Carnegie Mellon University; Pittsburgh, PA, USA: 2002. Technical Report CMU-CALD-02–107. [Google Scholar]
  • 16.Wang J. [(accessed on 31 July 2021)];SSL: Semi-Supervised Learning. 2016 R Package Version 0.1. Available online: https://CRAN.R-project.org/package=SSL.
  • 17.Oliver A., Odena A., Raffel C., Cubuk E.D., Goodfellow I.J. Realistic evaluation of deep semi-supervised learning algorithms. arXiv. 20181804.09170 [Google Scholar]
  • 18.van Buuren S., Groothuis-Oudshoorn K. Mice: Multivariate Imputation by Chained Equations in R. [(accessed on 31 July 2021)];J. Stat. Softw. 2011 45:1–67. doi: 10.18637/jss.v045.i03. Available online: https://www.jstatsoft.org/v45/i03/ [DOI] [Google Scholar]
  • 19.Honaker J., King G., Blackwell M. Amelia II: A Program for Missing Data. [(accessed on 31 July 2021)];J. Stat. Softw. 2011 45:1–47. doi: 10.18637/jss.v045.i07. Available online: http://www.jstatsoft.org/v45/i07/ [DOI] [Google Scholar]
  • 20.Ghahramani Z., Hinton G.E. The EM Algorithm for Mixtures of Factor Analyzers. University of Toronto; Toronto, ON, Canada: 1996. [(accessed on 31 July 2021)]. Technical Report, CRG-TR-96-1. Available online: http://mlg.eng.cam.ac.uk/zoubin/papers/tr-96-1.pdf. [Google Scholar]
  • 21.Bishop C.M. Pattern Recognition and Machine Learning. Springer Science + Business Media; Berlin, Germany: 2006. [Google Scholar]
  • 22.Zaharia M., Xin R.S., Wendell P., Das T., Armbrust M., Dave A., Meng X., Rosen J., Venkataraman S., Franklin M.J., et al. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM. 2016;59:56–65. doi: 10.1145/2934664. [DOI] [Google Scholar]
  • 23.Dempster A.P., Laird N.M., Rubin D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. (Methodol.) 1977;39:1–22. [Google Scholar]
  • 24.Kingma D.P., Welling M. Auto-encoding variational bayes. arXiv. 20131312.6114 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Publicly available data sets were analyzed in this study. This data can be found here: http://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+flow+modulation (accessed on 31 July 2021) for the gas sensor array under flow modulation data set, https://www.openml.org/d/41475 (accessed on 31 July 2021) for atp1d, http://www.eigenvector.com/data/Corn (accessed on 31 July 2021) for m5spec.


Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES