Abstract
Wyner’s common information is a measure that quantifies and assesses the commonality between two random variables. Based on this, we introduce a novel two-step procedure to construct features from data, referred to as Common Information Components Analysis (CICA). The first step can be interpreted as an extraction of Wyner’s common information. The second step is a form of back-projection of the common information onto the original variables, leading to the extracted features. A free parameter controls the complexity of the extracted features. We establish that, in the case of Gaussian statistics, CICA precisely reduces to Canonical Correlation Analysis (CCA), where the parameter determines the number of CCA components that are extracted. In this sense, we establish a novel rigorous connection between information measures and CCA, and CICA is a strict generalization of the latter. It is shown that CICA has several desirable features, including a natural extension to beyond just two data sets.
Keywords: common information, dimensionality reduction, feature extraction, unsupervised, canonical correlation analysis, CCA
1. Introduction
Understanding relations between two (or more) sets of variates is key to many tasks in data analysis and beyond. To approach this problem, it is natural to reduce each of the sets of variates separately in such a way that the reduced descriptions fully capture the commonality between the two sets, while suppressing aspects that are individual to each of the sets. This permits to understand the relation between the two sets without obfuscation. A popular framework to accomplish this task follows the classical viewpoint of dimensionality reduction and is referred to as Canonical Correlation Analysis (CCA) [1]. CCA seeks the best linear extraction, i.e., we consider linear projections of the original variates. Via the so-called Kernel trick, this can be extended to cover arbitrary (fixed) function classes.
Wyner’s common information is a well-known and established measure of the dependence of two random variables. Intuitively, it seeks to extract a third random variable such that the two random variables are conditionally independent given the third, but at the same time that the third variable is as compact as possible. Compactness is measured in terms of the mutual information that the third random variable retains about the original two. The resulting optimization problem is not a convex problem (because the constraint set is not a convex set), and therefore, not surprisingly, closed-form solutions are rare. A natural generalization of Wyner’s common information is obtained by replacing the constraint of conditional independence by a limit on the conditional mutual information. If the limit is set equal to zero, we return precisely to the case of conditional independence. Exactly like mutual information, Wyner’s common information and its generalization are endowed with a clear operational meaning. They characterize the fundamental limits of data compression (in the Shannon sense) for a certain network situation.
1.1. Related Work
Connections between CCA and Wyner’s common information have been explored in the past. It is well known that, for Gaussian vectors (standard, non-relaxed), Wyner’s common information is attained by all of the CCA components together, see [2]. This has been further interpreted, see, e.g., [3]. Needless to say, having all of the CCA components together essentially amounts to a one-to-one transform of the original data into a new basis. It does not yet capture the idea of feature extraction or dimensionality reduction. To put our work into context, it is only the relaxation of Wyner’s common information [4,5] that permits to conceptualize the sequential, one-by-one recovery of the CCA components, and thus, the spirit of dimensionality reduction.
CCA also appears in a number of other problems related to information measures and probabilistic models. For example, in the so-called Gaussian information bottleneck problem, the optimizing solution can be expressed in terms of the CCA components [6], and an interpretation of CCA as a (Gaussian) probabilistic model was presented in [7].
Generalizations of CCA have appeared before in the literature. The most prominent is built around maximal correlation. Here, one seeks arbitrary remappings of the original data in such a way as to maximize their correlation coefficient. This perspective culminates in the well-known alternating conditional expectation (ACE) algorithm [8].
Feature extraction and dimensionality reduction have a vast amount of literature attached to them, and it is beyond the scope of the present article to provide a comprehensive overview. In a part of that literature, information measures play a key role. Prominent examples are independent components analysis (ICA) [9] and the information bottleneck [10,11], amongst others. More recently, feature extraction alternations via information theory are presented in [12,13]. In [12], the estimation of Rényi’s quadratic entropy is studied, whereas, in [13], standard information theoretic measures such as Kullback–Leibler divergence are used for fault diagnosis. Other slightly related feature extraction methods that perform dimensionality reduction on a single dataset include [14,15,16,17,18,19,20]. More concretely, in [14], a sparse Support Vector Machine (SVM) approach is used for feature extraction. In [15], feature extraction is performed via regression by using curvilinearity instead of linearity. In [16], compressed sensing is used to extract features when the data have a sparse representation. In [17], an invariant mapping method is invoked to map the high-dimensional data to low-dimensional data that is based on a neighborhood relation. In [18], feature extraction is performed for a partial learning of the geometry of the manifold. In [19], distance correlation measure (a measure with similar properties as the regular Pearson correlation coefficient) is proposed as a new feature extraction method. In [20], kernel principal component analysis is used to perform feature extraction and allow for the extraction of nonlinearities. In [21], feature extraction is done by a robust regression based approach and, in [22], a linear regression approach is used to extract features.
1.2. Contributions
The contributions of our work are the following:
We introduce a novel suit of algorithms, referred to as CICA. These algorithms are characterized by a two-step procedure. In the first step, a relaxation of Wyner’s common information is extracted. The second step can be interpreted as a form of projection of the common information back onto the original data so as to obtain the respective features. A free parameter is introduced to control the complexity of the extracted features.
We establish that, for the special case where the original data are jointly Gaussian, our algorithms precisely extract the CCA components. In this case, the parameter determines how many of the CCA components are extracted. In this sense, we establish a new rigorous connection between information measures and CCA.
We present initial results on how to extend CICA to more than two variates.
Via a number of paradigmatic examples, we illustrate that, for discrete data, CICA gives intuitively pleasing results while other methods, including CCA, do not. This is most pronounced in a simple example with three sources described in Section 7.1.
1.3. Notation
A bold capital letter such as denotes a random vector, and its realization. The probability distribution function of random variable X will be denoted by or depending on the context. A non-bold capital letter such as K denotes a (fixed) matrix, and its Hermitian transpose. Specifically, denotes the covariance matrix of the random vector denotes the covariance matrix between random vectors and Let be the set of all probability distribution, discrete or continuous depending on the context. Let us denote with the identity matrix of dimension and the zero matrix of dimension . We denote by the lower convex envelope of with respect to x and for random variables is the lower convex envelope of with respect to . We denote by the binary entropy for .
1.4. A Simple Example with Synthetic Data
To set the stage and under the guise of an informal problem statement, let us consider a simple example involving synthetic data. Specifically, we consider two-dimensional data, that is, the vectors and are of length 2. The goal is to extract, separately from each of the two, a one-dimensional description in such a way as to extract the commonality between and while suppressing their individual features. For simplicity, in the present artificial example, we will assume that the entries of the vectors only take value in a small finite set, namely, To illustrate the point, we consider the following special statistical model:
| (1) | 
and
| (2) | 
where and are mutually independent uniform random variables over the set and ⊕ denotes addition modulo 4.
The reason for this special statistical structure is such that it is obvious what should be extracted, namely, should be reduced to and should also be reduced to This reduces both and to “one-dimensional” descriptions, and these one-dimensional descriptions capture precisely the dependence between and In this simple example, all the commonality between and is captured by More formally, conditioned on the vectors and are conditionally independent.
The interesting point of this example is that any pair of components of and are independent of each other, such as, for example, and Therefore, the joint covariance matrix of the merged vector is a scaled identity matrix. This implies that any method that only uses the covariance matrix as input, including CCA, cannot find any commonalities between and in this example.
By contrast, the algorithmic procedure discussed in the present paper will correctly extract the desired answer. In Figure 1, we show numerical simulation outcomes for a couple of approaches. Specifically, in (a), we can see that, in this particular example, CCA fails to extract the common features. This, of course, was done on purpose: For the synthetic data at hand, the global covariance matrix is merely a scaled identity matrix, and since CCA’s only input is the covariance matrix, it does not actually do anything in this example. In (b), we show the performance of the approximate gradient-descent based implementation of the CICA algorithm proposed in this paper, as detailed in Section 6. In this simple example, this precisely coincides with the ideal theoretical performance of CICA as in a Generic Procedure 1, but, in general, the gradient-descent based implementation is not guaranteed to find the ideal solution.
Figure 1.
The situation for the synthetic data as described in example described in Section 1.4. Figure 1a shows the scatterplot for two one-dimensional features extracted by CCA. Apparently, the approach is not able to extract the commonality between the vectors and in this synthetic example. Figure 1b shows the performance of the heuristic algorithm of CICA described in Section 6, which, in this simple example, ends up matching the ideal theoretical performance of CICA as in a Generic Procedure 1 for data samples.
At this point, we should stress that, for such a simple example, many other approaches would also lead to the same, correct answer. One of them is maximal correlation. In that perspective, one seeks to separately reduce and by applying possibly nonlinear functions and in such a way as to maximize the correlation between and Clearly, for the simple example at hand, selecting and leads to correlation one, and is thus a maximizer.
Finally, the present example is also too simplistic to express the finer information-theoretic structure of the problem. One step up is the example presented in Section 5 below, where the commonality between and is not merely an equality (the component U above), but rather a probabilistic dependency.
2. Wyner’s Common Information and Its Relaxation
The main framework and underpinning of the proposed algorithm is Wyner’s common information and its extension, which is briefly reviewed in the sequel, along with its key properties.
2.1. Wyner’s Common Information
Wyner’s common information is defined for two random variables X and Y of arbitrary fixed joint distribution
Definition 1
([23]). For random variables X and Y with joint distribution Wyner’s common information is defined as
(3) 
Wyner’s common information satisfies a number of interesting properties. We state some of them below in Lemma 1 for a generalized definition given in Definition 2.
We note that explicit formulas for Wyner’s common information are known only for a small number of special cases. The case of the doubly symmetric binary source is solved completely in [23] and can be written as
| (4) | 
where denotes the probability that the two sources are unequal (assuming without loss of generality ). In this case, the optimizing W in Equation (3) can be chosen to be binary. Further special cases of discrete-alphabet sources appear in [24].
Moreover, when X and Y are jointly Gaussian with correlation coefficient then Note that, for this example, This case was solved in [25,26] using a parameterization of conditionally independent distributions, and we have recently found an alternative proof that also extends to the generalization of Wyner’s common information discussed in the next subsection [5].
2.2. A Natural Relaxation of Wyner’s Common Information
A natural generalization of Wyner’s common information (Definition 1) is obtained by replacing the constraint of conditional independence with a limit on the conditional mutual information, leading to the following:
Definition 2
(from [4,5,27]). For random variables X and Y with joint distribution we define
(5) 
This definition appears in slightly different form in Wyner’s original paper (Section 4.2 in [23]), where an auxiliary quantity is defined satisfying The above definition first appears in [4]. Comparing Definitions 1 and 2, we see that, for we have the regular Wyner’s common information. In this sense, one may refer to as relaxed Wyner’s common information.
In line with the discussion following Definition 1, it is not surprising that explicit solutions to the optimization problem in Definition 2 are very rare. In fact, the only presently known general solution concerns the case of jointly Gaussian random variables [5]. The corresponding formula is given below in Theorem 1.
By contrast, the case of the doubly symmetric binary source remains open. An upper bound for this case is given by choosing the auxiliary W as
| (6) | 
where V is Bernoulli with probability and U is Bernoulli with probability . Thus, the upper bound is
| (7) | 
where is chosen such that
| (8) | 
where
| (9) | 
Numerical studies (Section 3.5 in [28]) suggest that this upper bound is tight, but no formal proof is available to date.
The following lemma summarizes some basic properties of
Lemma 1
(partially from [5]). satisfies the following basic properties:
- 1.
 
- 2.
 Data processing inequality: If forms a Markov chain,
then
- 3.
 is a convex and continuous function of γ for
- 4.
 Tensorization: For n independent pairs we have that
where the min is over all non-negative satisfying
- 5.
 If forms a Markov chain, then
- 6.
 The cardinality of may be restricted to
- 7.
 If and are one-to-one functions, then
- 8.
 For discrete we have
Proofs of items 1–4 are given in [5], and the proofs of items 5–8 are given in Appendix A.
2.3. The Non-Convexity of the Relaxed Wyner’s Common Information Problem
It is important to observe that the optimization problem of Definition 2 is not a convex problem. First, we observe that is indeed a convex function of which is a well-known fact, see, e.g., (Theorem 2.7.4 in [29]). The issue is with the constraint set. The set of distributions for which is not a convex set. To provide some intuition for the structure of this set, let us consider as a function of , and examine its (non-)convexity. The relation between the two is described by the epigraph
| (10) | 
The function is convex in if and only if its epigraph is a convex set which would imply that the set of distributions for which is also convex. Now, we present an example that is not a convex function of .
Example 1.
Let the distributions be
(11) respectively. For this example, one can evaluate numerically that, under we have and under we have . By the same token, one can show that, under we have . Hence, we conclude that, for this example,
(12) which proves that cannot be convex.
2.4. The Operational Significance of the Relaxed Wyner’s Common Information Problem
It is important to note that Wyner’s common information has clear and well-defined operational significance. This is perhaps not central to the detailed explanations and examples given in the sequel. However, it has a role in the appreciation of the rigorous connection established in our work. An excellent description of the operational significance is given in (Section I in [23]), where two separate aspects are identified. The first concerns a source coding scenario with a single encoder and two decoders, one interested in X and the other in Three bit streams are constructed by the encoder: One public (to both decoders), and two private streams, one for each decoder. Then, characterizes the minimum number of bits that must be sent on the public stream such that the total number of bits sent stays at the global minimum, which is well known to be the joint entropy of X and If the rate on the public bit stream dips below it is no longer possible to keep the total rate at the joint entropy. Rather, there is a strict penalty now, and this penalty can be expressed via The second rigorous operational significance concerns the distributed generation of correlated randomness. We have two separate processors, one generating X and the other For a fixed desired resulting probability distribution how many common random bits (shared between both processors) are required? Again, the answer is precisely A connection between caching and the Gray–Wyner network is developed in [30].
3. The Algorithm
The main technical result of this paper is to establish that the outcome of a specific procedure induced by the relaxed Wyner’s common information is tantamount to CCA whenever the original underlying distribution is Gaussian. In preparation for this, in this section, we present the proposed algorithm. In doing so, we will assume that the distribution of the data are In many applications involving CCA, the data distributions may not be known, but, rather, a number of samples of and are provided, based on which CCA would then estimate the covariance matrix. A similar perspective can be taken on our procedure, but is left for future work. A short discussion can be found in Section 8 below.
3.1. High-Level Description
The proposed algorithm takes as input the distribution of the data, as well as a level The level is a non-negative real number and may be thought of as a resolution level or a measure of coarseness: If , then the full commonality (or common information) between and is extracted in the sense that it is conditioned on the common information, and are conditionally independent. Conversely, if is large, then only the most important part of the commonality is extracted. Fixing the level the idea of the proposed algorithm is to evaluate the relaxed Wyner’s Common Information of Equation (5) between the information sources (data sets) at the chosen level This evaluation will come with an associated conditional distribution namely, the conditional distribution attaining the minimum in the optimization problem of Equation (5). The second half of the proposed algorithm consists of leveraging the minimizing in such a way as to separately reduce and to those features that best express the commonality. This may be thought of as a type of projection of the minimizing random variable W back onto and respectively. For the case of Gaussian statistics, this can be made precise.
3.2. Main Steps of the Algorithm
The algorithm proposed here starts from the joint distribution of the data, Estimates of this distribution can be obtained from data samples and via standard techniques. The main steps of the procedure can then be described as follows:
Generic Procedure 1
(CICA).
- 1.
 Select a real number where This is the compression level: A low value of γ represents low compression, and, thus, many components are retained. A high value of γ represents high compression, and, thus, only a small number of components are retained.
- 2.
 Solve the relaxed Wyner’s common information problem,
(13) leading to an associated conditional distribution
- 3.
 Using the conditional distribution found in Step 2), the dimension-reduced data sets can now be found via one of the following three variants:
- (a)
 Version 1: MAP (maximum a posteriori):
(14) 
(15) - (b)
 Version 2: Conditional Expectation:
(16) 
(17) - (c)
 Version 3: Marginal Integration:
(18) 
(19) 
The present paper focuses on the three versions given here because, for these three versions, we can establish Theorem 2, showing that, in the case of Gaussian statistics, all three versions lead exactly to CCA. Second, we note that, for concrete examples, it is often evident which of the versions is preferable. For example, in Section 5, we consider a binary example where the associated W in Step 2 of our algorithm is also binary. In this case, Version 1 will reduce the original binary vector to a binary scalar, which is perhaps the most desirable outcome. By contrast, Versions 2 and 3 require an explicit embedding of the binary example in the reals, and will reduce the original binary vector to a real-valued scalar, which might not be as insightful.
4. For Gaussian, CICA Is CCA
In this section, we consider the special case where and are jointly Gaussian random vectors. Since the mean has no bearing on either CCA or Wyner’s common information, we will assume it to be zero in the sequel, without loss of generality. One key ingredient for this argument is a well-known change of basis, see, for example [2], which we will now introduce in detail. Note that the mean will not change any mutual information term, thus we assume it to be zero without a loss of generality. We first need to introduce notation for CCA. To this end, let us express the covariance matrices, as usual, in terms of their eigendecompositions as
| (20) | 
and
| (21) | 
where and denote the rank of and respectively. Starting from this, we define the matrices
| (22) | 
and
| (23) | 
where, for a diagonal matrix with strictly positive entries, denotes the diagonal matrix whose diagonal entries are the reciprocals of the square roots of the entries of the matrix Using these matrices, the key step is to apply the change of basis
| (24) | 
| (25) | 
In the new coordinates, the covariance matrices of and respectively, can be shown to be
| (26) | 
and
| (27) | 
Moreover, we have
| (28) | 
Let us denote the singular value decomposition of this matrix by
| (29) | 
where contains, on its diagonal, the ordered singular values of this matrix, denoted by In addition, let us define
| (30) | 
| (31) | 
which implies that and .
Next, we will leverage this change of basis to establish Wyner’s common information and its relaxation for the Gaussian vector case, and then to prove the connection between Generic Procedure 1 and CCA.
4.1. Wyner’s Common Information and Its Relaxation in the Gaussian Case
For the case where and are jointly Gaussian random vectors, a full and explicit solution to the optimization problem of Equation (5) is found in [5]. To give some high-level intuition, the proof starts by mapping from to and from to as in Equations (30) and (31). This preserves all mutual information expressions as well as joint Gaussianity. Moreover, due to the structure of the covariance matrices of the vectors and we have that are n independent pairs of Gaussian random variables. Thus, by the tensorization property (see Lemma 1), the vector problem can be reduced to n parallel scalar problems. The solution of the scalar problem is the main technical contribution of [5], and we refer to that paper for the detailed proof. The resulting formula can be expressed as in the following theorem.
Theorem 1
(from [5]). Let and be jointly Gaussian random vectors of length n and covariance matrix . Then,
(32) where
(33) and (for ) are the singular values of where and are defined to mean that only the positive eigenvalues are inverted.
As pointed out above, we refer to (Theorem 7 in [5]) for a rigorous proof of this theorem.
4.2. CICA in the Gaussian Case and the Exact Connection with CCA
In this section, we consider the proposed CICA algorithm in the special case where the data distribution is , a (multivariate) Gaussian distribution. We establish that, in this case, the classic CCA is a solution to all versions of the proposed CICA algorithm. In this sense, CICA is a strict generalization of CCA. CCA is briefly reviewed in Appendix B. Leveraging the matrices U and V defined via the singular value decomposition in Equation (29), CCA performs the dimensonality reduction
| (34) | 
| (35) | 
where the matrix contains the first k columns of U (that is, the k left singular vectors corresponding to the largest singular values), and the matrix the respective right singular vectors. We refer to these as the “top k CCA components.”
Theorem 2.
Let and be jointly Gaussian random vectors. Then:
- 1.
 The top k CCA components are a solution to all three versions of Generic Procedure 1.
- 2.
 The parameter γ controls the number k as follows:
(36) where
Remark 1.
Note that is a decreasing, integer-valued function. An illustration for a special case is given in Figure 2.
Figure 2.
Illustration of the function from Theorem 2 for the concrete case where and have components each and the correlation coefficients are
Proof.
The main contribution of the theorem is the first item, i.e., the connection between CCA and Generic Procedure 1 in the case where and are jointly Gaussian. The proof follows along the steps of the CICA procedure: We first show that, in Step 2, when and are jointly Gaussian, then the minimizing W may be taken jointly Gaussian with and Then, we establish that, in Step 3, with the W from Step 2, we indeed obtain that the dimension-reduced representations and turn into the top k CCA components. In detail:
Step 2 of Generic Procedure 1: The technical heavy lifting for this step in the case where is a multivariate Gaussian distribution is presented in [5]. We shall briefly summarize it here. In the case of Gaussian vectors, the solution to the optimization problem in Equation (5) is most easily described in two steps. First, we apply the change of basis indicated in Equations (24) and (25). This is a one-to-one transform, leaving all information expressions in Equation (5) unchanged. In the new basis, we have n independent pairs. By the tensorization property (see Lemma 1), when and consist of independent pairs, the solution to the optimization problem in Equation (5) can be reduced to n separate scalar optimizations. The remaining crux then is solving the scalar Gaussian version of the optimization problem in Equation (5). This is done in (Theorem 3 in [5]) via an argument of factorization of convex envelope. The full solution to the optimization problem is given in Equations (32) and (33). The remaining allocation problem over the non-negative numbers can be shown to lead to a water-filling solution, given in (Theorem 8 in [5]). More explicitly, to understand this solution, start by setting Then, the corresponding and the optimizing distribution trivializes. Now, as we lower the various terms in the sum in Equation (32) start to become non-zero, starting with the term with the largest correlation coefficient Hence, an optimizing distribution can be expressed as where the matrices and are precisely the top k CCA components (see Equations (34) and (35) and the following discussion), and is additive Gaussian noise with mean zero, independent of and
Step 3 of Generic Procedure 1: For the algorithm, we need the corresponding conditional marginals, and By symmetry, it suffices to prove one formula. Changing basis as in Equations (24) and (25), we can write
(37) 
(38) 
(39) 
(40) 
(41) The first summand contains exactly the top k CCA components extracted from which is the claimed result. The second summand requires further scrutiny. To proceed, we observe that, for CCA, the projection vectors obey the relationship (see Equation (A12))
(42) for some real-valued constant Thus, combining the top k CCA components, we can write
(43) where D is a diagonal matrix. Hence,
(44) 
(45) where is the diagonal matrix
(46) This is precisely the top k CCA components (note that the solution to the CCA problem (A7) is only specified up to a scaling). This establishes the theorem for the case of Version 2) of the proposed algorithm. Clearly, it also establishes that is a Gaussian distribution with mean given by (45), thus establishing the theorem for Version 1) of the proposed algorithm. The proof for Version 3 follows along similar lines and is thus omitted. □
5. A Binary Example
In this section, we carry through a theoretical study of a somewhat more general case of the example discussed in Section 1.4 that is believed to be within the reach of practical data. In order to do a theoretical study, we need to constrain the data into binary for the reason that computing the Wyner’s common information for doubly binary symmetric source is already known.
Let us illustrate the proposed algorithm via a simple example. Consider the vector of binary random variables. Suppose that is a doubly symmetric binary source. This means that U is uniform and V is the result of passing U through a binary symmetric (“bit-flipping”) channel with flip probability denoted by to match the notation in (Section 3 in [23]). Without loss of generality, we may assume Meanwhile, and are independent binary uniform random variables, also independent of the pair We will then form the vectors and as
| (47) | 
and
| (48) | 
where ⊕ denotes the modulo-reduced addition, as usual. How do various techniques perform for this example?
Let us first analyze the behavior and outcome of CCA in this particular example. The key observation is that any pair, amongst the four entries in these two vectors, and are (pairwise) independent binary uniform random variables. Hence, the overall covariance matrix of the merged random vector is merely a scaled identity matrix. This, in turn, implies that CCA as described in Equations (34) and (35) merely boils down to the identity mapping. Concretely, this means that, for CCA, in this example, the best one-dimensional projections are ex aequo any pair of one coordinate of the vector with one coordinate of the vector As we have already explain above, any such pair is merely a pair of independent (and identically distributed) random variables, so CCA does not extract any dependence between and at all. Of course, this is the main point of the present example.
- 
How does CICA perform in this example? We selected this example because it represents one of the only cases for which a closed-form solution to the optimization problem in Equation (13) is known, at least in the case To see this, let us first observe that, in our example, we have
Next, we observe that(49) (50) (51) 
where (51) follows from Lemma 1, Item 5, and the Markov chain that is satisfied from (49). The last Equation (52) follows from Lemma 1, Item 5, and the Markov chain that is satisfied from (49). That is, in this simple example, solving the optimization problem of Equation (13) is tantamount to solving the optimization problem in Equation (52). For the latter, the solution is well known, see (Section 3 in [23]). Specifically, we can express the conditional distribution that solves the optimization problem of Equation (13) and is required for Step 3 of Generic Procedure 1 as follows:(52) 
where(53) (54) Let us now apply Version 1 (the MAP version) of Generic Procedure 1. To this end, we also need to calculate and Again, for these can be expressed in a closed form as follows:
where(55) 
The formula for follows by symmetry and shall be omitted. The final step is to follow Equations (14) and (15) and find for each as well as for each For the example at hand, these can be compactly expressed as(56) (57) 
from the fact that that implies . Hence, we find that, for CICA as described in Generic Procedure 1, an optimal solution is to reduce to U and to This captures all the dependence between the vectors and which appears to be the most desirable outcome.(58) As a final note, we point out that it is conceptually straightforward to evaluate Versions 2 and 3 (conditional expectation) of Generic Procedure 1 in this example, but this would require embedding the considered binary alphabets into the real numbers. This makes it a less satisfying option for the simple example at hand.
 
6. A Gradient Descent Based Implementation
As we discussed above, in our problem, the objective is indeed a convex function of the optimization variables (but the constraint set is not convex). Clearly, this gives hope that gradient-based techniques may lead to interesting solutions. In this section, we examine a first tentative implementation and check it against ground truth for some simple examples.
In theory for convex problems, gradient descent will guarantee convergence to the optimal solution; otherwise, it will guarantee only local convergence. Gradient descent runs in iterative steps, where each step does a local linear approximation and the step size depends on a learning parameter that is for our problem. In our work, we want to minimize the objective when the constraint is held below a level.
Instead, we apply a variant of gradient descent where we minimize the weighted sum of objective and the constraint , which is . The parameter will permit some control on the constraint, thus sweeping all its possible values. We present the algorithm where will be a function of that will represent , and will be a function of that will represent .
The exact computation of the stated update step is presented in the following lemma.
Lemma 2
(Computation of the update step). Let be a fixed distribution, then the updating steps for the gradient descent are
(59) 
(60) 
Proof.
Let the function C be as defined above
(61) and, in terms of information theoretic terms, the function is . In addition, is a convex function of , shown in (Theorem 2.7.4 in [29]). Taking the first derivative, we get
(62) 
(63) On the other hand, the term can be expressed as
(64) 
(65) Taking the derivative with respect to becomes easier once is written in terms of function C and we already know the derivative of C from (63). Thus, the derivative would be
(66) 
(67) 
(68) 
(69) where (67) is an application of the chain rule, and the rest is straightforward computation. □
Remark 2.
In practice, it is useful and computationally cheaper to replace the derivative formulas in Lemma 2 by their standard approximations. That is, the updating step in line 7 of Algorithm 1 is replaced by
(70) 
(71) for a judicious choice of This is the version that was used for Figure 1b, with . We point out that, in the general case, the error introduced by this approximation is not bounded.
| Algorithm 1: Approximate CICA Algorithm via Gradient Descent | 
 
 | 
7. Extension to More than Two Sources
It is unclear how one would extend CCA to more than two databases. By contrast, for CICA, this extension is conceptually straightforward. For Wyner’s common information, in Definition 1, it suffices to replace the objective in the minimization by and to keep the constraint of conditional independence. To obtain an interesting algorithm, we now need to relax the constraint of conditional independence. The most natural way is via the conditional version of Watanabe’s total correlation [31], leading to the following definition:
Definition 3
(Relaxed Wyner’s Common Information for M variables). For a fixed probability distribution we define
(72) such that where the infimum is over all probability distributions with marginal
Not surprisingly, an explicit closed-form solution is difficult to find. One simple case appears below as part of the example presented in Section 7.1, see Lemma 4. By analogy with Lemma 1, we can again state basic properties.
Lemma 3.
satisfies the following basic properties:
- 1.
 
- 2.
 is a convex and continuous function of γ for
- 3.
 If forms a Markov chain,
then
- 4.
 The cardinality of may be restricted to
- 5.
 If are one-to-one functions,
then
- 6.
 For discrete X, we have
Proofs for these basic properties can be found in Appendix C.
Leveraging Definition 3, it is conceptually straightforward to extend CICA (that is, Generic Procedure 1) to the case of M databases as follows. For completeness, we include an explicit statement of the resulting procedure.
Generic Procedure 2
(CICA with multiple sources).
- 1.
 Select a real number where This is the compression level: A low value of γ represents low compression, and, thus, many components are retained. A high value of γ represents high compression, and, thus, only a small number of components are retained.
- 2.
 Solving the relaxed Wyner’s common information problem, -4.6cm0cm
(73) leading to an associated conditional distribution
- 3.
 Using the conditional distribution found in Step 2, the dimension-reduced data sets can now be found via one of the following three variants:
- (a)
 Version 1: MAP (maximum a posteriori):
(74) for- (b)
 Version 2: Conditional Expectation:
(75) for- (c)
 Version 3: Marginal Integration:
(76) for
Clearly, Generic Procedure 2 closely mirrors Generic Procedure 1. The key difference is that there is no direct analog of Theorem 2. This is no surprise since it is unclear how CCA would be extended to beyond the case of two sources. Nonetheless, it would be very interesting to explore what Generic Procedure 2 boils down to in the special case when all vectors are jointly Gaussian. At the current time, this is unknown. In fact, the explicit solution to the optimization problem in Definition 3 is presently an open problem.
Instead, we illustrate the promise of Generic Procedure 2 via a simple binary example in the next section. The example mirrors some of the basic properties of the example tackled in Section 5.
7.1. A Binary Example with Three Sources
In this section, we develop an example with three sources that borrows some of the ideas from the example discussed in Section 5. In a sense, the present example is even more illustrative because, in it, any two of the original vectors and are (pairwise) independent. Therefore, any method based on pairwise measures, including CCA and maximal correlation, would not identify any commonality at all. Specifically, we consider the following simple statistical model:
| (77) | 
where are independent uniform binary variables and ⊕ denotes modulo-2 addition. We observe that, amongst these three vectors, any pair is independent. This implies, for example, that any correlation-based technique (including maximal correlation) will not identify any relevant features, since correlation is a pairwise measure. By contrast, we can show that one output of Algorithm 2 is indeed to select for Thus, the algorithm would reduce each of the three vectors to their first component, which is the intuitively pleasing answer in this case. By going through the steps of the Generic Procedure 2, for , where the the joint distribution satisfies
| (78) | 
we have that
| (79) | 
| (80) | 
| (81) | 
| (82) | 
where we use Lemma 3, Item 3, together with the Markov chain that follows from (78) to prove step (80). Similarly, the Markov chain proves step (81) by making use of Lemma 3, Item 3. A similar argument is used for the last step (82). Managing to compute is equivalent to computing , and we demonstrate how to compute it in the next part.
Lemma 4.
Let be independent uniform binary variables and ⊕ denotes modulo-2 addition. Then, the optimal solution to
(83) is , where the expression evaluates to two.
The proof is given in Appendix D. If we apply Version 1 of Step 3 of Generic Procedure 2, we obtain
| (84) | 
that is, in this case, the maximizer is not unique. However, as we observe that the set of maximizers is a deterministic function of u alone, it is natural to reduce as follows:
| (85) | 
By the same token, we can reduce
| (86) | 
| (87) | 
In this example, it is clear that this indeed extracts all of the dependency there is between our three sources, and, thus, is the correct answer.
As pointed out above, in this simple example, any pair of the random vectors and are (pairwise) independent, which implies that the classic tools based on pairwise measures (CCA, maximal correlation) cannot identify any commonality between and
8. Conclusions and Future Work
We introduce a novel two-step procedure that we refer to as CICA. The first step consists of an information minimization problem related to Wyner’s common information, while the second can be thought of as a type of back-projection. We prove that, in the special case of Gaussian statistics, this two-step procedure precisely extracts the CCA components. A free parameter in CICA permits selecting the number of CCA components that are being extracted. In this sense, the paper establishes a novel rigorous connection between CCA and information measures. A number of simple examples are presented. It is also shown how to extend the novel algorithm to more than two sources.
Future work includes a more in-depth study and consideration to assess the practical promise of this novel algorithm. This will also require moving beyond the current setting where it was assumed that the probability distribution of the data at hand was provided directly. Instead, this distribution has to be estimated from data, and one needs to understand what limitations this additional constraint will end up imposing.
Acknowledgments
This work was supported in part by the Swiss National Science Foundation under Grant No. 169294.
Abbreviations
The following abbreviations are used in this manuscript:
| CCA | Canonical correlation analysis | 
| ACE | Alternating conditional expectation | 
| ICA | Independent component analysis | 
| CICA | Common information component analysis | 
Appendix A. Proof of Lemma 1
For Item 5, on the one hand, we have
| (A1) | 
| (A2) | 
| (A3) | 
where, in Equation (A2), we add the constraint that conditioned on W is selected to be independent of which cannot reduce the value of the infimum. That is, for such a choice of we have the Markov chain thus Furthermore, observe that the factorization also implies the factorization Hence, we also have the Markov chain ; thus, which thus established the last step. Conversely, observe that
| (A4) | 
| (A5) | 
| (A6) | 
where (A5) follows from the fact that the infimum of the sum is lower bounded by the sum of the infimums and the fact that relaxing constraints cannot increase the value of the infimum, and (A6) follows from non-negativity of the second term.
Item 6 is a standard cardinality bound, following from the arguments in [32]. For the context at hand, see also Theorem 1 in (p. 6396, [33]). Item 7 follows because all involved mutual information terms are invariant to one-to-one transforms. For Item 8), note that we can express which directly gives the result.
Appendix B. A Brief Review of Canonical Correlation Analysis (CCA)
A brief review of CCA [1] is presented. Let and be zero-mean real-valued random vectors with covariance matrices and respectively. Moreover, let We first apply the change of basis as in (24) and (25). CCA seeks to find vectors and such as to maximize the correlation between and that is,
| (A7) | 
which can be rewritten as
| (A8) | 
where
| (A9) | 
Note that this expression is invariant to arbitrary (separate) scaling of and To obtain a unique solution, we could choose to impose that both vectors be unit vectors,
| (A10) | 
From Cauchy–Schwarz, for a fixed the maximizing (unit-norm) is given by
| (A11) | 
or, equivalently, for a fixed the maximizing (unit-norm) is given by
| (A12) | 
Plugging in the latter, we obtain
| (A13) | 
or, dividing through,
| (A14) | 
The solution to this problem is well known: is the right singular vector corresponding to the largest singular vector of the matrix Evidently, is the corresponding left singular vector. Restarting again from Equation (A7), but restricting to vectors that are orthogonal to the optimal choices of the first round leads to the second CCA components, and so on.
Appendix C. Proof of Lemma 3
For item 1, we proceed as follows
| (A15) | 
| (A16) | 
where we used weak duality for and is
| (A17) | 
By setting , we obtain
| (A18) | 
| (A19) | 
| (A20) | 
where the infimum of in (A20) is attained for the trivial random variable W, thus . Item 2 follows from a similar argument as in (Corollary 4.5 in [23]). For item 3, we start by showing both sides of the inequality that will result in equality. One side of the inequality is shown below:
| (A21) | 
| (A22) | 
| (A23) | 
| (A24) | 
where the last inequality follows by restricting the possible set of W, such that W and Z are conditionally independent given ,
| (A25) | 
From the statement of the lemma, we have ,
| (A26) | 
By adding (A25) and (A26), we get . This implies that we have , which appears in the constraint of (A23). For the other part of the inequality we proceed as follows:
| (A27) | 
| (A28) | 
| (A29) | 
where the last part follows by relaxing the constraint set as and, by further bounding the terms in the objective, .
Item 4 is a standard cardinality bound, following from a similar argument in [32]. Item 5 follows because all involved mutual information terms are invariant to one-to-one transforms. For item 6, we apply the definition of relaxed Wyner’s common information for M variables, and we have
| (A30) | 
| (A31) | 
| (A32) | 
Appendix D. Proof of Lemma 4
An upper bound to the problem is to pick , thus
| (A33) | 
Another equivalent way of writing the problem is by splitting the constraint into two constraints, as we already know that the constraint cannot be smaller than zero, so it has to be exactly zero, and it can be written in the following way:
| (A34) | 
By using weak duality for , a lower bound to the problem would be the following
| (A35) | 
By further using the constraint , the above expression can be written as
| (A36) | 
| (A37) | 
| (A38) | 
| (A39) | 
where (A38) is a consequence of allowing a minimization (if minimum exists) over binary random variables and the rest of equalities is straightforward manipulation. The last equation is in terms of the lower convex envelope with respect to the distribution . The aim is to search for the tightest bound over by studying the lower convex envelope with respect to , which, for binary , can be simplified into
| (A40) | 
and the latter function is a lower convex envelope with respect to . Note that (A40) is a continuous function of , so a first order and a second order differentiation will be enough to compute the lower convex envelope. As a result for , the lower convex envelope of the right-hand side of (A40) is just zero, thus completing the proof.
Author Contributions
Conceptualization, M.C.G.; Data curation, M.C.G.; Formal analysis, E.S. and M.C.G.; Methodology, M.C.G.; Software, E.S.; Validation, E.S.; Visualization, E.S. and M.C.G.; Writing—original draft, E.S. and M.C.G.; Writing—review & editing, E.S. and M.C.G. The authors have contributed equally. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The dataset is available from the corresponding author on request.
Conflicts of Interest
The authors declare no conflict of interest.
Footnotes
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Hotelling H. Relations between two sets of variants. Biometrika. 1936;28:321–377. doi: 10.1093/biomet/28.3-4.321. [DOI] [Google Scholar]
 - 2.Satpathy S., Cuff P. Gaussian secure source coding and Wyner’s common information; Proceedings of the 2015 IEEE International Symposium on Information Theory (ISIT); Hong Kong, China. 14–19 June 2015; pp. 116–120. [Google Scholar]
 - 3.Huang S.L., Wornell G.W., Zheng L. Gaussian universal features, canonical correlations, and common information; Proceedings of the 2018 IEEE Information Theory Workshop (ITW); Guangzhou, China. 25–29 November 2018. [Google Scholar]
 - 4.Gastpar M., Sula E. Relaxed Wyner’s common information; Proceedings of the 2019 IEEE Information Theory Workshop; Visby, Sweden. 25–28 August 2019. [Google Scholar]
 - 5.Sula E., Gastpar M. On Wyner’s common information in the Gaussian Case. arXiv. 20191912.07083 [Google Scholar]
 - 6.Chechik G., Globerson A., Tishby N., Weiss Y. Information bottleneck for Gaussian variables. J. Mach. Learn. Res. 2005;6:165–188. [Google Scholar]
 - 7.Bach F., Jordan M. A Probabilistic Interpretation of Canonical Correlation Analysis. University of California; Berkeley, CA, USA: 2005. Technical Report 688. [Google Scholar]
 - 8.Breiman L., Friedman J.H. Estimating optimal transformations for multiple regression and correlation. J. Am. Stat. Assoc. 1985;80:580–598. doi: 10.1080/01621459.1985.10478157. [DOI] [Google Scholar]
 - 9.Comon P. Independent component analysis; Proceedings of the International Signal Processing Workshop on High-Order Statistics; Chamrousse, France. 10–12 July 1991; pp. 111–120. [Google Scholar]
 - 10.Witsenhausen H.S., Wyner A.D. A conditional entropy bound for a pair of discrete random variables. IEEE Trans. Inf. Theory. 1975;21:493–501. doi: 10.1109/TIT.1975.1055437. [DOI] [Google Scholar]
 - 11.Tishby N., Pereira F.C., Bialek W. The information bottleneck method; Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing; Monticello, IL, USA. 22–24 September 1999; pp. 368–377. [Google Scholar]
 - 12.Xiao-Tong Yuan B.H., Hu B.G. Robust feature extraction via information theoretic learning; Proceedings of the 26th Annual International Conference on Machine Learning; Montreal, QC, Canada. 14–18 June 2009; pp. 1193–1200. [Google Scholar]
 - 13.Wang H., Chen P. A feature extraction method based on information theory for fault diagnosis of reciprocating machinery. Sensors. 2009;9:2415–2436. doi: 10.3390/s90402415. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 14.Bi J., Bennett K.P., Embrechts M., Breneman C.M., Song M. Dimensionality reduction via sparse support vector machines. J. Mach. Learn. Res. 2003;3:1229–1243. [Google Scholar]
 - 15.Laparra V., Malo J., Camps-Valls G. Dimensionality reduction via regression in hyperspectral imagery. IEEE J. Sel. Top. Signal Process. 2015;9:1026–1036. doi: 10.1109/JSTSP.2015.2417833. [DOI] [Google Scholar]
 - 16.Gao J., Shi Q., Caetano T.S. Dimensionality reduction via compressive sensing. Pattern Recognit. Lett. 2012;33:1163–1170. doi: 10.1016/j.patrec.2012.02.007. [DOI] [Google Scholar]
 - 17.Hadsell R., Chopra S., LeCun Y. Dimensionality reduction by learning an invariant mapping; Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; New York, NY, USA. 17–22 June 2006. [Google Scholar]
 - 18.Zhang Z., Zha H. Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM J. Sci. Comput. 2004;26:313–338. doi: 10.1137/S1064827502419154. [DOI] [Google Scholar]
 - 19.Vepakomma P. Supervised dimensionality reduction via distance correlation maximization. Electron. J. Stat. 2018;12:960–984. doi: 10.1214/18-EJS1403. [DOI] [Google Scholar]
 - 20.Wu J., Wang J., Liu L. Feature extraction via KPCA for classification of gait patterns. Hum. Mov. Sci. 2007;26:393–411. doi: 10.1016/j.humov.2007.01.015. [DOI] [PubMed] [Google Scholar]
 - 21.Lai Z., Mo D., Wong W.K., Xu Y., Miao D., Zhang D. Robust discriminant regression for feature extraction. IEEE Trans. Cybern. 2018;48:2472–2484. doi: 10.1109/TCYB.2017.2740949. [DOI] [PubMed] [Google Scholar]
 - 22.Wang H., Zhang Y., Waytowich N.R., Krusienski D.J., Zhou G., Jin J., Wang X., Cichocki A. Discriminative feature extraction via multivariate linear regression for SSVEP-based BCI. IEEE Trans. Neural Syst. Rehabil. Eng. 2016;24:532–541. doi: 10.1109/TNSRE.2016.2519350. [DOI] [PubMed] [Google Scholar]
 - 23.Wyner A. The common information of two dependent random variables. IEEE Trans. Inf. Theory. 1975;21:163–179. doi: 10.1109/TIT.1975.1055346. [DOI] [Google Scholar]
 - 24.Witsenhausen H.S. Values and bounds for the common information of two discrete random variables. SIAM J. Appl. Math. 1976;31:313–333. doi: 10.1137/0131026. [DOI] [Google Scholar]
 - 25.Xu G., Liu W., Chen B. Wyner’s common information for continuous random variables—A lossy source coding interpretation; Proceedings of the Annual Conference on Information Sciences and Systems; Baltimore, MD, USA. 23–25 March 2011. [Google Scholar]
 - 26.Xu G., Liu W., Chen B. A lossy source coding interpretation of Wyner’s common information. IEEE Trans. Inf. Theory. 2016;62:754–768. doi: 10.1109/TIT.2015.2506560. [DOI] [Google Scholar]
 - 27.Gastpar M., Sula E. Common information components analysis; Proceedings of the Information Theory and Applications Workshop (ITA); San Diego, CA, USA. 2–7 February 2020. [Google Scholar]
 - 28.Wang C.Y. Ph.D. Thesis. École Polytechnique Fédérale de Lausanne; Lausanne, Switzerland: 2015. Function Computation over Networks: Efficient Information Processing for Cache and Sensor Applications. [Google Scholar]
 - 29.Cover T.M., Thomas J.A. Elements of Information Theory. 2nd ed. Wiley; Hoboken, NJ, USA: 2005. [Google Scholar]
 - 30.Timo R., Bidokhti S.S., Wigger M.A., Geiger B.C. A rate-distortion approach to caching. IEEE Trans. Inf. Theory. 2018;64:1957–1976. doi: 10.1109/TIT.2017.2768058. [DOI] [Google Scholar]
 - 31.Watanabe S. Information theoretical analysis of multivariate correlation. IBM J. Res. Dev. 1960;4:66–82. doi: 10.1147/rd.41.0066. [DOI] [Google Scholar]
 - 32.Ahlswede R., Körner J. Source coding with side information and a converse for degraded broadcast channels. IEEE Trans. Inf. Theory. 1975;21:629–637. doi: 10.1109/TIT.1975.1055469. [DOI] [Google Scholar]
 - 33.Wang C.Y., Lim S.H., Gastpar M. Information-theoretic caching: Sequential coding for computing. IEEE Trans. Inf. Theory. 2016;62:6393–6406. doi: 10.1109/TIT.2016.2604851. [DOI] [Google Scholar]
 
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The dataset is available from the corresponding author on request.


