Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Sep 10.
Published in final edited form as: Stat Med. 2019 Jun 19;38(20):3936–3946. doi: 10.1002/sim.8212

Maximum Diversity Weighting for Biomarkers with Application in HIV-1 Vaccine Studies

Zonglin He , Youyi Fong †,‡,*
PMCID: PMC6684395  NIHMSID: NIHMS1041471  PMID: 31215662

Abstract

While studying the association between risk of HIV-1 infection and vaccine-elicited immune responses in preventative HIV-1 vaccine recipients, we encountered a need to combine a collection of biomarkers in an unsupervised fashion with the goal of preserving signal diversity within that collection. Inspired by methods for weighting protein sequences from the biological sequence analysis literature, we propose novel methods for weighting biomarkers, which we call maximum diversity weights. These weights are defined as the weights that maximize measures of signal diversity within a collection of biomarkers. While the optimization problems do not admit analytical solutions, they are convex and hence can be solved efficiently using iterative search algorithms. Through Monte Carlo studies and a real data example from HIV-1 vaccine research, we show that using maximum diversity weights in association studies can lead to an increase in power over other commonly used weights such as uniform weights or principal component-based weights.

Keywords: multivariate analysis, unsupervised feature selection, profile hidden Markov model, HIV-1 vaccine development

1. Background

Unsupervised biomarker combination methods include a diverse set of methods. Unlike supervised marker combination methods used in classification, risk prediction, and treatment selection, unsupervised biomarker combination methods are performed using only information within the biomarkers and without using the clinical outcome of interest, e.g. disease status. There are a multitude of unsupervised biomarker combination methods that serve different purposes. For example, principal component analysis [PCA] [1] is often used to reduce data dimension to allow easier and more meaningful interpretation, and multidimensional scaling [MDS] [2] can be viewed as a way to visualize the proximity information contained in a distance metric. Our interest in unsupervised biomarker combination stems from a need to preserve signal diversity within a panel of related biomarkers, which arose from an effort to identify human immune response biomarkers associated with the risk of HIV-1 infection.

The RV144 phase III efficacy trial of an ALVAC prime/AIDSVAX boost vaccine regimen [3] showed an estimated vaccine efficacy of 31.2% (p value = 0.04) to prevent HIV-1 acquisition. Identifying immune response biomarkers associated with infection risk, which are termed correlates of risk [CoR] [4], in a vaccinated population can provide insight into vaccine-mediated protection from infection, even for imperfect vaccines such as the regimen tested in the RV144 trial. To identify CoR, Haynes et al. [5] measured a large number of immune response biomarkers from the vaccine recipients. The Scientific Advisory Board mandated an analysis framework for the purpose of testing six hypotheses in a multiple logistic regression model. They were interested in whether there was an association between risk of HIV-1 infection and each of six different primary variables: (1) gp70 V1V2-specific IgG binding antibodies, (2) envelope protein-specific IgG binding antibodies, (3) envelope protein-specific IgA binding antibodies, (4) neutralizing antibodies, (5) antibody-dependent cellular cytotoxicity, and (6) envelope-specific CD4+ T-cells [6]. Each of the six primary variables represents an immune function subclass. In this highly structured analysis plan, some of the six immune function subclasses map to a single measurement, whereas others map to a collection of measurements. For the latter, we need to find a summary score for a collection of measurements to represent an immune function subclass. Our goal is for the summary score to preserve the signal diversity within the collection, since the diversity in the measurements is directly due to the genetic and antigenic diversity of HIV-1 and is thought to be important for protection against a broad range of circulating HIV-1 strains.

We refer to the task of creating a summary/score of a set of biomarkers with the goal of preserving signal diversity as maximum diversity weighting. A simple approach to maximum diversity weighting is to combine the biomarkers with equal weights. The main drawback of this approach is that it ignores potential correlations between the biomarkers. If some biomarkers are highly correlated with each other, these biomarkers are intuitively ‘redundant’ in a sense and ideally need to be down-weighted so that other biomarkers would have a fairer representation in the score variable.

Maximum diversity weighting has a different goal than other unsupervised dimension reduction methods, which makes it uniquely suitable for some application scenarios. For example, in our motivating application we have a set of biomarkers that all measure the amount of binding antibodies present in participant serum samples, but each of the biomarkers encompasses antibody binding to a different subset of residues (i.e. a different epitope) on the surface of the HIV-1 envelope protein. Suppose that there is an easy-to-bind epitope on the surface that most antibodies recognize and there is also a more-difficult-to-bind epitope on the surface that only a few antibodies recognize. Further suppose that only the abundance of antibodies binding to the latter epitope is associated with decreased risk of HIV-1 infection. Since it is unknown a priori which epitope is of interest, it is best to preserve binding to all epitopes when we create the summary score variable. In this scenario, using maximum diversity weighting to create scores for use in subsequent association studies would often be more powerful than, for example, using the first principal component of PCA.

One potential approach for achieving the goal of preserving signal diversity is to first identify the latent, unique signals in a collection of biomarkers and then combine them equally. While conceptually straightforward, this strategy depends critically on resolving the unique signals, which is a challenging problem given typical sample sizes. For this reason we choose to work directly with the observed biomarkers. We find inspiration from the biological sequence analysis literature, which is briefly described in Section 2.1. In Section 2.2 we propose maximum diversity weights for biomarkers and study the theoretical properties of the estimators. In Section 3, we demonstrate the behavior of maximum diversity weighting through a series of Monte Carlo studies and compare their power with competitive weighting methods in association studies. In Section 4, we apply the proposed methods to a real dataset from the RV144 immune correlates study. We end with a discussion in Section 5.

2. Methods for weighting biomarkers to maximize signal diversity

Due to the dearth of unsupervised methods in the biostatistical literature for maximizing signal diversity, we start our investigation by surveying methods used in computational biology for weighting protein sequences in the training of profile hidden Markov models (Section 2.1). Although these two problems are vastly different on first appearance, further inspection reveals a degree of conceptual similarity that motivates ways of defining signal diversity for a collection of continuous biomarkers (Section 2.2).

2.1. Weighting protein sequences to improve prediction performance

With the advent of high-throughput genome sequencing technologies, the last decade of the twentieth century saw an explosion of statistical and machine learning methods for biological sequence analysis. A central problem in analyzing genome sequences is predicting the functions of a protein based on its amino acid sequence. The most successful set of solutions that emerged was to group protein sequences into protein families and represent each protein family with a profile hidden Markov model [HMM] [7]. Since the amino acid sequences of proteins determine to a very large extent their higher order structures, which, in turn, determine the functions they perform in a biological system, proteins within a family tend to share structural and functional features. Oftentimes some members of a protein family are well known to biologists from genetic screens or other types of screening studies. This knowledge can then be used to provide a starting point for functional analyses of the other members in the family. Estimation of profile HMMs was thus a key part of biological sequence analysis, and sequence weighting was an important aspect of this process.

Suppose we have a family of three amino acid sequences CAA, CGA, and CAA (Figure 1), for which we wish to estimate the profile HMM. If we let each sequence’s contribution to the model be equal, the resulting profile HMM would prefer A over G at position 2, as there are two A’s and one G. Researchers quickly realized that such an approach suffered from serious drawbacks. The collection of protein sequences used in estimation is usually a sample of convenience. That some sequences in the collection closely resemble each other (in our example, CAA appears twice) often stems from oversampling of a region in the sequence space rather than any reason of biological significance. Profile HMMs estimated with uniform weighting tend to under-perform when they are used to predict new members of the protein families because they underestimate the diversity within a protein family.

Figure 1.

Figure 1.

Conceptual similarity between weighting protein sequences and weighting continuous biomarkers for maximum diversity. (a) A group of 3 amino acid sequences arranged vertically. Each sequence (seq.) is assigned a weight. The numbers 1, 2 and 3 index the positions (pos.) within a sequence. (b) A panel of 3 immune response biomarkers measured from three subjects. The biomarker values are represented by color-coded tickmarks. Each color (biomarker) is assigned a weight. The numbers 1, 2, and 3 index the subjects (subj.). The arrows represent the real lines, and the bell curves over the tickmarks illustrate kernel density estimation.

To address this problem, a myriad of protein sequence weighting methods have been proposed (e.g. Section 5.8 of Durbin et al. [8]). Some approaches are more procedural, while others define the weights as the solutions to some maximization problems. As an example of the former, the Voronoi weights [9] are determined by partitioning the sequence space into polygons with each polygon occupied by one sequence and composed of all points closest to that sequence; the weights are then proportional to the volume of space in each polygon. The latter category includes maximum discrimination weights [10], which maximize the weighted posterior probability of the profile HMM given the observed sequences, and maximum entropy weights [11, 12], which maximize the overall entropy of the amino acid distributions across all positions.

2.2. Maximum diversity weights for continuous biomarkers

We propose to define maximum diversity weights for a set of continuous biomarkers as the weights that maximize a diversity measure of the weighted distributions of the biomarkers across all subjects. Suppose we have p immune response biomarkers for n subjects. Assume each biomarker has been scaled to have standard deviation 1. Denote the standardized biomarker values for the ith subject by x.i = (x1i,…, xpi)T, i = 1,… ,n. x.iRp. Let X = (x.1,…, x.n)T, X is a n × p matrix. Let w = (w1,…, wp)T be the weights corresponding to the p biomarkers. We define a maximum entropy weights for continuous biomarkers as

w^e=argmaxwi=1nH(f^i,h,w)subject to0w1,wT1=1, (1)

where H(f) = Ef{−log(f)} denotes the entropy corresponding to a density function f. f^i,h,w(x)=k=1pwkKh(xxki) is a nonparametric estimate of the distribution of the p biomarkers in the ith subject, where Kh(·) = h−1 K(·/h), h is the bandwidth parameter, and K is a kernel function, e.g. K(u) = (2π)−1/2exp(−u2/2). In other words, w^e maximizes the total entropies of the weighted within-subject distributions of different biomarkers.

Computing the maximum entropy weights can be resource-intensive. For a faster method, we propose an alternative definition of maximum diversity weights based on variance. The maximum variance weights are defined as

w^v=argmaxwi=1n{k=1p(wkxki2)(k=1pwkxki)2}subject to0w1,wT1=1. (2)

In other words, w^v maximizes the total variances of the weighted within-subject distributions of different biomarkers.

To provide some intuition on our definitions of maximum diversity weights, Figure 1(b) shows a set of three immune response biomarkers from three subjects. Each row corresponds to a subject, and each biomarker is represented with a different color. In subjects 1 and 3, the values of the three biomarkers are relatively close to each other; in subject 2, the blue and red biomarker values are close to each other, but the green biomarker stands apart. Since the green biomarker may pick up a different signal from the blue and red biomarkers, we would like to give it more weight than the uniform weight.

It is worth noting again that the goal of maximum diversity weighting is very different from those of other, more commonly encountered dimension reduction methods such as PCA. The first principal component of PCA (PC1) maximizes variance of the weighted combination across subjects. Intuitively, PC1 captures the most dominant signal in a collection of biomarkers. On the other hand, maximum diversity weighting seeks to create a composite score that is equally reflective of all unique signals in the collection. For example, suppose we have 10 biomarkers (scaled to have standard deviation 1), the first 8 of which are different measurements of one latent signal, and the last 2 of which are different measurements of a second latent signal. PC1 is basically the average of the first 8 biomarkers; however, we want the maximum diversity score to be an average of PC1 and the average of the last 2 biomarkers.

The solutions to the optimization problems in (1) and (2) do not have an explicit form. But the following theorem shows that the criterion functions in both problems are concave. Proof of Theorem 2.1 is given in Supplementary Materials Section A.

Theorem 2.1 Both i=1nH(f^i,h,w) and i=1n{k=1p(wkxki2)(k=1pwkxki)2} are concave functions of w.

The above theorem, together with the convexity of the constraint functions, guarantees that the optimization problems in (1) and (2) are convex optimization problems. When p is small, the optimization problems can be solved efficiently using standard constrained optimization algorithms that are part of the R stats package; when p is large, more sophisticated interior point optimization methods as implemented in, for example, MOSEK, are needed. More details can be found in the Supplementary Materials Section E.

2.3. Asymptotic distributions of maximum diversity weights

We now study the asymptotic properties of the maximum diversity weights estimators. Denote θ = (w1,…,wp−1)T. For a single observation x.i, let mθ(x.i)=k=1pwkKh(xxki)log(k=1pwkKh(xxki))dx or k=1p(wkxki2)(k=1pwkxki)2 for the maximum entropy weights or the maximum variance weights method, respectively. Denote θ^n=argmaxθΘPnmθ, where Pn()=n1i=1n() is the empirical average operator, and Θ is the (p − 1)-simplex, e.g. when p = 2, Θ is the [0, 1] segment on the real line, and when p = 3, Θ is the isosceles right triangular area in the 1st quadrant with its two equal sides being [0, 1] on x-axis and y-axis. Also denote θ0 = (w1,0,…, wp−1,0)T ≡ argmaxθΘ Ex mθ, where the expectation is taken with respect to the joint distribution function Fx(·) of x.i.

Theorem 2.2 Assume (i) we have independent identically distributed (i.i.d.) observations x.i, i = 1,… ,n; (ii) E ∥x.i2 < ∞ and Fx(·) is absolutely continuous; (iii) there exists a unique θ0. Then θ^npθ0.

Theorem 2.2 follows from Theorem 5.14 of van der Vaart (2000) [13]. This is because mθ is continuous in θ, which ensures that Ex supθU mθ < ∞ for every sufficiently small ball U. Adding the fact that θ^n maximizes Pnmθ, all the conditions of Theorem 5.14 in [13] are satisfied.

We next study the asymptotic distribution of θ^n. Let m.θ denote the first derivative of mθ with respect to θ and Vθ denote the expectation of the second derivative matrix of mθ with respect to θ.

Theorem 2.3 In addition to the assumptions in Theorem 2.2, assume that we have a non-singular (p − 1) × (p − 1) matrix Vθ0 with its (j, l)th element defined as Ex[{Kh(xxj)Kh(xxp)}{Kh(xxl)Kh(xxp)}{k=1pwk,0Kh(xxk)}dx] for the maximum entropy weights and Ex[−2(xjxp)(xlxp)] for the maximum variance weights. Then we have

n(θ^nθ0)dN(0,Vθ01Mθ0Vθ01),

where Mθ0=Exm.θ0m.θ0T.

The proof of Theorem 2.3 is detailed in Supplementary Materials Section A. Briefly, the result follows from Theorem 5.23 of van der Vaart (2000) [13] because mθ is twice continuously differentiable at θ0 for both maximum diversity weights, which allows a two-term Taylor expansion in a neighborhood of θ0. The asymptotic variance-covariance matrix of θ^n can be consistently estimated by (PnVθ^n)1PnMθ^n(PnVθ^n)1.

3. Numerical studies

We first investigate how well the weighting schemes proposed in the previous section achieve the goal of preserving signal diversity (Section 3.1). We then compare the power of using maximum diversity weighting versus other weighting methods in association studies (Section 3.2).

3.1. Maximum diversity weights

To assess the performance of the proposed weights in achieving the goal of preserving signal diversity, we compare them with “ideal weights.” The ideal weights are not well defined in general, but in some special cases they can be easily identified based on symmetry. We will focus on these special cases. We start with a simple scenario of three biomarkers, and then extend to five biomarkers. This subsection ends with a study on asymptotic variance estimates. In all studies the sample size is 100 and the Monte Carlo replicate number is 1000.

Scenario I: Weights studies with three biomarkers. Suppose we have three biomarkers {X1, X2, X3} with a joint multivariate normal distribution. Let their mean be μ = (0, 0, 0)T and their marginal variances all be 1. We consider two main scenarios with the variance-covariance matrices denoted by

C3,a=(100010001),C3,b(ρ)=(1ρ0.9ρ1ρ0.9ρ1).

In C3,a, since the three variables are independent, the ideal maximum diversity weights would be uniform across markers. As Table 1 shows, both the maximum entropy weights (h = 1.0) and the maximum variance weights behave as expected.

Table 1.

Monte Carlo mean values of the estimated weights in Scenario I of the simulation study. h = 1.0 for maximum entropy weights.

max entropy weights max variance weights
x1 x2 x3 x1 x2 x3
C3,a 0.334 0.333 0.333 0.333 0.334 0.333
C3,b(0) 0.259 0.480 0.261 0.253 0.491 0.256
C3,b(0.1) 0.260 0.478 0.262 0.254 0.490 0.256
C3,b(0.2) 0.261 0.476 0.263 0.255 0.488 0.257
C3,b(0.3) 0.262 0.474 0.264 0.256 0.486 0.259
C3,b(0.4) 0.264 0.471 0.266 0.257 0.482 0.260
C3,b(0.5) 0.266 0.466 0.268 0.260 0.478 0.263
C3,b(0.6) 0.269 0.460 0.271 0.263 0.471 0.266
C3,b(0.7) 0.275 0.448 0.277 0.270 0.458 0.272
C3,b(0.8) 0.288 0.423 0.289 0.284 0.432 0.285
C3,b(0.9) 0.333 0.334 0.334 0.333 0.334 0.334

In C3,b, we fix the correlation coefficient between X1 and X3 at 0.9, and let the correlation coefficient between X2 and either X1 or X3 be ρ ∈{0, 0.1,… ,0.9}. When ρ equals 0, we have close to two unique signals; X1 and X3 should each get about one quarter of the total weight while X2 should get close to half of the total weight. When ρ equals 0.9, X1, X2, and X3 are exchangeable; the three variables should get the same weights. Table 1 shows that both maximum entropy weights and maximum variance weights behave as expected.

We now study how the choice of the bandwidth parameter h affects the maximum entropy weights using C3,b(ρ). Figure 2 plots the Monte Carlo means of the weights assigned to x2 as functions of ρ by the maximum entropy method with h ∈ {0.1, nrd, 1.0} and the maximum variance method. When h = nrd, we use a variation of Silverman’s rule of thumb to select the bandwidth subject-wise as implemented by the bw.nrd function in R [14], and take the median of the selected bandwidths as h. The results show that when h = 1.0, the behavior of the maximum entropy weights is very close to that of the maximum variance weights. When h = 0.1, the maximum entropy weights assigned to x2 drop toward 1/3. The results obtained using h = nrd fall in between h =1 and h = 0.1.

Figure 2.

Figure 2.

The weight assigned to x2 versus ρ when the correlation matrix is C3,b(ρ) in Scenario I of the simulation study. Three values of the bandwidth parameter h under the maximum entropy approach are compared with the maximum variance approach. When h = nrd, the bandwidth is selected using a standard bandwidth selection procedure for density estimation [14].

To help explain the effect of the bandwidth parameter choice, let us consider a dataset with just two samples: x.1 = (−0.9, −0.2, 1.2)T, x.2 = (−0.2, −0.9, 1.2)T. In panels (a) and (b) of Figure B.1 in the Supplementary Materials, we draw three Gaussian kernel functions with h = 1.0 and h = 0.1 at −0.9, −0.2 and 1.2, respectively. In panel (a) there is substantial overlap between the kernel function, while there is almost no overlap in panel (b). The estimated maximum entropy weights for this dataset when h = 1.0 are {0.264, 0.264, 0.472}, close to the estimated maximum variance weights {0.260, 0.260, 0.480}; the estimated maximum entropy weights when h = 0.1 are {0.334, 0.334, 0.333}, close to {1/3, 1/3, 1/3}. This suggests that when using maximum entropy weights with small h, the biomarkers behave like categorical random variables. We further test this idea by considering another dataset of two samples: x.1 = (−1.5, −0.2, 1.5)T, x.2 = (−0.2, −1.5, 1.5)T. When h = 0.1, the overlap between the three kernel functions remains negligible, and the estimated maximum entropy weights are {0.334, 0.334, 0.333}. Because the biomarkers we study are continuous in nature, in the balance of the numerical studies (each biomarker has standard deviation 1), we choose h = 1.0 for maximum entropy weighting. It is useful to note, however, that using h = nrd provides an useful alternative when less aggressive weight adjustment than maximum variance weighting (in other words, having weights closer to the uniform weights) is desired.

Scenario II: Weights studies with five variables. Suppose we have five biomarkers {X1,X2,X3,X4,X5} with a joint multivariate normal distribution. Let each biomarker be distributed as a standard normal random variable marginally.

First, consider a set of correlation matrices C5,a (ρ) for ρ ∈ {0, 0.1,…, 0.9}:

C5,a(ρ)=(10.9ρ000.91ρ00ρρ10000010.90000.91),C5,b(ρ)=(1ρ000ρ1000001000001ρ000ρ1).

When ρ = 0, there are approximately three unique signals in the group; when ρ = 0.9, the number of unique signals drops to approximately two. Figure 3(a) shows that the sum of w1, w2 and w3 goes from near 2/3 to 1/2 as ρ increases from 0 to 0.9 for both the maximum entropy weights and the maximum variance weights. Next, consider another set of correlation matrices C5,b (ρ) for ρ ∈ {0, 0.1,…, 0.9}. As ρ increases from 0 to 0.9, the number of unique signals goes from 5 to approximately 3. Figure 3(b) shows that w3 goes from 1/5 to nearly 1/3 as ρ increases, as would be expected. Table B.1 of the Supplementary Materials lists the weight estimates numerically.

Figure 3.

Figure 3.

Simulation study Scenario II. (a) The sum of weights assigned to x1, x2, x3 versus ρ when the correlation matrices are C5,a(ρ), (b) the weight assigned to x3 versus ρ when the correlation matrices are C5,b(ρ).

We use these simulation setups from this scenario to evaluate the impact of non-normal data. Figures B.5 and B.6 in the Supplementary Materials show the simulation results when the marginal distributions are Student’s t and gamma, respectively. We find that the estimated maximum entropy weights and maximum variance weights follow the same patterns as when the marginal distributions are Gaussian.

The simulation setups so far all have p < n. A study presented in Section D of the Supplementary Materials shows that the estimated maximum variance weights behave differently from our intuition when p > n and that we should be careful when considering using maximum diversity weights when p is large relative to n.

Asymptotic variances.

To study the performance of the estimated variances of the estimated maximum diversity weights, we choose two simulation setups from Scenario I that correspond to the covariance matrices C3,a and C3,b(0.7). Summaries of the simulation results from 10,000 replicates are shown in Table 2. For the simulation setup with covariance matrix C3,a, the true weight is taken to be {1/3, 1/3, 1/3}. To determine the true weights for the simulation setup with covariance matrix C3,b(0.7), we simulate datasets of sample size 10,000 and take the median of the estimated weights from 1000 replicates. In Table 2, we see that for both the maximum entropy weights and the maximum variance weights, the Monte Carlo standard deviations of the estimated weights are close to the Monte Carlo mean of the standard error estimates, and the coverage probabilities are close to the nominal 95% level. It is also interesting to note that the estimated maximum entropy weights appear to be less variable than the estimated maximum variance weights under both simulation setups.

Table 2.

Simulation study for variance estimation. SD, Monte Carlo standard deviation; Mean SE, Monte Carlo mean of standard error estimates; CP, coverage probability of 95% confidence intervals. Under C3,a, θ0 = (0.333, 0.333)T for both maximum entropy weights and maximum variance weights; under C3,a, C3,b(0.7), θ0 = (0.277,0.446)T and (0.272, 0.455)T for maximum entropy weights and maximum variance weights, respectively.

max entopy method max variance method
C3,a C3,c(0.7) C3,a C3,c(0.7)
w1
% bias 0.020 −0.151 −0.002 −0.224
SD 0.023 0.055 0.035 0.079
Mean SE 0.023 0.054 0.034 0.077
CP 0.942 0.946 0.934 0.953
w2
% bias 0.008 0.207 0.031 0.376
SD 0.023 0.010 0.035 0.012
Mean SE 0.023 0.011 0.034 0.012
CP 0.942 0.940 0.932 0.968
w3
% bias −0.028 −0.068 −0.029 −0.159
SD 0.023 0.055 0.035 0.079
Mean SE 0.023 0.054 0.034 0.077
CP 0.940 0.944 0.934 0.952

3.2. Regression studies

One major application of maximum diversity weighting is to use the thus-derived score in association studies. To compare the power of using maximum diversity weights and other weighting methods in this application setting, we set up experiments using parameters that mimic the HIV-1 immune correlates studies. Specifically, we simulate an outcome variable from the following logistic regression model

logit(P(Yi=1))=ziTα+xiTβ.

Here zi is a covariate vector that may include intercept and clinical variables, and xi=(x1i,,xli)T is a vector of latent, unobserved variables that underlies the observed biomarkers xi = (x1i,…, xpi)T. It is assumed that p > l. We start with l = 2 in Scenario III, and then extend to l = 3 in Scenario IV.

Given the observed data, we create a score variable si based on xi. We then regress Y on z and s, and test the null hypothesis that s is not associated with Y at alpha level 0.05. Four methods for creating the score variable are compared. In addition to the two proposed maximum diversity weighting methods, the uniform approach weighs each of (x1i,…, xpi) equally; the PCA approach uses the first principal component as the score.

Scenario III: Regression studies with two latent variables. Suppose l = 2, p =10, and n = 100. Let zi be intercept only. Let x1i and x2i be two independent standard normal random variables, and let xri=x1i+εri, for r ∈ {1, 2}, xri=x2i+εri for r ∈ {3,…, 10}, where (ε1i,…, ε10,i) are i.i.d. N(0, σε). The latent structure induces a correlation ρ between x1i and x2i, and between x3i,…, x10i. Three values of σε are examined, which correspond to ρ ∈ {0.8, 0.5, 0.2}. The estimated probabilities of rejecting the null hypothesis are shown in Table 3.

Table 3.

Estimated size (β1, 104 replicates) and power (β2 to β4, 103 replicates) in Scenario III of the simulation study. β1 = (0, 0)T, β2 = (1.2, 0)T, β3 = (0, 0.6)T, β4 = (0.5, 0.5)T. Four methods for creating scores are compared: PCA, first principal component; unif, uniform weight; entropy, maximum entropy weight; and var, maximum variance weight. glasso shows the probability of selecting the group using grouped LASSO.

ρ = 0.8 ρ = 0.5 ρ = 0.2
PCA unif entropy var glasso PCA unif entropy var glasso PCA unif entropy var glasso
β1 0.045 0.043 0.043 0.045 0.047 0.043 0.043 0.043 0.043 0.046 0.044 0.044 0.044 0.043 0.047
β2 0.061 0.193 0.871 0.874 0.904 0.064 0.173 0.664 0.712 0.744 0.067 0.136 0.287 0.323 0.343
β3 0.786 0.764 0.505 0.485 0.358 0.737 0.717 0.513 0.482 0.315 0.591 0.564 0.455 0.415 0.224
β4 0.592 0.753 0.867 0.868 0.481 0.559 0.714 0.794 0.793 0.379 0.437 0.570 0.586 0.575 0.226

Four different β’s are examined. The size of the tests is checked under β1; all methods have close to nominal type 1 error rates. Under β2 = (1.2, 0)T, only x1i is associated with the outcome, and both maximum diversity weighting methods outperform the uniform and PCA approaches; under β3 = (0, 0.6), only x2i is associated with the outcome, and we see the opposite trend. The performance differences are more notable when ρ is higher. Under β4 = c(0.5, 0.5), both x1i and x2i are associated with the outcome. Both maximum diversity weights perform better than the uniform and PCA weights.

The trade-off in power between β2 and β3 is because many more observed biomarkers are related to x2i than x1i, and the maximum diversity weighting methods are designed to even out the contributions of unique signals. An advantage of the maximum diversity weights is that if we do not have any prior information that favors either β2 or β3, the worst outcome for the maximum diversity weights is substantially better than the worst outcome for the competing approaches.

An alternative approach to studying the association between the outcome and a group of p biomarkers is through grouped LASSO (e.g. [15, 16]). We fit logistic regression models with grouped penalties using the R package grpreg and used the AIC criterion to select the regularization parameter. Table 3 shows the probabilities of selecting the group based on 104 MC replicates under β1 and 103 MC replicates under β2, β3 and β4. We tune the grid size of the regularization parameter so that when there is no association between the outcome and the group, under β1, the probability of selecting the group using grouped LASSO is around 0.05, which is comparable to the type 1 error rates of the hypothesis testing methods. When there is true association between the outcome and the group, under β2, the probability of selecting the group using grouped LASSO is slightly better than the power of maximum diversity weights (entropy and maximum), and better than uniform weights and the first principal component; under β3, both grouped LASSO and maximum diversity weights underperform uniform weights and the first principal component, but maximum diversity weights outperform LASSO substantively; under β3, grouped LASSO performs worst among all methods, while maximum diversity weights perform best among all methods. These results suggest that grouped LASSO and weighting methods may complement each other: while grouped LASSO performs well when a small number of biomarkers have large effect sizes, weighting methods appear to excel when effect sizes are smaller.

Scenario IV: Regression studies with three latent variables. Suppose l = 3, p = 10, and n = 100. Let zi be intercept only. Let x1i, x2i and x3i be three independent standard normal random variables, and let xri=x1i+εri, for r ∈ {1, 2}, xri=x2i+εri for r ∈ {3, 4, 5, 6}, and xri=x3i+εri for r ∈ {7, 8, 9, 10}, where (ε1i,…, ε10,i) are i.i.d. N(0, σε). The latent structure induces a correlation ρ among {x1i, x2i}, {x3i, x4i, x5i, x6i}, and {x7i, x8i, x9i, x10i}. Three values of σε are examined, which correspond to ρ ∈ {0.8, 0.5, 0.2}. The estimated probabilities of rejecting the null hypothesis are shown in Table 4. Six different β’s are examined.

Table 4.

Estimated size (β1, 104 replicates) and power (β2 to β6, 103 replicates) in Scenario IV of the simulation study. β1 = (0, 0, 0)T, β2 = (1.5, 0, 0)T, β3 = (0, 1, 0)T, β4 = (0.5, 0.5, 0)T, β5 = (0, 0.5, 0.5)T, β6 = (0.4, 0.4, 0.4)T. Four methods for creating scores are compared: PCA, first principal component; unif, uniform weight; entropy, maximum entropy weight; and var, maximum variance weight. glasso shows the probability of selecting the group using grouped LASSO.

ρ = 0.8 ρ = 0.5 ρ = 0.2
PCA unif entropy var PCA unif entropy var PCA unif entropy var
β1 0.045 0.044 0.044 0.045 0.043 0.044 0.044 0.045 0.043 0.043 0.044 0.044
β2 0.149 0.398 0.812 0.825 0.147 0.335 0.603 0.624 0.139 0.233 0.309 0.317
β3 0.795 0.791 0.661 0.637 0.688 0.705 0.588 0.568 0.472 0.456 0.417 0.391
β4 0.311 0.570 0.674 0.670 0.273 0.476 0.542 0.543 0.176 0.297 0.305 0.297
β5 0.490 0.821 0.719 0.702 0.487 0.753 0.654 0.632 0.332 0.516 0.478 0.462
β6 0.400 0.843 0.849 0.846 0.358 0.770 0.780 0.766 0.226 0.529 0.520 0.500

The results are similar to those of the previous scenario. Under the simulation scenario corresponding to β1, we see that all methods have close to nominal type 1 error rates. We also see a trade-off between the maximum diversity weights and the other weights under β2 and β3, each of which has one non-zero coefficient, as well as under β4 and β5, each of which has two non-zero coefficients. Under β6, all three latent variables are associated with the outcome, and the maximum diversity weights perform similarly to the uniform weight and substantially better than the PCA weight. Taken together, the worst outcome for the maximum diversity weights is substantially better than the worst outcome for the competing approaches under the β’s for the power study.

4. Real data example

As described in the Introduction, the RV144 immune correlates study [5] performed a broad characterization of the innate, antibody and cellular immune responses elicited by the ALVAC/AIDSVAX prime-boost HIV-1 vaccine. These immune response biomarkers were divided into six primary functional categories. One variable was chosen for each category, and the six variables were included in multiple regression models to study the association between the risk of HIV-1 infection and each immune response. One of the categories consists of IgA antibodies, which includes 11 immune response biomarkers. The Spearman correlation coefficients between the biomarkers range from 0.31 to 0.91 (Figure B.2 of the Supplementary Materials). Here we compare four different methods of creating an IgA score variable. For the other five categories, we adopt the same representing variables as those used in the original study. All regression variables are standardized to have mean 0 and standard deviation 1.

Two logistic regression models are investigated. Model I comprises six main effects representing the six immune response categories, and Model II includes an additional interaction term between IgA and antibody-dependent cellular cytotoxicity (ADCC) because of a link between these two functional categories [17]. To compare the power of the different methods, we create 103 bootstrap replicates of the dataset and estimate the probabilities of detecting a significant association between the IgA score and the outcome in model I, and the probabilities of detecting a significant interaction between IgA and ADCC in model II, both at two-sided alpha level 0.05. The results in Table 5 show that using maximum diversity weights to create the IgA score leads to higher probabilities of rejecting the null hypotheses in both models than using uniform weights or the first principal component.

Table 5.

Estimated probabilities of detecting the association between IgA and the risk of HIV-1 infection in Model I and the interaction between IgA and ADCC in Model 2 (103 bootstrap replicates) in the RV144 example. Four methods for creating the IgA score are compared: PCA, first principal component; unif, uniform weights; entropy, maximum entropy weights; and var, maximum variance weights.

PCA unif entropy var
Model I: IgA 0.429 0.469 0.562 0.610
Model II: IgA*ADCC 0.333 0.371 0.479 0.527

5. Discussion

In this paper we studied an unsupervised dimension-reduction task aimed at preserving diversity within a collection of continuous biomarker measurements. Such unsupervised methods can be useful for preprocessing biomarkers in an analysis framework that calls for scores that summarize a collection of related biomarkers.

We proposed two criterion functions for measuring biomarker dispersion. We proved that the optimization problems for finding weights estimation are convex problems, and derived the asymptotic variances of the estimators. Monte Carlo studies suggest that the estimated weights are very close to the ideal weights when such weights can be determined. The numerical results also suggest that when used to create score variables for use in regression studies, maximum diversity weighting leads to substantially improved power, even under the worst case scenario. An implementation of the proposed methods can be found in the mdw package available from the Comprehensive R Archive Network.

The inspiration for our proposed methods comes from methods for weighting protein sequences in biological sequence analysis [11, 12]. The same methods can be applied to combining discrete biomarkers that share the same categories. We give an example of this in Section C of the Supplementary Materials. If the discrete biomarkers to be combined have different categories or if we are to combine a collection of mixed discrete and continuous biomarkers, it becomes more difficult to define a criterion function to capture signal diversity. A potential solution is to transform the biomarkers in the collection in a way that make them more comparable. The best ways to perform the transformation would depend on the specific applications desired.

In the RV144 immune correlates studies that motivated this study, there are more than one collection of biomarkers for which we need to create summary scores. There is no overlap between these collections of biomarkers, and we can simply apply maximum diversity weighting to each collection. In other applications, it may not be uncommon to see overlapping collections of biomarkers. The proposed weighting methods can technically be used for each collection; however, we do recommend caution in using maximum diversity weighting methods when this is done, as other types of weighting methods may be more scientifically appropriate. For example, suppose the biomarkers are gene expression markers from two different pathways and we want the summary scores to capture the activities of the pathways. It is quite common for a gene to play multiple roles in different contexts. Whether it is more appropriate to use maximum diversity weighting or, say, the first principal component, to create the scores would largely depend on the scientific questions of interest being asked.

When it is prohibitively expensive to measure biomarkers for every study participant, cost effective study designs such as case-control studies [18] are needed. The proposed maximum diversity weights can be extended to such designs through inverse probability weighting. Details of the implementation and additional Monte Carlo studies are given in Section F of the Supplementary Materials.

Supplementary Material

Supp info

Acknowledgement

The authors thank the Editor, the AE and two anonymous referees for their thoughtful and constructive comments. The authors are indebted to the participants and investigators of RV144, in particular Georgia Tomaras and Guido Ferrari, for providing the biomarker data for the example. The authors thank Peter B. Gilbert for insightful comments and Lindsay N. Carpp for help with editing. This work was supported by the National Institutes of Health (R01-AI122991; UM1-AI068635) and the Henry M. Jackson Foundation (W81XWH-07-2-0067). The views expressed are those of the authors and should not be construed to represent the positions of the U.S. Army or the Department of Defense. The data that support the findings of this study are available from the U.S. Military Research Program, Walter Reed Army Institute of Research. Restrictions apply to the availability of these data, which were used under confidentiality agreement for this study.

References

  • 1.Pearson K On lines and planes of closest fit to systems of point in space. Philosophical Magazine 1901; 2(11):559–572. [Google Scholar]
  • 2.Kruskal J, Wish M. Multidimensional Scaling. No. no. 11 in 07, SAGE Publications, London, 1978. URL https://books.google.com/books?id=ZzmIPcEXPf0C. [Google Scholar]
  • 3.Rerks-Ngarm S, Pitisuttithum P, Nitayaphan S, Kaewkungwal J, Chiu J, Paris R, et al. Vaccination with ALVAC and AIDSVAX to prevent HIV-1 infection in Thailand. New England Journal of Medicine 2009; 361(23):2209–2220. [DOI] [PubMed] [Google Scholar]
  • 4.Plotkin S, Gilbert P. Nomenclature for immune correlates of protection after vaccination. Clinical Infectious Diseases 2012; 54(11):1615–1617. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Haynes BF, Gilbert PB, McElrath MJ, Zolla-Pazner S, Tomaras GD, Alam SM, et al. Immune-correlates analysis of an HIV-1 vaccine efficacy trial. New England Journal of Medicine 2012; 366(14):1275–1286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Haynes BF, Gilbert PB, McElrath MJ, Zolla-Pazner S, Tomaras GD, Alam SM, Evans DT, Montefiori DC, Karnasuta C, Sutthent R, et al. Immune-correlates analysis of an HIV-1 vaccine efficacy trial. New England Journal of Medicine 2012; 366(14):1275–1286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Eddy S Profile Hidden Markov Models. Bioinformatics 1998; 14:755–763. [DOI] [PubMed] [Google Scholar]
  • 8.Durbin R, Eddy S, Krogh A, Mitchison G. Biological Sequence Analysis. Cambridge University Press, Cambridge, UK, 1998. [Google Scholar]
  • 9.Sibbald PR, Argos P. Weighting aligned protein or nucleic acid sequences to correct for unequal representation. Journal of molecular biology 1990; 216(4):813–818. [DOI] [PubMed] [Google Scholar]
  • 10.Eddy SR, Mitchison G, Durbin R. Maximum discrimination hidden markov models of sequence consensus. Journal of Computational Biology 1995; 2(1):9–23. [DOI] [PubMed] [Google Scholar]
  • 11.Henikoff S, Henikoff JG. Position-based sequence weights. Journal of molecular biology 1994; 243(4):574–578. [DOI] [PubMed] [Google Scholar]
  • 12.Krogh A, Mitchison GJ. Maximum entropy weighting of aligned sequences of proteins or dna. ISMB, vol. 3, 1995; 215–221. [PubMed] [Google Scholar]
  • 13.van der Vaart A Asymptotic Statistics Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, Cambridge, UK, 2000. [Google Scholar]
  • 14.Scott DW. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons, 1992. [Google Scholar]
  • 15.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2006; 68(1):49–67. [Google Scholar]
  • 16.Meier L, Van De Geer S, Bühlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2008; 70(1):53–71. [Google Scholar]
  • 17.Tomaras G, Ferrari G, Shen X, Alam S, Liao H, Pollara J, et al. Vaccine-induced plasma IgA specific for the C1 region of the HIV-1 envelope blocks binding and effector function of IgG. Proceedings of the National Academy of Sciences 2013; 110(22):9019–9024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 1986; 73(1):1–11. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp info

RESOURCES