Abstract
In HIV vaccine/prevention research, probing into the vaccine-induced immune responses that can help predict the risk of HIV infection provides valuable information for the development of vaccine regimens. Previous correlate analysis of the Thai vaccine trial aided the discovery of interesting immune correlates related to the risk of developing an HIV infection. The present study aimed to identify the combinations of immune responses associated with the heterogeneous infection risk. We explored a “change-plane” via combination of a subset of immune responses that could help separate vaccine recipients into two heterogeneous subgroups in terms of the association between immune responses and the risk of developing infection. Additionally, we developed a new variable selection algorithm through a penalized likelihood approach to investigate a parsimonious marker combination for the change-plane. The resulting marker combinations can serve as candidate correlates of protection and can be used for predicting the protective effect of the vaccine against HIV infection. The application of the proposed statistical approach to the Thai trial has been presented, wherein the marker combinations were explored among several immune responses and antigens.
Keywords and phrases: Change-plane, Immune-correlates analysis, Latent subgroup analysis, RV144 trial
1. Introduction.
Immune correlates analyses of the HIV-1 vaccine studies aim to identify the immune response variables induced by the administration of a vaccine. Such a strategy can help predict the risk of HIV infection and thus can provide valuable information for the development of vaccine regimens. Multiple immune correlates analyses of the HIV-1 vaccine efficacy were conducted in the Thai Phase III Clinical Trial (RV144) (Rolland et al., 2012; Rolland and Gilbert, 2012), which was reportedly the first HIV vaccine trial to demonstrate any efficacy at preventing infection (Rerks-Ngarm et al., 2009). These previous correlates of risk (CoR) analyses using logistic and Cox-hazard regression models in the RV144 trial showed that two immune response variables were associated with the risk of developing HIV infection among the vaccine recipients: The binding of immunoglobulin G (IgG) antibodies to the variable loops 1 and 2 (V1V2) region of glycoprotein (gp) 120 (“V1V2-IgG”) correlated inversely (negatively) with the risk of developing HIV infection, and the binding of plasma IgA antibodies to the envelope proteins (Env) (“Env-specific IgA”) correlated directly (positively) with the infection risk (Haynes et al., 2012; Yates et al., 2018). Interestingly, the interaction analyses based on the pre-determined subgroups of the Env-specific IgA levels (low, medium, high) suggested that the Env-specific IgA antibodies might modify the association between the infection risk and each of the other immune response variables. For example, four immune response variables exhibited a significant inverse correlation with the infection risk among the vaccine recipients whose Env-specific IgA level was low, while the association was not significant among the vaccine recipients with high levels of Env-specific IgA (Haynes et al., 2012; Zolla-Pazner et al., 2014). These findings raised a question regarding the existence of heterogeneous subgroups in the RV144 trial, in which the association between the immune response variable and the risk of developing infection differed across subgroups defined by other immune markers and individual characteristics.
Motivated by the previous immune correlates analysis of the RV144 trial, we aimed to investigate the existence of the heterogeneous subgroups among the vaccine recipients, characterized by differential associations of the candidate immune response endpoints with the risk of developing HIV infection. This enables us to identify immune responses strongly associated with the risk of developing infection within subgroups, even no strong association in the whole population is observed. Another important question is the development of an immune response combination score that could be used to divide the vaccine recipients into subgroups with differential risk. This type of score can serve an outcome in vaccine immunogenicity studies for selecting and ranking candidate vaccine regimens.
There is extensive literature on the statistical methods used for latent subgroup analyses. Here, we have focused our attention on the case in which the latent subgroups can be identified based on a function of an individual’s baseline characteristics due to space limitations. The finite mixture distribution is one of the most commonly used methods for modeling heterogeneous data in such a case. For example, Chung, Flaherty and Schafer (2006) applied the latent class logistic model to items included for assessment of the self-reported use of marijuana and attitudes of the users by associating class membership and individual variables. In typical finite-mixture modeling approaches, subgroup memberships are commonly determined through the posterior probability of subgroup membership, and the determination of the best threshold for subgroup identification is challenging. Furthermore, the posterior probability is a function of covariates and outcome variables in the mixture of the binomial distribution; hence, subgroup identification is not possible for a new person without his/her outcome measurement. A few recent studies proposed alternative methods for subgroup identification. Shen and He (2015) introduced a likelihood-based test for the existence of subgroups using a structured logistic-normal mixture model and used predictive scores using logistic regression for subgroup identification. Shen and Qu (2020) proposed a structured mixed-effects model with a two-component mixture model for evaluation of the treatment effects and used a similar method to Shen and He (2015) for the subgroup determination.
Due to these concerns associated with a typical finite-mixture modeling approach, we chose to adopt instead a “change-plane” regression framework for subgroup analysis of the RV144 immune correlates data. This framework has been recently developed to study heterogeneous data structure. The change-plane method can be considered as an extension of the change-point method. While most classical change-point analyses focus on the detection of changes in the gradual distribution of data points, the change-plane model focuses on the identification of latent subgroups of individuals whose association between outcome variable and regression covariates of interest or the distribution of outcomes changes according to a set of covariates. In the change-plane method, the subgroup identification depends on the sign of the change-plane, where the intercept estimation of the change-plane provides the best threshold (Section 3). Therefore, a separate procedure to determine the best threshold is not necessary. Kosorok et al. (2007) studied the estimation and inference of the change-point regression model for the right-censored survival outcome with a change-point in a one-dimensional covariate. Kang (2011) extended the change-point model from one-dimensional covariate spaces to two-dimensional covariate spaces (“change-line” method). Wei and Kosorok (2013) adopted a sieve maximization scheme using sliced inverse regression to study latent subgroups in the change-plane classification and subsequently used the model for the establishment of Cox regression models (Wei and Kosorok, 2018). The composite large margin classifier (Chen et al., 2016) is similar to the change-plane method in that the subclasses are determined by the sign of the split function of covariates. Recently, the change-plane method has been applied to the estimation and inference of latent subgroups with heterogeneous treatment effects (Fan, Song and Lu, 2017; Kang, Lu and Song, 2017).
In vaccine studies, there could be a substantial number of candidate covariates, including demographics and lab-measured biomarkers that could potentially contribute to the heterogeneity between subgroups. Selection of the most relevant features to characterize the subgroup can lead to the establishment of a model with a more stable performance via estimation of a reduced number of parameters. More importantly, it will reduce the cost to measure the information necessary to determine an individual’s subgroup membership, especially when lab measures are involved in the subgroup definition. This motivated our study described in this paper to develop a new method that could incorporate feature selection into change-planes regression modeling, which has not been extensively studied in the existing change-planes methods. Our proposed method relies on a penalized likelihood framework. Smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001) is frequently used to determine a subset of important variables. However, the standard change-plane method contains a non-differentiable function and it is not easy to adopt SCAD for variable selection.
To overcome this challenge, we have proposed a penalized probabilistic version of the change-plane regression model using SCAD and selected variables that contributed to subgroup identification through the change-plane. Through the introduction of a smooth likelihood, our method facilitates the natural implementation of SCAD to produce a penalized log-likelihood for selection of variables in change-plane characterization. Another contribution of the proposed method is that it enables computation of the conditional probability of individual subgroup membership. To the best of our knowledge, none of the existing standard change-plane regression models have accomplished these two tasks simultaneously.
The remainder of this paper is organized as follows. We have provided more details on the RV144 trial in Section 2. In Section 3, we have introduced our proposed change-plane method with and without variable selection. We adopted an ad hoc approach to validate the resulting subgroups using the proposed method. We have illustrated our method using the simulation studies described in Section 4. In Section 5, we highlighted the application of the proposed method to the immune correlates study of the RV144 trial and identified subgroups of vaccine recipients associated with the heterogeneous risk of developing HIV infection based on a small set of individual characteristics and immune markers. We have concluded this article with a discussion of the method and open research questions in Section 6.
2. The RV144 Thai trial and immune markers.
RV144 is a community-based, randomized, multi-center, double-blinded, placebo-controlled vaccine efficacy trial designed to evaluate a regime consisting of a replication-defective canarypox vector vaccine (ALVAC-HIV) in combination with recombinant gp 120 subunits (AIDSVAX B/E)(Zolla-Pazner et al., 2014). The vaccine and placebo injections were administered intramuscularly to 16,402 healthy men and women between the ages of 18 and 30 years who were at a risk of developing HIV-1 infection in Thailand between 2003 and 2009. In this study, we analyzed the data obtained from the case-control correlates study, consisting of peak immune response data obtained at week 26, derived from all 41 vaccine recipients who acquired HIV-1 infection after week 26; additionally, information obtained from frequency-matched random samples of 205 HIV-1 negative vaccine recipients at the end of the follow-up period at month 42 (Corey et al., 2015) was included for analysis.
The immune correlates analysis of the RV144 trial aimed to determine whether any of the immune responses were associated with the risk of developing HIV infection. In this study, we have considered two sets of immune response variables (Tables 10 and 11 of the Supplementary materials). First, the following six primary immune response variables in gp 120 were considered: 1) V1V2-IgG, 2) Env-specific IgA, 3) the IgG antibodies avidity score to IgA (“Avidity”), 4) antibody-dependent cellular cytotoxicity(“ADCC”), 5) the magnitude of CD4+T cells intracytoplasmic cytokines (IFNγ, IL-2, TNFα, CD154) stimulated by AE.92TH023 peptides (“ICS”), and 6) tier 1 neutralizing antibodies (“NAB”). These variables were selected from the previously reported immune correlates analyses (Haynes et al., 2012). Second, the following V1V2-scaffold antigens using either enzyme-linked immunosorbent assay (ELISA) or a binding antibody multiplex assay (BAMA) performed in independent laboratories were explored: S1) gp70.A(92RW020)-V1V2.GN, S2) gp70.AE(92TH023)-V1V2.AP, S3) gp70.B(CaseA2.p623)-V1V2.Aporig, S4) gp70.B(Case A2)-V1V2.LL, S5) gp70.B(Case A2/V169K)-V1V2.LL, S6) gp70.B(Case A2/mut3)-V1V2.LL, S7) gp70.C(97ZA012)-V1V2.GN, S8) tags.C(1086)-V1V2.LL (using ELISA); S9) gp70.A(92RW020)-V1V2.GN, S10) gp70.AE(92TH023)-V1V2.AP, S11) gp70.B(CaseA2.p623)-V1V2.APorig, S12) gp70.B(CaseA2)-V1V2.GN, S13) gp70.B(Case A2)-V1V2.LL, S14) gp70.B(Case A2/V169K)-V1V2.LL, S15) gp70.B(Case A2/mut3)-V1V2.LL, S16) gp70.C(1086)-V1V2.LL, S17) gp70.C(97ZA012)-V1V2.GN, S18) tags.AE(A244)-V1V2.LL, and S19) tags.C(1086)-V1V2.LL (using BAMA). The gp70 (CaseA2/ug/ml)-V1V2.CH58 (S20) was generated separately. Except for the four proteins (S12, S18, S19 and S20), eight proteins were generated using both ELISA and BAMA. More detailed information on the RV144 trial can be found in the previous publications (Zou and Li, 2008; Haynes et al., 2012; Zolla-Pazner et al., 2014).
3. Method.
3.1. Data setting and the change-plane method.
Suppose the binary disease outcome variable Y can be expressed using the following risk model:
where C ∈ {0, 1} indicates individual subgroup membership, represents a set of covariates of interest, and β, represent vectors of the regression parameters. In the immune correlates analysis of the HIV-1 vaccine studies, we aim to estimate the regression parameters β and δ. We assumed that β ≠ δ. That is, for individuals whose subgroup variable C = 1, the risk model could be expressed as a function of Z indexed by β; otherwise, the risk model is a function of Z indexed by δ. In practice, individual subgroup membership is not directly observed from the data; hence, C is considered a latent variable. We assume that a set of covariates, , provides complete information that can be used to determine the latent subgroup such that:
| (1) |
where the first column of X contains ones for intercept and . Here, ζ0 and represent the intercept and direction vector, respectively, that determine the latent subgroup. For example, Y and Z denote the HIV infection status and an immune response variable in the RV144 study; X represents a set of covariates that can include demographics and immune response variables. Our goal was to assess whether the entire vaccine recipients could be characterized by two subgroups with heterogeneous risk models where the risks of disease development were differently associated with the main covariates of the interest, Z. The modeling approach in (1) is the change-plane regression model (Kosorok et al., 2007; Kang, 2011; Wei and Kosorok, 2013; Fan, Song and Lu, 2017). Here, ζ is referred to as “change-plane parameter” and (β, δ) are referred to as “regression parameters”.
In this study, we estimated the change-plane parameters using a probabilistic model G = G(X; η) for the following two reasons: 1) a computation of the conditional probability of individual subgroup membership and 2) variable selection for the change-plane covariates X by using SCAD penalization that relies on smoothness and differentiability of the objective function. In the change-plane method, we determined the subgroup membership for each individual i by using the sign of ηTXi, that is, Ci = 1{ηTXi > 0} in the model (1), where for parameter identifiability (Kang, 2011; Wei and Kosorok, 2013; Fan, Song and Lu, 2017; Kang, Lu and Song, 2017; Wei and Kosorok, 2018).
3.2. Estimation of the change-plane parameters using the EM algorithm.
Considering the latent variable C as missing data, we have modeled P (C = 1|X) using a smooth function indexed by . Let θ = (βT, δT, ηT) be a vector of the parameters. Let the completed and observed data be expressed by using and for n independent individuals. The complete-data likelihood function is
where f1 and f0 denote the probability densities of the two subgroups for C = 1 and C = 0, respectively; p1 = P(Y = 1|Z; β) and p0 = P(Y = 1|Z; δ) represent the risk models for individuals with C = 1 and C = 0, respectively. We used logistic regression for G(x; η); hence the probability of subgroup membership has been computed as G(x; η) = {1 + exp(−ηT x)}−1. Here, we assumed that P(C = c|x) > 0 for c = 0, 1, and thus every individual exhibited a positive probability of membership for both subgroups.
We estimated the change-plane and regression parameters using the expectation-maximization (EM) algorithm (Dempster, Laird and Rubin, 1977). Using the logistic regression for G(x; η), the conditional expectation of ℓc = log Lc given the observed data, assuming that the true parameter values are θ(k), is expressed by
In the E-step, we calculated Q(θ, θ(k), O) and
| (2) |
In the M-step, we estimated θ(k+1) = arg maxθ Q(θ, θ(k), O). By using a smooth function G which is differentiable up to the second order at least instead of the indicator of the latent subgroup, we can estimate both the change-plane and regression parameters by solving the score equation: .
Once the change-plane parameters, η, were estimated using the EM algorithm, the individual membership was determined by the sign of . Finally, we fit an outcome regression model for each subgroup separated by the sign of , and the regression parameters were estimated by fitting the ordinary logistic regression model. The EM procedure to estimate η without the SCAD penalty is similar to that of the standard finite mixture of logistic binomial regression. However, we used the change-plane method to determine subgroup memberships and the ordinary logistic regression model for each subgroup to estimate the final regression parameters. Thus our proposed method can produce different results from the standard finite mixture of logistic binomial regression models or the change-plane regression model that simply replaces 1{ζTX > 0} in (1) with a smoothing approximation function. Our method without the SCAD penalty for dealing with a binary outcome has a structure that is analogous to the logistic-normal mixture model for dealing with a continuous outcome proposed by Shen and He (2015). Shen and He (2015) focused on the hypothesis testing for the existence of subgroups. We focused on the estimation and variable selection that can produce a parsimonious change-plane to separate the data points into two subgroups. In the EM procedure of Shen and He (2015), (β, δ) and “gating function” parameters (η) were estimated separately within each iteration using equations (10) and (11) of Shen and He (2015) in the logistic-normal mixture modeling framework. A pre-determined compact set (η) was used to estimate η and (β, δ) was optimized for a given η. Our EM procedure estimated both change-plane and intermediate regression parameters simultaneously within each iteration in the logistic-binomial mixture modeling framework without candidate values of η.
3.3. The SCAD-penalized model for change-plane covariates selection.
The following estimating equation has been adopted for variable selection using the SCAD penalty:
| (3) |
where represents a p-dimensional vector of penalty functions with a regularization parameter λ, sign(a) = 1{a > 0} – 1{a < 0}, and sign(θ) = {sign(θ1, ), …, sign(θp)}T. We selected the SCAD penalty function based on its appreciable properties (unbiasedness, sparsity, and continuity), as discussed in the studies reported by Fan and Li (2001) and Wang, Zhou and Qu (2012): . In this study, we conducted a variable selection for the change-plane covariates only although the SCAD penalty could be extended for both the change-plane and regression parameters. Therefore, θ = (0T, 0T, ηT) in . Fan and Li (2001) reported that h = 3.7, which was close to the value obtained by using the generalized cross-validation method (Craven and Wahba (1978)). Therefore, we set h as 3.7. We selected λ, which minimizes the Bayesian information criterion (BIC) proposed by Ueki (2009), to reduce the computational burden: , where and denotes the number of nonzero parameters. We solved the penalized estimating equations using the iterative algorithm described by Wang, Zhou and Qu (2012), and used the functions in the R package PGEE (Inan and Wang, 2017) with a few modifications.
- The estimation of the parameters using EM is conducted as follows:
- E-step: compute as described in (2).
- M-step: compute by solving conditional score equations: by plugging in to E(C).
- Evaluate Q(k) at . Compute and the minimum subgroup sizes by .
- Repeat steps 1a–1c until convergence. The iterations are carried out according to the following stopping rules: 1) if is smaller than the pre-specified threshold value; 2) if the number of iterations exceeds the pre-specified maximum number; or 3) if is smaller than the pre-determined minimum size.
The final change-plane estimates for the final k of the EM-procedure. The individual subgroup membership is determined by .
Fit an outcome regression model for each subgroup separated by or 0 and compute the final regression parameter estimates from the ordinary logistic regression model. The final estimations are denoted by . If the EM-procedure stops due to a small subgroup size or fails to separate the samples into two subgroups, we fit a single outcome model to the entire data set.
To avoid the parameter identifiability problem, compare the first element of each estimated vector of parameters, and so as to satisfy . The class parameter vector is scaled such that .
Estimation of the parameters with the SCAD penalty: We repeat the EM procedure as described above except that we solve the equation (3) iteratively until convergence.
4. Simulation study.
4.1. Simulation set-up.
We conducted simulation studies to evaluate the finite sample performance of the proposed estimation procedure with (EM+SCAD) and without (EM) variable selection using SCAD under the following three different scenarios: 1) when the two subgroups exhibited different associations, positive or negative associations, between the risk of disease outcome and a set of regression covariates, 2) when two subgroups demonstrated different associations, where the direction of the association was the same, but one subgroup showed a stronger association than the other group, and 3) when no heterogeneous subgroup existed. For Scenario 1(a), the true values of the regression coefficient in the risk model were β0 = (1, 1)T versus δ0 = (−1, −1)T for the two subgroups. For Scenario 1(b), β0 = (1, −1)T versus δ0 = (−1, 1)T. For Scenario 2, the true values of the regression coefficients were β0 = (3.5, 3.5)T (Scenario 2(a)) or (1, 1.5)T (Scenario 2(b)) versus δ0 = (0.5, 0.5)T. For Scenario 3, we set β0 = δ0 = (1, 1)T.
For each scenario, we generated training data sets for the following three sample sizes: n = 250, 500, and 1,000. The change-plane covariates, , were generated from Uniform(−2, 2), where the last column of X contained ones for the intercept. The regression covariate was generated from Uniform(−2, 2), where the first column of Z contained ones for the intercept. The true values of the change-plane parameters were set to ; hence the intercept term X10 and two variables X1, X2 provide complete information of the latent subgroups by C = 1{η10 + η1X1 + η2X2 > 0}, with . The binary outcomes Y s were generated by using Y = CW1 + (1–C)W0, where W1 and W0 represented binary outcomes generated via application of the logistic regression model with β0 and δ0, respectively, for each scenario. We performed a Monte Carlo (MC) simulation with 1,000 replications. For each MC replication, the methods were evaluated using separate test data sets with a sample size of 10,000. The minimum subgroup size was set equal to the total number of regression parameters or 5% of the training data sample size, whichever was greater.
Results for Scenario 1(b), 2(a), 2(b), and more simulations (scenarios 4–8) to explore the performance of the proposed methods for high-dimensional change-plane covariates and the robustness against model misspecification are given in sections 1.1, 1.2 and 1.3 of the Supplementary materials. In summary, model misspecification could lead to bias in estimation and increased misclassification with extents depending on the simulation setting. Variable selection performance was generally better than random guesses with minor misspecification in the change-plane model. Larger sample size is needed for the good performance of the proposed method with the SCAD penalty when the dimensionality of candidate covariates increases in the change-plane model, particularly to select many weakly contributing covariates.
The following metrics were used to compare the finite sample performance of the subgroup identification and variable selections in our simulation study: 1) miss-classification rate of the subgroups , 2) area under the receiver operating characteristics (ROC) curve of the identification of the subgroup (AUCC), 3) frequency of the correctly selected change-plane variables with non-zero coefficient (True non-zero selection rate)= # of selected variables among the three non-zero coefficient variables (X1, X2 and X10), in each simulation divided by 3 and 4) frequency of the correctly not-selected change-plane variables with zero coefficient (True zero selection rate)= 1-(# of selected variables among seven zero-coefficient variables, X3, …, X9, in each simulation divided by 7).
4.2. Validation of the resulting subgroups.
Validation of the resulting subgroups is important in the subgroup analysis. As reported by Sun et al. (2010) and Simon and Wang (2006), it is difficult to validate the results of subgroup analysis without the inclusion of a considerable external data set. Alternatively, simple ad-hoc approaches, such as the interaction test (Sun et al., 2010) and cross-validation methods(Simon and Wang, 2006), have been used for internal validations. To examine whether the subgroups resulting from the proposed methods were statistically meaningful, we introduced a binary variable indicating each individual’s subgroup, , where was obtained by adopting our proposed method. Then, we fit the following logistic regression model including the interaction term with the subgroup indicators to either the training or the test data set (N=10,000):
| (4) |
This approach has been used in the statistical literature (Wei and Kosorok, 2013) as an ad-hoc validation method for subgroup analysis. In the simulation studies, we fit the logistic regression model (4) to the test data sets. We also conducted a hypothesis test for using the likelihood ratio and score tests.
4.3. Simulation results.
4.3.1. Scenarios 1(a) and 2: In the presence of the heterogeneous subgroups.
Figures 1 and 2 show the estimates of the coefficients for the three change-plane covariates with non-zero coefficients (η1, η2, η10) and one of the change-plane covariates with zero coefficients (η3) as well as the regression coefficients of the two separated subgroups in scenarios 1 and 2, respectively. The same information can be found in Table 1 in the Supplementary materials. Overall, the proposed methods showed better performance for the estimation and identification of the heterogeneous subgroups when two subgroups exhibited the opposite direction of the association (Scenario 1), compared to those with the same direction but different strengths of the associations (Scenario 2). Regardless of the use of the SCAD penalty, the proposed methods improved both estimations for the change-plane and regression coefficients as the training sample sizes increased.
FIG 1.

Results of the simulation study: scenario 1(a), where two subgroups exhibit the opposite direction of the association with (β1, β2) = (1, 1) and (δ1, δ2) = (−1, −1). Sample sizes of n = 250, 500, and 1000 have been reported. Change-plane and regression parameters have been estimated by using EM without variable selection (EM, Salmon) or with variable selection using SCAD (EM+SCAD, Turquoise). Estimates of the three non-zero change-plane parameters (η1, η2, η10), one zero change-plane parameter (η3), and the regression parameters of the two subgroups are reported using box-plots.
FIG 2.

Results of the simulation study: scenario 2(a), where two subgroups exhibit the same direction, but different strengths of the association. Sample sizes n = 250, 500, and 1000 have been reported. Change-plane and regression parameters have been estimated by using EM without variable selection (EM, Salmon) or with variable selection using SCAD (EM+SCAD, Turquoise). Estimates of the three non-zero change-plane parameters (η1, η2, η10), one zero change-plane parameter (η3), and the regression parameters of the two subgroups have been reported using box-plots.
Table 1 reports the performance for the latent subgroup classification and variable selection in scenarios 1 and 2. The mean and MC standard deviations of the subgroup miss-classification rates (MCRC) and AUCs (AUCC) of the latent subgroup were computed using independent test data sets. By selecting the change-plane covariates using SCAD, the proposed method improved the latent subgroup identifications over EM without variable selection in both scenarios. The proposed SCAD-penalized method exhibited a rate of 99.3% of correctly selecting variables that contributed to subgroup identification and a rate of 99.6% of correctly excluding variables that did not contribute.
TABLE 1.
Simulation results of the evaluation of the latent subgroup classification and variable selection with training sample sizes of n = 250, 500, and 1,000 are presented. The mean (Mean) and Monte-Carlo standard deviation (MCSD) of miss-classification rates (MCR_C) and AUCs (AUC_C) of the latent subgroup are computed using independent test data sets. When SCAD is implemented for a variable selection, mean and MCSD of true non-zero effect variable selection rates (True non-zero selection rate) and true zero effect variable selection rates (True zero selection rate) over 1,000 MC replications are reported.
| Measurement | n = 250 | n = 500 | n = 1000 | ||||
|---|---|---|---|---|---|---|---|
| EM | EM+SCAD | EM | EM+SCAD | EM | EM+SCAD | ||
| Scenario 1(a): two subgroups with opposite directions | |||||||
| MCR_C | Mean | 0.089 | 0.035 | 0.049 | 0.023 | 0.024 | 0.015 |
| MCSD | 0.055 | 0.050 | 0.057 | 0.051 | 0.041 | 0.040 | |
| AUC_C | Mean | 0.948 | 0.993 | 0.973 | 0.995 | 0.989 | 0.996 |
| MCSD | 0.054 | 0.047 | 0.057 | 0.045 | 0.042 | 0.039 | |
| True non-zero selection rate | Mean | - | 0.993 | - | 0.994 | - | 0.995 |
| MCSD | - | 0.063 | - | 0.058 | - | 0.054 | |
| True zero selection rate | Mean | - | 0.996 | - | 0.997 | - | 0.995 |
| MCSD | - | 0.026 | - | 0.041 | - | 0.056 | |
| Scenario 2(a): two subgroups with the same direction but different strengths | |||||||
| MCR_C | Mean | 0.134 | 0.082 | 0.072 | 0.051 | 0.036 | 0.042 |
| MCSD | 0.056 | 0.074 | 0.040 | 0.037 | 0.023 | 0.021 | |
| AUC_C | Mean | 0.889 | 0.978 | 0.945 | 0.996 | 0.976 | 0.999 |
| MCSD | 0.057 | 0.057 | 0.038 | 0.024 | 0.023 | 0.008 | |
| True non-zero selection rate | Mean | - | 0.971 | - | 0.998 | - | 0.999 |
| MCSD | - | 0.099 | - | 0.033 | - | 0.015 | |
| True zero selection rate | Mean | - | 0.992 | - | 0.996 | - | 1.000 |
| MCSD | - | 0.038 | - | 0.035 | - | 0.000 | |
The sample density plots of the probabilities of subgroup membership are shown in Figure 3. We randomly selected one data set among 1,000 simulated data sets for scenarios 1 and 2 with a sample size of 1,000. We reported the true probabilities and estimated probabilities with and without SCAD by true subgroups. The estimated probabilities of subgroup membership were found to be closer to the true probabilities of subgroup membership with variable selection than those without variable selection in both scenarios.
FIG 3.

Results of the simulation study: one of the simulated data sets for scenario 1(a) and scenario 2(a) with a sample size of 1,000. The probabilities of subgroup membership are reported for the true probabilities (black), probability estimates without SCAD (salmon), and probability estimates with SCAD (turquoise) by the true subgroup memberships (solid and dashed lines for the true subgroups C = 1 and C = 0, respectively).
As an ad-hoc validation of the resulting subgroup, we fit the logistic regression model including the interaction with the change-plane subgroup indicator using the test data sets (Table 2). Owing to the obtainment of data with a mix of different directions or strengths of the association (1 versus −1 in scenario 1; 0.3 versus 3.5 in scenario 2), the toward-null-bias produced regression parameter estimates to be closer to the other subgroups, especially, in scenario 2. However, the latent subgroups were correctly identified, and the regression parameter estimates were improved by using larger training data sets. The LRT and score test rejection rates (%) of H0 of the interaction effect in (4 ) were over 99% in both scenarios.
TABLE 2.
Simulation results of the ad-hoc validations. Logistic regression models including the interaction terms with the resulting subgroup indicators are fitted to the independent test data sets. Average of the regression coefficients estimates, standard error estimates, and p-values of the t-test of the zero-coefficient null hypothesis using training sample sizes of n = 250, 500, and 1,000 are presented. The score tests and rejection rates (%) of H0 of the zero interaction effects with the subgroup indicator over 1,000 Monte-Carlo replications are reported.
| Coefficient | True value | Average Statistics | n = 250 | n = 500 | n = 1000 | |||
|---|---|---|---|---|---|---|---|---|
| EM | EM+SCAD | EM | EM+SCAD | EM | EM+SCAD | |||
| Scenario 1(a): two subgroups with opposite directions | ||||||||
| α 0 | −1 | Estimates | −0.671 | −0.877 | −0.803 | −0.919 | −0.895 | −0.946 |
| Str.Err | 0.037 | 0.041 | 0.039 | 0.041 | 0.041 | 0.042 | ||
| p-value | 0.001 | 0.001 | 0.002 | 0.000 | 0.001 | 0.000 | ||
| α 1 | −1 | Estimates | −0.671 | −0.879 | −0.804 | −0.922 | −0.895 | −0.947 |
| Str.Err | 0.034 | 0.037 | 0.036 | 0.038 | 0.037 | 0.038 | ||
| p-value | 0.001 | 0.000 | 0.002 | 0.000 | 0.002 | 0.000 | ||
| α 2 | 2 | Estimates | 1.471 | 1.792 | 1.688 | 1.863 | 1.835 | 1.906 |
| Str.Err | 0.048 | 0.052 | 0.050 | 0.053 | 0.052 | 0.053 | ||
| p-value | 0.001 | 0.004 | 0.004 | 0.002 | 0.000 | 0.002 | ||
| α 3 | 2 | Estimates | 1.470 | 1.792 | 1.688 | 1.864 | 1.832 | 1.904 |
| Str.Err | 0.044 | 0.048 | 0.046 | 0.049 | 0.048 | 0.049 | ||
| p-value | 0.002 | 0.007 | 0.000 | 0.001 | 0.001 | 0.003 | ||
| LRT rejection rate | 99.86% | 99.49% | 99.57% | 99.35% | 99.80% | 99.50% | ||
| Score test rejection rate | 99.86% | 99.49% | 99.57% | 99.35% | 99.80% | 99.50% | ||
| Scenario 2(a): two subgroups with the same direction but different strengths | ||||||||
| α 0 | 0.3 | Estimates | 0.413 | 0.344 | 0.351 | 0.309 | 0.324 | 0.302 |
| Str.Err | 0.036 | 0.036 | 0.035 | 0.035 | 0.034 | 0.035 | ||
| p-value | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | ||
| α 1 | 0.3 | Estimates | 0.420 | 0.346 | 0.357 | 0.311 | 0.326 | 0.303 |
| Str.Err | 0.032 | 0.032 | 0.031 | 0.031 | 0.030 | 0.031 | ||
| p-value | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | ||
| α 2 | 3.2 | Estimates | 1.520 | 1.954 | 1.982 | 2.130 | 2.412 | 2.155 |
| Str.Err | 0.067 | 0.078 | 0.077 | 0.080 | 0.089 | 0.080 | ||
| p-value | 0.000 | 0.002 | 0.000 | 0.001 | 0.000 | 0.000 | ||
| α 3 | 3.2 | Estimates | 1.569 | 1.996 | 2.026 | 2.173 | 2.447 | 2.200 |
| Str.Err | 0.062 | 0.072 | 0.071 | 0.074 | 0.082 | 0.074 | ||
| p-value | 0.000 | 0.001 | 0.000 | 0.001 | 0.000 | 0.000 | ||
| LRT rejection rate | 100.00% | 99.78% | 100.00% | 99.93% | 100.00% | 99.92% | ||
| Score test rejection rate | 100.00% | 99.78% | 100.00% | 99.93% | 100.00% | 99.92% | ||
| Scenario 3: No subgroup exists | ||||||||
| α 0 | 1 | Estimates | 1.001 | 1.000 | 1.001 | 0.999 | 1.000 | 0.999 |
| Str.Err | 0.039 | 0.040 | 0.038 | 0.040 | 0.039 | 0.039 | ||
| p-value | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | ||
| α 1 | 1 | Estimates | 1.002 | 1.002 | 1.003 | 1.003 | 1.002 | 1.001 |
| Str.Err | 0.036 | 0.037 | 0.036 | 0.037 | 0.036 | 0.036 | ||
| p-value | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | ||
| α 2 | 0 | Estimates | −0.001 | 0.001 | 0.001 | 0.005 | 0.000 | 0.003 |
| Str.Err | 0.056 | 0.057 | 0.057 | 0.059 | 0.058 | 0.059 | ||
| p-value | 0.516 | 0.510 | 0.503 | 0.492 | 0.472 | 0.485 | ||
| α 3 | 0 | Estimates | −0.005 | −0.004 | −0.002 | 0.001 | −0.001 | 0.001 |
| Str.Err | 0.051 | 0.053 | 0.052 | 0.054 | 0.053 | 0.055 | ||
| p-value | 0.458 | 0.476 | 0.509 | 0.498 | 0.492 | 0.491 | ||
| LRT rejection rate | 5.47% | 4.30% | 5.47% | 5.70% | 5.81% | 4.65% | ||
| Score test rejection rate | 5.65% | 4.30% | 5.32% | 5.78% | 5.66% | 4.65% | ||
4.3.2. Scenario 3: In the absence of heterogeneous subgroups.
As shown in Figure 2 of the Supplementary materials, the estimates of the change-plane coefficients were distributed around zero, but not all coefficients presented with zero values because of the restriction of . Since there were no heterogeneous subgroups, the difference in the resulting regression coefficients between two subgroups separated by the estimated change-plane was minimal, particularly with a larger training sample size. Same pattern could be observed from the ad-hoc validation presented in Table 2. The estimates of the interaction effect in (4) were close to zero, and the LRT and score test rejection rates were close to the nominal level (0.05). This approach could be easily used as an ad-hoc validation tool for determining heterogeneous subgroups identified by the proposed methods.
5. Application to the RV144 trial immune correlates study.
5.1. Application setting and variables of the RV144 immune correlates study.
We demonstrated the application of the proposed change-plane regression model with or without variable selection for the RV144 immune correlates study (Rolland et al., 2012; Rolland and Gilbert, 2012). The specific goal of the application was to find the combinations of immune markers to separate the vaccine recipients of RV144 into two subgroups to address the following questions: 1) to assess whether the associations between the infection risk and each of Avidity, ADCC, ICS, and NAB could be modified by the level of Env-specific IgA; 2) to identify immune markers whose association with the infection risk was not strong in the entire set of vaccine recipients, but was stronger within subgroups; and 3) to determine if the subgroup exhibited a higher risk of developing infection than the other subgroup. We are also interested in finding a parsimonious set of immune markers that rendered a valuable contribution to the identification of heterogeneous subgroups through the change-plane. The resulting combination of markers would then serve as candidate correlates of protection for predicting the vaccine’s protective effect against HIV infection. The application of our proposed methods to RV144 data targeted exploration and hypothesis generation, and future studies to evaluate the subgroups and immunologic mechanisms are required to validate the CoR analysis results, which was beyond the scope of this study.
Considering a binary indicator of infection status as an outcome variable, a single logistic regression model was used in the previous immune correlates analysis to study the association between each of the immune markers and infection risk, with and without adjustment for the effect of baseline age and behavioral risk. The immune markers (X) listed in Section 2 and Tables 10 and 11 of the Supplementary materials represent candidate markers that can be used to build the change-plane. The baseline behavioral risk level (Moderate = 1{Moderate level of behavioral risk}; High=1{High level of behavioral risk}; low risk as a reference level) and age group (Age1 = 1{21 ≤ Age ≤ 25}; Age2 = 1{Age ≥ 26}; 1{Age ≤ 20} as a reference level) have been considered as X, except for the change-plane model based on the use of Env-specific IgA in Section 5.2.1. The immune markers demonstrate diverse scales, and thus each marker is standardized by using the empirical mean and standard deviation to enable the use of SCAD for variable selection. The reciprocal of the sampling ratio (infected: non-infected) was used to weight each individual in the regression model. All other settings for the estimation procedure are the same as those described in Section 4.1. For example, the optimal value for the SCAD penalty term, λ, is determined in terms of minimizing the BIC. The minimum subgroup size has been set equal to the total number of regression parameters or 5% of the data sample size, whichever was greater.
As an ad-hoc validation of the resulting subgroups, we fit the logistic regression model including the interaction with the change-plane subgroup indicator to the entire set of vaccine recipients. Since the sample size was small, we also repeatedly split the data into training and test data sets equally with different random seeds and evaluated the resulting subgroups using the test data set. The rejection rates of the score test for the zero interaction effect with the subgroup indicator over 100 random splits have been reported.
5.2. Results of the application.
5.2.1. The change-plane using Env-specific IgA.
The interaction analyses using the weighted logistic or Cox regression models reported in previous studies suggested that Env-specific IgA antibodies might modify the association between the infection risk and each of Avidity, ADCC, ICS, and NAB. These four variables were inversely associated with the infection risk for the vaccine recipients whose Env-specific IgA level was low, while the association was not significant for vaccine recipients with high levels of Env-specific IgA (Haynes et al., 2012; Zolla-Pazner et al., 2014). Therefore, we first explored the heterogeneous associations between the infection risk and each of Avidity, ADCC, ICS, and NAB by fitting the change-plane model using Env-specific IgA alone as a change-plane covariate.
As shown in Table 3, the slope of the regression line between ADCC and the risk of developing infection is significantly greater than zero for the vaccine recipients whose standardized value of IgA > 1.1212 (, p-value =0.002) after adjusting for the effect of behavioral risk and age; the slope was negative, yet only marginally significant for the vaccine recipients whose standardized value of IgA ≤ 1.121 (, p-value =0.055). The interaction effect of the subgroup indicator based on Env-specific IgA and ADCC was significantly considerable (Table 14 of the Supplementary materials). The subgroup indicator based on the estimated change-plane after conducting adjustment for the effect of behavioral risk and age was .
TABLE 3.
Results of the RV144 trial data analysis using Env-specific IgA as a change-plane covariate. A weighted logistic regression model with and without adjustment for the effect of behavioral risk and age is fitted to each subgroup, and , separately, to associate each variable with the infection risk. The subgroups are identified by the estimated change-plane using Env-specific IgA. A single weighted logistic regression is also applied to the data on all vaccine recipients (Single logistic).
| Single Logistic | Change-plane with IgA | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Estimate | Std.Err | P-value | Estimate | Std.Err | P-value | Estimate | Std.Err | P-value | |
| ADCC without adjusting for the behavioral risk and age | |||||||||
| Intercept | −5.132 | 0.157 | < 10−5 | −5.469 | 0.215 | < 10−5 | −4.602 | 0.268 | < 10−5 |
| ADCC | −0.022 | 0.157 | 0.889 | −0.329 | 0.211 | 0.119 | 0.409 | 0.259 | 0.114 |
| ADCC adjusting for the behavioral risk and age | |||||||||
| Intercept | −5.345 | 0.403 | < 10−5 | −5.707 | 0.518 | < 10−5 | −2.771 | 0.708 | < 10−5 |
| ADCC | −0.089 | 0.164 | 0.587 | −0.399 | 0.208 | 0.055 | 1.039 | 0.342 | 0.002 |
| Age1 | −0.667 | 0.424 | 0.115 | −0.319 | 0.524 | 0.543 | −6.377 | 1.296 | < 10−5 |
| Age2 | 0.708 | 0.402 | 0.079 | 0.707 | 0.518 | 0.172 | −0.581 | 0.900 | 0.518 |
| Moderate | 0.006 | 0.407 | 0.989 | −0.014 | 0.483 | 0.976 | −0.952 | 1.035 | 0.358 |
| High | 0.796 | 0.377 | 0.035 | 0.530 | 0.471 | 0.261 | 3.628 | 0.950 | 0.0001 |
We noted that the slope of ADCC was negative in a single weighted logistic regression model, but the relationship was not significant(, p-value =0.889 without adjustment; , p-value =0.587). The proposed change-plane method helped identify two subgroups of recipients whose relationship between ADCC levels and the risk of developing infection could differ. We could not find heterogeneous subgroups for Avidity, ICS, and NAB by using the change-plane based on Env-specific IgA alone. None of the interaction effects were significantly substantial, as shown in Tables 13 and 14 of the Supplementary materials. However, the estimated slopes in the change-plane regression model were negative for the low level of Env-specific IgA without adjustment for the effect of behavioral risk and age in these variables, which was consistent with the reverse association in the low level of IgA reported in previous studies.
5.2.2. The change-plane using 6 primary variables + V1V2-scaffold antigen using ELISA + demographics.
The results of the immune correlates analysis of each marker using the weighted logistic regression model are reported in Table 12 of the Supplementary materials. Among the six primary immune response variables and 20 antigens, no significantly strong relationship with the risk of developing infection was found for the following 13 markers: V1V2-IgG, ADCC, Avidity, ICS, NAB, S1,S3, S6, S9, S10, S15, S16, and S20. Similar to the interaction analysis described in Section 5.2.1, we identified immune markers among the 13 markers that did not exhibit a significant relationship with the risk of developing infection in the entire set of vaccine recipients, while they demonstrated a stronger relationship with the risk of developing infection within subgroups in the logistic regression model. Instead of conducting an interaction analysis by exhaustively establishing interactions of each of the 26 markers with each of the 13 markers listed above, we applied the change-plane regression model to explore heterogeneous subgroups regarding the relationship with the risk of developing infection through a linear combination of markers. We focused on the relationship between the infection risk (Y) and each of the 13 markers (Z) through the adoption of a change-plane consisting of immune markers and demographics (X). Without using prior information on candidate moderators, we considered all available immune markers and demographics as candidate markers for the change-plane covariates, except for certain immune markers that exhibited extremely high correlations with other markers. Then, a subset of markers that render a significant contribution to the subgroup identification can be identified by conducting variable selection among candidate change-plane covariates using the SCAD penalty. In this article, among the 13 markers listed above, we reported the change-plane analysis results for the markers that showed a significant interaction effect with the resulting subgroup indicator in the ad-hoc validations.
First, we applied the change-plane regression model to combine six primary immune response variables, selected V1V2-scaffold antigens using ELISA, V1V2-scaffold antigens generated using BAMA only, S20, the behavioral risk level, and age group, with and without variable selection. Among the eight V1V2-scaffold antigens using ELISA, five antigens (S3, S4, S5, S6, S8) were excluded due to the exhibition of high correlations (> 0.9) with six immune response variables or other antigens using ELISA. In addition to the six immune response variables, behavioral risk, and age groups, the following markers were also included as candidate change-plane covariates: S1) gp70.A(92RW020)-V1V2.GN, S2) gp70.AE(92TH023)-V1V2.AP, S7) gp70.C(97ZA012)-V1V2.GN, S12) gp70.B(CaseA2)-V1V2.GN, S18) tags.AE(A244)-V1V2.LL, and S20) gp70 (CaseA2/ug/ml)-V1V2.CH58. The intercept and the four indicators of behavioral risk and age group were included as default in the variable selection to adjust for the effect of the demographics. The results of the regression model including the interaction with resulting subgroup indicators are presented in Table 16 of the Supplementary materials along with the score test for the null hypothesis of no interaction effect at the significance level of 0.05. We have presented the density plots of the estimated probability of subgroup membership and box plots of immune markers for infected and non-infected vaccine recipients by estimated subgroups in Figures 2 and 3 of the Supplementary materials, respectively. The resulting subgroup indicators obtained by using the estimated change-plane without variable selection are described in Section 2.3 of the Supplementary materials. We repeated the change-plane analysis by replacing the V1V2-scaffold antigens using ELISA with the antigens using BAMA. The details of the change-plane analysis results have been provided in Section 2.4 of the Supplementary materials.
ICS:
As shown in Table 4, the slope of the regression line between ICS and the risk of developing infection was significantly greater than zero (, p-value= 0.0007 without SCAD; 0.588, p-value = 0.015 with SCAD) for vaccine recipients satisfying , while the relationship was not markedly strong for the remaining vaccine recipients. The box-plots of the standardized ICS levels for infected and non-infected recipients (Figure 4 of the Supplementary materials) show a different comparison result for the two estimated subgroups, which is consistent with the estimation result. When we fit a single logistic regression, the estimated effect of ICS was positive, but the relationship was not significant (, p-value=0.397). The subgroup indicator based on the estimated change-plane with the selected variables can be expressed using the following:
TABLE 4.
Results of RV144 data analysis obtained by combining six primary immune response variables, selected V1V2-scaffold antigens using ELISA, antigens measured using BAMA alone, gp70 (CaseA2/ug/ml)-V1V2.CH58, behavioral risk and age group for the change-plane with variable selection using SCAD. A weighted logistic regression model is fitted for each subgroup, or = 0, estimated by using the change-plane regression methods. Reciprocal of the sampling probability is used as the weight for each individual.
| Single Logistic | Change-plane | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Estimate | Std.Err | P-value | Estimate | Std.Err | P-value | Estimate | Std.Err | P-value | |
| ICS with variable selection | |||||||||
| Intercept | −5.139 | 0.158 | < 10−5 | −5.555 | 0.230 | < 10−5 | −4.494 | 0.226 | < 10−5 |
| ICS | 0.126 | 0.149 | 0.397 | 0.588 | 0.241 | 0.0147 | −0.314 | 0.303 | 0.300 |
| NAB with variable selection | |||||||||
| Intercept | −5.139 | 0.158 | < 10−5 | −5.889 | 0.311 | < 10−5 | −4.579 | 0.199 | < 10−5 |
| NAB | 0.105 | 0.168 | 0.532 | 0.889 | 0.308 | 0.0039 | −0.335 | 0.194 | 0.084 |
| S9:gp70.A(92RW020)-V1V2.GN with variable selection | |||||||||
| Intercept | −5.137 | 0.158 | < 10−5 | −6.054 | 0.308 | < 10−5 | −4.230 | 0.186 | < 10−5 |
| S9 | −0.287 | 0.151 | 0.057 | −0.680 | 0.342 | 0.047 | 0.087 | 0.180 | 0.631 |
| S16:gp70.C(1086)-V1V2.LL with variable selection | |||||||||
| Intercept | −5.129 | 0.157 | < 10−5 | −5.417 | 0.202 | < 10−5 | −3.891 | 0.255 | < 10−5 |
| S16 | −0.243 | 0.139 | 0.080 | −0.741 | 0.229 | 0.001 | −0.390 | 0.275 | 0.156 |
Based on the logistic regression model including the interaction with the change-plane subgroup indicator (Table 16 of the Supplementary materials), the two degrees of freedom score test was found to reject the null hypothesis of no interaction effect at the significance level of 0.05 (score test statistics 19.03, p-value < 0.0001 without SCAD; 15.45, p-value=0.0004 using SCAD). The rejection rate using the test data sets of 100 random splits was 27.9% and 21.8% with and without SCAD, respectively.
NAB:
The slope of the regression line between NAB and the risk of developing infection was significantly greater than zero (, p-value= 0.019 without SCAD; 0.889, p-value =0.0039 using SCAD) for vaccine recipients satisfying , while the relationship was not significant for the remaining vaccine recipients (Table 4). The box-plots of the standardized NAB levels for infected and non-infected recipients (Figure 4 of the Supplementary materials) show a very clear difference for the two estimated subgroups. The estimated effect of NAB was positive in the single weighted logistic regression model, but the relationship was not significant (, p-value=0.532).
The two degrees of freedom score test was found to reject the null hypothesis of no interaction effect (30.73, p-value < 0.0001 without SCAD; 24.45, p-value <0.0001 using SCAD). The rejection rate of the two degrees of freedom score test using the test data sets was 30.6% and 22.4% with and without SCAD, respectively.
S9-gp70.A(92RW020)-V1V2.GN:
The slope of the regression line between gp70.A(92RW020)-V1V2.GN and the risk of developing infection was significantly smaller than zero (, p-value = 0.053 without SCAD; −0.68, p-value =0.047 with SCAD) for vaccine recipients satisfying , while the relationship was not statistically significant for the remaining vaccine recipients (Table 4). The box-plots of the standardized gp70.A(92RW020)-V1V2.GN levels for infected and non-infected recipients show a very clear difference for the two estimated subgroups. The estimated effect of S9 in the single weighted logistic regression model was negative, which was marginally significant(, p-value =0.057). The subgroup indicator based on the estimated change-plane with selected variables is
The two degrees of freedom score test was found to reject the null hypothesis of no interaction effect with the subgroup indicator (32.48, p-value < 0.0001 without SCAD; 37.904, p-value < 0.0001 using SCAD). The rejection rate using the test data sets was 22.8% and 22.1% with and without the SCAD penalty, respectively.
S16-gp70.C(1086)-V1V2.LL:
The slope of the regression line between gp70.C(1086)-V1V2.LL and the risk of developing infection was significantly smaller than zero (, p-value=0.001) for vaccine recipients satisfying , while the relationship was not statistically significant for the remaining vaccine recipients. The relationship was not significantly significant when all variables were combined without variable selection (Table 4). The box-plots of the standardized gp70.C(1086)-V1V2.LL levels for infected and non-infected recipients (Figure 4 of the Supplementary materials) show a very clear difference for the two estimated subgroups. The slope was negative and marginally significant ( with p-value =0.08) in the single weighted logistic regression model. The subgroup indicator based on the estimated change-plane with selected variables is as follows:
The two degrees of freedom score test was found to reject the null hypothesis of no interaction effect with the subgroup indicator (37.89, p-value < 0.0001 without SCAD; 37.9, p-value < 0.0001 with SCAD). The rejection rate using the test data sets was 23% and 31.1% with and without the SCAD penalty, respectively.
5.2.3. Clustering by using the change-plane model.
We also used the change-plane models, where the regression model included only the intercept term. The resulting change-plane model can be used to separate the vaccine recipients into two subgroups with different risks of developing HIV1 infection. The combination of immune response endpoints that characterized the change-plane and thus associated with the infection risk could play an important role in vaccine development towards ranking and selecting vaccine candidates based on their immunogenicity. Similar to the previous regression models, we combined data on the six primary immune response variables, baseline behavioral risk, age and selected the V1V2-scaffold antigens using ELISA or BAMA separately. The score test using the weighted logistic regression model including the interaction effect with the resulting subgroup indicator was found to reject the null hypothesis of no interaction effect for both ELISA and BAMA sets, with and without variable selection (Table 18 of the Supplementary materials). The rejection rate of the score test using the test data sets of 100 random splits was not high (32.2% and 26.2% with and without SCAD, respectively, using ELISA set; 28.3% and 26.2% with and without SCAD, respectively, using BAMA set). The subgroup indicators obtained by using the estimated change-plane with the selected variables using ELISA or BAMA are as follows (estimated change-plane without variable selections is presented in Section 2.5 of the Supplementary materials):
We compared the observed and estimated infection risks between the two subgroups as well as the overall infection risk (Table 5). Here, the observed infection risk was computed by the proportion of the infected vaccine recipients among all subjects for each subgroup, weighted by using the reciprocal of the sampling weight. The estimated change-plane could separate the entire vaccine recipients into two subgroups, one with lower infection risk and the other with higher infection risk than the infection risk of the entire studied participants. The estimated risks computed by using for each subgroup are similar to the observed infection risk, particularly, with variable selection. The cells in Table 20 of the Supplementary materials present the empirical risks of the subgroups cross-classified with application of the change-plane model with and without variable selection, along with the number of observed vaccine recipients for each subgroup presented in parentheses. The subgroup with the highest infection risk can be found by with variable selection.
TABLE 5.
Results of the logistic regression model including the interaction terms with the subgroup indicators that are estimated by adopting the change-plane model to explore the heterogeneous risk of developing infection in the RV144 trial. The change-plane parameters are estimated by combining six primary immune response variables, selected V1V2-scaffold antigens (ELISA or BAMA) with variable selection using SCAD. Cells show the empirical risk ratio and estimated risk ratio with the standardized error estimates of the intercept parameter. Reciprocal of the sampling probability is used as the weight for each individual.
| V1V2-Scaffold antigen | Variable Selection | Overall | Change-plane | |||||
|---|---|---|---|---|---|---|---|---|
| n | n | |||||||
| ELISA | No SCAD | 0.00587 | 0.001487 | 0.001487 | 122 | 0.011852 | 0.011856 | 124 |
| SCAD | 0.002017 | 0.002017 | 122 | 0.010922 | 0.010922 | 124 | ||
| BAMA | No SCAD | 0.00587 | 0.002492 | 0.002492 | 162 | 0.015809 | 0.015810 | 84 |
| SCAD | 0.002457 | 0.002456 | 164 | 0.016521 | 0.016520 | 82 | ||
6. Conclusion.
We applied the change-plane regression modeling method to explore potentially heterogeneous subgroups of vaccine recipients in the immune correlates analysis of the RV144 trial. We added a SCAD penalty to produce a parsimonious change-plane. Individual subgroup membership is determined by the sign of the change-plane. The simulation results demonstrated that the proposed penalized probabilistic version of the change-plane regression model could successfully identify the heterogeneous subgroups and change-plane covariates that contributed to subgroup identification in the presence of heterogeneous subgroups. The application results to the RV144 trial suggested the existence of some immune markers that did not exhibit a strong relationship with the risk of developing infection among all vaccine recipients but demonstrated a stronger relationship with the risk of developing infection within subgroups. The use of the SCAD penalized method enabled the identification of immune markers that contributed to subgroup identification. This method can be used to estimate a threshold value of Env-specific IgA, where the association between ADCC and the infection risk may differ above or below the threshold value of Env-specific IgA. Our analysis results can serve as a helpful exploratory tool to identify potential correlates of HIV infection risk and to determine the vaccine’s protective effect when used in combination with biological knowledge and further validations.
Our study has several limitations that should be considered for future studies. Regarding generalizability or reproducibility, a limitation of the study is that we did not present a well-established hypothesis test or external validation procedure to test the existence of the latent subgroup in the RV144 trial. Without such tests or external validations using an independent data set, these analyses may be considered hypothesis-generating. The development of a testing procedure for assessing the existence of latent subgroups in the change-plane approach is challenging because of the non-identifiability of change-plane parameters in the absence of latent subgroups (Kosorok et al., 2007). The application of recently developed testing procedures, such as a score test using a sphere coordinate transformation and resampling method (Fan, Song and Lu, 2017) or a maximum of likelihood ratio statistics (Huang, Cho and Fong, 2021), to our penalized change-plane model will be explored in a separate study. We deliberately restricted our consideration to the simple case in which the change-plane was constructed by using a linear combination of change-plane covariates. Our method can be easily extended to establish a non-linear form of the change-plane, for example, by using a nonparametric or kernel logistic regression model to improve the robustness. The proposed method with the SCAD penalty tends to require a large sample size for a high-dimensional change-plane. Further research is warranted to develop the asymptotic properties of the estimator and to improve variable selection performance for a high-dimensional change-plane.
Despite these limitations, the proposed method can be deemed useful for exploring latent subgroups in which the subgroup can be identified by considering a small subset of variables from a collection of multiple candidate covariates. By using the logistic regression model to approximate the indicator of the standard change-plane model, we can compute the probability of subgroup membership using change-plane covariates. For example, we targeted the vaccine recipients in the RV144 trial whose estimated probability of membership was higher than 0.6 (which was more likely to be a member of C = 1) or lower than 0.4 (which was subsequently deemed more likely to be a member of C = 0). The box plots of the targeted vaccine recipients show a stronger heterogeneous relationship between immune markers and the risk of developing infection than those of the entire set of vaccine recipients (Figures 3 and 5 of the Supplementary materials). The change-plane is a function of covariates; hence we can use the method for subgroup identification and probability computation for a new vaccine recipient before the disease outcome information is obtained.
Supplementary Material
Acknowledgements.
Kang was supported in part by the grant from the University of Pittsburgh Central Research Development Fund (CRDF) and by the University of Pittsburgh Center for Research Computing through the resources provided. Huang was supported in part by NIH R01 GM106177–01 and 2R37AI05465–08. The authors thank the participants, investigators and sponsors of the RV144 trial. The latter includes the U.S. Military HIV Research Program (MHRP); U.S. Army Medical Research and Materiel Command; NIAID; U.S. and Thai Components, Armed Forces Research Institute of Medical Science; Ministry of Public Health, Thailand; Mahidol University; SanofiPasteur; and Global Solutions for Infectious Diseases. The views expressed are those of the authors and should not be construed to represent the positions of the U.S. Army, the Department of Defense, or HJF
Footnotes
SUPPLEMENTARY MATERIAL
1. More results from the simulation study
More tables and graphs to summarize the results from the simulation study are provided.
2. More results from the application of the change-plane methods to the RV144 trial
We have provided more results from the application of the proposed change-plane regression model to the immune correlate analysis of the RV144 trial.
REFERENCES
- CHEN G, LIU Y, SHEN D and KOSOROK MR (2016). Composite large margin classifiers with latent subclasses for heterogeneous biomedical data. Statistical Analysis and Data Mining: The ASA Data Science Journal 9 75–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- CHUNG H, FLAHERTY BP and SCHAFER JL (2006). Latent class logistic regression: application to marijuana use and attitudes among high school seniors. Journal of the Royal Statistical Society: Series A (Statistics in Society) 169 723–743. [Google Scholar]
- COREY L, GILBERT PB, TOMARAS GD, HAYNES BF, PANTALEO G and FAUCI AS (2015). Immune correlates of vaccine protection against HIV-1 acquisition. Science Translational Medicine 7 310rv7–310rv7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- CRAVEN P and WAHBA G (1978). Smoothing noisy data with spline functions. Numerische Mathematik 31 377–403. [Google Scholar]
- DEMPSTER AP, LAIRD NM and RUBIN DB (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39 1–22. [Google Scholar]
- FAN J and LI R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96 1348–1360. [Google Scholar]
- FAN A, SONG R and LU W (2017). Change-plane analysis for subgroup detection and sample size calculation. Journal of the American Statistical Association 112 769–778. [DOI] [PMC free article] [PubMed] [Google Scholar]
- HAYNES BF, GILBERT PB, MCELRATH MJ, ZOLLA-PAZNER S, TOMARAS GD, ALAM SM, EVANS DT, MONTEFIORI DC, KARNASUTA C, SUTTHENT R et al. (2012). Immune-correlates analysis of an HIV-1 vaccine efficacy trial. New England Journal of Medicine 366 1275–1286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- HUANG Y, CHO J and FONG Y (2021). Threshold-based subgroup testing in logistic regression models in two-phase sampling designs. Journal of the Royal Statistical Society: Series C 70 291–311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- INAN G and WANG L (2017). PGEE: An R Package for Analysis of Longitudinal Data with High-Dimensional Covariates. The R Journal 9 393–402. [Google Scholar]
- KANG C (2011). New statistical learning methods for chemical toxicity data analysis PhD Thesis: The University of North Carolina at Chapel Hill. [Google Scholar]
- KANG S, LU W and SONG R (2017). Subgroup detection and sample size calculation with proportional hazards regression for survival data. Statistics in Medicine 36 4646–4659. [DOI] [PMC free article] [PubMed] [Google Scholar]
- KOSOROK MR, SONG R et al. (2007). Inference under right censoring for transformation models with a change-point based on a covariate threshold. Annals of Statistics 35 957–989. [Google Scholar]
- RERKS-NGARM S, PITISUTTITHUM P, NITAYAPHAN S, KAEWKUNGWAL J, CHIU J, PARIS R, PREMSRI N, NAMWAT C, DE SOUZA M, ADAMS E et al. (2009). Vaccination with ALVAC and AIDSVAX to prevent HIV-1 infection in Thailand. New England Journal of Medicine 361 2209–2220. [DOI] [PubMed] [Google Scholar]
- ROLLAND M and GILBERT P (2012). Evaluating immune correlates in HIV type 1 vaccine efficacy trials: what RV144 may provide. AIDS Research and Human Retroviruses 28 400–404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- ROLLAND M, EDLEFSEN PT, LARSEN BB, TOVANABUTRA S, SANDERS-BUELL E, HERTZ T, CARRICO C, MENIS S, MAGARET CA, AHMED H et al. (2012). Increased HIV-1 vaccine efficacy against viruses with genetic signatures in Env V2. Nature 490 417–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
- SHEN J and HE X (2015). Inference for subgroup analysis with a structured logistic-normal mixture model. Journal of the American Statistical Association 110 303–312. [Google Scholar]
- SHEN J and QU A (2020). Subgroup analysis based on structured mixed-effects models for longitudinal data. Journal of Biopharmaceutical Statistics 30 607–622. [DOI] [PubMed] [Google Scholar]
- SIMON R and WANG S (2006). Use of genomic signatures in therapeutics development in oncology and other diseases. The Pharmacogenomics Journal 6 166–173. [DOI] [PubMed] [Google Scholar]
- SUN X, BRIEL M, WALTER SD and GUYATT GH (2010). Is a subgroup effect believable? Updating criteria to evaluate the credibility of subgroup analyses. BMJ 340. [DOI] [PubMed] [Google Scholar]
- UEKI M (2009). A note on automatic variable selection using smooth-threshold estimating equations. Biometrika 96 1005–1011. [Google Scholar]
- WANG L, ZHOU J and QU A (2012). Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics 68 353–360. [DOI] [PubMed] [Google Scholar]
- WEI S and KOSOROK MR (2013). Latent supervised learning. Journal of the American Statistical Association 108 957–970. [DOI] [PMC free article] [PubMed] [Google Scholar]
- WEI S and KOSOROK MR (2018). The change-plane Cox model. Biometrika 105 891–903. [DOI] [PMC free article] [PubMed] [Google Scholar]
- YATES NL, DECAMP AC, KORBER BT, LIAO H-X, IRENE C, PINTER A, PEACOCK J, HARRIS LJ, SAWANT S, HRABER P et al. (2018). HIV-1 envelope glycoproteins from diverse clades differentiate antibody responses and durability among vaccinees. Journal of Virology 92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- ZOLLA-PAZNER S, DECAMP A, GILBERT PB, WILLIAMS C, YATES NL, WILLIAMS WT, HOWINGTON R, FONG Y, MORRIS DE, SODERBERG KA et al. (2014). Vaccine-induced IgG antibodies to V1V2 regions of multiple HIV-1 subtypes correlate with decreased risk of HIV-1 infection. PloS one 9 e87572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- ZOU H and LI R (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of statistics 36 1509. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
