Abstract
Preventive vaccines are an effective public health intervention for reducing the burden of infectious diseases, but have yet to be developed for several major infectious diseases. Vaccine sieve analysis studies whether and how the efficacy of a vaccine varies with the genetics of the infectious pathogen, which may help guide future vaccine development and deployment. A standard statistical approach to sieve analysis compares the effect of the vaccine to prevent infection and disease caused by pathogen types defined dichotomously as genetically near or far from a reference pathogen strain inside the vaccine construct. For example, near may be defined by amino acid identity at all amino acid positions considered in a multiple alignment and far defined by at least one amino acid difference. An alternative approach is to study the efficacy of the vaccine as a function of genetic distance from a pathogen to a reference vaccine strain where the distance cumulates over the set of amino acid positions. We propose a nonparametric method for estimating and testing the trend in the effect of a vaccine across genetic distance. We illustrate the operating characteristics of the estimator via simulation and apply the method to a recent preventive malaria vaccine efficacy trial.
Keywords: vaccines, competing risks, causal inference, marginal structural model, Hamming distance
1. Introduction
Over the past century, disease burden due to infectious pathogens has been substantially reduced by preventive vaccines. However, many existing vaccines are only partially efficacious, a fact that may be explained in part by genetic heterogeneity of pathogens. Whereas vaccines are typically constructed using only one or several specific pathogen sequences, pathogens may exhibit broad genetic heterogeneity. Thus, the vaccine may stimulate immune responses that are protective against infection or disease caused by pathogens with these few sequences, but may not confer protection more broadly, leading to reduced overall efficacy. Therefore, it is often informative to study whether and how the efficacy of a partially effective vaccine varies by pathogen genetics.
A common analogy used to describe the protective mechanism of vaccines is to imagine a vaccine as a sieve: the vaccine blocks infection or disease caused by some genotypes of pathogens, but lets others through. Equivalently, the sieve may be regarded as the latent immune response elicited by vaccination, which may differ based on pathogen genotypes and may impact risk of infection or disease. These analogies have led to the study of vaccine efficacy as a function of pathogen genetics to be referred to as vaccine sieve analysis (Gilbert et al., 1998, 2001). A vaccine is said to exhibit a sieve effect at a particular genetic region of the pathogen if the vaccine is differentially efficacious against pathogens depending on their amino acid sequence in that region. Identification of sieve effects may help guide the selection of antigens to include in future multivalent vaccines.
It is common practice in sieve analysis to compare binary categories of genetic sequences, such as the vaccine’s efficacy for preventing infection or disease caused by (a) pathogens that are fully matched to the vaccine in a particular region, i.e., have the same genetic sequence versus (b) pathogens that are mismatched to the vaccine in this region, i.e., have at least one amino acid difference, (e.g., Rolland et al. (2012); Neafsey et al. (2015)). We expect that if a sieve effect is present, (a) will be greater than (b), indicating that the vaccine works better against matched than mismatched pathogens. Under a binary categorization, many genotypes are combined in the mismatched category: the category includes pathogens with a single amino acid that differs from the vaccine, as well as pathogens with many amino acids that differ from the vaccine. An alternative approach is to study how vaccine efficacy depends on genetic distance from the vaccine, which can be helpful for better understanding mechanisms of protection associated with antibody epitopes in the selected region. This approach can help overcome the common problem that, although a large number of discrete genotypes exist, there are often too-few infection or disease endpoints of individual genotypes to reliably assess vaccine efficacy against each genotype. Several methodologies have been proposed for estimating genotype-specific vaccine efficacy with genotypes defined by genetic distance from the vaccine (Gilbert et al., 2008; Sun et al., 2009, 2012; Juraska and Gilbert, 2013, 2016). Whereas most approaches have developed tests centered around parameters that describe the vaccine’s effect on the hazard or instantaneous risk of disease or infection acquisition, Gilbert et al. (2008) studies the vaccine’s effect on cumulative risk of disease or infection as a function of genetic distance, which may be more relevant for public health decision making in the common setting of waning vaccine efficacy. We propose a new approach to studying the vaccine’s effect on cumulative risk as a function of genetic distance. Our approach differs from that of Gilbert et al. (2008) in several ways. Gilbert and colleagues focused on nonparametric smoothing for a continuous distance, while our approach treats distance as an ordered count with many categories and avoids nonparametric smoothing. Our approach allows for participant dropout to depend on measured characteristics, whereas the approach of Gilbert and colleagues assumed random censoring. Finally, our approach lends itself to a causal interpretation under assumptions, whereas that of Gilbert et al. (2008) does not.
Our study is motivated by a Phase III randomized, controlled trial of the RTS,S/AS01 vaccine (hence, RTS,S), which, from a regulatory perspective, is the most advanced vaccine candidate for malaria prevention. Malaria is the clinical disease associated with infection by the Plasmodium parasite. The parasite has a complicated life-cycle, which involves stages in the human blood and liver, as well as stages in a mosquito host. The RTS,S vaccine consists of a single antigen: a portion of the circumsporozoite protein (CS protein) found on the surface of the Plasmodium falciparum parasite during the human blood phase of its life-cycle. The vaccine reduced the average instantaneous risk of clinical malaria over 12 months by an estimated 63% in 5–17 month old children (RTS,S Clinical Trials Partnership, 2011; Agnandji et al., 2012) and pilot implementation programs are disseminating the vaccine in areas of high malaria transmission (WHO, 2016). The protective mechanism of the vaccine is incompletely understood and differing hypotheses exist regarding how immunity is mediated through immune responses. We are thus motivated to study how vaccine efficacy varies as a function of genetic distance to contribute to scientific understanding of the immune mechanisms associated with the RTS,S vaccine.
The remainder of the article is organized as follows. In Section 2, we introduce notation, formalize our definition of a sieve effect, and define a parameter to describe the trend in vaccine efficacy as a function of genetic distance. In Section 3, we discuss estimation of sieve effects and inference for the trend parameter. In Section 4, we include a simulation study and in Section 5 we analyze the RTS,S data. We conclude with a discussion.
2. Notation and parameter of interest
Data typical of vaccine sieve analysis are generated as follows. Trial participants are enrolled and baseline covariates W are measured. Participants are subsequently randomly assigned to either an active vaccine (Z = 1) or a control vaccine (Z = 0). The random assignment could depend on covariates, as in the RTS,S/AS01 trial, where vaccine assignment was randomized within each of eleven study sites. Participants are followed for a fixed study period and monitored for the occurrence of a study endpoint, such as pathogen infection or clinical disease with documentation of pathogen infection. We use T to denote the time from baseline until the first endpoint. We assume that T takes only a finite-number of values, as is the case if T corresponds to days until first endpoint. Participants who do not experience an endpoint are followed until completion of the study at a pre-specified time τ. It is common in trials with longitudinal follow-up that some participants are right-censored before τ. We use U to denote the time until last follow-up. We set U = τ for participants who do not experience a study endpoint by τ. For each observed endpoint, we obtain the genetic sequence of the pathogen at the time of the endpoint. Based on a multiple alignment of pathogen amino acid sequences from subjects with the failure event, we use J ∈ {1, … , K + 1} to categorize sequences based on their genetic distance from the vaccine antigen in a genetic region of interest comprised of K amino acids. Sequences are categorized so that a categorization of J−1 corresponds to endpoints with genetic distance J. For example, we use J = 1 to denote sequences that are fully matched to the vaccine along the entire region of interest (i.e., genetic distance of zero), J = 2 for sequences with genetic distance of one, and so on. Due to censoring, we do not observe T, U, and J for all participants; instead, we observe and ΔJ, where . The observed data are assumed to be independent and identically distributed copies of .
For our developments, it is useful to alternatively express the observed data in terms of discrete counting processes. Specifically, we write O = (W, Z, {Nj(t), C(t) : j = 0, … , K and t = 1, … τ}), where for j = 0, … , K, , and . In words, Nj(t) is an indicator that an observed endpoint occurred prior to or at time t, and that the pathogen associated with endpoint had j + 1 distance from the vaccine insert. Similarly, C(t) is an indicator that right-censoring has occurred prior to or at time t. We will also use the shorthand N․(t) to simultaneously refer to Nj(t) for all j = 0, … , K; for example, we write N․(t) = 0 to denote that N0(t) = 0, … , NK(t) = 0. By convention, if a participant has an endpoint with genetic distance k at time s, then we set Nk(t) = 1 for all t > s, Nj(t) = 0 for j ≠ k and t > s, and C(t) = 0 for all t ≥ s. If a participant is censored at time s, we arbitrarily set N․(t) = 0 for all t > s. We denote by P0 the true distribution of O and denote by our statistical model, which we take to be nonparametric. However, our theoretical developments apply to any model that makes assumptions about the probability of receiving vaccine given covariates and the probability of censoring given vaccine status and covariates.
We define our causal parameter of interest using a structural causal model (Pearl, 2009). We assume that each component of the observed data structure is a function of a set of observed parent variables and an unmeasured exogenous error term. The observed parent of Z is W. For t = 1, … , τ, the observed parents of Nj(t) are W, Z, C(t−1), and N․(t−1), and the observed parents of C(t) are W, Z, C(t−1) and N․(t). We denote by the distribution the data would have under an intervention that sets Z = z (e.g., assigning vaccine to all participants) and sets C(t) = 0 for t = 1, … , τ−1 (i.e., assigning all participants to remain under study). We refer to this distribution as the post-intervention distribution and define to be a counterfactual random variable with this distribution. The counterfactual parameter
| (1) |
is the proportion of participants who experience an endpoint with genetic distance j from the vaccine antigen by time τ if all participants are assigned Z = z and remain under study until τ. A typical summary of the causal effect of the vaccine relative to the control vaccine for preventing endpoints with genetic distance j is genotype-specific vaccine efficacy . Values of VEj,0(τ) near to one indicate reduced incidence of endpoints caused by a pathogen of genetic distance j. Note that the scale of VEj,0(τ) ∈ (−∞, 1] is not symmetric, so we instead focus on a parameter with a symmetric scale, the log-ratio of counterfactual cumulative incidence control vs. vaccine,
| (2) |
Values of Lj,0(τ) greater than zero indicate a higher incidence of j-distance endpoints when control vaccine is assigned compared to when the active vaccine is assigned. Large positive values of Lj,0(τ) indicate the vaccine works well at preventing endpoints with distance j from the vaccine. Furthermore, Lj,0(τ) can be related back to vaccine efficacy, VEj,0(τ) = 1−exp{−Lj,0(τ)}.
We are interested in testing the null hypothesis that Lj,0(τ) is the same for all j = 0, … , K. One approach for testing this hypothesis is to estimate Lj,0(τ) for j = 0, … , K, test each pairwise null hypothesis Lj,0 = Lk,0 for j = 0, … , K and k ≠ j, and perform a multiplicity correction to properly control the type-one error rate. However, if K is large, there will be many pairwise comparisons and multiplicity corrections may result in a test with low overall power to detect sieve effects. Instead, we propose to test for a trend in the effect across genetic distances. We expect that if there is a sieve effect present in the genetic region under study, then Lj,0(τ) will be monotone non-increasing in j. That is, the vaccine will work best against endpoints that are genetically similar to the vaccine antigen with efficacy decreasing as endpoints become less similar to the vaccine. Thus, we would like to design a test that has high power under this alternative hypothesis. To that end, we consider the parameter
| (3) |
where ω := (ωj : j) is a user-specified vector of positive weights. We discuss choices of weight function in Section 3.4. This parameter equals the weighted L2 projection of the true function describing how Lj,0(τ) varies with j onto a linear working model. Notice that the linear working model is truly a working model in the sense that the definition of the parameter does not depend on the true function being linear. The parameter β0 merely provides a useful summary of the trend in the vaccine’s effect across genetic distance. In particular, note that β0 = 0 corresponds with the null hypothesis of interest, that Lj,0(τ) is the same for all j = 0, … , K, while values of β0 less than 0 indicate decreasing efficacy with increasing distance. Our goal is to estimate β0 and design a test of the null hypothesis that β0 = 0.
3. Identification, estimation and inference
3.1. Identification
Our approach to estimating β0 is to estimate the counterfactual cumulative incidence for z = 0, 1 and j = 0, … , K. Subsequently, these estimates are plugged into (2) and (3) to obtain estimates of Lj,0 and (α0, β0), respectively. The counterfactual cumulative incidence may be estimated using the observed data under the following assumptions:
(consistency) if Z = z, C(t−1) = 0 for t = 1, … , τ−1;
(no interference) the counterfactual outcome for participant i, depends only on the treatment assignment for patient i;
(sequential randomization) and , W, for t = 1, … , τ;
(positivity) P0(Z = z | W) and P0(C(t−1) = 0 | Z = z, C(t−2) = 0, N․(t−2) = 0, W) for t = 1, … , τ−1 are each strictly greater than zero with probability one.
The first two assumptions are fundamental in order to ensure that the counterfactual endpoints are well defined. The consistency assumption essentially says that the hypothetical intervention that assigns vaccine and no censoring does not fundamentally alter the way the vaccine works. Thus, the outcome we see in the observed data is the same outcome we would have seen under this hypothetical intervention. The assumption of no interference is often dubious in infectious disease settings, where the infection status of a given participant might depend on whether or not their family and friends also received the vaccine (Hudgens and Halloran, 2008). Given the complicated life cycle of malaria, it is uncertain the degree to which the assumption of no interference may have been violated in the RTS,S trial. We proceed as though this assumption approximately holds, leaving to future work the development of methodologies which fully relax this assumption. The sequential randomization assumption states that there are no unmeasured confounders of vaccine assignment nor of censoring. The former is guaranteed by randomization of vaccine assignment in clinical trials, while the latter is generally not testable based on observed data. Instead, we must hope that the covariates W collected are sufficiently rich so as to include all possible variables that might influence both a participant’s risk of the endpoint and her/his propensity to drop out of the trial. The positivity assumption states that there are no groups of participants with zero probability of receiving the vaccine and remaining uncensored. Because this is an assumption on the observed data distribution P0, this assumption can be studied empirically (Petersen et al., 2010).
If these assumptions hold, then we can estimate counterfactual cumulative incidence by estimating
| (4) |
where for t = 1, … , τ, we define , , and G0(w) := P0(W ≤ w). We refer to as the conditional genotype-specific hazard function, as the conditional total hazard function, and G0 as the distribution of baseline covariates. The key to proving (4) is that for t = 1, … , τ, under sequential randomization
| (5) |
That is, the covariate-conditional cause-specific hazard function for the counterfactual counting process is equal to the covariate-conditional cause-specific hazard function for the observed counting process. Intuitively, if W contains all information about meaningful differences between participants with respect to their probability of endpoints and censoring, then within each stratum defined by W at a given time, there are no meaningful differences (with respect to the probability of experiencing an endpoint) between participants who have previously dropped out of the trial and those who remain. Thus, the counterfactual probability of an endpoint at the next time point in each stratum is identical to the observed data probability of an endpoint at the next time point. Notice that the positivity assumption is required in order that the right-hand-side of (5) is well-defined.
Once the equivalence of counterfactual and observed data hazard functions is established, all that remains is to relate the observed data hazard function to cumulative incidence. Toward that end, it is helpful to enumerate the possible ways that an endpoint with genetic distance j may be observed by time τ given vaccine assignment Z = z and covariates W = w. Such an endpoint may be observed at the first time point, which occurs with probability . The endpoint may also be observed to occur at the second time point, in which case no endpoint of any type must have occurred at the first time point. Given no endpoint at the first time point, the probability of an event at the second time point is , while the probability of no endpoint at the first time point is . Thus, the probability of observing a j-distance endpoint at the second time is and the cumulative probability of observing an endpoint by the second time is . The summation in (4) is thus made plain: the t-th term in the sum is the probability of observing a j-distance endpoint at t given no previous endpoint, , multiplied by the probability that no endpoint of any distance was observed prior to t, . The sum therefore yields the cumulative probability of j-distance events in participants with Z = z and W = w, while the integral averages these covariate-conditional probabilities with respect to the distribution of covariates in the participant population.
3.2. Estimation
Equation (4) implies that an estimate of cumulative incidence may be obtained by estimating , and G0 and substituting these estimates into (4). If covariates are low-dimensional and discrete, then we can use empirical estimates for each of these components. For t = 1, … , τ we define and . Empirical estimates of the conditional genotype-specific and total hazard at time t = 1, … , τ can be computed respectively as
Similarly, we may use empirical estimates of the distribution of baseline covariates,
Together, these empirical estimates of the hazard functions and baseline covariate distribution give an estimate of cumulative incidence:
| (6) |
An estimate Lj,n(τ) of Lj,0(τ) may be computed for each j = 0, … , K by computing (6) for j = 0, … , K and z = 0, 1 and substituting into (2). An estimate of (α0, β0) can be computed by substituting Lj,n(τ) into (3),
which can be computed explicitly. To wit, we define
and note that (αn, βn)⊤ = SLn, where
| (7) |
3.3. Inference
In this section, we establish an asymptotic distribution for βn, which serves as the basis for construction of Wald-style confidence intervals and hypothesis tests. We begin by establishing asymptotic linearity of for a given j, z. We show that asymptotic linearity immediately implies a joint distribution for , an estimator of . We subsequently use this joint distribution and the delta method to derive an asymptotic distribution for βn.
By definition, an estimator of Fj,0(τ) is asymptotically linear if , where D0 is a mean-zero, finite-variance function of the observed data that is referred to as the influence function of (Hampel, 1974). The central limit theorem implies that converges in distribution to a normally distributed random variable with mean zero and variance E0{D0(O)2}. In previous work, we have shown that is asymptotically linear and have derived its influence function (Benkeser et al., 2018). We restate those results here. Define as the conditional probability of vaccine and for t = 1, … , τ−1 define as the conditional hazard of censoring. Define
The influence function of is
| (8) |
Obtaining the joint distribution of several asymptotically linear estimators is straightforward: by the asymptotic linearity of each component estimators of Fn(τ),
| (9) |
By the multivariate central limit theorem, (9) converges in distribution to a mean-zero multivariate normal distribution with covariance matrix where .
Based on this joint distribution, we can use the delta method to derive a distribution for βn. We use S[i,j] to denote the (i, j) entry in S, as defined in (7), and note that β0 = h(F0), where
The delta method implies that n1/2(βn−β0) converges in distribution to a mean-zero normally distributed variate with variance
| (10) |
where ∇h is the gradient of h,
A consistent estimate of can be computed by substituting the gradient evaluated at Fn and estimated influence function into (10). Specifically, we define to be the estimated influence function that substitutes , and empirical estimates of ζ0 and into (8). We additionally define and . An estimate of is . This variance estimate may be used to construct asymptotic 100(1−α)% confidence intervals of the form βn ± z1−α/2n−1/2σn, where z1−α/2 is the 1−α/2 quantile of the standard normal distribution. Similarly, two-sided level α Wald-style hypothesis tests of the null hypothesis that β0 = 0 may be performed by rejecting the null hypothesis whenever |n1/2βn/σn | > z1−α/2.
Remark:
While not the focus of the present work, if W features many discrete components, then the empirical estimator will likely have high variance and perform poorly in finite samples. If W contains continuous-valued variates, then the stratified estimator cannot be constructed without some discretization of covariates, which risks incurring bias. In these cases, rather than empirical hazard estimators, we may prefer instead regression-based hazard functions that borrow information over time and across covariate levels. The bias-variance tradeoff for the hazard regression may be optimized through the use of cross-validated estimator selection or by regression stacking, also known as super learning (Wolpert, 1992; Breiman, 1996; van der Laan et al., 2007). This approach allows for pre-specification of a large number of candidate regressions, which can include parametric regression as well as adaptive regression approaches. Cross-validation is used to estimate the convex combination of candidate regression estimators that optimizes a user-selected risk criteria. Under assumptions, the selected combination of regression functions estimates the true hazard function essentially as well as the unknown best combination of regression function estimates (van der Laan et al., 2004). Stitelman et al. (2011) discusses super learning in the context of estimating hazard functions. We note, however, that if these techniques are used instead of empirical estimators in (6), Fn is no longer asymptotically linear and performing valid inference for estimates of β0 is challenging. Techniques from the semiparametric efficiency theory literature, such as targeted minimum loss-based estimation (van der Laan and Rubin, 2006), may be used to modify initial hazard estimates in such a way that the estimator (6) based on the modified hazard estimators is asymptotically linear. Benkeser et al. (2018) discusses targeted minimum loss-based estimation in this context. These methods are implemented in the R package survtmle freely available from the Comprehensive R Archive Network (Benkeser and Hejazi, 2017).
3.4. Choice of weights
We now turn to selection of the weight matrix Ω. At many amino acid positions, a specific residue is required for biological viability of the pathogen, and, at positions that tolerate multiple residues, viable residues typically have different associations with pathogen fitness. Because of these physiological constraints, endpoints with certain genetic distances may be uncommon. Therefore, we may wish to give more weight to genetic distances that are commonly observed in the population of interest, for example, by weighting estimates of Lj,0 proportional to the inverse asymptotic covariance matrix of an efficient estimator of L0 := (L0,0, … , LK,0)⊤. If this matrix were known, the efficiency bound for estimation of projection parameter defined by this choice of weights would have the lowest efficiency bound of any choice of weight matrix (Aitken, 1936).
In practice, the covariance matrix of L0 is unknown, but can be estimated using the influence function methodology of the previous section. We define and the Jacobian of g as
An estimate of the covariance matrix of L0 is . The projection parameter of interest based on this choice of weights is . We index the parameter by n to denote that it depends on the sample. Such parameters are sometimes referred to as data-adaptive, in that they are unknown until one has seen the data (Hubbard et al., 2016). Similarly as above, an estimate of this parameter is
| (11) |
Theorem 3 of Hubbard et al. (2016) establishes conditions for asymptotic linearity of estimators of data-adaptive target parameters. Under the conditions of this theorem, we may apply our previous results without modification for the estimated weight matrix.
An alternative approach is to define the target parameter as a weighted L2 projection based on ϒ0, the true asymptotic covariance matrix of an efficient estimator of L0. The inference we have derived would likely be anti-conservative for estimation of this parameter as we ignore uncertainty induced by estimation of ϒ0. However, a nonparametric bootstrap wherein in each resample one estimates both ϒ0 and the projection parameter may well lead to confidence intervals with proper coverage. We leave this study to future work.
4. Simulation
We studied the performance of our estimators of β0n via simulation. A single covariate W was drawn from a Binomial(4, 0.5) distribution, which mimics the geographic site variable used in the RTS,S analysis. Vaccine assignment Z was drawn from a Bernoulli(0.5) distribution, which mimics a randomized trial with equal vaccine allocation. Given W = w and Z = z an endpoint time was generated from a geometric distribution with failure probability expit[−2 + 0.4{I(w = 0) + I(w = 1) + I(w = 2)}−0.2I(w = 3)−z]. Similarly, a censoring time was generated from a geometric distribution with failure probability expit{−3+0.2I(w = 2)−0.2I(w = 3)}. The observed failure time was taken to be the minimum of the endpoint and censoring times, with ties recorded as endpoints. Given Z = z, the genetic distance associated with each endpoint was drawn from a Binomial(4, expit(0.2z)) distribution. This choice resulted in vaccine efficacy that decreased with genetic distance, the direction most commonly seen in vaccine sieve analysis. We set the final observation at τ = 6, and any observations with an endpoint beyond this time were right-censored at τ. We analyzed 1,000 data sets of size 1000, 2500, and 5000.
Figure 1 shows the true efficacy across genetic distance for this data generating mechanism. The log-ratio of cumulative incidences is greater than zero for each distance indicating efficacy against each distance value, though the efficacy decreases with distance. To illustrate the impact of the estimated weight matrix on the parameter of interest, the figure shows the projection lines with the largest and smallest true value of the slope parameter β0n across all simulations. The two lines are quite similar indicating that the estimated covariance matrix ϒn was relatively stable and had only a minor effect on the true value of the parameter of interest across simulations.
Figure 1.

True values of the log ratio of cumulative incidences by genetic distance (circles). The two lines are the lines associated with the most extreme true slope parameters across all simulated data sets.
For each of the sample sizes considered, the estimators of β0n were approximately unbiased (Table 1). The bias and variance of the estimators decreased appropriately with sample size and the confidence intervals achieved approximately nominal coverage at all sample sizes. Overall, the estimators had excellent finite-sample performance.
Table 1.
Results from the simulation study showing bias, variance, and mean squared-error (MSE) of βn as an estimator of β0n. The coverage of nominal 95% Wald-style confidence intervals for β0n is also shown.
| Bias | Variance | MSE | Coverage | |
|---|---|---|---|---|
| x 1e2 | x 1e2 | x 1e2 | % | |
| n = 1,000 | 0.59 | 1.20 | 1.20 | 94.2 |
| n = 2,500 | 0.38 | 0.45 | 0.46 | 94.9 |
| n = 5,000 | 0.28 | 0.21 | 0.21 | 96.1 |
5. Data analysis
The design of the Phase III RTS,S/AS01 trial has been previously described (RTS,S Clinical Trials Partnership, 2011; Agnandji et al., 2012), as have the methods for Plasmodium parasite sequencing (Neafsey et al., 2015). We illustrate the proposed method by studying the efficacy of the RTS,S efficacy as a function of Hamming distance, i.e., the number of mismatched amino acid residues, between an aligned founder sequence and the vaccine insert sequence in the Th3R epitope region contained within the sequenced C-terminus amplicon of the CS protein. Th3R is a putative T-cell epitope region that is 12 amino acids long. Figure 2 shows the distribution of the Th3R distance from the RTS,S vaccine by treatment arm in children aged 5 to 17 months. We also analyzed Hamming distances defined based on the entire C-terminus amplicon of CS and on three other pre-specified epitope regions, but for brevity restrict reporting of results to the Th3R distances.
Figure 2.

Distribution of the Th3R Hamming distance of clinical malaria sequences to the vaccine insert sequence in children aged 5–17 months
A unique feature of the pathogen sequence collection was that multiple founder parasites could be sequenced from the dried blood spot sample of a single participant. The majority of the 2,090 clinical malaria endpoints through τ = 12 months post-vaccination with available C-terminus sequence data were found to have multiple parasite genotypes, with each founder parasite likely caused by a transmission event from a distinct mosquito. The presence of multiple founder infections complicates the assessment of sieve effects by genetic distance, as a single participant may have several founder parasites each with a unique genetic distance. We used multiple outputation (Follmann et al., 2003) to estimate the trend in the vaccine’s effect across the Th3R Hamming distance for a randomly sampled founder parasite of a clinical malaria case. This approach separates the genetic component of a sieve effect from any effect that the vaccine might have on the number of infecting parasites (Neafsey et al., 2015). Multiple outputation was performed by repeatedly sampling a single parasite genotype at random from each clinical malaria endpoint and applying a statistical method designed for a single genotype per endpoint. Based on the guidelines of Follmann et al. (2003), the estimated number of outputations required for stable inference was B = 4, 127. For each resample, we estimated Lj,0(τ), j = 0, … , 4, adjusting for geographical study site as the sole covariate, and averaged of these estimates over resamples. We then estimated and tested the proposed trend parameter on the outputation-averaged estimates of Lj,0(τ) weighting by the estimated inverse variance of the multiple-outputation estimates.
Point estimates of VEj,0(τ) with 95% Wald-style confidence intervals (CIs) are plotted in Figure 3, with VE0,n(τ) = 46% (95% CI, 32 to 57%) against a perfectly vaccine-matched parasite in the Th3R region and a steady decline to VE4,n(τ) = 11% (95% CI, −25 to 36%) against parasites with four vaccine-mismatched Th3R residues. The two-sided Wald-style test of the null hypothesis that β0 = 0 yields the p-value of 0.0036 suggesting potential immunological relevance of Th3R epitopes for multivalent vaccine design.
Figure 3.

Point estimates of VEj,0(τ) against clinical malaria with Th3R Hamming distance j = 0, … , 4 to the vaccine insert sequence, with 95% Wald-style confidence intervals, in children aged 5–17 months. The superimposed curve 1−exp{−αn−βn j} is a transformation to the VE scale of the linear projection of Lj,n(τ). The two-sided Wald-style test of the null hypothesis that β0 = 0 yields the p-value of 0.0036.
6. Discussion
The proposed trend parameters are appealing in the context of vaccine sieve analysis in that we often expect trends in efficacy by genetic distance to be monotone. Thus, the slope of the projection onto a linear working model provides a reasonable summary of the trend. In other applications, monotonicity may not be expected. For example, in cardiovascular epidemiology, researchers are interested different types of heart failure, which are defined by the strength of the heart’s contraction using a measure called ejection fraction. Our methods could be applied to estimate the trend in a relationship between biomarkers and ejection fraction. However, we might not have a reason to expect monotonicity in this relationship. Nevertheless, our methods easily extend to more flexible working models (e.g., including polynomial terms) with simple modifications to the delta method calculus. Multiple-degree-of-freedom tests could be developed to test whether there is any variation in the effect of a treatment across levels of an ordinal competing risk.
The efficacy of vaccines often wanes over time, due to waning immune responses many months or years after receipt of vaccinations, and this occurred in the Phase III RTS,S efficacy trial. While our method makes no assumption of time constancy of the vaccine’s effect, it does require the selection of a single fixed time at which to examine trends in efficacy by pathogen genetic distance, and these trends may vary over time. Therefore, an interesting elaboration of our method would extend the working models to summarize trends in efficacy across distance and over time. Working models could include functions of both genetic distance and time, while hypothesis tests could test how the trend in efficacy varies over time. Another interesting extension would provide simultaneous confidence bands or regions for the pattern of vaccine efficacy as it varies over genetic distance and/or time.
A code repository, which can be downloaded as an R package, is available (https://github.com/benkeser/sievetrend). The repository contains the code used to execute the simulation study and data analysis.
Acknowledgements
The authors thank the RTS,S/AS01 Phase 3 trial study participants and study investigators, and in particular thank Daniel Neafsey at the Department of Immunology Broad Institute of Massachusetts Institute of Technology and Dyann Wirth at the Harvard T.H. Chan School of Public Health for collaboration and generation of the malaria parasite sequence data.
References
- Agnandji S, Lell B, Fernandes J, Abossolo B, Methogo B, Kabwende A, Adegnika A, Mordmüller B, Issifou S, et al. (2012). A phase 3 trial of RTS,S/AS01 malaria vaccine in African infants. New England Journal of Medicine, 367(24):2284–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Aitken AC (1936). On least squares and linear combination of observations. Proceedings of the Royal Society of Edinburgh, 55:42–48. [Google Scholar]
- Benkeser D, Carone M, and Gilbert PB (2018). Improved estimation of the cumulative incidence of rare outcomes. Statistics in Medicine, 37(2):280–293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Benkeser D and Hejazi NS (2017). survtmle: Targeted Minimum Loss-Based Estimation for Survival Analysis in R. [Google Scholar]
- Breiman L (1996). Stacked regressions. Machine learning, 24(1):49–64. [Google Scholar]
- Follmann D, Proschan M, and Leifer E (2003). Multiple outputation: Inference for complex clustered data. Biometrics, 59:420–429. [DOI] [PubMed] [Google Scholar]
- Gilbert PB, McKeague IW, and Sun Y (2008). The two-sample problem for failure rates depending on a continuous mark: An application to vaccine efficacy. Biostatistics, 9(2):263–276. [DOI] [PubMed] [Google Scholar]
- Gilbert PB, Self SG, and Ashby MA (1998). Statistical methods for assessing differential vaccine protection against human immunodeficiency virus types. Biometrics, 54(3):799–814. [PubMed] [Google Scholar]
- Gilbert PB, Self SG, Rao M, Naficy A, and Clemens J (2001). Sieve analysis: methods for assessing from vaccine trial data how vaccine efficacy varies with genotypic and phenotypic pathogen variation. Journal of Clinical Epidemiology, 54(1):68–85. [DOI] [PubMed] [Google Scholar]
- Hampel FR (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346):383–393. [Google Scholar]
- Hubbard AE, Kherad-Pajouh S, and van der Laan MJ (2016). Statistical inference for data adaptive target parameters. The International Journal of Biostatistics, 12(1):3–19. [DOI] [PubMed] [Google Scholar]
- Hudgens MG and Halloran ME (2008). Toward causal inference with interference. Journal of the American Statistical Association, 103(482):832–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Juraska M and Gilbert PB (2013). Mark-specific hazard ratio model with multivariate continuous marks: An application to vaccine efficacy. Biometrics, 69(2):328–337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Juraska M and Gilbert PB (2016). Mark-specific hazard ratio model with missing multivariate marks. Lifetime Data Analysis, 22(4):606–625. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neafsey DE, Juraska M, Bedford T, Benkeser D, Valim C, Griggs A, Lievens M, Abdulla S, Adjei S, Agbenyega T, et al. (2015). Genetic diversity and protective efficacy of the RTS,S/AS01 malaria vaccine. New England Journal of Medicine, 373(21):2025–2037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pearl J (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, NY. [Google Scholar]
- Petersen ML, Porter KE, Gruber S, Wang Y, and van der Laan MJ (2010). Diagnosing and responding to violations in the positivity assumption. Statistical Methods in Medical Research, 21(1):31–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rolland M, Edlefsen PT, Larsen BB, Tovanabutra S, Sanders-Buell E, Hertz T, Carrico C, Menis S, Magaret CA, and Ahmed H (2012). Increased HIV-1 vaccine efficacy against viruses with genetic signatures in Env V2. Nature, 490(7420):417–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
- RTS,S Clinical Trials Partnership (2011). First results of phase 3 trial of RTS,S/AS01 malaria vaccine in African children. New England Journal of Medicine, 365(20):1863–1875. PMID: 22007715. [DOI] [PubMed] [Google Scholar]
- Stitelman OM, Wester CW, De Gruttola V, and van der Laan MJ (2011). Targeted maximum likelihood estimation of effect modification parameters in survival analysis. The International Journal of Biostatistics, 7(1):1–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun Y, Gilbert PB, and McKeague IW (2009). Proportional hazards models with continuous marks. Annals of Statistics, 37(1):394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun Y, Li M, and Gilbert PB (2012). Mark-specific proportional hazards model with multivariate continuous marks and its application to HIV vaccine efficacy trials. Biostatistics, 14(1):60–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Laan M and Rubin D (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics, 2(1):1–40. [Google Scholar]
- van der Laan MJ, Dudoit S, and Keles S (2004). Asymptotic optimality of likelihood-based cross-validation. Statistical Applications in Genetics and Molecular Biology, 3(1):1–23. [DOI] [PubMed] [Google Scholar]
- van der Laan MJ, Polley EC, and Hubbard AE (2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1):1–23. [DOI] [PubMed] [Google Scholar]
- WHO (2016). Weekly epidemiological record. World Health Organization, 91:33–52. [Google Scholar]
- Wolpert D (1992). Stacked generatlization. Neural Networks, 5(2):241–259. [Google Scholar]
