Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Sep 27.
Published in final edited form as: Biometrics. 2017 Dec 14;74(3):1023–1033. doi: 10.1111/biom.12833

Sieve Analysis Using the Number of Infecting Pathogens

Dean Follmann 1,*, Chiung-Yu Huang 2,**
PMCID: PMC6004265  NIHMSID: NIHMS968286  PMID: 29238965

Summary

Assessment of vaccine efficacy as a function of the similarity of the infecting pathogen to the vaccine is an important scientific goal. Characterization of pathogen strains for which vaccine efficacy is low can increase understanding of the vaccine's mechanism of action and offer targets for vaccine improvement. Traditional sieve analysis estimates differential vaccine efficacy using a single identifiable pathogen for each subject. The similarity between this single entity and the vaccine immunogen is quantified, for example, by exact match or number of mismatched amino acids. With new technology, we can now obtain the actual count of genetically distinct pathogens that infect an individual. Let F be the number of distinct features of a species of pathogen. We assume a log-linear model for the expected number of infecting pathogens with feature “f,” f = 1, …, F. The model can be used directly in studies with passive surveillance of infections where the count of each type of pathogen is recorded at the end of some interval, or active surveillance where the time of infection is known. For active surveillance, we additionally assume that a proportional intensity model applies to the time of potentially infectious exposures and derive product and weighted estimating equation (WEE) estimators for the regression parameters in the log-linear model. The WEE estimator explicitly allows for waning vaccine efficacy and time-varying distributions of pathogens. We give conditions where sieve parameters have a per-exposure interpretation under passive surveillance. We evaluate the methods by simulation and analyze a phase III trial of a malaria vaccine.

Keywords: Competing risks, Cox regression, Empirical process, Infectious disease, Marked process, Multiple Outputation, Within Cluster Resampling

1. Introduction

Randomized clinical trials of vaccine candidates can provide a rich source of clues about vaccine mechanism in addition to a rigorous evaluation of overall efficacy. One important analysis for clues focuses on whether the vaccine has differential efficacy across different types of infecting pathogens (e.g., strains of Hepatitis C). The idea is that pathogens that are genetically or functionally similar to the vaccine immunogen may be more effectively blocked by the vaccine compared to pathogens that are different. Such analyses are known as sieve analyses as they reveal if the vaccine differentially blocks certain strains from infecting.

A broad and sophisticated literature for sieve analysis has developed; see for example, Gilbert et al. (1998, 2001), Gilbert (2000, 2001), Juraska and Gilbert (2013), Sun et al. (2013), and Juraska and Gilbert (2016). A traditional feature of these methods is that infection was categorized by a single strain of pathogen and thus infection by multiple strains was not considered. Recently, technology has improved so that counts of multiple infecting pathogen strains can be characterized. Intuitively, this additional information should yield more efficient analyses than when only infection yes/no by a single type of pathogen is recorded for each subject.

Follmann and Huang (2015) developed methods for analyzing vaccine trials where, at the time that infection was detected, the count of infecting pathogens was also determined. They assumed an underlying proportional intensity model for the risk of a potentially infectious exposure along with a log-linear model for the number of infecting pathogens. Vaccine efficacy was defined as the reduction in the mean number of infecting pathogens. Different parametric and semi-parametric methods for estimation of the overall vaccine efficacy were developed and evaluated. However, they did not consider the case where the infecting pathogens could be classified into multiple strains and did not consider sieving methods.

This article develops methods that blend the sieve approaches of Gilbert et al. cited above with the founder pathogen approaches developed by Follmann and Huang (2015). Both active and passive surveillance vaccine trials are considered. For passive surveillance trials, a log-linear model for the mean number of infecting pathogens with a given feature f over a fixed interval of follow-up is specified where f = 1, …, F characterize the features of interest. For active surveillance, we allow for a proportional intensity model on the incidence of exposures that can result in terminal infection, that is, infections that terminate follow-up. Following exposure, a vector of pathogen counts X1, …, XF is drawn. Infection occurs if any Xf > 0 and for example, Xf = 2 means two distinct pathogens with feature f infected the person. With parsimonious specification of the mean function, these approaches can be applied even if F is large.

This article is organized as follows: We describe the technology of how infecting pathogens can be characterized by long amino-acid strings and specify a log-linear model for the pathogen counts following exposure. We discuss how data are obtained under passive and active surveillance and provide conditions under which vaccine and sieve effects from such data estimate per-exposure effects. We conduct a small simulation to illustrate GEE estimation and compare it to old technology and to the within cluster resampling (WCR) method (Hoffman et al., 2001; Follmann et al., 2003). We next consider active surveillance methods and generalize the methods of Follmann and Huang (2015) to the sieve setting. A simple product estimator is developed, and then weighted estimating equations (WEE) are derived for a general model that allows for time-varying vaccine efficacy and pathogen distributions that vary with time. The asymptotic distribution of the WEE estimator is established. We conclude with an analysis of a malaria vaccine trial and explore the difference between the proposed and WCR approaches to active surveillance.

2. Pathogen Quantification and Sampling

Sieve analysis involves quantifying the similarity between the vaccine immunogen and the infecting pathogens. Vaccine efficacy is compared between those infected with vaccine-similar strains versus those with vaccine-dissimilar strains. In this article, we allow “infection” to mean either disease (e.g., clinical malaria) or evidence of replication (e.g., virus positive as in HIV or Zika infection). Historically, similarity was defined by simple methods where an infecting pathogen might be placed into one of several “races,” serotypes, or strains, depending on the era and technology. A single infecting pathogen would be identified for each infected individual.

While many infections are caused by a single infecting pathogen (or multiple identical clones), other pathogens can cause serial infections each by single or multiple unique infecting pathogens. If the infecting pathogens are genetically distinct it is possible to count them. To fix ideas, consider malaria which is caused by the parasite plasmodium falciparum. The infectious cycle starts by a mosquito bite of a human where the sporozoite form of the parasite is transferred from the mosquito to human. The sporozoite has an outer membrane of circumsporozoite (CS) protein which can induce an immune response. The RTS,S malaria vaccine uses part of this outer membrane as an immunogen and part of the amino acid (AA) sequence of the RTS,S immunogen is detailed in the first row of Table 1 (see Doud et al., 2012).

Table 1.

Partial amino acid sequence of the RTS,S vaccine immunogen along with an illustration of 4 infecting or founder parasites (aka haplotypes in malaria) that could have been recovered from an infected volunteer in a vaccine trial. A dot indicates agreement with the amino acid of the vaccine immunogen. The consensus sequence is given in the bottom row.

graphic file with name nihms968286f3.jpg

For ease of discussion, suppose that all subjects are followed for the same fixed interval of time. To identify infecting parasites, blood is drawn at the end of the study and a number of parasites sampled. The part of the parasite's DNA that codes for the CS region of Table 1 is amplified so that the DNA for each parasite is determined. This DNA sequence identifies the actual protein that coats the outside of the parasite. The middle four rows of Table 1 illustrate four different parasites. For malaria, the sampled parasites are genetically identical progeny of the infecting pathogens. Other diseases are different. For example, HIV undergoes replication and mutation within the infected host which adds further complexity to the identification of infecting pathogens (see e.g., Keele et al., 2008). Nonetheless, with detailed knowledge of the pathogen life cycle, assay technology, and amplification method, one can identify the genotypes and count the clonally distinct infecting pathogens.

The potential number of ways that the infecting pathogens can differ is large and simple metrics are often used to characterize the difference or “distance” between the vaccine and an infecting parasite. The final 3 columns of Table 1 give three examples of dissimilarity; (A) whether a vaccine/pathogen match occurs at amino acid location # 320, (B) whether a match occurs for all amino acids from location 293 through 302 (the DV10 region of the CS protein), or (C) the total number of mismatches from position 290 to 331.

To develop some notation, suppose that Xf is the number of infecting pathogens with feature f for a given subject following exposure and define the vector of counts as X = (X1, …, XF). For example, if we define feature by a match at location 320, with f = 1 denoting a match and f = 2 a mismatch, then F = 2 and from Table 1, we have (X1, X2) = (1, 3). If we define feature as the total number of mismatches (i.e., Hamming distance) over the 42 locations 290-331 then F = 43 and Xf is the number of pathogens with f–1 mismatches; for Table 1, X = (0, 0, 1, 2, 0, 1, 0, 0, …, 0). For malaria, multiple founders often follow from multiple infections so X1 + X2 > 1 from a single bite would be somewhat unusual.

A convenient regression model for count data is to assume that the log of the mean count is a linear function of covariates. Usually one has a single outcome X along with covariates W and one assumes that E(X|W) = exp(Wα). With sieve models we have a vector of outcomes X, a vaccine indicator Z and each element f has its associated covariate vector Vf. For the Hamming distance metric described above, we have Vf = (f – 1) and the mean model is

E(Xf|Wf)=exp(Wfα)=exp{α1+α2(f1)+α3Z+α4Z(f1)}. (1)

The other metrics described above can be represented in a similar fashion.

Vaccine efficacy is typically defined as a percent reduction in an outcome on a vaccine relative to a placebo (see Halloran et al., 2010). A traditional metric is to use infection by a single pathogen as an outcome and thus vaccine efficacy against infection by a pathogen with feature f is

VEIf=1P(Xf>0|Z=1)P(Xf>0|Z=0).

With the number of infecting pathogens with feature f, we can define vaccine efficacy in terms of the reduction in the mean number of infecting pathogens.

VEMf=1E(Xf|Z=1)E(Xf|Z=0)=1Δf.

Follmann and Huang (2015) introduced the metric VEM. While VEIf is typically more clinically relevant than VEMf, the latter may be more efficient at uncovering mechanistic clues.

While in vaccine challenge studies, X can be recorded after each controlled exposure in field trials, humans are exposed and infected during the course of follow-up, but when infection is detected depends on the study sampling framework. We postulate a process for exposure and infection where, at each exposure, a vector of counts of clonally unique infecting pathogens, X, is drawn from a distribution F(). For active follow-up, surveillance continues until an infection or censoring occurs. Whenever an infection occurs, we sample the infecting pathogens, determine XA, and stop follow-up. We sometimes call this a “terminal” infection as follow-up stops. What constitutes a terminal infection varies. For HIV any infection is terminal, while for malaria multiple subclinical infections are possible before the first terminal infection (e.g., parasitemia plus symptoms). For passive surveillance, we follow until the end of the study and genotype any infecting pathogens which reveals XP.

Figure 1 illustrates the intricacies of exposure, infection, and terminal infection under passive and active surveillance. The top trajectory is for an individual who has 4 exposures but remains uninfected until the end of follow-up. Under passive surveillance, we measure XP = 0 while for active surveillance, we write XA as 0. The second trajectory is where all infections are terminal such as HIV. Follow-up stops for active surveillance when infection is detected but continues for passive surveillance. Here, XP = XA though the sample is obtained at different times. Finally, the bottom trajectories are where subclinical infections can occur and possibly dissipate over time such as subclinical malaria. In this case, XP and XA may be weighted random sums of per exposure X's with bigger weights closer to the time of sampling. The difference is that the last infection in the sum comprising XA must be terminal. Passive surveillance has no such requirement.

Figure 1.

Figure 1

Trajectories of the exposure, infection (possibly subclinical), and terminal infection processes for four individuals Exposure=e at which time X is drawn from FZ(). XA and XP are the vector of counts obtained under active (solid line) and passive (dashed line) surveillance. Xs which result in subclinical infections do not terminate follow-up. Trajectory 2 corresponds to a disease where all infections are terminal (e.g., HIV) while trajectories 3 and 4 correspond to a disease with subclinical infections (e.g., malaria). Note that, we allow that old sub-clinical infections may be cleared (x1 from trajectory 3 under active surveillance).

In general, the distributions of XA, XP, and X are different and thus VEMf and sieve effects based on XA, XP, and X may be different. However for trajectories of type 1 and 2, XA = XP. We later provide conditions where the per-exposure VEMf can be recovered from XA. This is not possible for passive surveillance but we can provide conditions under which the mean per-exposure sieve effect θf,g = Δfg is recoverable from XP. If these conditions are not met, sieve effects can still be tested, but the estimates may not have the per-exposure interpretation. For simplicity, we have no covariates other than, Z, the vaccine indicator. We allow different patients to have different lengths of follow-up L.

Assumption 1. The per exposure counts of infecting pathogens X are independent and identically distributed draws from FZ() both within and across subjects within randomization group Z. Thus, the distribution of circulating strains of pathogens at each exposure is constant over time. Note this forbids an all-or-none vaccine effect where placebos and some vaccinees repeatedly draw X from an F() while other vaccinees repeatedly draw from point mass at 0.

Assumption 2. Our setting is for the second trajectory of Figure 1, where any infection is terminal and stops active surveillance. Multiple infections are not allowed. The count vector X that obtain from a terminal infection is recorded without error.

Under these assumptions, for both active and passive surveillance the distribution of XP = XA follows FZP(X)=FZ(X|X+>0)pZ+δ0(X)(1pZ), where Z = 0, 1 is the vaccine indicator, pZ is the probability of a (terminal) infection during follow-up in group Z, X+=f=1FXf, and δ0() is point mass at 0. Note that pZ implicitly depends on the distribution of follow-up lengths, L, in group Z.

With passive surveillance, we can recover the per-exposure ratio of means:

θf,gP=E(XfP|Z=1)/E(XfP|Z=0)E(XgP|Z=1)/E(XgP|Z=0)=p1E(Xf|X+>0,Z=1)/{p0E(Xf|X+>0,Z=0)}p1E(Xg|X+>0,Z=1)/{p0E(Xg|X+>0,Z=0)}=E(Xf|X+>0,Z=1)/{E(Xf|X+>0,Z=0)}E(Xg|X+>0,Z=1)/{E(Xg|X+>0,Z=0)}=π1E(Xf|X+>0,Z=1)/{π0E(Xf|X+>0,Z=0)}π1E(Xg|X+>0,Z=1)/{π0E(Xg|X+>0,Z=0)}=E(Xf|Z=1)/{E(Xf|Z=0)}E(Xg|Z=1)/{E(Xg|Z=0)}=θf,g,

where πz is the per-exposure probability of infection P(X+ > 0|Z). This result is analogous to that of Gilbert et al. (1998) who used the infection indicator I(Xf > 0) as outcome and only allowed for a single infecting strain.

3. Passive Surveillance

In trials with passive surveillance, volunteers are followed for a length of time. At the end of follow-up, infection status is determined and for infected volunteers, a sample of infecting pathogens genotyped. Let Li be the length of followup for person i. Let Zi be the vaccine indicator, XiP=X1iP,,XFiP the vector of counts of infecting pathogens, and Wfi a covariate vector, which depends implicitly on Z and which we sometimes write as Wfi(Zi). Similar to (1), we specify

E(XfP|Wf)=exp(Wfα). (2)

We allow that Wf can include terms involving L to reflect different lengths of follow-up.

As in Gilbert et al. (1998), who parameterized sieve effects for the infection indicator I(XfiP>0), we specify Wfi to allow for an arbitrary mean for feature f in the placebo group, with possibly structured vaccine effects. For example, one may postulate

E(XfP|Wf)=exp(β0f+γ1L+ZVfψ). (3)

Unstructured vaccine effects occur with Vf = If identifying the fth feature, for example, 0, …, 1, …, 0. Ordinal effects occur with Vf as above but with ψ1ψ2 · · · ≤ ψf. Another possibility is to specify a linear effect for the Hamming distance, for example, Vf = 1, f – 1 where f – 1 is the number of mismatches. These models can be fit using generalized estimating equations (GEE) with each individual a cluster (Zeger and Liang, 1986).

A small simulation was performed to evaluate the GEE estimation. We categorized pathogens as either matched (f = 1) or mismatched (f = 2) to the vaccine and generated counts ( X1P, X2P) using a bivariate negative binomial model with mean given by

E(XfiP|Wfi,bi)=exp{bi+α1+α2I(f=1)+α3Zi+α4ZiI(f=1)}=exp{bi+Wfiα}, (4)

where exp(bi) follows a gamma distribution with μ = E{exp(bi)} =.5 or 1 and σexp(b)2=ver{exp(bi)}=0, 1, or 2. We specify α = (0.96, 0, −0.32, −1.28) so that VEM1 = .80 and VEM2 = .27. In this model, α4 quantifies the sieve effect.

For estimation, we use GEE with a working independence correlation matrix. To mimic the old technology where only a single infecting pathogen is characterized, we selected a pathogen at random from each infected subject and recorded whether it was a match or not. We then fit a logistic regression model with probability exp(α2 + Ziα4)/{1 + exp(α2 + Ziα4)} to the match indicator. As an alternate estimation procedure, one could repeat the above random selection many times and average the associated estimates of α4. This averaging corresponds to a within cluster resampling (WCR) or multiple outputation approach to dealing with the multiple outcomes from each individual (Hoffman et al., 2001; Follmann et al., 2003). The WCR approach was introduced for use in sieve analysis in Neafsey et al. (2015). Here, we average over all possibilities and call this exhaustive WCR or EWCR. This approach may offer some robustness as an aberrant subject with large X+ and a peculiar suite of infecting pathogens will have substantial weight under the GEE approach with working independence but not under WCR.

Table 2 summarizes the simulation results. The two left columns give the true mean and variance of exp(bi) while the column summarizes the average count averaged over f = 1, 2, and Z. We define the squared Wald statistic as 𝒵2 = θ̂̄2 /S2 (θ̂) or the squared sample average of an estimate divided by its sample variance. The far right columns give the ratio of 𝒵2 for different estimators. This can be viewed as the relative efficiency of different estimators. For unbiased estimates, the ratio reduces to the ratio of variances, while for estimators that estimate different parameters, it approximates the ratio of sample sizes required to achieve a given power. So if ZA2/ZB2 is 2, then method B requires twice the sample size as method A to achieve the same power.

Table 2.

Simulated performance of different methods of estimation of a sieve effect using the match/mismatch metric for the mean number of infecting pathogens in a trial with passive surveillance. Pairs of rows report the average and variance of the estimated α4 over 1000 simulated datasets. Data generated under a bivariate negative binomial model. is the mean X averaged over groups and f and then averaged over the simulations. Relative Efficiency is the ratio of squared Wald statistics or ZA2/ZB2, where 𝒵 is the sample average divided by the sample standard deviation.

Relative efficiency

E{exp(bi)} var{exp(bi)} GEE Single pathogen EWCR GEE/Single GEE/ EWCR
Bivariate negative binomial

0.5 0.0 0.9 −1.302 −1.304 −1.304 2.68 1.18
0.073 0.151 0.087
0.5 1.0 0.9 −1.314 −1.336 −1.325 2.51 1.51
0.067 0.173 0.103
0.5 2.0 0.9 −1.289 −1.322 −1.292 3.49 1.82
0.060 0.222 0.110
1.0 0.0 1.9 −1.283 −1.299 −1.283 3.33 1.31
0.032 0.109 0.042
1.0 1.0 1.9 −1.295 −1.322 −1.299 4.89 1.96
0.030 0.152 0.059
1.0 2.0 1.9 −1.279 −1.283 −1.278 5.00 2.19
0.033 0.168 0.073

All-or-none

0.5 0.0 6.0 −0.997 −1.241 −1.237 1.21 0.78
0.076 0.143 0.091
0.5 1.0 4.8 −0.863 −1.148 −1.137 1.02 0.72
0.092 0.166 0.114
0.5 2.0 4.0 −0.799 −1.108 −1.075 2.75 0.68
0.102 0.540 0.126
1.0 0.0 8.1 −0.742 −1.082 −1.065 1.34 0.62
0.035 0.101 0.045
1.0 1.0 6.4 −0.660 −0.971 −0.965 1.24 0.65
0.045 0.121 0.063
1.0 2.0 5.3 −0.610 −0.907 −0.899 1.20 0.63
0.054 0.142 0.073

For the top half of Table 2, data are generated under (4) and all estimators are unbiased. The new technology where we characterize multiple infecting pathogens is more efficient than the old technology where only a single pathogen is identified; the variance ratio ranges from 2.68 to 3.49 when μ = 0.5 and from 3.33 to 5.00 when μ = 1. The final columns shows that GEE is more efficient than WCR with gains ranging from 18% to 119%.

The top half of Table 2 was generated using a conditional Poisson model for the mean count which is consistent with the models used for analysis. Another possibility is a kind of all-or-none vaccine effect where the vaccine reduces the probability of infection, but has no effect on the count of infecting pathogens once infection has occurred. We formalize this idea by perturbing (4) by replacing any Xf|Xf > 0 with a discrete uniform(1,21), irrespective of group. Results are given in the bottom half of Table 2. Simulations show that for this setting, the EWCR and GEE methods estimate different parameters. The GEE method is more powerful than the old technology, but the GEE approach is less powerful than EWCR with relative efficiencies between 0.62 and 0.78.

4. Field Trial Time to Event Analysis

In a field trial with active surveillance, the time to event (e.g., infection for HIV or first clinical disease for malaria) is recorded along with the counts of each type of infecting pathogen. In this section, we parallel the approach of Follmann and Huang (2015) and develop estimators of the V:P ratio of means. Our methods are best conceptualized with trajectory 2 of Figure 1 where any infection is terminal, that is, stops follow-up and we recover the per-exposure mean ratio. Our methods also apply to settings 3 and 4 though here XA for the terminal infection may include some previous subclinical infections and the mean ratio may not have a crisp per-exposure interpretation.

4.1. Product Estimators of VEMf

Let ω(t) be the instantaneous risk of exposure by any pathogen. This risk should be free of Z in a blinded trial. Following exposure, X is generated from FZ(). If f=1FXf=X+>0 the person is infected and we see XA while if X+ = 0 they remain at risk of future infection. If no infection occurs during follow-up, we write XA = 0. Under the assumption that the time of exposure and X are conditionally independent given Z, the instantaneous risk of (terminal) infection can be written as

h(t|Z)=ω(t)P(X+>0|Z)=h0(t)exp[log{P(X+>0|Z=1)P(X+>0|Z=0)}Z]=h0(t)exp(Zβ). (5)

Thus, exp(β) is the V:P ratio of per exposure infection probabilities. Estimation of β can be achieved using standard software for Cox regression and the asymptotic distribution of β̂ is well known. Additionally, covariates that impact the risk of exposure can be easily incorporated in equation (5) if desired.

Given infection, we can drill down to get at feature specific vaccine efficacy and from that sieve effects. Similar to before, we assume that the expectation of XfA|X+A>0 is given by

E(XfA|Wf,X+A>0)=exp(Wfα). (6)

Note that if all infections are terminal then XA=dX. The vector Wf can be parameterized as described previously and α can be estimated using the GEE method with data from infected individuals. Using arguments in Follmann and Huang (2015), we can blend the Cox estimator of β with an estimate of α to obtain a consistent estimate of VEMf. Specifically, as the sample size goes to infinity

exp(β^)exp{Wf(1)α^}exp{Wf(0)α^}P(X+>0|Z=1)P(X+>0|Z=1)E(Xf|Z=1,X+>0)E(Xf|Z=0,X+>0)=E(Xf|Z=1)E(Xf|Z=0)=1VEMf,

in probability. Here, Wf(Z) are the covariates for a person in group Z. Remarkably, even though, we only see XfA|X+A>0, we are able to obtain a consistent estimate of the ratio of unselected means. Asymptotics for the product estimator easily follow from the delta method as both β̂ and α̂ are asymptotically normal and independent given standard conditions including that the Xis are independent draws from Xi|W1i, …, Wfi.

A sieving effect occurs if the VEMf are not constant over f which happens if

E{XfA|Wf(1)}E{XfA|Wf(0)}=exp[{Wf(1)Wf(0)}α]

is nonconstant over f. Exactly what this means in terms of α depends on Wf. For example, with the Hamming distance metric parameterized as in equation (1), a sieve effect occurs if α4 ≠ 0. For the mis-match metric sieving occurs if α4 ≠ 0 with (f – 1) replaced by I(f = 1) in (1).

For certain diseases, such as malaria, subclinical infections can occur as suggested by trajectories 3 and 4 of Figure 1 and thus an XA associated with a terminal infection (which has X+A>0) may include infecting pathogens from prior subclinical infection(s). Such an XA can be represented as the sum of the count vector from the terminal infection plus a random number of other count vectors from recent subclinical infections. We might write XA=c=1CXc. The X1, …, XC could have a complex distribution with, for example, C–1 iid subclinical infections drawn from FWsc followed by a terminal infection drawn from FZt. In this case, the product method recovers a ratio of means, but it does not have a crisp per-exposure interpretation.

Interestingly, we can repeat the above development by the indicator of infection yes/no I(Xf > 0) instead of Xf. This may be a more sensitive measure of vaccine efficacy if the vaccine has an impact on I(Xf > 0) but no impact on the count Xf|Xf > 0. We assume the same model for the risk of infection by any pathogen (5). But now our outcome is the indicator of an infection of type “f” given infection has occurred. For this, we assume that

E{I(XfA>0)|WfX+A>0}=exp{Wfα} (7)

Arguing as before it follows that as the sample size goes to infinity

exp(β^)exp(Wf(1)α^)exp(Wf(0)α^)1VEIf,

in probability with asymptotic normality following from the delta method.

4.2. Weighted Estimating Equations

While the product estimators of VEIf and VEMf for active surveillance are simple to obtain, they do not naturally allow for complex covariates that can model nonconstant vaccine efficacy nor allow for changing distributions of pathogens. In this section, we extend the WEE approach in Follmann and Huang (2015) to allow for these effects.

Let Ti be the time to infection and Ci the time to censoring for individual i. We assume that the censoring time Ci is independent of Ti conditioning on the covariates. Moreover, define Yi = min(Ti, Ci), δi = I(TiCi) and Ni(t) = δiI(Yit). Motivated by Follmann and Huang (2015), we propose to construct unbiased estimating equations based on the observed stochastic processes XfiAdNi(t), f = 1, …, F and i = 1, …, n.

Define by WE covariates that impact the exposure to pathogens and by WfX covariates that impact the count of Xf given exposure to pathogens. Define W=(WE,W1X,,WFX). We assume that the intensity of exposure to any pathogen is given by

ω(t|W)=ω(t)exp(θWE), (8)

while the mean of X given exposure at time t satisfies the proportional mean model

E(Xf|Wf)=E(Xf|WfX)=exp{βf(t)+ϕWI+ψZVf}. (9)

With βf(t) varying with t, we allow the mean response in the placebo group to change arbitrarily with time. We include WI to allow for pan-feature effects such as innate immunity or actual/counterfactual immune response to the vaccine while Vf specifies the vaccine effect for feature f. In the appendix, we show these assumptions and only terminal infections imply that

E{XfiAdNi(t)|Wi}=λf(t)exp(αWfiu)P(Yit|Wi)dt (10)

where λf(t) = ω(t) exp{βf(t)} is a nuisance function and Wfiu are the unique covariates in (WE, WI, ZVf). Note that for a covariate included in both WiE and WiI, the corresponding α is the sum of the corresponding θ and ϕ. When there are no such common covariates, we have Wfiu=(WiE,WiI,ZiVfi). A simple example is Wfiu={Zi,Zi(f1)}.

Based on the zero-mean property of the observed stochastic processes, XfiAdNi(t)E{XfiAdNi(t)}, we derive p = dim(α) unbiased estimating equations after profiling out the nuisance functions λf(t), f = 1, …, F,

U(α)=i=1n0τf=1FXfiA{Wfiuj=1nWfjuexp(αWfju)I(Yju)k=1nexp(αWfku)I(Yku)}×dNi(u)=0. (11)

Define the solution as α̂, which we call the weighted estimating equation (WEE) estimator. In Web Appendix A, we show that √n(α̂α0) converges to a multivariate normal distribution with mean zero and variance-covariance matrix Γ−1ΩΓ−1 where Γ = −n−1E{∂U(α)/α |α=α0} and Ω = var{Ui(α0)}.

Remarks

  1. Stratification can be incorporated in the proposed WEE method in the usual way. Within each stratum, for example, sites in a field trial, we construct estimating equation (11) and then sum the equations over strata to estimate the regression coefficients.

  2. Analogous to the product estimator, for trajectories of type 3 and 4 where XA may include pathogens from prior nonterminal infections, the WEE estimator recovers a ratio of means, but they do not have a per-exposure interpretation.

  3. The equations (11) for failure type f correspond to estimating equation from standard Cox regression based on a weighted partial likelihood where the ith failure of type f gets weight XfiA. The equations are summed over all f which have at least 1 failure which results in equation (11). Cox regression software which allows this sort of weighting can be re-purposed to solve these equations. Note that weighted Cox regression typically replicates a person XfiA times including in the “risk set” or denominator of (11). The weight in equation (11) is different with an unchanged risk set but the contribution at Ti reproduced XfiA times.

  4. If such weighting is not allowed, the equations can still be solved with standard Cox regression software. Let 𝒦 be the set of subjects with an event. For subject k ∈ 𝒦 with an event at time Tk, we concatenate f=1FXkfA data sets. If an infection with feature f is observed for this subject k, a total of XkfA>0 identical datasets of failure type f are created from the risk set ℛ(Tk) of subjects under study at time Tk. Each data set has #ℛ(Tk) – 1 nonevents with covariate vectors Wft from those ℓ in the risk set ℛ(Tk) (but excluding subject k) and a single infection with feature f with covariate vector Wfk for subject k. This results in kKf=1FXfkA concatenated datasets. If each dataset defines its own stratum and stratified Cox regression is run then the a solution to (11) can be obtained. The bootstrap can be used to approximate standard errors for the regression coefficients.

  5. The same arguments apply if we replace XfiA with I(XfiA>0) when our goal is to estimate VEIf. In this case, the estimating equations can be solved using stratified Cox regression software. For feature f, we create a Cox regression data set using time to infection with feature f as the failure time and the censoring indicator in the usual fashion along with Wfi as the covariate for person i in the fth dataset. We can then concatenate the datasets into a large dataset and apply stratified Cox software. Note that we only need concatenate UF datasets where U is the number of unique infection features observed in the data. Each feature that has an event defines a stratum so there are U strata. Again, the bootstrap can be used to obtain standard errors for the regression coefficients. This formulation extends Method B of Lunn and McNeil (1995) to multiple noncompeting events.

  6. The assumption of a constant VEMf throughout follow-up can be weakened. For example, the mean at time t for feature f might be proportional to

    E(XfiA|Wfu)=exp{βf(t)+Z(αf1+αf2t)},

    Thus, VEMf = 1 – exp(αf1 + αf2t) can wane smoothly over time. If we further require αf2 = αf for all f, then the ensemble VEM1, …, VEMf can wane similarly over time. Another formulation would have

    E(XfiA|Wfu)=exp[βf(t)+Z{I(tτ)αf1+I(t>τ)αf2}],

    so that VEMf can differ before and after τ, for example, the median follow-up.

In Web Appendix B, we evaluate the performance of the WEE and product approaches via simulation. We consider pathogen distributions, that is, placebo draws from FZ, that are constant, increase, or decrease over time, and allow for no and substantial subject level heterogeneity. The WEE method allows for changing pathogen distributions while the product methods does not. Both methods assume no subject-level heterogeneity. Interestingly, the simulations show that both methods are unbiased for the sieve effect θf,g for all scenarios. For feature specific vaccine efficacy, VEMf the performance differs. The WEE is unbiased for all null scenarios and only modestly biased at about 10% and then only when there is both substantial heterogeneity and the mean of FZ() increases over time. WEE is unbiased for moderate heterogeneity. For both non and non-null scenarios, the product estimate shows bias with nonconstant FZ and is substantially biased under heterogeneity for all FZ.

5. Example

Malaria is an ancient parasitic disease that causes substantial illness and disease, primarily in the developing world. While vaccine development has been long and difficult, the RTS,S vaccine has demonstrated promising efficacy; see Agnandji et al. (2015); RTS et al. (2012). Investigation of potential sieve effects of the vaccine is of keen interest. Towards this end, next-generation PCR amplification of the CS region in AA positions 293–389 was undertaken and the infecting parasites characterized as illustrated in Table 1. Neafsey et al. (2015) conducted an extensive sieve analysis of the RTS,S vaccine using the mismatch metric for individual AA locations and subregions of the region defined by amino acid positions 293–389. Sophisticated methods that had been developed for single infecting pathogens were leveraged using the Monte Carlo WCR approach.

To illustrate our methods, we use children 5–17 months old from all trial sites of the RTS,S trial (Neafsey et al., 2015). There were 6912 children from 11 trial sites with a total of 2089 terminal infections, that is, first or only episodes of clinical malaria within the 1st year post vaccination. For illustration, we focus on the mismatch metric applied to the DV10 region of AA positions 293–302. We define X1A, X2A as the counts of matched and mismatched pathogens at terminal infection. Of the 2089 infections for this AA region, 1722 had all infecting pathogens mismatched, 68 had all infecting pathogens matched and 299 had both matched and mismatched infecting pathogens. The top panel of Figure 2 displays a random sample of 165 children, 56 of them had an episode of first or only clinical malaria (parasitemia plus symptoms) and are denoted by a number providing the jittered total count of infecting pathogens at the time of terminal infection. The bottom panel is a jittered scatterplot of the count of mismatched and matched infecting pathogens; the correlation between the two is about 0.28.

Figure 2.

Figure 2

Top panel: Years to censoring or first/only episode of clinical malaria for 165 randomly selected children. Small dots denote censoring, numbers provide the number of infecting pathogens at the time of the detection of infection. Bottom panel: a scatterplot of the count of infecting pathogens by mismatch/match to the DV10 region.

We begin our analysis by evaluating three different estimators of VEIf: WCR, an analysis that approximates old technology where a single pathogen is identified, and the product estimator. For WCR we generated 10,000 datasets by selecting one infecting strain at random for each infected child. For each generated dataset, the Cox proportional hazards model was applied separately to matched (f = 1) and mismatched (f = 2) infections using the parametrization

λf(t)=λ0f(t)exp[Z{α1I+I(f=2)α2I}],

see Method B of (Lunn and McNeil, 1995). Here, α1I, α1I + α2I correspond to matched and mismatched vaccine effects, and α2I ≠ 0 indicates differential vaccine efficacy. An estimate of its variance is obtained as described in Hoffman et al. (2001) and Follmann et al. (2003). To mimic the old technology where a single infecting pathogen was identified, we selected the median α̂2I and corresponding standard error. We view this an approximation to the estimate that would likely be obtained under the old technology. For the product method with I(Xf > 0) as outcome, we use (7) with Wfα = α1 + α1IZ + α2I(f = 2) + α2IZI(f = 2). The estimator reduces to

VE^If=1=exp(β^)I(Xf1A>0)¯I(Xf0A>0)¯. (12)

where I(Xfz>0)¯ is the proportion infected with feature f among the infected in group z and β̂ is estimated from Cox regression. One can show that the estimate of the sieving effect simplifies to α^2I=log{I(X11A>0)¯/I(X10A>0)¯}log{I(X01A>0)¯/I(X00A>0)¯}.

For the product method with Xf as outcome, the development is similar. The estimates of VEMf, α2M are analogous to VEIf, α2I with I(XfzA>0)¯ replaced by XfzA¯. The nonparametric bootstrap was used to estimate the variance of the estimates based on the product method.

Table 3 reports the results. The first three columns all estimate VEIf. We see that the approximation of the old technology, where only one parasite was used for each infection, has a smaller Wald statistic than that from Monte Carlo WCR, which shows significant evidence of sieving. The use of the product method with the infection indicator as outcome, I(Xf > 0), gives a substantially larger Wald statistic than Monte Carlo WCR. The difference between the WCR and product estimators of VEIf was interesting and explored in detail analytically and via simulations in the Web Appendix C for the simple setting of constant pathogen distributions and exponential times to infection. We show that these two approaches actually estimate different parameters; the WCR approach estimates VEIf for a randomly selected pathogen while the product approach estimates a marginal VEIf. Limited simulations show that the relative efficiency ZPROD2/ZWCR2 was about 3.5 for these scenarios, suggesting the product method may be substantially more efficient. However, these simulations do not evaluate an all-or-none type vaccine effect.

Table 3.

Different methods of estimating differential vaccine efficacy applied to the DV10 region of the circumsporozoite protein. Data from a phase 3 trial of the RTS,S/AS01 malaria vaccine in African infants. The differential VE parameter α2U = log{(1-VEU1)/(1-VEU2)}, where U = I for infection indicator or M for mean. 95% confidences in parentheses.

VEIf on infection VEMf on count

Parameter One parasite 10,000 Monte Carlo WCR Product method on I(Xf > 0) Product method on Xf
Matched: VE1 0.55 (0.39,0.66) 0.56 (0.44,0.65) 0.60 (0.51,0.67) 0.61 (0.52,0.68)
Mismatched:VE2 0.43 (0.38,0.48) 0.43 (0.38,0.48) 0.44 (0.39,0.49) 0.52 (0.46,0.56)

Sieving effect

α̂2 −0.219 −0.245 −0.324 −0.211
Vâr(α̂2) 0.0244 0.0150 0.0097 0.0100
α^2/va^r(α^2)
−1.40 −2.01 −3.29 −2.10

The final column estimates VEMf which are larger than the corresponding VEif^s indicating a greater estimated effect of the vaccine on the mean than infection. However, the increase is more substantial for mismatched infections compared to matched infections. As a consequence, the Wald test of differential vaccine efficacy is less substantial (−2.01) than for the product estimator using I(Xf > 0) (−3.29).

6. Discussion

This article has developed methods to assess differential vaccine efficacy when multiple infecting strains can be quantified. While many pathogens typically have only a single clone infect a person, others, including HIV, malaria, and hepatitis C, can have multiple clones infect during the interval(s) between infection assessment. Such characterization offers the potential for increased efficiency in identifying clues about the mechanism of action of the vaccine. Methods for both passive and active surveillance were developed, the former is an obvious generalization of existing methods developed by Gilbert et al., while the latter is an extension of the methods of Follmann and Huang (2015). Because this article covered a lot of different approaches with different assumptions, in Table 4, we provide a summary of key assumptions required for the estimands of θf,g and 1-VEMf to achieve a per-exposure interpretation.

Table 4.

Key assumptions required to recover per-exposure estimands. All methods require that any infection be terminal (e.g., a trajectory of type 2 of Figure 1). If non-terminal (e.g., subclinical) infections are allowed, θf,g recovers a sieve effect and 1-VEMf a ratio of means, but neither has a clear per-exposure interpretation. In practice the assumptions below can be weakened further via stratification for any method or by allow use of time-varying covariates for the WEE and product methods.

Est. Met. Data Estimand Assumptions
GEE XP
Wf
L
Per-exposure ratio of means θf,g E(XfP|Wf)=exp(αWf)
per-exposure X iid FW () any infection is terminal
Prod XA
W
δ, T
VEMf=1E{Xf|W(1)}E{Xf|W(0)}
Same V,P exposure: ω(t) exp(θWE)
X independent draws from a dbn. with
  • · P(X+>0|Z=1)P(X+>0|Z=0)=exp(β)

  • · E(Xf|W, X+ > 0) = exp(Wfα)

  • · E(Xf|W,X+>0)=E(XfA|W,X+A>0) any infection is terminal

WEE XA
W
δ, T
VEMf=1E{Xf|W(1)}E{Xf|W(0)}
Same V,P exposure: ω(t) exp(θWE)
X at time t an independent draw from a dbn. with
  • · E(Xf|W) = exp{βf(t) + ϕW′ + ψZVf}

  • · E(Xf|W,X+>0)=E(XfA|W,X+A>0) any infection is terminal

Notes: X is the per exposure count vector; XP is observed at end of follow-up L; XA is observed at terminal infection at time T, 0 otherwise; T is the time to infection or censoring; δ is the infection indicator; f = 1, … F are the features of interest of the pathogen; WfE are covariates that impact exposure; Wf are covariates that impact Xf; W1 are covariates that impact Xf for vaccine and placebo; Vf are covariates that describe the vaccine efficacy for feature f; Z is the vaccine indicator; W = WE, W1, …, WF; W(Z) denotes covariates in group Z = 1 vaccine or Z = 0 placebo.

For the RTS,S malaria vaccine trial, there were large Wald statistics for the new method compared to cataloging a single pathogen when yes/no infection was used as the readout. Additionally, in this dataset use of VEIf rather than VEMf had a more significant result. Whether VEIf would generally be more powerful than VEMf for malaria or other diseases will be answered in the analysis of additional trials. The example also showed that use of the marginal VEIf from active surveillance provided more evidence of sieving than the WCR estimator of VEIf for a randomly selected pathogen. This behavior was investigated analytically and reproduced in limited simulations. The two estimators do estimate different parameters and thus provide complementary information; as above, it will be interesting to see which approach is more powerful for different diseases.

Our weighted estimating equations were obtained by summing unbiased estimating equations over the different pathogen features. This is akin to the use of working independence correlation matrix in the GEE method and greater efficiency may be achievable if the equations were weighted based on the covariance cov(X|X+ > 0). The method of generalized methods of moments approach could be used in future work to derive a more efficient estimator of α based on the “F × p” estimating equations.

Supplementary Material

Datacode
Supplemental Materials

Acknowledgments

This work utilized the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov). We thank Peter Gilbert and Michal Juraska for helpful comments. We also thank the participants, investigators, and sponsors of the RTS,S trials. We especially thank GlaxoSmithKline, Christian Ockenhouse for the Path Malaria Vaccine Initiative, Dyann Wirth for the Harvard School of Public Health, and Daniel Neafsey for the Broad Institute Genomics Platform.

Footnotes

Supplementary Materials: Web appendices A and B referenced in Section 4.2; Web Appendix C referenced in Section 5; and the computer code and example data for Table 3 are available with this article at the Biometrics website on Wiley Online Library.

References

  1. Agnandji ST, Fernandes JF, Bache EB, Ramharter M. Clinical development of RTS, S/AS malaria vaccine: A systematic review of clinical phase I-III trials. Future Microbiology. 2015;10:1553–1578. doi: 10.2217/fmb.15.90. [DOI] [PubMed] [Google Scholar]
  2. Doud MB, Koksal AC, Mi LZ, Song G, Lu C, Springer TA. Unexpected fold in the circumsporozoite protein target of malaria vaccines. Proceedings of the National Academy of Sciences. 2012;109:7817–7822. doi: 10.1073/pnas.1205737109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Follmann D, Huang CY. Incorporating founder virus information in vaccine field trials. Biometrics. 2015;71:386–396. doi: 10.1111/biom.12277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Follmann D, Proschan M, Leifer E. Multiple outputation: Inference for complex clustered data by averaging analyses from independent data. Biometrics. 2003;59:420–429. doi: 10.1111/1541-0420.00049. [DOI] [PubMed] [Google Scholar]
  5. Gilbert PB. Comparison of competing risks failure time methods and time-independent methods for assessing strain variations in vaccine protection. Statistics in Medicine. 2000;19:3065–3086. doi: 10.1002/1097-0258(20001130)19:22<3065::aid-sim600>3.0.co;2-d. [DOI] [PubMed] [Google Scholar]
  6. Gilbert PB. Interpretability and robustness of sieve analysis models for assessing HIV strain variations in vaccine efficacy. Statistics in Medicine. 2001;20:263–279. doi: 10.1002/1097-0258(20010130)20:2<263::aid-sim660>3.0.co;2-1. [DOI] [PubMed] [Google Scholar]
  7. Gilbert PB, Self S, Rao M, Naficy A, Clemens J. Sieve analysis: Methods for assessing how vaccine efficacy depends on genotypic and phenotypic pathogen variation from vaccine trial data. Journal of Clinical Epidemiology. 2001;54:68–85. doi: 10.1016/s0895-4356(00)00258-4. [DOI] [PubMed] [Google Scholar]
  8. Gilbert PB, Self SG, Ashby MA. Statistical methods for assessing differential vaccine protection against human immunodeficiency virus types. Biometrics. 1998;54:799–814. [PubMed] [Google Scholar]
  9. Halloran ME, Longini IM, Struchiner CJ, Longini IM, Struchiner CJ. Design and Analysis of Vaccine Studies. New York: Springer; 2010. [Google Scholar]
  10. Hoffman EB, Sen PK, Weinberg CR. Withincluster resampling. Biometrika. 2001;88:1121–1134. [Google Scholar]
  11. Juraska M, Gilbert P. Mark-specific hazard ratio model with multivariate continuous marks: An application to vaccine efficacy. Biometrics. 2013;69:328–337. doi: 10.1111/biom.12016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Juraska M, Gilbert PB. Mark-specific hazard ratio model with missing multivariate marks. Lifetime Data Analysis. 2016;22:606–625. doi: 10.1007/s10985-015-9353-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Keele BF, Giorgi EE, Salazar-Gonzalez JF, Decker JM, Pham KT, Salazar MG, et al. Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proceedings of the National Academy of Sciences. 2008;105:7552–7557. doi: 10.1073/pnas.0802203105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Lunn M, McNeil D. Applying cox regression to competing risks. Biometrics. 1995;51:524–532. [PubMed] [Google Scholar]
  15. Neafsey DE, Juraska M, Bedford T, Benkeser D, Valim C, Griggs A, et al. Genetic diversity and protective efficacy of the RTS, S/AS01 malaria vaccine. New England Journal of Medicine. 2015;373:2025–2037. doi: 10.1056/NEJMoa1505819. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. RTS S, Agnandji ST, Lell B, Fernandes JF, Abossolo BP, Methogo B, et al. A phase 3 trial of RTS, S/AS01 malaria vaccine in african infants. New England Journal of Medicine. 2012;367:2284–95. doi: 10.1056/NEJMoa1208394. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Sun Y, Li M, Gilbert PB. Mark-specific proportional hazards model with multivariate continuous marks and its application to hiv vaccine efficacy trials. Biostatistics. 2013;14:60–74. doi: 10.1093/biostatistics/kxs022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Zeger SL, Liang KY. Longitudinal data analysis for discrete and continuous outcomes. Biometrics. 1986;42:121–130. [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Datacode
Supplemental Materials

RESOURCES