Assessing etiological heterogeneity for multinomial outcome with two-phase outcome-dependent sampling design

Sarah A Reifeis; Michael G Hudgens; Melissa A Troester; Michael I Love

doi:10.1093/aje/kwae212

. 2024 Jul 16;194(4):1072–1078. doi: 10.1093/aje/kwae212

Assessing etiological heterogeneity for multinomial outcome with two-phase outcome-dependent sampling design

Sarah A Reifeis ¹, Michael G Hudgens ², Melissa A Troester ³, Michael I Love ^4,^5,^✉

PMCID: PMC13070532 PMID: 39010753

Abstract

Etiological heterogeneity occurs when distinct sets of events or exposures give rise to different subtypes of disease. Inference about subtype-specific exposure effects from two-phase outcome-dependent sampling data requires adjustment for both confounding and the sampling design. Common approaches to inference for these effects do not necessarily adjust appropriately for these sources of bias, or allow for formal comparisons of effects across different subtypes. We show that using inverse probability weighting (IPW) to fit a multinomial model to yield valid inference with this sampling design for subtype-specific exposure effects, and contrasts thereof. We compare the IPW approach to common regression-based methods for assessing exposure effect heterogeneity using simulations. The methods are applied to estimate subtype-specific effects of various exposures on breast cancer risk in the Carolina Breast Cancer Study (1993-2001).

Keywords: cancer subtypes, confounding, exposure effect, inverse probability weighting, polytomous outcome, observational data

Introduction

Two-phase outcome-dependent sampling (ODS) designs are often utilized in observational studies as a cost effective way to estimate exposure effects on an outcome of interest. Two-phase sampling designs entail two steps. First, a random sample of individuals is drawn from a population; outcome and covariate data are collected on these individuals. Second, individuals are stratified according to the first-phase data, and simple random samples are drawn from each stratum, with known stratum-specific selection probabilities. Additional covariates, possibly including exposures of interest, are collected on the individuals selected for this second phase.¹ Assessing exposure effects in two-phase observational studies is challenging because adjustment is needed for both confounding and unequal probability sampling.

ODS is often performed with respect to a binary outcome, such as in case–control studies. Many disease outcomes, although dichotomized for ease of analysis, are not truly binary. For instance, cancers may have multiple subtypes, and researchers may compare exposure effects across these subtypes. Etiological heterogeneity occurs when the degree of relative risk conferred by exposures differs across subtypes of disease. This is crucial in cancer epidemiology, as identifying whether an exposure raises the risk for a particular subtype can reveal unknown relationships and generate new mechanistic hypotheses.² For example, in breast cancer research, investigators might explore whether oral contraceptive use affects estrogen receptor (ER)-positive versus ER-negative breast cancer differently, and if so, to what extent. Answers to these kinds of questions can be informed by point and interval estimates of subtype-specific exposure effects as well as hypothesis tests for equality of these effects.

Environmental or behavioral risk factors and exposures are often influenced by many other variables. As a result, confounding can generally be expected when estimating environmental and behavioral exposure effects on disease subtypes from observational data. Zabor and Begg² compared multinomial logistic regression with methods from multiple studies³^‑⁵ that were developed for investigation of etiological heterogeneity, but this comparison did not evaluate these methods for confounding adjustment. Benefield et al⁶ estimated various environmental exposure effects on risk of developing subtypes of breast cancer using regression and including confounders as predictors in the models. While the aforementioned methods allow for incorporating confounders as predictors in a regression model, they each rely on the strong assumption that the confounders and exposure do not interact to influence the outcome.

Inverse probability weighting (IPW) has been proposed to adjust for both confounding and biased sampling when estimating exposure effects for studies with ODS,⁷ but not in the context of a multinomial outcome. Additionally, IPW has been used for estimation of exposure effects on multinomial outcomes,⁸ but not in the context of ODS. When estimating exposure effects on multinomial outcomes for studies using ODS, methods are needed that (1) appropriately adjust for confounding, (2) account for the sampling design, and (3) accommodate the multinomial nature of the outcome to answer questions of exposure effect heterogeneity.

In this paper, methods are considered that address (1)-(3). In particular, in Methods an IPW approach is developed that extends a method from Wang et al⁷ to ODS studies where the outcome is multinomial for inference about subtype-specific exposure effects, and two common regression-based approaches are also described. These methods are compared in Results using simulation studies and data from the Carolina Breast Cancer Study (CBCS). We conclude with a review and discussion of the main results, and addresses limitations and future directions for this research. Appendix S1 provides further details of the methods introduced here. Additional simulation study results are included Appendix S2, and Appendix S3 contains SAS (SAS Institute, Inc, Cary, NC) code for analyzing an example data set with each method considered.

Methods

Consider a two-phase ODS design where the goal is to estimate the effects of a binary exposure A on a multinomial outcome Y, where Y takes values Inline graphic for . The outcome levels k are referred to as (disease) subtypes. Assume is coded as “no disease,” as inference is dependent on the choice of reference level. Below A = 0 is referred to as “unexposed” and A = 1 as “exposed,” although in general the coding for A may represent any two exposure levels. Let Inline graphic where and denote vectors of covariates observed for all first- and second-phase individuals, respectively. Define to be the indicator of experiencing subtype . Note that . Let S be the indicator of selection for the second phase, and represent the known second-phase selection probabilities. Suppose we observe m independent and identically distributed copies of Inline graphic for . Assume , ie, conditioning on the first-phase data, selection is independent of the covariates observed in the second phase of sampling. Let denote the number of individuals selected in the second phase of sampling.

Exposure (or risk factor) effects can be defined using potential outcomes (or counterfactuals). Let Inline graphic and be the potential outcome and potential subtype indicator if, possibly counter to fact, an individual had exposure a for . Let represent the counterfactual risk of developing subtype had the exposure been a, and let for . Let represent the probability of exposure a given covariate vector L, for Inline graphic . Ultimately our goal is to draw inference about the effects of A on the risk of developing different disease subtypes, which may be defined in terms of contrasts in the subtype-specific counterfactual risks .

Assume Inline graphic for and (conditional exchangeability), for (causal consistency), and for , and all l such that (positivity), where in general denotes the cumulative distribution function (CDF) of X. In other words, positivity requires that there are both exposed and unexposed participants at every combination of the values of the observed covariates L in the population of interest.⁹ Also assume that every individual has a nonzero second-phase sampling probability, ie, Inline graphic for all y, , and a where . In combination with the other aforementioned assumptions, knowing these selection probabilities permits identifiability of the risks in the original population for the K nonreference subtypes, so functions of the risk (eg, the risk difference, risk ratio, and relative risk ratio) are also identifiable.

There are various possible estimands of interest to quantify etiological heterogeneity. Some estimands are covariate-conditional, meaning that the estimands quantify the exposure effects within strata of individuals with the same values of the covariates L. Estimands that are not covariate-conditional are referred to as marginal estimands. When the research question of interest is best answered by estimating a single quantity describing the population of interest, marginal estimands may be preferred. All estimands considered here are subtype-conditional, meaning that the estimand for each subtype k is conditional on the outcome being subtype k or no disease (ie, subtype 0). We will hereafter refer to these simply as subtype-specific estimands. This quality is desirable because it allows the “no disease” subtype to serve as a common reference outcome, facilitating comparisons between subtypes of disease. First we consider IPW of a marginal structural model (MSM) where the estimands are marginal, and subsequently we consider regression-based estimators where the corresponding estimands are covariate-conditional.

Inverse probability weighting

Consider the following multinomial MSM

or equivalently Inline graphic . The parameter can be interpreted as a log relative risk ratio (RRR) or log subtype-specific odds ratio (OR) by noting that

(1)

The numerator of equation 1 is the risk ratio comparing the risk of subtype k versus the risk of no disease had everyone been exposed; the denominator of equation 1 has a similar interpretation had everyone been unexposed. Thus Inline graphic is the causal RRR of developing subtype k relative to no disease had everyone been exposed versus unexposed. The RRR being less than 1 indicates exposure is protective against developing subtype k relative to no disease. Conversely if the RRR is greater than 1, exposure confers increased risk of developing subtype k relative to no disease. In this work, etiological heterogeneity is defined as differences in the causal RRR between subtypes. Thus, for the multinomial MSM above, comparison of the Inline graphic coefficients for different values of k quantifies the extent to which there is exposure effect heterogeneity across subtypes of disease, with no heterogeneity corresponding to . Additionally, can be interpreted as the causal subtype-specific OR because the risk ratios in the numerator and denominator of equation 1 can be equivalently expressed as conditional odds, ie, Inline graphic for and . Thus the numerator of equation 1 is the odds of developing subtype k had everyone been exposed, among those who would develop subtype k or no disease, and the denominator has an analogous interpretation.

The parameters of the MSM can be estimated by weighted maximum likelihood as follows. First estimate Inline graphic by using weighted maximum likelihood, where represents estimated coefficients from the propensity score model, as described in Appendix S1.1 of the Supplementary Material. Then for each individual compute the weight . Next fit the following multinomial logistic regression model

(2)

using weighted maximum likelihood with weights Inline graphic where is the probability of developing subtype k given second-phase selection and A, for , and . Let denote the weighted maximum likelihood estimator (MLE) of the parameters in model 2. The estimator is consistent for and asymptotically normal; see the proof in Appendix S1.2 of the Supplementary Material.

There are several methods available in standard software to estimate the covariance matrix of the estimated coefficient Inline graphic , including a Taylor series (TS) approximation¹⁰ or a nonparametric bootstrap,¹¹ the latter method either accounting for IPW weight estimation or not. These methods and their implementation are described in more detail in Appendix S1.3.

Finally, it may be of interest to construct hypothesis tests comparing subtype-specific ORs (or equivalently RRRs). F-tests may be used for formal comparisons of subtype-specific ORs. For example, the test of no exposure effect heterogeneity across subtypes of disease, ie, Inline graphic , may be of interest. The test statistic may be used to test the null hypothesis , where R is a contrast matrix with K − 1 rows representing the structure of and is an estimated covariance matrix (eg, estimated using the TS method) evaluated at . If the estimator is consistent, then in large samples Inline graphic will have approximate distribution under the null hypothesis. Similarly, the researcher may wish to evaluate exposure effect heterogeneity for subtypes j and k by modifying R and using the F-test of .

Regression with sampling weights or offsets

An alternative to IPW commonly used to estimate the effects of A on different subtypes entails using regression as described below. Let Inline graphic and consider the following multinomial regression model

(3)

where Inline graphic is a column vector. Assuming conditional exchangeability and that the model 3 is correctly specified, represents the subtype-specific OR of developing subtype k had everyone been exposed versus unexposed, within strata defined by L. Note that there are no terms for interaction between the exposure A and the covariates L in model (3), such that the subtype-specific OR is constant across strata. Since the OR is not generally collapsible¹² over levels of the covariates L, the parameter Inline graphic need not equal the parameter of the marginal structural model 2.

The parameters of model 3 can be estimated using either selection weights or an offset term to account for ODS. The selection weight for an individual i is defined as the inverse of their sampling probability, ie, Inline graphic , which is known by design; see Appendix S1.4 of the Supplementary Material for details of fitting the regression model with sampling weights. The offset approach¹³ is applicable for study designs where the log relative risk of selection for the second phase is linear in X, ie,

(4)

for some vector of constants Inline graphic . Because the sampling probabilities are known, will be known and therefore can be evaluated for each individual and used as an offset term in the regression model; see Appendix S1.5 of the Supplementary Material for details. Note that the offset term does not depend on the value of k because here it is assumed that selection is performed with respect to the dichotomized outcome where 0 is the reference category and the selection probabilities for all subtypes Inline graphic are equal, ie, .

Using either selection weights or the offset term 4, the parameters of model 3 can be estimated by fitting the model

(5)

where Inline graphic is the probability of developing subtype k among the selected given A and L for , , is a column vector, and . Let represent the MLE of when fitting model 5 with sampling weights, and denote the MLE of when fitting model 5 with the offset term. Then and are consistent for and asymptotically normal; see Appendixes S1.1 and S1.2 of the Supplementary Material for details.

As with IPW, covariance matrix estimation for the regression approaches can be carried out with the TS or nonparametric bootstrap methods. Wald CIs for the estimated subtype-specific ORs can be constructed using the variance estimates computed by these methods. Similarly, F-tests of any null hypothesis of interest that can be expressed as a contrast of model parameters may be performed. Note the null hypotheses constructed with covariate-conditional parameters Inline graphic differ from the null hypotheses described in Methods, which are constructed using marginal parameters .

Simulation studies

Simulation studies were conducted to compare the finite-sample performance of the IPW and regression methods described above. Four phase-one data sets were simulated, with the target parameters of interest defined specifically with respect to the m individuals in the first phase; this simulation design mimics the breast cancer study described in the following section. First-phase data sets of size m = 20 000 and m = 200 000 were simulated for each of two simulation study scenarios: one scenario with no confounding in which the exposure and potential outcomes are generated unconditional on covariates Inline graphic and , and one scenario with confounding in which exposure and potential outcomes are generated conditional on those covariates. Details on the generation of covariates, exposure, and potential outcomes are described in Appendix S2.1. Five hundred second-phase samples of size n = 2000 were drawn from each of the first-phase data sets, resulting in sampling percentages of 10% and 1% for the first-phase data sets. The sampling weights were dependent on Inline graphic and the dichotomized observed outcome , but not the exposure A, and were defined such that approximately half of the observations within each second-phase data set had Y = 0; see Section C.1 of the Supplementary Material for further details. Subtype-specific ORs and their corresponding confidence intervals were estimated for subtypes 1 and 2, using each of the methods described in Section 2. All analyses here and in the following section were conducted with SAS software, version 9.4, using the SURVEYLOGISTIC procedure¹⁴ and the %BOOT macro.

Choosing the marginal subtype-specific OR as the target parameter, the empirical bias of the estimated ORs and 95% CI coverage and width were computed, and the results were aggregated over the 500 simulated second-phase data sets. Each OR estimator’s variance was estimated using Taylor series and bootstrap estimators. For the IPW estimator, the covariance matrix was estimated using the bootstrap estimator with and without accounting for estimation of the propensity score. All bootstrap covariance matrix estimates were computed using 250 replicates. For the IPW method, the empirical type I error and power were also computed with the F-test of no exposure effect heterogeneity across subtypes at the 0.05 significance level.

CBCS analysis

The methods described above were applied to the CBCS Phase I-II data for estimation of subtype-specific effects of several exposures on breast cancer. These data have been described and analyzed previously.⁶^,¹⁵ Phase I (1993-1996) and II (1996-2001) of CBCS were population-based case–control studies, where “Phase” here refers to the distinct time periods over which the CBCS was conducted and does not indicate phase with respect to two-phase ODS. Individuals with breast cancer (“cases”) were recruited into CBCS from the North Carolina Central Cancer Registry, and “controls” representing the source population from which the cases were selected were recruited from Department of Motor Vehicles (DMV) and Medicare records. Age, race, and breast cancer status were known for all individuals in these records. Breast cancer subtype, exposures, and other covariate information were not recorded in the cancer registry, DMV, or Medicare records, and thus were only obtained for those individuals selected into the CBCS. Individuals in the cancer registry or in the DMV or Medicare databases were assumed to constitute a random sample of the population of interest, namely women from central North Carolina. Thus, these individuals constitute the first-phase sample and participants in the CBCS comprise the second-phase sample in this two-phase ODS study.

Breast cancer subtypes were defined as in Benefield et al,⁶ namely based on ER status (positive or negative) and tumor suppressor gene p53 status (positive or negative) obtained from genomic analysis of the breast tumor sample. These outcome subtypes will be hereafter referred to as ER−/p53−, ER−/p53+, ER+/p53−, and ER+/p53+. The dichotomized exposures considered were body mass index (BMI, calculated as weight [kg]/height [m]²; Inline graphic /), breastfeeding (ever/never), oral contraceptive (OC) use (ever/never), and parity (parous/nulliparous). The selection probabilities were fixed by study design and were dependent on study Phase (I or II), age, race, and breast cancer status. Note that in Results these probabilities were allowed to depend on the exposure A, but for the CBCS data they are not dependent on any of the exposures considered. For all exposures considered, the set of variables assumed sufficient for confounding adjustment (ie, conditional exchangeability) included age, race, and age at menarche. For the parity and breastfeeding exposures, the adjustment set also included BMI and OC use. For the BMI exposure, the adjustment set also included OC use. Finally, when OC use was the exposure, parity was also included in the adjustment set. Complete data on 2789 individuals (pooled over Phases I and II) were analyzed.

Results

Simulation studies

Results from the simulation studies are presented in Table 1 for the setting where 1% of observations were selected from the first phase; results including outcome regression approaches and results for the 10% sampling setting are presented in Table S1. In the scenario with no confounding and the scenario with confounding as well as an interaction between the exposure and the covariates, the IPW estimator was approximately unbiased and the corresponding CIs achieved nominal coverage levels. CI widths were comparable regardless of the variance estimators used. For the scenario with confounding and an interaction between the exposure and the covariates, the regression methods exhibited substantial bias and poor coverage for the marginal subtype-specific OR for subtype 1, Inline graphic , but not for subtype 2, (Table S2). These results accord with expectation in that the regression estimators are not necessarily expected to perform well when drawing inference about marginal effects in the presence of confounding. For the IPW method, the F-test using either the TS or bootstrap variance estimator had empirical type I error close to the nominal level in the first scenario (0.06 for TS, 0.04 for bootstrap) and high power in the second scenario ( Inline graphic for TS, 0.98 for bootstrap) for the 1% sampling setting; results for 10% sampling were similar.

Table 1.

Average empirical bias and 95% CI coverage and width for the inverse-probability weighting estimator.^a

Scenario	Method	Variance estimation	Bias		Coverage		CI Width
Scenario	Method	Variance estimation
	IPW	Boot+PS	0.03	0.01	0.97	0.94	0.96	0.43
No confounding		Boot			0.97	0.95	0.97	0.43
		TS			0.97	0.94	0.95	0.43
	IPW	Boot+PS	0.00	0.01	0.95	0.96	0.35	0.30
Confounding		Boot			0.97	0.97	0.40	0.30
with interactions		TS			0.97	0.96	0.40	0.28

Open in a new tab

Abbreviations: Boot, bootstrap; IPW, inverse-probability weighting; OR, odds ratio; PS, propensity score; TS, Taylor series.

^aVariances were estimated using the Taylor series and bootstrap variance estimators, as well as a bootstrap estimator accounting for propensity score estimation. Results are given for the marginal subtype conditional odds ratio (OR) for both subtype 1 ( Inline graphic ) and subtype 2 (). Scenarios shown are for 1% sampling proportion of first phase observations into second phase.

CBCS analysis

The CBCS data were used to evaluate differences in the subtype-specific exposure effect estimates for BMI, breastfeeding, oral contraceptive use, and parity. Figure 1 displays the estimated ORs and 95% CIs for each exposure of interest using each method outlined in Methods. The variances for the estimated subtype-specific ORs from each method were estimated using both the Taylor series and nonparametric bootstrap estimators, as well as the bootstrap estimator accounting for propensity score estimation for the IPW estimator. For each of the exposures considered, the CIs for the regression estimator with offsets were narrower than those of the other estimators. Variance estimates were similar for each subtype-specific OR estimator. Slight differences in subtype-specific OR estimates and their estimated variances were observed across methods, with the most pronounced differences occurring for parity. Frequency distributions of breast cancer subtype within each exposure are also given in Figure 1. Note for parity that the cell counts for the nulliparous group were small, particularly for the ER−/p53+ subtype, contributing to the larger width of the CI for the corresponding OR.

Subtype-conditional odds ratio estimates and 95% CIs for breast cancer outcome subtypes according to body mass index (BMI) (referent: ), breastfeeding (referent: never), oral contraceptive (OC) use (referent: never), and parity (referent: nulliparous). Odds ratios and CIs estimated using each of the methods presented in Methods. Breast cancer subtypes were defined as in Benefield et al,⁶ namely based on estrogen receptor (ER) status (positive or negative) and tumor suppressor gene p53 status (positive or negative) obtained from genomic analysis of the breast tumor sample. Contingency tables of the exposure-outcome frequency distributions are displayed beneath their corresponding column of the plot. Boot, bootstrap; IPW, inverse-probability weighting; OC, oral contraceptive; Reg Offset, regression offset; Reg Wts, regression weights; TS, Taylor series.

Inline graphic — Subtype-conditional odds ratio estimates and 95% CIs for breast cancer outcome subtypes according to body mass index (BMI) (referent: ), breastfeeding (referent: never), oral contraceptive (OC) use (referent: never), and parity (referent: nulliparous). Odds ratios and CIs estimated using each of the methods presented in Methods. Breast cancer subtypes were defined as in Benefield et al,⁶ namely based on estrogen receptor (ER) status (positive or negative) and tumor suppressor gene p53 status (positive or negative) obtained from genomic analysis of the breast tumor sample. Contingency tables of the exposure-outcome frequency distributions are displayed beneath their corresponding column of the plot. Boot, bootstrap; IPW, inverse-probability weighting; OC, oral contraceptive; Reg Offset, regression offset; Reg Wts, regression weights; TS, Taylor series.

The distribution of the weights estimated in the IPW approach were examined for each exposure, and trimming of weights was considered. In general, the mean of the estimated weights should be close to 1, and there should not be extreme weights for any individual if the propensity score model is correctly specified.¹⁶ The estimated weights corresponding to the BMI, breastfeeding, and oral contraceptive use analyses each met these criteria. For the parity analysis, the mean of the estimated weights was 1.31 and there were some individuals with extreme weights (range, 0.3-80.4). The estimated weights were trimmed to the 99th percentile (ie, weights greater than 10.1 were set to this value) in an effort to reduce possible bias and inflated variance of the estimated exposure effect. After trimming, the estimated ORs remained similar and their corresponding CIs were narrower; the results of the parity analysis with trimmed weights are displayed in Figure 1.

Table 2 gives P values for the test of no heterogeneity of exposure effect across breast cancer subtypes, for each of BMI, breastfeeding, oral contraceptive use, and parity. Specifically, each P value corresponds to the F-test of Inline graphic for the given exposure, where represents the subtype-specific OR for subtype ER−/p53+, and ORs for the other subtypes are defined analogously. Recall that for regression the ORs are covariate-conditional, whereas for IPW the ORs have a marginal interpretation. Overall, none of the P values for the test of heterogeneity were less than 0.05, but there were some differences between methods. For instance, for parity the F-test P values corresponding to IPW were several times larger than P values corresponding to the regression approaches. Within each modeling approach, variance estimation methods performed similarly for each exposure. Finally, note that while Table 2 provides P values across a number of exposures without subsequent adjustment, multiple testing should generally be taken into account for such analyses involving multiple comparisons.

Table 2.

P values for the test of no heterogeneity across subtypes for each of body mass index, breastfeeding, oral contraceptive use, and parity. Carolina Breast Cancer Study, 1993-2001.

Method	Variance Estimation	BMI	Breastfeeding	OC use	Parity
IPW	Boot+PS	.98	.26	.75	.69
	Boot	.97	.25	.65	.67
	TS	.97	.25	.69	.65
Regression weights	Boot	.77	.29	.68	.27
	TS	.75	.26	.71	.21
Regression offset	Boot	.71	.57	.90	.09
	TS	.70	.47	.87	.07

Open in a new tab

Abbreviations: BMI, body mass index; Boot, bootstrap; IPW, inverse-probability weighting; OC, oral contraceptive; PS, propensity score; TS, Taylor series.

Discussion

The proposed IPW method can be used to simultaneously account for confounding and two-phase ODS when the outcome is multinomial. On the other hand, regression with sampling weights or offsets will not yield unbiased estimates of marginal subtype-specific ORs in general in the presence of confounding. This was demonstrated in the simulation studies, where the regression approaches at times showed substantial bias of the OR estimator and poor CI coverage. For CBCS, similar results were observed across methods for BMI, breastfeeding, and oral contraceptive use, but substantial differences were observed across methods for parity. All methods considered here modeled subtypes jointly. Alternatively outcome subtypes may be modeled separately, which provides reduced asymptotic relative efficiency on joint tests of parameters as described by Begg and Gray.¹⁷ Fitting the multinomial models described in Methods facilitates between subtype comparisons, allowing for estimation of the covariance matrix and therefore testing null hypotheses involving multiple subtype-specific ORs. Applying IPW in this setting enables inference about effect heterogeneity, yielding approximately unbiased estimates of marginal subtype-specific ORs and valid CIs in the presence of confounding and biased sampling. Further, while we focus in this work on the subtype-specific ORs, the methods can be used to draw inference about any function of the Inline graphic . For example, the quantity , ie, the ratio of the risks of subtype 2 disease and subtype 1 disease when exposed, over the ratio of the risks when unexposed, can be estimated using the IPW method by with corresponding standard error estimates obtained using the delta method.

In the present work, we consider etiological heterogeneity defined as differences in the relative risk with respect to an exposure, comparing risk of each subtype to the risk of no disease. In this formulation, increase or decrease of counterfactual risk for one subtype k due to exposure A may additionally affect the risk of first incidence of disease for another subtype j. Alternatively, one may consider different definitions of etiological heterogeneity, including those in which counterfactual risk for any subtype not affected by exposure A takes a functional form not dependent on A.¹⁸ Longitudinal data and competing risks analysis are necessary to fully investigate the relationship between risk of incidence of disease subtypes beyond the risk for first incidence of disease as modeled here.

In the methods presented above, the exposure A takes on two values, 0 and 1. In many settings where etiological heterogeneity is of interest there may be multiple exposures. The methods developed above can be easily extended to account for the joint effects of multiple exposures. In particular, suppose A takes on multiple values, say Inline graphic corresponding to all possible combinations of the multiple exposures, where A = 0 corresponds to being unexposed to all of the exposures. Then the multinomial MSM from 1 can be extended to accommodate multiple-valued A by letting

which reduces to the model in 1 when Inline graphic . The parameters for are now interpreted such that is the causal relative risk ratio for developing subtype k if everyone had exposure j − 1 versus everyone being unexposed to all of the exposures. This approach could create sparsity in observed exposures and subtypes, as seen in the CBCS study. Certain breast cancer subtypes are rare, making it difficult to assess joint effects of multiple exposures on subtypes. For example, there are only 29 nulliparous individuals with ER−/p53+ breast cancer in CBCS Phases I and II.

There are several possible areas of related research that could be undertaken. For instance, future work could explore improving efficiency by utilizing data from individuals not selected into the second phase. In particular, the simple and enriched doubly robust estimators and the locally efficient estimator presented in Wang et al⁷ for two-phase ODS may be adapted to the multinomial outcome setting. Finally, further investigation of the potential consequences of extreme sampling weights and violations (or near-violations) of positivity is desirable.

Supplementary Material

Web_Material_kwae212

web_material_kwae212.pdf^{(207.9KB, pdf)}

Contributor Information

Sarah A Reifeis, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States.

Michael G Hudgens, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States.

Melissa A Troester, Department of Epidemiology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States.

Michael I Love, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States; Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States.

Supplementary material

Supplementary material is available at American Journal of Epidemiology online.

Funding

S.A.R. was supported by The Chancellor’s Fellowship from The Graduate School at the University of North Carolina at Chapel Hill. M.G.H. was supported by R01 AI085073. M.I.L. and M.A.T. were supported by P50 CA058223.

Conflict of interest

The authors have declared no competing interest.

Data availability

The data analyzed in this study are available from the Carolina Breast Cancer Study. Restrictions apply to the availability of these data, which were used under data use agreements for this study. Data is not publicly available; however, investigators may submit a letter of intent to gain access upon reasonable request.

References

1. Neyman J. Contribution to the theory of sampling human populations. J Am Stat Assoc. 1938;33(201):101-116. 10.1080/01621459.1938.10503378 [DOI] [Google Scholar]
2. Zabor EC, Begg CB. A comparison of statistical methods for the study of etiologic heterogeneity. Stat Med. 2017;36(25):4050-4060. 10.1002/sim.7405 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Chatterjee N. A two-stage regression model for epidemiological studies with multivariate disease classification data. J Am Stat Assoc. 2004;99(465):127-138. 10.1198/016214504000000124 [DOI] [Google Scholar]
4. Rosner B, Glynn RJ, Tamimi RM, et al. Breast cancer risk prediction with heterogeneous risk profiles according to breast cancer tumor markers. Am J Epidemiol. 2013;178(2):296-308. 10.1093/aje/kws457 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Wang M, Kuchiba A, Ogino S. A meta-regression method for studying etiological heterogeneity across disease subtypes classified by multiple biomarkers. Am J Epidemiol. 2015;182(3):263-270. 10.1093/aje/kwv040 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Benefield HC, Zabor EC, Shan Y, et al. Evidence for etiologic subtypes of breast cancer in the Carolina Breast Cancer Study. Cancer Epidemiol Biomarkers Prev. 2019;28(11):1784-1791. 10.1158/1055-9965.EPI-19-0365 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Wang W, Scharfstein D, Tan Z, et al. Causal inference in outcome-dependent two-phase sampling designs. J R Stat Soc Series B Stat Methodology. 2009;71(5):947-969. 10.1111/j.1467-9868.2009.00712.x [DOI] [Google Scholar]
8. Richardson DB, Kinlaw AC, Keil AP, et al. Inverse probability weights for the analysis of polytomous outcomes. Am J Epidemiol. 2018;187(5):1125-1127. 10.1093/aje/kwy020 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Westreich D, Cole SR. Invited commentary: positivity in practice. Am J Epidemiol. 2010;171(6):674-677. 10.1093/aje/kwp436 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Binder DA. On the variances of asymptotically normal estimators from complex surveys. Int Stat Rev. 1983;51(3):279-292. 10.2307/1402588 [DOI] [Google Scholar]
11. Mashreghi Z, Haziza D, Léger C. A survey of bootstrap methods in finite population sampling. Stat Surv. 2016;10:1-52. 10.1214/16-SS113 [DOI] [Google Scholar]
12. Greenland S, Robins JM, Pearl J. Confounding and collapsibility in causal inference. Stat Sci. 1999;14(1):29-46. 10.1214/ss/1009211805 [DOI] [Google Scholar]
13. Weinberg CR, Wacholder S. The design and analysis of case-control studies with biased sampling. Biometrics. 1990;46(4):963-975. 10.2307/2532441 [DOI] [PubMed] [Google Scholar]
14. SAS Institute Inc . Chapter 114: the SURVEYLOGISTIC procedure. In: SAS/STAT 14.3 User’s Guide, chapter 114. SAS Institute Inc.; 2017:9328-9428. [Google Scholar]
15. Newman B, Moorman PG, Millikan R, et al. The Carolina Breast Cancer Study: integrating population-based epidemiology and molecular biology. Breast Cancer Res Treat. 1995;35(1):51-60. 10.1007/BF00694745 [DOI] [PubMed] [Google Scholar]
16. Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol. 2008;168(6):656-664. 10.1093/aje/kwn164 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Begg CB, Gray R. Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika. 1984;71(1):11-18. 10.2307/2336391 [DOI] [Google Scholar]
18. Sun S, Hood M, Scott L, et al. Differential expression analysis for RNAseq using Poisson mixed models. Nucleic Acids Res. 2017;45(11):e106. 10.1093/nar/gkx204 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Web_Material_kwae212

web_material_kwae212.pdf^{(207.9KB, pdf)}

Data Availability Statement

[ref1] 1. Neyman J. Contribution to the theory of sampling human populations. J Am Stat Assoc. 1938;33(201):101-116. 10.1080/01621459.1938.10503378 [DOI] [Google Scholar]

[ref2] 2. Zabor EC, Begg CB. A comparison of statistical methods for the study of etiologic heterogeneity. Stat Med. 2017;36(25):4050-4060. 10.1002/sim.7405 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] 3. Chatterjee N. A two-stage regression model for epidemiological studies with multivariate disease classification data. J Am Stat Assoc. 2004;99(465):127-138. 10.1198/016214504000000124 [DOI] [Google Scholar]

[ref4] 4. Rosner B, Glynn RJ, Tamimi RM, et al. Breast cancer risk prediction with heterogeneous risk profiles according to breast cancer tumor markers. Am J Epidemiol. 2013;178(2):296-308. 10.1093/aje/kws457 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5. Wang M, Kuchiba A, Ogino S. A meta-regression method for studying etiological heterogeneity across disease subtypes classified by multiple biomarkers. Am J Epidemiol. 2015;182(3):263-270. 10.1093/aje/kwv040 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] 6. Benefield HC, Zabor EC, Shan Y, et al. Evidence for etiologic subtypes of breast cancer in the Carolina Breast Cancer Study. Cancer Epidemiol Biomarkers Prev. 2019;28(11):1784-1791. 10.1158/1055-9965.EPI-19-0365 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] 7. Wang W, Scharfstein D, Tan Z, et al. Causal inference in outcome-dependent two-phase sampling designs. J R Stat Soc Series B Stat Methodology. 2009;71(5):947-969. 10.1111/j.1467-9868.2009.00712.x [DOI] [Google Scholar]

[ref8] 8. Richardson DB, Kinlaw AC, Keil AP, et al. Inverse probability weights for the analysis of polytomous outcomes. Am J Epidemiol. 2018;187(5):1125-1127. 10.1093/aje/kwy020 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] 9. Westreich D, Cole SR. Invited commentary: positivity in practice. Am J Epidemiol. 2010;171(6):674-677. 10.1093/aje/kwp436 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] 10. Binder DA. On the variances of asymptotically normal estimators from complex surveys. Int Stat Rev. 1983;51(3):279-292. 10.2307/1402588 [DOI] [Google Scholar]

[ref11] 11. Mashreghi Z, Haziza D, Léger C. A survey of bootstrap methods in finite population sampling. Stat Surv. 2016;10:1-52. 10.1214/16-SS113 [DOI] [Google Scholar]

[ref12] 12. Greenland S, Robins JM, Pearl J. Confounding and collapsibility in causal inference. Stat Sci. 1999;14(1):29-46. 10.1214/ss/1009211805 [DOI] [Google Scholar]

[ref13] 13. Weinberg CR, Wacholder S. The design and analysis of case-control studies with biased sampling. Biometrics. 1990;46(4):963-975. 10.2307/2532441 [DOI] [PubMed] [Google Scholar]

[ref14] 14. SAS Institute Inc . Chapter 114: the SURVEYLOGISTIC procedure. In: SAS/STAT 14.3 User’s Guide, chapter 114. SAS Institute Inc.; 2017:9328-9428. [Google Scholar]

[ref15] 15. Newman B, Moorman PG, Millikan R, et al. The Carolina Breast Cancer Study: integrating population-based epidemiology and molecular biology. Breast Cancer Res Treat. 1995;35(1):51-60. 10.1007/BF00694745 [DOI] [PubMed] [Google Scholar]

[ref16] 16. Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol. 2008;168(6):656-664. 10.1093/aje/kwn164 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] 17. Begg CB, Gray R. Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika. 1984;71(1):11-18. 10.2307/2336391 [DOI] [Google Scholar]

[ref18] 18. Sun S, Hood M, Scott L, et al. Differential expression analysis for RNAseq using Poisson mixed models. Nucleic Acids Res. 2017;45(11):e106. 10.1093/nar/gkx204 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Assessing etiological heterogeneity for multinomial outcome with two-phase outcome-dependent sampling design

Sarah A Reifeis

Michael G Hudgens

Melissa A Troester

Michael I Love

Abstract

Introduction