osDesign: An R Package for the Analysis, Evaluation, and Design of Two-Phase and Case-Control Studies

Sebastien Haneuse; Takumi Saegusa; Thomas Lumley

doi:10.18637/jss.v043.i11

. Author manuscript; available in PMC: 2012 Apr 25.

Published in final edited form as: J Stat Softw. 2011 Aug;43(11):v43/i11/paper. doi: 10.18637/jss.v043.i11

osDesign: An R Package for the Analysis, Evaluation, and Design of Two-Phase and Case-Control Studies

Sebastien Haneuse ¹, Takumi Saegusa ², Thomas Lumley ³

PMCID: PMC3337215 NIHMSID: NIHMS340920 PMID: 22545023

Abstract

The two-phase design has recently received attention in the statistical literature as an extension to the traditional case-control study for settings where a predictor of interest is rare or subject to missclassification. Despite a thorough methodological treatment and the potential for substantial efficiency gains, the two-phase design has not been widely adopted. This may be due, in part, to a lack of general-purpose, readily-available software. The osDesign package for R provides a suite of functions for analyzing data from a two-phase and/or case-control design, as well as evaluating operating characteristics, including bias, efficiency and power. The evaluation is simulation-based, permitting flexible application of the package to a broad range of scientific settings. Using lung cancer mortality data from Ohio, the package is illustrated with a detailed case-study in which two statistical goals are considered: (i) the evaluation of small-sample operating characteristics for two-phase and case-control designs and (ii) the planning and design of a future two-phase study.

Keywords: operating characteristics, power, simulation, study design

1. Introduction

Researchers have at their disposal a wide variety of study designs with which to identify and assess associations between predictors and outcomes. In epidemiology, the case-control design is a mainstay of observational research for binary outcomes (Prentice and Pyke 1979; Breslow and Day 1980). While the design provides an efficient framework for investigating a rare outcome, gains over simple random sampling may be lost if the predictor or exposure of interest is also rare. For the setting where the exposure is binary, White (1982) proposed using the two-phase design as a means to improving efficiency (Neyman 1938). In this context, the design stratifies the population jointly on the outcome and exposure, resulting in four phase I strata. From each of these, a sample is drawn with additional exposure/confounder information retrospectively ascertained. Similar to the case-control design, efficiency gains arise due to the over-sampling of a rare sub-population (particularly exposed cases). Over the past 25 years the statistical literature has built on this (and other) work, to provide a comprehensive methodological treatment of the two-phase design; key recent references include Breslow and Holubkov (1997a), Scott and Wild (1997) and Breslow and Chatterjee (1999).

Despite the potential for substantial efficiency gains relative to a case-control design, the two-phase design has not been widely adopted. To illustrate this, and motivated by our work in epidemiologic applications, we recently conducted a survey of five top-line epidemiological/medical journals (American Journal of Epidemiology, Epidemiology, International Journal of Epidemiology, New England Journal of Medicine and Journal of the American Medical Association). Of the 4,792 studies published between 2002 and 2007, 816 used the case-control design; only one specifically employed the two-phase design. The lack of uptake may be due, in part, to a paucity of general-purpose, readily-available software for (i) the analysis of two-phase designs and (ii) the evaluation of small-sample operating characteristics, such as bias and power.

In this article we introduce and provide an overview of a new R (R Development Core Team 2011) package, osDesign, that contains a suite of functions useful when designing and analyzing two-phase and case-control studies. The package is available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=osDesign. In Section 2 we provide a brief review of the two-phase design including notation, an overview of estimation/inference techniques and a summary of the published literature on design considerations. Section 3 outlines a proposed simulation-based algorithm for evaluating small-sample operating characteristics of two-phase and case-control designs, implemented in the osDesign package. Sections 4, 5, and 6 illustrate the package with a detailed case-study using lung cancer mortality data from Ohio. Section 7 briefly discusses the trade-off between computing time and accuracy inherent to simulation-based estimation. Finally, Section 8 concludes with a brief summary and areas for further work.

2. The two-phase design

Let Y be the binary outcome under investigation and X a vector of explanatory variables which will generally include the exposure of interest as well as confounders and effect modifiers. Suppose the relationship between Y and X is characterized via the logistic regression model:

P (Y = 1 ∣ X = x) = \frac{\exp {β_{0} + x^{⊤} β_{x}}}{1 + \exp {β_{0} + x^{⊤} β_{x}}}

(1)

In the context of this work, the primary statistical task is to perform estimation and inference with respect to the vector of exposure-specific log odds ratios, β_x. The rest of this section outlines the general two-phase sampling scheme and corresponding likelihood, presents an overview of estimation and inference techniques, and reviews the literature and available software for designing two-phase studies.

2.1. Data collection scheme

Initially, suppose N individuals from the population of interest are observed. These data can either be a complete enumeration of the observable population or a large sample drawn cross-sectionally, prospectively or retrospectively (Breslow and Holubkov 1997b). In addition to outcome status, some additional information is available for all N observed individuals; using this information a stratification variable, denoted by S and taking on one of K levels is constructed. For example if sex and age information is available, S could be constructed by either taking sex alone or categorizing age or some combination of both.

The cross-classification by Y and S is referred to as the phase I data and yields N_0k controls and N_1k cases in the k-th stratum of S, k = 1,…, K; Table 1 summarizes the notation.

Table 1.

Notation for the phase I data.

	S = 1	S = 2	…	5 = K	S = K
Y = 0	N₁₀	N₀₂	…	N_0K	N₀
Y = 1	N₁₁	N₁₂	…	N_1K	N₁

	Age	Sex	Race	N	Death
1	0	0	0	429617	1025
2	0	0	1	48382	172
3	0	1	0	476170	507
4	0	1	1	54662	81
5	1	0	0	319387	1477
6	1	0	1	29972	182
7	1	1	0	408229	733
8	1	1	1	38767	62
9	2	0	0	139050	768
10	2	0	1	11610	100
11	2	1	0	244965	391
12	2	1	1	19366	35

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	−5.96210	0.02572	−231.769	< 2e-16	***
factor(Age)1	0.59538	0.03118	19.097	< 2e-16	***
factor(Age)2	0.68796	0.03671	18.738	< 2e-16	***
Sex	−1.00412	0.02879	−34.878	< 2e-16	***
Race	0.28303	0.04238	6.678	2.42e-11	***

Design	Elements available at phase I	Number of phase I strata	Elements requiring collection at phase II
1^a	Y, Age, Sex, Race	24	NA
2^b	Y	2	Age, Sex, Race
3	Y, Age	6	Sex, Race
4	Y, Sex	4	Age, Race
5	Y, Race	4	Age, Sex
6	Y, Age, Sex	12	Race
7	Y, Age, Race	12	Sex
8	Y, Sex, Race	8	Age

	S = 1	S = 2	S = 3
Y = 0	1,007,046	793,901	413,697	2,220,177
Y = 1	1,785	2,454	1,294	5,533

Int	−5.9620986
Age1	0.5953782
Age2	0.6879627
Sex	−1.0041234
Race	0.2830300

Design	Sex main effect, β_S			Interaction, β_SR
	WL	PL	ML	WL	PL	ML
Complete data	-	-	0.0	-	-	0.6
Case-control design^a	-	-	1.7	-	-	38.7
Two-phase design
Sex^b	1.4	0.1	0.1	5.7	2.7	1.1
Race^b	6.3	2.9	2.9	−15.5	1.4	1.4
Sex and Race jointly^c	0.9	−0.4	−0.4	−5.7	0.6	0.6

		N0k	N1k
Age = 0	Race = 0	904235	1552
Age = 0	Race = 1	102811	233
Age = 1	Race = 0	725434	2182
Age = 1	Race = 1	68467	272
Age = 2	Race = 0	382848	1167
Age = 2	Race = 1	30849	127

	p0k	p1k
Age = 0	0.000099	0.056022
Age = 1	0.000126	0.040750
Age = 2	0.000242	0.077280

Function call	Manuscript sub-section	Distinct designs^a			Simulation sample size, B

		TP	CC	CD	500	1,000	10,000
ocAge <- tpsSim()	5.1	1	1	1	0.28	0.58	5.95
ocAll <- tpsSim()	5.3	6	1	1	1.45	2.93	27.92
powerRaceTPS <- tpsPower()	6.2	9	9	1	2.48	4.87	49.83
powerCC <- ccPower()	6.3	0	9	1	0.38	0.77	7.60

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	−5.97008	0.02599	−229.722	< 2e-16	***
factor(Age)1	0.59547	0.03118	19.099	< 2e-16	***
factor(Age)2	0.68752	0.03672	18.724	< 2e-16	***
Sex	−0.97979	0.03045	−32.182	< 2e-16	***
Race	0.35097	0.05024	6.986	2.84e-12	***
Sex:Race	−0.22213	0.09362	−2.373	0.0177	*

	Int	Age1	Age2	Sex	Race
CD	0.0	−0.1	−0.1	0.1	0.0
CC	NA	2.6	3.0	2.3	7.1
TPS-WL	−0.1	1.3	0.3	3.4	13.4
TPS-PL	−0.1	0.0	0.5	2.0	6.3
TPS-ML	−0.1	0.0	0.5	2.0	6.3

	Int	Age1	Age2	Sex	Race
CD	95.0	94.7	95.1	94.8	95.1
CC	NA	94.5	94.7	94.5	95.1
TPS-WL	95.1	96.9	96.2	94.5	94.9
TPS-PL	94.8	97.0	95.6	94.5	95.5
TPS-ML	94.8	97.0	95.6	94.5	95.5

	Int	Age1	Age2	Sex	Race
CD	NA	11.0	10.8	11.6	10.0
CC	NA	100.0	100.0	100.0	100.0
TPS-WL	NA	42.9	37.0	104.0	110.7
TPS-PL	NA	30.3	27.3	98.3	99.3
TPS-ML	NA	30.3	27.3	98.3	99.3

PERMALINK

osDesign: An R Package for the Analysis, Evaluation, and Design of Two-Phase and Case-Control Studies

Sebastien Haneuse

Takumi Saegusa

Thomas Lumley

Abstract

1. Introduction

2. The two-phase design

2.1. Data collection scheme

Table 1.

2.2. Two-phase likelihood

2.3. Estimation and inference

2.4. Design considerations

2.5. Software

3. Evaluating operating characteristics via simulation

3.1. Algorithm

3.2. Simulation functions

4. Example: Lung cancer in Ohio

4.1. Marginal exposure distribution

4.2. Model specification

4.3. Design choice

Table 2.

5. Small-sample operating characteristics

5.1. A single two-phase design

Table 3.

5.2. Comparing specific designs

Table 4.

5.3. All two-phase designs

5.4. Case-control sampling at phase I

5.5. Traditional case-control designs

6. Power calculations for study design

6.1. Expected phase I strata

6.2. Power for a two-phase study

Figure 1.

6.3. Comparing two-phase designs

Figure 2.

6.4. Power for case-control studies

Figure 3.

Figure 4.

6.5. Modifying anticipated effect sizes

7. Run times and Monte Carlo error

Table 5.

8. Summary

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases