Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2020 Jan 11;9(3):626–649. doi: 10.1093/jssam/smz054

Oversampling of Minority Populations Through Dual-Frame Surveys

Sixia Chen 1,, Alexander Stubblefield 1, Julie A Stoner 1
PMCID: PMC8308969  PMID: 34322557

Abstract

Previous studies have shown disparities in health conditions and behaviors among different ethnic groups. Sampling designs that do not consider oversampling certain minority populations, such as American Indians or African Americans, may not produce sufficient sample sizes for estimating health parameters for minority populations. Oversampling is one of the most common approaches that researchers use to achieve required precision levels for small domain estimation. However, it has not been rigorously investigated in dual-frame survey settings. To take advantage of extra information for minority populations in the Marketing Systems Group database, we propose a novel optimal oversampling strategy that minimizes the domain variance subject to total cost restriction or vice versa. We further extend the method to oversample multiple minorities simultaneously. Empirical study using a population-based community survey shows the benefits of our proposed methods compared with traditional methods in terms of statistical efficiency and cost balance.

Keywords: Minority, Oversampling, Stratification, Telephone survey

1. INTRODUCTION

The dual-frame telephone survey, utilizing both cell phone and landline numbers, has been used frequently due to its time, cost efficiency, and high coverage rate. Even though the response rates went down in recent years for telephone surveys, it can still serve as one of the effective data collection tools since it is more cost effective compared with in-person surveys and has several advantages compared with mail or email surveys (see Marcus and Crane 1986; Chang and Krosnick 2009; and Szolnoki and Hoffmann 2013, among others). According to the 2016 National Health Interview Survey (NHIS), the estimated coverage rate for either cell phones or landlines in the United States is more than 95 percent. There are many surveys that implement such a sampling design, including the 2016 Behavioral Risk Factor Surveillance System (BRFSS), the 2016 Tobacco Settlement Endowment Trust Healthy Living Program Community Member Survey (TSET HLP Survey), and the 2015–2016 California Health Interview Survey (CHIS). The corresponding statistical inference for dual-frame surveys has been discussed in Lohr and Rao (2000). Estimators and weight adjustments for dual-frame surveys have been considered in Hartley (1962, 1974), Skinner and Rao (1996), Fuller and Burmeister (1972), Bankier (1986), Kalton and Anderson (1986), and Rao and Wu (2010), among others.

Most existing dual-frame telephone surveys were designed to estimate parameters for the general population. For example, Wolter, Tao, Montgomery, and Smith (2015) discussed optimal allocation for a dual-frame telephone survey when there is overall cost constraint. The idea is to minimize the variability subject to the budget constraint or vice versa. However, they only considered estimating parameters for the overall population. Therefore, their proposed approach cannot be used to target minority populations, such as American Indians. However, estimating health-related parameters for minority populations is crucial in public health research since estimates may be different for these populations compared with the general population. For instance, Espey, Jim, Cobb, Bartholomew, Becker, et al. (2014) and Mowery, Dube, Thorne, Garrett, Homa, et al. (2015) showed that American Indians and Alaska Natives have a higher risk of experiencing tobacco-related disease and death due to high prevalence of cigarette smoking and other commercial tobacco use. The 2014 Minnesota Survey on Adult Substance Use (MSASU; Helba, Love, Wivagg, Frissell, Lee, et al. 2015) is one of the few dual-frame random digital dialing surveys that has oversampled certain minority populations, such as African American, Asian American, and American Indian populations. However, an ad hoc manual adjustment procedure was used to oversample these populations.

Several methods for oversampling rare populations have been proposed (see Kalton and Anderson 1986; Kalsbeek 2003; Elliot, Finch, Klein, Sai, Phuong Do, et al. 2008; Kalton 2009; Tourangeau, Edwards, Johnson, Wolter, and Bates 2014; and Chen and Kalton 2015). Chen and Kalton (2015) proposed an optimal sampling design for oversampling single or multiple rare populations simultaneously. These researchers examined the methods for achieving oversampling by increasing the sampling fractions for phone numbers from areas with a greater prevalence of the rare populations of interest. Other forms of oversampling using this method were performed in the 2003 US National Assessment of Adult Literacy to oversample the African American population (Mohadjer and Krenzke 2009) and in the US National Health and Nutrition Examination Survey (NHANES) to oversample several race/ethnicity populations (Curtin, Mohadjer, Dohrmann, Kruszon-Moran, Mirel, et al. 2013).

Despite the well-established theory and practice of dual-frame telephone surveying and of oversampling rare domains, to the best of our knowledge, there has been little rigorous discussion of the combination of these two methods. Even though the dual-frame optimization for area and telephone surveys has been considered in the sample design report of the 1997 and 1999 National Survey of America’s Families (NSAF; Judkins, Brick, Broene, Ferraro, and Strickler 2001), there is a significant absence in the literature regarding analysis of the optimal allocation for single and multiple minorities in a dual-frame telephone survey with rigorous justifications. The method of optimal allocation detailed in this article presents a novel approach to improve the precision of estimators for rare populations using oversampling for dual-frame telephone surveys. We propose using constraint optimization to achieve this goal. Given that a dual-frame approach is becoming increasingly standard for telephone surveys and that more precise estimators of rare populations are often needed, this method could contribute significantly to future survey design.

Section 2 introduces relevant mathematical notations and problems. Section 3 presents our proposed approach for oversampling a single minority using a dual-frame design. The corresponding design effect due to oversampling and stratification is contained in section 4. We then discuss oversampling multiple minorities simultaneously in section 5. Section 6 contains a real application of this allocation approach to the TSET HLP Survey. In section 7, we conclude the article with final remarks and discussion. All technical proofs are contained in the appendix.

2. NOTATIONS

We consider only dual-frame telephone survey in the following sections, even though our proposed approach can work for any dual-frame sampling designs. Denote A and B as landline and cell, Y as the study variable of interest, and D as the domain of interest. Population level notations including populations, domain populations, population sizes, domain population sizes, domain totals, domain means, and domain variances are defined in table 1. Let yi be the study variable of interest for unit i and Di be the domain indicator for domain D, such that Di = 1 if unit i is in the domain D and zero otherwise. Assume the screening costs per interview for frame A and frame B are cA and cB and the full interview costs per interview are cA* and cB*. The total budget is denoted as C. The parameter of interest is the population domain mean θD=ND1iUDyi.

Table 1.

Notations for Different Populations

Category Pop Dom Pop Dom Dom Dom Dom
size size total mean Var
Overall U UD N ND YD Y¯D SD2
LL UA UA,D NA NA,D YA,D Y¯A,D SA,D2
Cell UB UB,D NB NB,D YB,D Y¯B,D SB,D2
LL only Ua Ua,D Na Na,D Ya,D Y¯a,D Sa,D2
LL dual Uab Uab,D Nab Nab,D Yab,D Y¯ab,D Sab,D2
Cell dual Uba Uba,D Nba Nba,D Yba,D Y¯ba,D Sba,D2
Cell only Ub Ub,D Nb Nb,D Yb,D Y¯b,D Sb,D2

Note.— Pop, population; Dom, domain; LL, landline.

3. PROPOSED METHOD

Denote the oversampling strata for frame A as hA=1,2,,HA and the oversampling strata for frame B as hB=1,2,,HB. Such oversampling strata can be constructed using the cumulative root frequency rule developed by Dalenius (1957) and population-level aggregated information obtained from the Marketing Systems Group company. Specifically, the Marketing Systems Group produced rate center–level ethnicity information for the cell phone frame and six-digit group-level ethnicity information for the landline frame. Let the sample sizes and population sizes for strata hA and hB be nA,hA, nB,hB, NA,hA, and NB,hB. Stratified simple random sampling without replacement sampling designs are applied to both frames. Write the corresponding samples for different frames and strata as sA,hA and sB,hB for hA=1,,HA and hB=1,,HB. The consistent weighted estimator for θD can be written as

θ^D=1N^D(Y^a,D+pDY^ab,D+qDY^ba,D+Y^b,D), (1)

where pD and qD depend on D with pD+qD=1; Y^a,D, Y^ab,D, Y^ba,D, and Y^b,D are the consistent estimators for the corresponding population domain totals; and N^D is the consistent estimator of ND. They can be written as

Y^a,D=hA=1HANA,hAnA,hAisA,hAaiDiyi,Y^ab,D=hA=1HANA,hAnA,hAisA,hAabiDiyi,Y^b,D=hB=1HBNB,hBnB,hBisB,hBbiDiyi,Y^ba,D=hB=1HBNB,hBnB,hBisB,hBbaiDiyi,

and

N^D=hA=1HANA,hAnA,hAisA,hAaiDi+pDhA=1HANA,hAnA,hAisA,hAabiDi+qDhB=1HBNB,hBnB,hBisB,hBbaiDi+hB=1HBNB,hBnB,hBisB,hBbiDi,

where ai, abi, bi, and bai are corresponding indicator variables (zero or one) for Ua, Uab, Ub, and Uba. The optimal choices of pD and qD were discussed in Hartley (1962, 1974) and Skinner and Rao (1996), among others. By using Taylor linearization and after some algebra (see Appendix A), we have

V(θ^D)=1ND2(hA=1HANA,hA2EA,hAnA,hA+hB=1HBNB,hB2EB,hBnB,hB)+o(n1), (2)

where n=min(nA,nB),

EA,hA=NA,hA,a,DNA,hASA,hA,a,D2+pD2NA,hA,ab,DNA,hASA,hA,ab,D2+NA,hA,a,DNA,hA(Y¯A,hA,a,DθD)2+pD2NA,hA,ab,DNA,hA(Y¯A,hA,ab,DθD)2{NA,hA,a,DNA,hA(Y¯A,hA,a,DθD)+pDNA,hA,ab,DNA,hA(Y¯A,hA,ab,DθD)}2,

and

EB,hB=NB,hB,b,DNB,hBSB,hB,b,D2+qD2NB,hB,ba,DNB,hBSB,hB,ba,D2+NB,hB,b,DNB,hB(Y¯B,hB,b,DθD)2+qD2NB,hB,ba,DNB,hB(Y¯B,hB,ba,DθD)2{NB,hB,b,DNB,hB(Y¯B,hB,b,DθD)+qDNB,hB,ba,DNB,hB(Y¯B,hB,ba,DθD)}2,

where NA,hA,a,D, NA,hA,ab,D, NB,hB,b,D, and NB,hB,ba,D are the corresponding population sizes for domain D in subpopulations; SA,hA,a,D2, SA,hA,ab,D2, SB,hB,b,D2, and SB,hB,ba,D2 are corresponding domain population variances; and Y¯A,hA,a,D, Y¯A,hA,ab,D, Y¯B,hB,b,D, and Y¯B,hB,ba,D are the corresponding domain population means. The total cost can be written as

C=hA=1HAnA,hA{PA,hAcA*+(1PA,hA)cA}+hB=1HBnB,hB{PB,hBcB*+(1PB,hB)cB}, (3)

with PA,hA and PB,hB as the stratum prevalence for domain D. By minimizing V(θ^D) in (2) subject to cost constraint (3), it can be shown that

fA,hA,opt=KEA,hAPA,hA(cA*cA)+cA,fB,hB,opt=KEB,hBPB,hB(cB*cB)+cB, (4)

where fA,hA,opt=nA,hA,opt/NA,hA and fB,hB,opt=nB,hB,opt/NB,hB.

Remark 3.1 If we assume SA,hA,a,D2=SA,hA,ab,D2=SB,hB,b,D2=SB,hB,ba,D2=SD2, Y¯A,hA,a,D=Y¯A,hA,ab,D=Y¯B,hB,b,D=Y¯B,hB,ba,D=θD, then (4) can be simplified as

fA,hA,opt=K˜PA,hA(QA,hA,a+pD2QA,hA,ab)PA,hA(cA*cA)+cA,

and

fB,hB,opt=K˜PB,hB(QB,hB,b+qD2QB,hB,ba)PB,hB(cB*cB)+cB,

where QA,hA,a=NA,hA,a,DNA,hA,D1, QA,hA,ab=NA,hA,ab,DNA,hA,D1, QB,hB,b=NB,hB,b,DNB,hB,D1 and QB,hB,ba=NB,hB,ba,DNB,hB,D1. Therefore, the sampling fractions fA,hA,opt and fB,hB,opt increase as PA,hA, QA,hA,a, QA,hA,ab, and PB,hB, QB,hB,b, QB,hB,ba increase.

According to Remark 3.1, we know that the sampling fractions fA,hA,opt and fB,hB,opt will increase with increasing prevalence of PA,hA and PB,hB and decreasing prevalence of cA*/cA and cB*/cB, consistent with previous literature.

4. DESIGN EFFECT DUE TO STRATIFICATION

Without using oversampling stratification, our proposed optimal allocation in (4) becomes

fA,opt=K0EAPA(cA*cA)+cA,fB,opt=K0EBPB(cB*cB)+cB, (5)

where

EA=NA,a,DNASA,a,D2+pD2NA,ab,DNASA,ab,D2+NA,a,DNA(Y¯A,a,DθD)2+pD2NA,ab,DNA(Y¯A,ab,DθD)2{NA,a,DNA(Y¯A,a,DθD)+pDNA,ab,DNA(Y¯A,ab,DθD)}2,

and

EB=NB,b,DNBSB,b,D2+qD2NB,ba,DNBSB,ba,D2+NB,b,DNB(Y¯B,b,DθD)2+qD2NB,ba,DNB(Y¯B,ba,DθD)2{NB,b,DNB(Y¯B,b,DθD)+qDNB,ba,DNB(Y¯B,ba,DθD)}2.

The optimal allocation in (4) can be treated as an extension of Wolter et al. (2015) with estimation of the domain parameter. Suppose we wish to compare our proposed sampling design without oversampling stratification with our proposed sampling design with oversampling stratification. Assume our target sample size for minority domain D is nD*. Then we have

nD*=fA,optNAPA+fB,optNBPB.

Together with (5), we have

K0={NAPAEAPA(cA*cA)+cA+NBPBEBPB(cB*cB)+cB}1nD*.

So, the total cost can be written as

C0*=fA,opt*NA{PAcA*+(1PA)cA}+fB,opt*NB{PBcB*+(1PB)cB},

where fA,opt* and fB,opt* are defined in (5) by using K0, defined previously. Define the point estimator as θ^D(0) under this design by using the formula in (1) without stratification. The corresponding design variance of θ^D(0) can be written as

Vopt(θ^D(0))=1ND2(NA2EAnA,opt*+NB2EBnB,opt*)+o(n1), (6)

where nA,opt*=fA,opt*NA and nB,opt*=fB,opt*NB. According to (3), (4), and (6), to achieve the same cost, the corresponding estimation of K can be written as

K*=(ΔA+ΔB)1C0*, (7)

with

ΔA=hA=1HANA,hAEA,hAPA,hA(cA*cA)+cA{PA,hAcA*+(1PA,hA)cA},

and

ΔB=hB=1HBNB,hBEB,hBPB,hB(cB*cB)+cB{PB,hBcB*+(1PB,hB)cB}.

According to (2), (4), and (7), we have

Vopt(θ^D)=1ND2(hA=1HANA,hA2EA,hAnA,hA,opt*+hB=1HBNB,hB2EB,hBnB,hB,opt*)+o(n1), (8)

where nA,hA,opt*=NA,hAfA,hA,opt* and nB,hB,opt*=NB,hBfB,hB,opt*. Therefore, according to (6) and (8), the corresponding design effect for oversampling stratification is

Deffs=Vopt(θ^D)Vopt(θ^D(0))hA=1HANA,hA2EA,hAnA,hA,opt*1+hB=1HBNB,hB2EB,hBnB,hB,opt*1NA2EAnA,opt*1+NB2EBnB,opt*1. (9)

The design effect defined in (9) can be used to compare the design efficiency for designs with and without oversampling stratification. Such comparisons are conducted in section 6.

5. EXTENSION TO OVERSAMPLE MULTIPLE MINORITIES

For simplicity, we only consider two minority populations here. However, our proposed approach can be naturally extended to oversample more than two minority populations. Denote more prevalent minority population as M and less prevalent minority population as L. Similar to Chen and Kalton (2015), we propose first constructing final oversampling strata by crossing the two oversampling strata for M and L. For example, if we consider African American as M and American Indian as L, then we can first construct two oversampling strata for African American and American Indian separately by using the cumulative root frequency rule developed by Dalenius (1957). The final four oversampling strata can then be constructed by crossing the two kinds of oversampling strata. Waksberg, Judkins, and Massey (1997) and Jewett and Judkins (1988) also discussed multivariate stratification with minority populations. Suppose we first draw nA,hA cases and nB,hB cases in stratum hA of frame A and stratum hB of frame B by using simple random sampling design without replacement (SRSWOR), then keep all sampled cases in L and draw a second phase sample in M by using stratified simple random sampling design without replacement with sampling fraction fA,hA(M) and fB,hB(M). The total cost can be written as

C=hA=1HAnA,hA{PA,hA(L)cA*+fA,hA(M)PA,hA(M)cA*+cA(1PA,hA(L)fA,hA(M)PA,hA(M))}+hB=1HBnB,hB{PB,hB(L)cB*+fB,hB(M)PB,hB(M)cB*+cB(1PB,hB(L)fB,hB(M)PB,hB(M))}, (10)

where PA,hA(L), PA,hA(M), PB,hB(L), and PB,hB(M) are the population prevalence of corresponding minority and stratum. Suppose we are interested in estimating two minority population means θD(L)=ND(L)1iUD(L)yi and θD(M)=ND(M)1iUD(M)yi,, where ND(L) and ND(M) are the population sizes for two minority populations L and M; UD(L) and UD(M) denote the universes for L and M. Our proposed sample sizes nA,hA, nB,hB, and subsampling fractions fA,hA(L) and fB,hB(M) can be obtained by minimizing the total cost (10) subject to the following precision requirements for the estimates of two minority population parameters

CV(θ^D(L))α(L),CV(θ^D(M))α(M), (11)

where CV(θ^)=V1/2(θ^)/θ is the coefficient of variation of θ^, and α(L) and α(M) are the prespecified thresholds. In practice, θ is unknown and one can use weighted estimates obtained from previous survey or similar survey. According to the derivations in appendix B, by using the Lagrange multiplier approach, we have

nA,hA=NA,hAND(L)λ1EA,hA(L)(cA*cA)PA,hA(L)+cA,nB,hB=NB,hBND(L)λ1EB,hB(L)(cB*cB)PB,hB(L)+cB, (12)
fA,hA(M)=ND(L)ND(M)λ2EA,hA(M){PA,hA(L)+cA(cA*cA)1}λ1PA,hA(M)EA,hA(L), (13)

and

fB,hB(M)=ND(L)ND(M)λ2EB,hB(M){PB,hB(L)+cB(cB*cB)1}λ1PB,hB(M)EB,hB(L), (14)

where EA,hA(L) and EB,hB(L) are defined in section 3, and λ1 and λ2 are the Lagrange multipliers. From (12), we find that the sample sizes nA,hA and nB,hB decrease as cost ratios cA*/cA and cB*/cB and increase as the prevalence of PA,hA(L) and PB,hB(L) increase. According to (13) and (14), the subsampling fractions fA,hA(M) and fB,hB(M) increase as population ratio ND(L)/ND(M) increases and as PA,hA(L) and PB,hB(L) increase or PA,hA(M) and PB,hB(M) decrease. The Lagrange multipliers λ1 and λ2 can be obtained by plugging (12)–(14) into (11) and solving the two equations. If fA,hA(M)1, then we set fA,hA(M)=1 and redo the Lagrange multiplier procedure to obtain

nA,hA=NA,hAλ1EA,hA(L)ND(L)2+λ2EA,hA(M)ND(M)2(PA,hA(L)+PA,hA(M))(cA*cA)+cA,

and nA,hA and fA,hA(M) for fA,hA(M)<1 remain the same. Similarly, if fB,hB(M)1, then we set fB,hB(M)=1 and redo the Lagrange multiplier procedure to obtain

nB,hB=NB,hBλ1EB,hB(L)ND(L)2+λ2EB,hB(M)ND(M)2(PB,hB(L)+PB,hB(M))(cB*cB)+cB.

Then λ1 and λ2 can be obtained by solving equations in (11). The corresponding total cost can be obtained by plugging nA,hA, nB,hB, fA,hA(M), and fB,hB(M) into (10). Therefore, our proposed oversampling design can be compared with other designs in terms of total cost. Such comparisons are conducted in section 6.

6. REAL APPLICATION

We consider two scenarios in this section. Scenario one (section 6.1) includes a comparison between the stratified optimal and unstratified suboptimal sampling designs described in sections 3 and 4 for oversampling a single minority population. Scenario two (section 6.2) contains a comparison between the stratified optimal and unstratified suboptimal sampling designs for oversampling multiple minority populations simultaneously described in section 5.

Data are taken from the TSET HLP Survey, which is a dual-frame (landline and cell) random digit dialing (RDD) survey of adults living in Oklahoma. The survey was designed to gather information on Oklahomans’ knowledge, attitudes, and behaviors regarding physical activity, nutrition, tobacco use, and overall wellness and identify descriptive norms related to these health topics. The final number of completed surveys is about 4,500. The original sampling design of the survey is stratified simple random sampling without replacement by using frame (landline and cell) as the strata. The survey did not consider oversampling minority populations. Final weights were created by taking into account probability sampling, combining landline with cell frames, nonresponse adjustment, raking, and trimming. We use the survey data to estimate unknown quantities defined in previous sections for comparison purposes. In addition to the survey data, we used population-level aggregated data files obtained from the Marketing Systems Group company. Specifically, rate center–level ethnicity information for the cell phone frame and six-digit group-level ethnicity information for the landline frame were used to create oversampling strata.

6.1 Oversampling a Single Minority Population

This section compares the design effect of using stratified oversampling of a single minority population in a dual-frame telephone survey as opposed to a nonstratified sampling approach. We focus on the American Indian and African American populations separately in this application. We have created two different designs, one for each of these populations. Design effects were calculated by using our proposed methods in sections 3 and 4 to compare the variance of the stratified oversampling method with the nonstratified method.

We consider the following two study variables of interest. The first survey question of interest is the response to the question, “Do you believe that being overweight or obese can cause Type 2 Diabetes?” The study variable Y1=1 if the answer is “yes” and zero if the answer is “no.” The second survey question of interest is the response to the question, “Do you believe that being overweight or obese can cause heart disease?” The study variable Y2=1 if the answer is “yes” and zero if the answer is “no.” Any missing responses for either of the questions were imputed by using gender, age group, race, phone type, household income, and marriage status as covariates.

By using the population-level aggregated ethnicity information provided by Marketing Systems Group, we establish three strata for each race. Stratum one consists of those rate centers or six-digit groups with a relatively high density of the minority population. Stratum two consists of those rate centers or six-digit groups with a relatively moderate density of that minority population. Stratum three consists of those rate centers or six-digit groups with a low density of that minority population. The strata levels were determined by using the cumulative root frequency rule developed by Dalenius (1957) and population-level aggregated ethnicity information. The density cutoffs for strata were 5 percent and 13 percent for the American Indian population and 4 percent and 14 percent for the African American population. The prevalence of African Americans was about 7.56 percent, and the prevalence of American Indians was about 6.63 percent.

Without loss of generality, we assume the costs of a full complete interview with landline and cell are cA*=1 and cB*=1. We consider seven different cost structures for the ratio of the cost of the full interview to cost of initial screening: cA*/cA=cB*/cB= 1, 2, 3, 5, 10, 20, and 30. Assume the desired number of completes for minority population is nD*=1,000. The corresponding design effects are calculated by using the formulas described in section 4. The results are presented in tables 2 and 3 for the first and second study variables with different cost ratios. For both study variables and two ethnicities, the design effects increase as the cost ratio increases. This finding demonstrates the benefits of oversampling with a low cost ratio. The strength of oversampling stratification also depends on the study variable. For instance, the design effects for study variable one are generally less than those for study variable two.

Table 2.

Ratios of the Sampling Variances of the Designs With and Without Oversampling Stratification by Different Ethnicity for “Type 2 Diabetes” Study Variable

Cost ratio American Indians African Americans
1 0.684 0.700
2 0.699 0.715
3 0.711 0.728
5 0.732 0.747
10 0.766 0.778
20 0.801 0.807
30 0.820 0.820

Table 3.

Ratios of the Sampling Variances of the Designs With and Without Oversampling Stratification by Different Ethnicity for “Heart Disease” Study Variable

Cost ratio American Indians African Americans
1 0.898 0.928
2 0.908 0.935
3 0.916 0.941
5 0.928 0.949
10 0.946 0.962
20 0.960 0.972
30 0.965 0.976

6.2 Oversampling Multiple Minority Populations

In this section, we compare the total costs between designs with and without oversampling stratification in terms of targeting multiple minority populations simultaneously. The TSET HLP Survey was again used, and the two minorities oversampled were African Americans and American Indians. African Americans were the more prevalent minority, representing 7.56 percent of the population, compared with American Indians, who represented 6.63 percent of the population.

This time, one cutoff per minority divided each minority group into two strata, creating a total of four strata combinations. Strata were again determined using the cumulative root method. The density cutoffs for the strata for the American Indian population were set at 9 percent, and the density cutoffs for the African American population were set at 7 percent. Thus, the following four strata were established: high prevalence of both American Indians (greater than 9 percent) and African Americans (greater than 7 percent), high prevalence of American Indians (greater than 9 percent) and low prevalence of African Americans (less than 7 percent), low prevalence of American Indians (less than 9 percent) and high prevalence of African Americans (greater than 7 percent), and low prevalence of both (less than 9 percent African Americans and less than 7 percent American Indians). Suppose the study variable of interest is again Y1: “Being overweight or obese can cause Type 2 Diabetes?” Any missing responses for either of the questions were imputed by using gender, age group, race, phone type, household income, and marriage status as covariates.

We assume a cost structure cA*=1 and cB=2, with c = cA*(cA)1=cB*(cB)1, with c taking the values 2, 3, 5, 10, 20, and 30. For simplicity, we set p  =  q = 0.5, where p and q are defined in (1). We assume α(L)=α(M)=0.05. We followed the method described in section 5 to determine the total cost for both the stratified oversampling and nonstratified approaches. The ratios of the costs for the stratified to nonstratified designs were calculated and presented in table 4. The total costs based on our proposed approach with oversampling stratification are about 0.83 to 0.88 times that for the design without oversampling stratification. The benefits of stratification decrease as the cost ratio increases, which is consistent with previous findings.

Table 4.

Ratios of the Total Costs With and Without Oversampling Stratification for Multiple Minorities by Different Cost Structures for “Type 2 Diabetes” Study Variable

Cost ratio c Stratified cost Nonstratified costs Ratio of total costs
2 118.931 143.161 0.831
3 88.385 105.251 0.840
5 63.718 74.822 0.852
10 44.984 51.894 0.867
20 35.460 40.369 0.878
30 32.238 36.513 0.883

7. CONCLUDING REMARKS

In this article, we propose a novel oversampling procedure for targeting one or multiple minority populations with dual-frame sampling design. One application of our proposed approach is the dual-frame random digit dialing (RDD) telephone survey. We derived theoretical formulas for optimal allocation, including the corresponding sample sizes and subsampling fractions for all strata. We applied our proposed approach to the TSET HLP Survey and demonstrated its benefits compared with other methods. Weighting and statistical analysis based on our sampling design can be conducted by using similar techniques as described in the dual-frame research literature (Lohr and Rao, 2000; Lohr, 2011). Fahimi and Judkins (1991) considered differential sampling at second-stage in multi-stage sampling design to oversample small populations. Further investigation of oversampling minority populations with stratified multi-stage sampling design will be conducted.

Appendix

A. SKETCHED PROOF OF (2)

By using Taylor linearization, we have:

θ^D=θD+1ND(Y^a,D+pDY^ab,D+qDY^ba,D+Y^b,DYD)YDND2(N^DND)+op(n12)=θD+1NDhA=1HANA,hAnA,hAiSA,hAaiDi(yiθD)+pD1NDhA=1HANA,hAnA,hAiSB,hBabiDi(yiθD)+qD1NDhB=1HBNB,hBnB,hBiSB,hBbaiDi(yiθD)+1NDhB=1HBNB,hBnB,hBiSB,hBbiDi(yiθD)+op(n12)=θD+T1+pDT2+qDT3+T4+op(n12),

where

T1=1NDhA=1HANA,hAnA,hAiSA,hAaiDi(yiθD),T2=1NDhA=1HANA,hAnA,hAiSA,hAabiDi(yiθD)T3=1NDhB=1HBNB,hBnB,hBiSB,hBbaiDi(yiθD),T4=1NDhB=1HBNB,hBnB,hBiSB,hBbiDi(yiθD).

Then,

V(θ^D)=V(T1+pDT2+qDT3+T4)+o(n1)=V(T1+pDT2)+V(qDT3+T4)+o(n1)=V(T1)+pD2V(T2)+2pDcov(T1,T2)+qD2V(T3)+V(T4)+2qDcov(T3,T4)+o(n1).

Let n1i=aiDi(yiθD). Then,

V(T1)=1ND2hA=1HANA,hA2nA,hA(1nA,hANA,hA)Sn1,A,hA2, (1)

where

Sn1,A,hA2=1NA,hA1i=1NA,hA(n1in¯1,A,hA)2=1NA,hA1i=1NA,hA(n1i22n1in¯1,A,hA+n¯1,A,hA2)=1NA,hA1(i=1NA,hAn1i2NA,hAn¯1,A,hA2)=1NA,hA1(i=1NA,hAaiDi(yiθD)2NA,hAn¯1,A,hA2), (2)

and

n¯1,A,hA=1NA,hAi=1NA,hAaiDi(yiθD)=1NA,hA(i=1NA,hAaiDiyiNA,hA,a,DθD)=NA,hA,a,DNA,hAY¯A,hA,a,DNA,hA,a,DNA,hAθD=NA,hA,a,DNA,hA(Y¯A,hA,a,DθD). (3)

Because we have

i=1NA,hAaiDi(yiθD)2=i=1NA,hAaiDi(yiY¯A,hA,a,D+Y¯A,hA,a,DθD)2=i=1NA,hAaiDi(yiY¯A,hA,a,D)2NA,hA,a,D(Y¯A,hA,a,DθD)2 (4)

According to (1), (2), and (3), we have

Sn1,A,hA2NA,hA,a,DNA,hASA,hA,a,D2+NA,hA,a,DNA,hA(Y¯A,hA,a,DθD)2(NA,hA,a,DNA,hA)2(Y¯A,hA,a,DθD)2. (5)

By using similar techniques, we have

V(T2)=1ND2hA=1HANA,hA2nA,hA(1nA,hANA,hA)Sn2,A,hA2,wheren2i=abiDi(yiθD), (6)
Sn2,A,hA2NA,hA,ab,DNA,hASA,hA,ab,D2+NA,hA,ab,DNA,hA(Y¯A,hA,ab,DθD)2(NA,hA,ab,DNA,hA)2(Y¯A,hA,ab,DθD)2
V(T3)=1ND2hB=1HBNB,hB2nB,hB(1nB,hBNB,hB)Sn3,B,hB2,wheren3i=baiDi(yiθD),Sn3,B,hB2NB,hB,ba,DNB,hBSB,hB,ba,D2+NB,hB,ba,DNB,hB(Y¯B,hB,ba,DθD)2(NB,hB,ba,DNB,hB)2(Y¯B,hB,ba,DθD)2 (7)
V(T4)=1ND2hB=1HBNB,hB2nB,hB(1nB,hBNB,hB)Sn4,B,hB2,wheren4i=biDi(yiθD) (8)
Sn4,B,hB2NB,hB,b,DNB,hBSB,hB,b,D2+NB,hB,b,DNB,hB(Y¯B,hB,b,DθD)2(NB,hB,b,DNB,hB)2(Y¯B,hB,b,DθD)2cov(T1,T2)=1ND2hA=1HANA,hA2nA,hA(1nA,hANA,hA)Sn1n2,A,hA2 (9)
Sn1n2,A,hA2=1NA,hA1i=1NA,hA(n1in¯1,A,hA)(n2in¯2,A,hA)NA,hA,a,DNA,hANA,hA,ab,DNA,hA(Y¯A,hA,a,DθD)(Y¯A,hA,ab,DθD)cov(T3,T4)=1ND2hB=1HBNB,hB2nB,hB(1nB,hBNB,hB)Sn3n4,B,hB2 (10)
Sn3n4,B,hB2=1NB,hB1i=1NB,hB(n3in¯3,B,hB)(n4in¯4,B,hB)NB,hB,b,DNB,hBNB,hB,ba,bNB,hB(Y¯B,hB,b,DθD)(Y¯B,hB,ba,DθD).

According to (1)–(10) and after some algebra, we have

V(θ^D)=1ND2(hA=1HANA,hA2EA,hAnA,hA+hB=1HBNB,hB2EB,hBnB,hB)+o(n1),

where EA,hA and EB,hB are defined in section 2.

B. SKETCHED PROOF FOR TWO MINORITY POPULATIONS

We seek to minimize

C=hA=1HAnA,hA[PA,hA(L)cA+fA,hA(M)PA,hA(M)cA+cA(1PA,hA(L)fA,hA(M)PA,hA(M))]+hB=1HBnB,hB[PB,hB(L)cB+fB,hB(M)PB,hB(M)cB+cB(1PB,hB(L)fB,hB(M)PB,hB(M))]

subject to the constraints:

V(θ^D(L))θD(L)αL,V(θ^D(M))θD(M)αM,

where CL=(αLθD(L))2andCM=(αMθD(M))2.

V(θ^D(L))=(ND(L))2(hA=1HANA,hA2EA,hA(L)nA,hA+hB=1HBNB,hB2EB,hB(L)nB,hB),V(θ^D(M))=(ND(M))2(hA=1HBNA,hA2EA,hA(M)nA,hAfA,hA(M)+hB=1HBNB,hB2EB,hB(M)nB,hBfB,hB(M))

Using Lagrange multipliers, we have

F=hA=1HAnA,hA[PA,hA(L)cA+fA,hA(M)PA,hA(M)cA+cA(1PA,hA(L)fA,hA(M)PA,hA(M))]+hB=1HBnB,hB[PB,hB(L)cB+fB,hB(M)PB,hB(M)cB+cB(1PB,hB(L)fB,hB(M)PB,hB(M))]+λ1[V(θ^D(L))CL]+λ2[V(θ^D(M))CM].

Taking the derivative of F with respect to nA,hA and setting it equal to zero yields

FnA,hA=PA,hA(L)cA+fA,hA(M)PA,hA(M)cA+cA(1PA,hA(L)fA,hA(M)PA,hA(M))+λ11(ND(L))2NA,hA2EA,hA(L)(1)1nA,hA2+λ21(ND(M))2NA,hA2EA,hA(M)fA,hA(M)(1)1nA,hA2=0.

Then

NA,hA2nA,hA2[λ1EA,hA(L)(ND(L))2+λ2EA,hA(M)(ND(M))2fA,hA(M)]=PA,hA(L)cA+fA,hA(M)PA,hA(M)cA+cA(1PA,hA(L)fA,hA(M)PA,hA(M)).

Similarly, taking the derivative of F with respect to nB,hB and setting it equal to zero yields

FnB,hB=PB,hB(L)cB+fB,hB(M)PB,hB(M)cB+cB(1PB,hB(M)fB,hB(M)PB,hB(M))+λ11(ND(L))2NB,hB2EB,hB(L)(1)1nB,hB2+λ21(ND(M))2NB,hB2EB,hB(M)fB,hB(M)(1)1nB,hB2=0

and

NB,hB2nB,hB2[λ1EB,hB(L)(ND(L))2+λ2EB,hB(M)(ND(M))2fB,hB(M)]=PB,hB(L)cB+fB,hB(M)PB,hB(M)cB+cB(1PB,hB(L)fB,hB(M)PB,hB(M))FfA,hA(M)=nA,hA[PA,hA(M)cAPA,hA(M)cA]+λ21(ND(M))2NA,hA2EA,hA(M)nA,hA(1)1(fA,hA(M))2=0.

Then

nA,hA2PA,hA(M)(cAcA)=λ2NA,hA2EA,hA(M)(ND(M))2(fA,hA(M))2NA,hA2nA,hA2=PA,hA(M)(cAcA)(ND(M))2(fA,hA(M))2λ2EA,hA(M)

Similarly,

NB,hB2nB,hB2=PB,hB(M)(cBcB)(ND(M))2(fb,hb(M))2λ2EB,hB(M).

Then

PA,hA(M)(cAcA)(ND(M))2(fA,hA(M))2λ2EA,hA(M)×λ1EA,hA(L)(ND(L))2+PA,hA(M)(cAcA)fA,hA(M)=PA,hA(L)(cAcA)+cA+fA,hA(M)PA,hA(M)(cAcA).

Then

PA,hA(M)λ1(ND(M))2EA,hA(L)(fA,hA(M))2(λ2EA,hA(M)(ND(L))2)1=PA,hA(L)+cAcAcAfA,hA(M)=(PA,hA(L)+cA(cAcA)1)λ2EA,hA(M)(ND(L))2PA,hA(M)(ND(M))2λ1EA,hA(L)=ND(L)ND(M)λ2EA,hA(M)(PA,hA(L)+cA(cAcA)1)PA,hA(M)λ1EA,hA(L)

Similarly, we have

fB,hB(M)=ND(L)ND(M)λ2EB,hB(M)(PB,hB(L)+cB(cBcB)1)PB,hB(M)λ1EB,hB(L)nA,hA=λ2NA,hA2EA,hA(M)(ND(M))2PA,hA(M)(cAcA)×PA,hA(M)(ND(M))2λ1EA,hA(L)(PA,hA(L)+cA(cAcA)1)λ2EA,hA(M)(ND(L))2=NA,hA2λ1EA,hA(L)[(cAcA)PA,hA(L)+cA](ND(L))2

so

nA,hA=NA,hAND(L)λ1EA,hA(L)(cAcA)PA,hA(L)+cA

and

nB,hB=NB,hBND(L)λ1EB,hB(L)(cBcB)PB,hB(L)+cB.

Then λ1 and λ2 can be obtained by solving

V(θ^D(L))=(ND(L))2(hA=1HANA,hA2EA,hA(L)nA,hA+hB=1HBNB,hB2EB,hB(L)nB,hB)=CLandV(θ^D(M))=(ND(M))2(hA=1HANA,hA2EA,hA(M)nA,hAfA,hA(M)+hB=1HBNB,hB2EB,hB(M)nB,hBfB,hB(M))=CM.CL=(θD(L)αL)2,CM=(θD(M)αM)2

If fA,hA(M)1, then set fA,hA(M)=1.

NA,hA2nA,hA2×[λ1EA,hA(L)(ND(L))2+λ2EA,hA(M)(ND(M))2]=PA,hA(L)cA+PA,hA(M)cA+cA(1PA,hA(L)PA,hA(M))=(PA,hA(L)+PA,hA(M))(cAcA)+cAnA,hA=NA,hA2(λ1EA,hA(L)(ND(L))2+λ2EA,hA(M)(ND(M))2)(PA,hA(L)+PA,hA(M))(cAcA)+cA

Similarly, if fB,hB(M)1, then set fB,hB(M)=1.

Then, we have

nB,hB=NB,hB2(λ1EB,hB(L)(ND(L))2+λ2EB,hB(M)(ND(M))2)(PB,hB(L)+PB,hB(M))(cBcB)+cB.

We thank a referee for his/her useful comments and suggestions, which improved the quality of this article. Data were collected by the Sooner Survey Center as part of the evaluation of the Tobacco Settlement Endowment Trust’s statewide Healthy Living Program (PI Rebekah Rhoades). The research of Drs. Sixia Chen and Alexander Stubblefield is supported by a Presbyterian Health Foundation seed grant number is C5101401 / ORA # 20171573; Support for Drs. Chen and Stoner was provided through National Institutes of Health, National Institute of General Medical Sciences (Grant 2U54GM104938-06, PI Judith James) and National Institutes of Health, National Institute on Minority Health and Health Disparities (Grant 1R25MD011564, PI Julie Stoner/Courtney Houchen).

References

  1. Bankier M. D. (1986), “ Estimators Based on Several Stratified Samples with Applications to Multiple Frame Surveys,” Journal of the American Statistical Association, 81, 1074–1079. [Google Scholar]
  2. Chang L., Krosnick J. A. (2009), “ National Surveys via RDD Telephone Interviewing versus the Internet,” Public Opinion Quarterly, 73, 641–678. [Google Scholar]
  3. Chen S., Kalton G. (2015), “ Geographic Oversampling for Race/Ethnicity Using Data from the 2010 US population Census,” Journal of Survey Statistics and Methodology, 3, 543–565. [Google Scholar]
  4. Curtin L. R., Mohadjer L. K., Dohrmann S. M., Kruszon-Moran D., Mirel L. B., Carroll M. D., Hirsch R., Burt V. L., Johnson C. L. (2013), National Health and Nutrition Examination Survey: Sample Design, 2007-2010, Washington, DC: US Government Printing Office. [PubMed] [Google Scholar]
  5. Dalenius T. (1957), Sampling in Sweden, Stockholm: Almquist and Wicksell. [Google Scholar]
  6. Elliot M. N., Finch B. K., Klein D., Sai M., Phuong Do D., Beckett M. K., Orr N., Lurie N. (2008), “ Sample Designs for Measuring the Health of Small Racial/Ethnic Subgroups,” Statistics in Medicine, 27, 4016–4029. [DOI] [PubMed] [Google Scholar]
  7. Espey D. K., Jim M. A., Cobb N., Bartholomew M., Becker T., Haverkamp D., Plescia M. (2014), “ Leading Causes of Death and All-Cause Mortality in American Indians and Alaska Natives,” American Journal of Public Health, 104, S303–S311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Fahimi M., Judkins D. (1991), PSU Probabilities Given Differential Sampling at Second Stage. Proceedings of the Section on Survey Research Methods, pp. 538–543, American Statistical Association. [Google Scholar]
  9. Fuller W. A., Burmeister L. F. (1972), “Estimators for Samples Selected from Two Overlapping Frames,” Proceedings of the Social Statistics Section, American Statistical Association, pp. 245–249.
  10. Hartley H. O. (1962), “Multiple Frame Surveys,” Proceedings of the Social Statistics Section, American Statistical Association, pp. 203–206.
  11. Hartley H. O. (1974), “ Multiple Frame Methodology and Selected Applications,” Sankhyā, 36, 99–118. [Google Scholar]
  12. Helba C., Love C., Wivagg J., Frissell K., Lee K. C., Whitwell C. (2015), “Estimating the Need for Treatment for Substance Use Disorders among Minnesota Adults: Results of the 2014/2015 Minnesota Survey on Adult Substance Use, available at https://mn.gov/dhs/partners-and-providers/news-initiatives-reports-workgroups/alcohol-drug-other-addictions/mn-adult-su-survey/, unpublished report.
  13. Jewett R. S., Judkins D. R. (1988), “ Multivaraite Stratification with Size Constraints,” SIAM Journal on Scientific and Statistical Computing, 9, 1091–1097. [Google Scholar]
  14. Judkins D. R., Brick J. M., Broene P., Ferraro D., Strickler T. (2001), 1999 NSAF Sample Design: Report No. 2, Washington, DC: Urban Institute Press. [Google Scholar]
  15. Kalsbeek W. D. (2003), “ Sampling Minority Groups in Health Surveys,” Statistics in Medicine, 22, 1527–1549. [DOI] [PubMed] [Google Scholar]
  16. Kalton G. (2009), “ Methods for Oversampling Rare Subpopulations in Social Surveys,” Survey Methodology, 35, 125–141. [Google Scholar]
  17. Kalton G., Anderson D. (1986), “ Sampling Rare Populations,” Journal of the Royal Statistical Society, 149, 65–82. [Google Scholar]
  18. Lohr S. L. (2011), “ Alternative Survey Sample Designs: Sampling with Multiple Overlapping Frames,” Survey Methodology, 37, 197–213. [Google Scholar]
  19. Lohr S. L., Rao J. N. K. (2000), “Inference from Dual Frame Surveys,” Journal of the American Statistical Association, 95, 271–280. [Google Scholar]
  20. Marcus A. C., Crane L. A. (1986), “Telephone Surveys in Public Health Research,” Medical Care, 24, 97–112. [DOI] [PubMed] [Google Scholar]
  21. Mohadjer L., Krenzke T. (2009), “Sample Design,” in Technical Report and Data File User’s Manual for the 2003 National Assessment of Adult Literacy, eds. Baldi S. and US National Center for Education Statistics, Washington, DC: US Government Printing Office. [Google Scholar]
  22. Mowery P. D., Dube S. R., Thorne S. L., Garrett B. E., Homa D. M., Nez-Henderson P. (2015), “Disparities in Smoking-Related Mortality among American Indians/Alaska Natives,” American Journal of Preventive Medicine, 49, 738–744. [DOI] [PubMed] [Google Scholar]
  23. Rao J. N. K., Wu C. (2010), “ Pseudo-Empirical Likelihood Inference for Multiple Frame Surveys,” Journal of the American Statistical Association, 105, 1494–1503. [Google Scholar]
  24. Skinner C. J., Rao J. N. K. (1996), “ Estimation in Dual Frame Surveys with Complex Desings,” Journal of the American Statistical Association, 91, 349–356. [Google Scholar]
  25. Szolnoki G., Hoffmann D. (2013), “ Online, Face-to-Face and Telephone Surveys- Comparing Different Sampling Methods in Wine Consumer Research,” Wine Economics and Policy, 2, 57–66. [Google Scholar]
  26. Tourangeau R., Edwards B., Johnson T. P., Wolter K. M., Bates N. (2014), Hard-to-Survey Populations, Cambridge University Press. [Google Scholar]
  27. Waksberg J., Judkins D., Massey J. T. (1997), “ Geographic-Based Oversampling in Demographic Surveys of the United States,” Survey Methodology, 23, 61–71. [Google Scholar]
  28. Wolter K. M., Tao X., Montgomery R., Smith P. J. (2015), “ Optimum Allocation for a Dual-Frame Telephone Survey,” Survey Methodology, 41, 389–401. [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of Survey Statistics and Methodology are provided here courtesy of Oxford University Press

RESOURCES