Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 May 20.
Published in final edited form as: Stat Med. 2023 Mar 7;42(11):1641–1668. doi: 10.1002/sim.9692

Comparisons of Statistical Methods for Handling Attrition in a Follow-up Visit with Complex Survey Sampling

Jianwen Cai 1, Donglin Zeng 1, Haolin Li 1, Nicole M Butera 2, Pedro L Baldoni 1, Poulami Maitra 3, Li Dong 1
PMCID: PMC10957339  NIHMSID: NIHMS1962082  PMID: 37183765

Abstract

Design-based analysis, which accounts for the design features of the study, is commonly used to conduct data analysis in studies with complex survey sampling, such as the Hispanic Community Health Study/Study of Latinos (HCHS/SOL). In this type of longitudinal study, attrition has often been a problem. Although there have been various statistical approaches proposed to handle attrition, such as inverse probability weighting (IPW), non-response cell weighting (NRCW), multiple imputation (MI), and full information maximum likelihood (FIML) approach, there has not been a systematic assessment of these methods to compare their performance in design-based analyses. In this paper, we perform extensive simulation studies and compare the performance of different missing data methods in linear and generalized linear population models, and under different missing data mechanism. We find that the design-based analysis is able to produce valid estimation and statistical inference when the missing data are handled appropriately using IPW, NRCW, MI, or FIML approach under missing-completely-at-random (MCAR) or missing-at-random (MAR) missing mechanism and when the missingness model is correctly specified or over-specified. We also illustrate the use of these methods using data from HCHS/SOL.

Keywords: Missing data, attrition, complex survey sampling design, design-based analysis

1 |. INTRODUCTION

Complex sample surveys have been widely used in many large cohort studies to oversample under-represented subgroups in the study population and to improve the efficiency and convenience of sample selection. The sampling scheme often involves stratified sampling, cluster sampling, and/or multistage sampling. For example, our motivating study, the Hispanic Community Health Study/Study of Latinos (HCHS/SOL),1,2 is a community-based, multi-center, longitudinal cohort study that started in year 2008, with one of the study goals to examine risk and protective factors for chronic diseases in Hispanics/Latinos. The study recruited 16,415 self-identified Hispanic/Latino adults, aged 18–74 years, from four US field centers: Bronx, NY; Chicago, IL; Miami, FL; and San Diego, CA. A stratified three-stage probability sampling design was implemented within each field center (LaVange, et al., 2010)2: census block groups were selected in stage 1, households within the selected block groups were selected in stage 2, and participants within the selected households were selected in stage 3. Furthermore, unequal probabilities of sampling were implemented in each stage to minimize cost and to ensure large enough sample sizes for particular subgroups of the final cohort (e.g., based on age, Hispanic/Latino background distribution, neighborhood socioeconomic status distribution).

There are two commonly used approaches for analyzing data from complex sample surveys: model-based approach and design-based approach. In model-based approaches, samples are assumed to come from a hypothetical infinite population, and the observed values are regarded as the realizations of the random variables that follow some distributions, so that the statistical inference is based on the modeling and distribution assumptions. On the other hand, in design-based approaches, the population is regarded as fixed, and the variation in the sample is due to random sampling, so that the statistical inference is based on the design features of the study.

Attrition often occurs in longitudinal studies such as HCHS/SOL. In HCHS/SOL, study participants may not return for the follow-up visit. This could happen due to moving away from the study clinic, inability to schedule a study visit, or no longer being interested in participating in the study, among other reasons. Out of the original 16,415 participants in HCHS/SOL, 4,792 (29.2%) did not return for the follow-up examination during 2014–2017. It is well known that if the study participants who returned for the follow-up visit are systematically different from those who did not return, the results based on standard statistical methods assuming random samples using only the subset of the participants who returned to the follow-up visit may be biased. Hence, it is important to consider how to handle attrition in longitudinal studies appropriately. Since attrition leads to missing data in longitudinal studies, we will handle it in the framework of missing data.

Many statistical approaches have been proposed to handle missing data in survey studies using model-based approaches. One group of methods only use the data from the complete units but with appropriate adjustment to sampling weights so that the complete units can be sufficient to represent the whole population. Specifically, one way for weight adjustment is called inverse probability weighting (IPW).3,4,5 In such approach, the response probability (e.g., the probability of returning for the second study visit) for each complete unit is estimated from some posited model for the response status, and then the adjusted weight for each complete unit will be the sampling weight multiplied by the inverse of the estimated response probability. An alternative way for weight adjustment is called non-response cell weighting (NRCW),6,7,8 which is similar to IPW, except that the whole sample is first stratified into several classes based on observed variables that are thought to be related to the response, and then the response probability for a complete unit is estimated as the proportion of those with observed data in the class out of those who were originally sampled in the same class. For NRCW, post-stratification is usually used to further calibrate the adjusted weights so that after weighting, the complete units can match the whole population for some variables (e.g., demographic varaibles). A second group of methods to handle missing data consist of different ways to impute all missing values for each unit.9,10 Among them, the most commonly used imputation method is called multiple imputation (MI).11 This method “imputes” (i.e., fills in) the missing data multiple times based on a statistical model for the missing variables (i.e., creates m imputed datasets, each including the observed data and a unique set of the imputed values for the missing data), then performs separate statistical analyses using standard complete-data methods with each of the m imputed datasets to obtain the parameter estimates and covariance matrices. At the end, the parameter estimates and covariances are combined and adjusted to obtain the final estimates and covariances, respectively. A third group of methods to handle missing data are based on analyzing all available data. For example, full information maximum likelihood (FIML) approach13,14,15 assumes a model for the joint distribution of all variables. Then the parameter estimates are obtained by maximizing the weighted log-likelihood function formed by all available information from complete and partially complete observations.

Although these missing data methods mentioned above could be adopted for the design-based approaches, there has not been a systematic assessment of these methods to compare their performance in design-based analyses. Consequently, the choices of methods are unclear when studying the associations between risk factors and various types of outcomes in such analysis. In this paper, we aim to fill this gap by performing a systematic comparison of some commonly used methods for missing data in design-based analyses. We conduct extensive simulation studies where data are simulated mimicking HCHS/SOL. Specifically, we consider body mass index (BMI) change as an example of a continuous outcome and incident chronic kidney disease (CKD) as an example of a binary outcome. For the binary outcome, we examine both logistic regression adjusting for the time elapsed between the two visits for studying disease risk and Poisson regression for studying disease incidence. The paper is structured as follows. Section 2 provides a detailed review of the existing methods for analyzing complex survey data with missing data and summarizes the currently available statistical software and corresponding packages and procedures that can be used for these purposes. Section 3 presents simulation studies to evaluate the performance of statistical methods for data obtained from a complex survey sampling design in the presence of attrition in design-based approaches. Section 4 illustrates the use of these methods using data from HCHS/SOL. Section 5 summarizes the conclusions based on our empirical studies and discusses further considerations.

2 |. METHODS FOR HANDLING ATTRITION IN COMPLEX SAMPLE SURVEY

In this section, we provide an overview of commonly used methods that can be adopted for a design-based approach for analyzing longitudinal complex sample survey data in the presence of attrition. We particularly focus on estimating the effect of risk factors in the regression setting. First, we establish the framework of generalized linear model (GLM) from a model-based viewpoint. Let Yi be the response variable for subject i and xi be the p×1 dimensional vector of explanatory variables. Then we consider the following GLM in the target population:

gEYixi=xiβ, (1)

where β is the vector of population parameters of interest and g(.) is a known link function. For continuous Yi, linear regression is often used, and g(.) is the identity function. For binary Yi, logistic regression is often used for disease risk and Poisson regression for incidence rate. For logistic regression, the link function is g(p)=logit(p)=logp1-p. For the incidence rate of an event, Poisson regression is used with log link function and an offset term, logti, i.e., logEYixi=logti+xiβ, where Yi is an indicator for the incident event of interest and ti is the person-time at risk for subject i.

Binder (1983)16 and Lumley and Scott (2017)17 provide the theory for estimating the parameters for GLM under the finite population framework of the design-based analysis when there is no missing data. Let B denote the census parameters of interest (i.e., the maximum quasi-likelihood estimator for β based on the full population data, one could understand it as the design-based counterpart of β), and then the corresponding estimator, B^, can be solved from the score equation:

GB=iSwiUiB=0,

where S denotes the units included in the sample, Ui(B) denotes the score function of B from subject i based on the GLM, and wi denotes the inverse of the sampling probability and is used to adjust for either oversampling or undersampling of certain subjects.18 To obtain the correct variance estimator for B^, it is necessary to take the sampling design (i.e., sampling probability, strata, and clusters) into consideration to incorporate the variability of sample selection. When the first-stage sampling probabilities are small or when the first-stage clusters are sampled with replacement, Taylor series linearization can be used to estimate standard errors (Woodruff, 1971).19 Specifically, to construct the design-based variance estimator for B^, let

D=-GBBB=B^-1,

and the variance estimator can be expressed as

V^(B^)=DMD,

where M is constructed as follows. Let j denote the index for strata, where j=1,,J, and J is the total number of strata in the sample, and let 𝒞jk denote the units included in the sampling clusters k in stratum j, where k=1,,Kj, where Kj is the total number of clusters in the stratum j. Then let

Ujk(B^)=iCjkwiUi(B^)andU-j(B^)=1Kjk=1KjUjk(B^),

and M is given by

M=j=1JKjKj-1k=1Kj[Ujk(B^)-U-j(B^)][Ujk(B^)-U-j(B^)].

With missing data, it is important to identify the missing data mechanism. Generally, there are three types of missing data mechanisms (Little and Rubin, 2020).9 Data are missing-completely-at-random (MCAR) if the subset of study subjects who returned for the follow-up is a random subset of the entire study cohort. Data are missing-at-random (MAR) if the probability that subjects return for the follow-up is independent of the outcome at the follow-up visit conditional on baseline observed variables. Data are missing-not-at-random (MNAR) if the probability that the subjects return for the follow-up visit depends on unobserved outcome measures at the follow-up visit, even after conditioning on the baseline observed data.

To handle missing data, one naive method, referred to as complete case (CC) analysis, is to exclude subjects with missing data and apply standard statistical methods assuming no missing data. This method is valid under MCAR assumption. However, estimates obtained from this analysis may be biased if the excluded subjects are systematically different from those included in the analysis. Under such situation, a variety of statistical methods can be used to accommodate missing data. These methods usually assume that data are MAR. In the following sub-sections, we provide descriptions of the methods that are to be assessed and compared in design-based approach in this paper.

2.1 |. Inverse Probability Weighting (IPW)

Inverse probability weighting (IPW) is one of several methods that can reduce the bias caused by CC analysis (Seaman and White, 2011).3 In this method, complete cases are weighted by the inverse of their probability of being a complete case, so that the complete cases will represent all the subjects in the study sample. Denoting Ri=1 if Yi and Xi are observed, Ri=0 otherwise, one can fit a model for R, say, based on a logistic regression model, known as the ‘missingness model’, with outcome R and predictors being some variables from {X,Y,Z}, where Z contains some auxiliary variables that were collected in the study but are not part of the analysis model. Let πˆi denote the estimated non-missing probability PRi=1Yi,Xi,Zi based on this missingness model. The IPW method solves the following equation for parameter estimation:

iSwiRiπ^iUi(B)=0. (2)

In the following text, we refer to wiπˆi as the adjusted IPW weights.

2.2 |. Non-Response Cell Weighting (NRCW)

Another commonly used method for handling missing data is non-response cell weighting (NRCW),6 which uses post-stratification to align the sample to the population based on some variables collected at baseline and over the follow-up period, referred to as post-stratification variables (for example, age, sex, or geographical region).

For the post-stratification adjustment, cross-classification is formed based on the post-stratification variables that were possibly related to missingness. Then, the respondents are grouped into different weighting classes defined by the design and post-stratification variables. Within each weighting class c, the post-stratification weight adjustment can be done by multiplying the survey sampling weight of each participant in the class by the following factor:

iinclasscwiiinclassandireturnsforvisit2wi. (3)

If the sample size is too small in certain weighting classes, a common practice is to combine some of these classes in calculating the post-stratification weights. After the post-stratification adjustment, one can then use these multiplied weights, which we will refer to as the adjusted NRCW weights, for various statistical analyses.

2.3 |. Multiple Imputation (MI)

Multiple Imputation (MI) is another statistical method that is often used to deal with incomplete data. Unlike IPW and NRCW where one weighs the observed cases, in MI, one imputes the missing data. Estimates are then obtained by applying standard statistical methods to the combined observed and imputed data. This process is repeated several times and the estimates from each repetition are then combined to obtain the final results using ‘Rubin’s formula’ (Rubin, 2004).11

There are various methods for multiple imputation. The ones that are commonly implemented include expectation-maximization (EM) algorithm for maximum likelihood estimation in parametric models, fully conditional specification (FCS) methods for multivariate imputation, and Markov Chain Monte Carlo (MCMC) algorithm based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. Based on our empirical experience, the different ways of imputing data resulted in very similar results. Therefore, we proceed with the FCS method in this paper.

The FCS method specifies a conditional imputation model for each missing variable conditional on the other variables and assumes that a joint distribution exists that corresponds to this set of conditional distributions. This method comprises two phases: ‘filled-in’ phase and ‘imputation’ phase. For each imputation, at the first phase, the missing values for all variables are filled-in sequentially over the variables taken one at a time, by randomly drawing from the conditional distribution of the observed variable under consideration given the preceding variables, which provides the starting values for the missing values at the imputation phase. At the imputation phase, the missing values for each variable are imputed sequentially for a number of burn-in iterations before the imputation. Multiple imputation is generally more computationally intensive than weighting methods (e.g., IPW and NRCW), because it involves fitting the final regression model multiple times (once for each imputed dataset). In addition, Kim et. al. (2006)12 shows that MI with complex sample survey data can have problems with variance estimation, especially for domains, so it is also necessary to incorporate the sampling weight and design variables in the MI procedure.

2.4 |. Full Information Maximum Likelihood (FIML) Approach

Full Information Maximum Likelihood (FIML) approach is another method that can be utilized to deal with incomplete data.13 Instead of imputing the missing values, the FIML approach assumes that all of the variables follow a multivariate normal distribution and is only applicable in the context of continuous, normally distributed incomplete variables. This approach maximizes the following log-likelihood function, which is formed based on all available information obtained from the complete and partially complete observations from the n subjects in the sample:

FIMLμ,ΣY=iSwi[-ki2log2π-12logΣi-12Yi-μiΣi-1Yi-μi],

where wi is the sampling weight for subject i,Y is a n×k matrix which includes both outcomes and covariates, ki is the number of non-missing entries in the ith row of Y,Yi is a 1×ki vector representing the ith row of matrix Y after removing all the missing entries, μi is a 1×ki population mean vector after removing all the means corresponding to missing entries, and Σi is a ki×ki population covariance matrix after removing all the rows and columns corresponding to missing entries. In this way, FIML estimation adjusts the log-likelihood function to make use of all the complete and partially complete observations and provides valid point estimates and confidence intervals for the parameters of interest. In addition, the FIML approach can also incorporate the auxiliary variables using the maximum likelihood estimation of Graham saturated correlates model (2003).20

2.5 |. Advantages and Disadvantages of Common Methods

There are advantages and disadvantages of these commonly used methods that we have described above. The advantage of MI is that it helps correct for bias and lead to very efficient estimates, because MI uses information not only from complete cases but also from observations with partially missing data.5,21 However, the downside of MI is that it needs to be developed within the context of the specific analysis. Since different analysts are concerned with different contexts, no single set of imputations can satisfy all interests.22,23 Additionally, the imputation model needs to accommodate the structure of the analysis model, which may contain interactions, quadratic terms, or even random effects. More complications could be introduced in practice, so that there is a danger of being applied incorrectly.5 In contrast, the weighting-based methods (IPW and NRCW) only require correct specification of the model for missingness and are often easier to understand and explain to the collaborators.5 However, these methods are generally less efficient compared to MI because they only use information from fully observed cases.5 For the FIML approach, the advantage is that it makes use of full information (both complete and partially complete cases) to calculate the log-likelihood, but it also has some drawbacks in the sense that it only takes continuous variables, and when the multivariate normal assumption is violated, it may result in under-estimation of the standard errors.24,25 All the methods mentioned above are able to produce unbiased parameter estimates under MCAR and MAR missing mechanisms. In this paper, our results also shed light on the advantages and disadvantages of these methods for design-based analysis with complex sampling survey data.

2.6 |. Software

We examined the following software that can conduct design-based analysis: R, SAS, SUDAAN, Stata, and Mplus. Table 1 provides a summary of different software and their corresponding procedures that can be used to conduct design-based analysis with three commonly used linear and generalized linear models with complete data. Table 2 presents how to use each software to handle attrition by using different missing data methods (weight-based approach (IPW and NRCW), imputation based approach (MI), and full information likelihood approach (FIML)), and how to incorporate the auxiliary variables in the analysis.

TABLE 1.

Statistical Software and Corresponding Procedures for Design-Based Analysis of Complex Sample Survey Data for Linear and Generalized Linear Models with Complete Data

Model Software Procedures/ Packages Workflow

Linear regression R survey Specify design features using svydesign function and use svyglm function to fit the model. In svyglm function, specify family = “gaussian”, with identity link function.
SAS proc surveyreg Specify the analysis model in model statement, sampling weights in weight statement, and use strata and cluster statements to specify design features.
SUDAAN proc regress Specify the analysis model in model statement, sampling weights in weight statement, and use nest statements to apply design-based analysis.
Stata svy Use surveyset to specify survey design features such as sampling units and weights, and use svy: regress to specify the model.
Mplus see workflow Specify the analysis model in MODEL command and use TYPE = COMPLEX in ANALYSIS command, along with STRAT, CLUSTER, WEIGHT, and/or SUBPOPULATION options in the VARIABLE command.

Logistic regression R survey Specify design features using svydesign function and use svyglm function to fit the model. In svyglm function, specify family = “binomial”, with logit link function.
SAS proc surveylogistic Specify the analysis model in model statement, sampling weights in weight statement, and use strata and cluster statements to specify design features.
SUDAAN proc rlogist Specify the analysis model in model statement, sampling weights in weight statement, and use nest statements to apply design-based analysis.
Stata svy Use surveyset to specify survey design features such as sampling units and weights, and use svy: logit to specify the model.
Mplus see workflow Specify the analysis model in MODEL command and use TYPE = COMPLEX in ANALYSIS command, along with STRAT, CLUSTER, WEIGHT, and/or SUBPOPULATION options in the VARIABLE command. In addition, the outcome variable should also be included in CATEGORICAL option in VARIABLE command.

Poisson regression R survey Specify design features using svydesign function and use svyglm function to fit the model. In svyglm function, specify family = “poisson”, with log link function.
SUDAAN proc loglink Specify the analysis model in model statement, sampling weights in weight statement, and use nest statements to apply design-based analysis.
Stata svy Use surveyset to specify survey design features such as sampling units and weights, and use svy: poisson to specify the model.
Mplus see workflow Specify the analysis model in MODEL command and use TYPE = COMPLEX in ANALYSIS command, along with STRAT, CLUSTER, WEIGHT, and/or SUBPOPULATION options in the VARIABLE command. In addition, the outcome variable should also be included in COUNT option in VARIABLE command.

TABLE 2.

Statistical Software and Corresponding Procedures for Design-Based Analysis of Sample Survey Data in the Presence of Attrition

Method Software Procedures/ Packages Workflow

Weighting-based R survey Use the same functions as for the complete data in Table 1 with the adjusted IPW/NRCW weights.
SAS proc surveyreg, and proc surveylogistic Use the same functions as for the complete data in Table 1 with the adjusted IPW/NRCW weights.
SUDAAN proc regress, proc rlogistic, and proc loglink Use the same functions as for the complete data in Table 1 with the adjusted IPW/NRCW weights.
Stata svy Use the same functions as for the complete data in Table 1 with the adjusted IPW/NRCW weights.
Mplus see workflow Use the same functions as for the complete data in Table 1 with the adjusted IPW/NRCW weights.

Imputation-based R MICE, missForest, Amelia, and mi There are various packages in R that can be used to conduct multiple imputations. For example, MICE package implements the FCS method for conducting multiple imputation; missForest package implements the nonparametric missing value imputation using random forest; Amelia package enables multiple imputation through a expectation-maximization with bootstrap (EMB) algorithm; and mi package imputes missing values in an approximate Bayesian framework and provides the corresponding diagnostics for multiple imputation.
SAS proc mi and proc mianalyze Use FCS statement in proc mi (or other statements such as EM and MCMC) to generate imputed complete data sets. Regular analyses are applied to these imputed data sets and proc mianalyze can be used to combine to get the final estimates.
SUDAAN proc impute Specify the imputation method in method option in proc impute. Available options include weighted sequential hotdeck (WSHD) multivariate imputation, cell mean imputation, linear regression imputation (continuous outcomes only), and logistic regression imputation (binary outcomes only). Specify the imputation class variables in impby statement and the outcome variable in impvar statement.
Stata mi Use mi set to specify the name of the data set that stores the imputations, and use mi register to specify the variables we would like to impute. Then, use mi impute to impute missing values. There are various available options of imputation methods. For example, sequential imputation using chained equations (i.e., FCS) is implemented in mi impute chained, and multivariate normal regression is implemented in mi impute mvn. Additionally, the mi prefix should be added to run the analysis model, mi svyset is used to specify the study design, and mi estimate: svy is used to fit a model on MI survey data.
Mplus see workflow Multiple imputation of missing data using Bayesian analysis is implemented in Mplus. In the DATA IMPUTATION command, specify the variables for which we will impute the missing values in IMPUTE option, specify the number of imputed data sets in NDATASET option, and use the SAVE option to save the imputed data sets for further analysis using TYPE=IMPUTATION in the DATA command.

FIML Stata sem Specify the option method (mlmv) in sem command.
Mplus see workflow Use MISSING option to identify the values or symbols in the dataset that are treated as missing and use AUXILIARY option to include the auxiliary variables.

3 |. SIMULATION STUDY

In this section, we present the set up for the extensive simulation studies to compare the methods for missing data for design-based approach. We describe the data generation process first followed by descriptions of the different methods used to analyze the data.

3.1 |. Sampling Design

Data were generated to mimic the sampling design and variable distributions from the HCHS/SOL (LaVange et. al., 2010).2 Specifically, we generated the simulated data based on a simplified version of the sampling design for the Bronx site in HCHS/SOL. First, data were generated for a population with a nested structure, i.e., subjects were nested within households and households were nested with census block (block group). Then 1000 independent samples were drawn from this same population using a complex sampling design. Specifically, the data were generated based on a stratified three-stage sampling design, where stage 1 corresponded to census block groups (BG), stage 2 corresponded to households (HH) within block groups, and stage 3 corresponded to subjects within households. A flowchart illustrating the overall sampling design for the simulated data is displayed in Figure 1.

FIGURE 1.

FIGURE 1

Flowchart to Illustrate Sampling Design for Simulated Data

Stage 1: Block Groups (BG)

The population contained 752 BGs, which were divided into 4 strata. At stage 1, entire BGs were sampled without replacement from the population. Figure 1 contains the number of BGs in the population and the sampling probability for each stratum.

Stage 2: Households (HH)

Households were nested within each BG. The total number of HHs in each BG was independently generated as 1+Round⁡(Exp⁡(450)), where “Round⁡()” indicates rounding to the nearest integer and Exp(θ) refers to the exponential distribution with mean θ. We divided all HHs into HH with Hispanic surname and HH with other surnames. Since the target population of HCHS/SOL is Hispanics/Latinos in the age range of 18–74 years, not all HHs would include eligible subjects. For example, a HH with only non-Hispanic/Latino subjects would not be eligible. Figure 1 presents the proportion of eligible HHs within each BG by Hispanic/other surnames (stage 2 stratum). HHs were sampled without replacement from those eligible HHs that were in BGs selected during stage 1. The stratified sampling probabilities for stage 2 sampling are presented in Figure 1.

Stage 3: Subjects

Within each HH, subjects were divided into 2 strata (stage 3 strata) based on baseline age: “younger adults” (age 18–44 years) and “older adults” (age 45–74 years). The number of subjects in each HH was independently generated as 1+Poisson(1). Baseline age (years) for each subject was independently generated as N(40,225), truncated to the range 18 to 74 years. At stage 3, subjects were sampled without replacement from each HH that was selected during stage 2. The sampling probabilities differed by stage 3 stratum, and are presented in Figure 1.

Based on the design, sampling weights were calculated in three steps. First, weights for each stage of sampling were calculated as the inverse of the sampling probability for that stage. Second, base weights were calculated as the product of the weights from all three stages of sampling. Third, the base weights were normalized by dividing the base weight by the mean of the base weight in the sample, so that the sum of the normalized weights in each sample would equal the sample size. These final normalized weights were used for all weighted analyses.

3.2 |. Data Generation

Generation of Covariates

Covariates were generated independently to each other for all subjects in the population, based on the weighted distributions of baseline variables from the HCHS/SOL data. Specifically, for generating the continuous outcome, the following covariates were generated: X1 (gender) was generated as Bernoulli(0.5), X2 (high school/GED or lower vs. higher level of education) as Bernoulli(0.67), X3 (current smoking) as Bernoulli(0.2), X4 (born in the United States) as Bernoulli(0.2), X5 (self-reported moderate to vigorous physical activity (hours/day)) as N(2,6.5), truncated to the range 0 to 24 hours/day, X6 (years between visits) as N(6,0.25), truncated to the range 3 to 9 years. For generating the binary outcome, the following covariates were generated: X7 (Hispanic/Latino background, equals 1 if Puerto Rican and 0 if Dominican) was generated as Bernoulli(0.5), X8 (gender) as Bernoulli(0.3), X9 (diabetes) as Bernoulli(0.25), X10 (red blood cell count) as N(4.5,0.17), truncated above 0. These specifications are based on data from HCHS/SOL of those who have baseline estimated glomerular filtration rate (eGFR) > 60 mg/min/1.73 m2, which is an indicator of normal kidney function.

Generation of Continuous Outcome

Generation of the continuous outcome at baseline and visit 2 (Y1 and Y2) was based on fitted models for body mass index (BMI) with the corresponding covariates using the HCHS/SOL data. Specifically, the following model was used to generate the baseline continuous outcome:

Y1=Xcon,1α1+aHH+δ1

where Xcon,1=1,X1,X2,X3,X4,X5,age,stratum1,stratum2,stratum3,stratum1*X2,stratum2*X2,stratum3*X2,I{age45}*X5, and stratumh indicates membership in sampling stratum h from the BG stage of sampling. For example, stratum1 indicates the membership in sampling stratum 1. α1=(30,-1,-0.7,-1,2,-0.8,0.03,0.4,-0.04,-1.5,1.7,0.7,-0.6,1.8) is the vector of fixed parameters. The HH clustering effect was generated as aHHN(0,7), and the within-subject error was generated as δ1N(0,32). Note that there was no clustering due to BG in the generation of Y1. This was chosen based on what was observed in HCHS/SOL study. With this data generation process, the intraclass correlation coefficient (ICC) of Y1 within the HH level is 0.179, and the ICC within the BG level is 0.

The following model was used to generate the difference in continuous outcome between visits:

Y2-Y1=Xcon,2α2+bBG+bHH+δ2

where Xcon,2=1,X1,X2,X3,X4,X5,age,stratum1,stratum2,stratum3,X6,X1*X6,X2*X6,X3*X6,X4*X6,X5*X6,age*X6,stratum1*X6,stratum2*X6,stratum3*X6,Y1,stratum1*X2,stratum2*X2,stratum3*X2,I{age45}*X5. α2=(8,4,0.1,-7.5,-6,-0.9,-0.1,7,-0.02,1,-0.05,-0.7,-0.02,1,1,.1,.01,-1,.1,-.1,-0.2,1,0,-0.3,1.3) is the vector of fixed parameters. The BG clustering effect was generated as bBGN(0,0.2), the HH clustering effect was generated as bHHN(0,0.3), and the within-subject error was generated as δ2N(0,9). The continuous outcome from visit 2 Y2 was then calculated as the sum of the baseline continuous outcome and the difference in continuous outcome between visits. With this data generation process, the ICC of Y2 within the HH level is 0.154, and the ICC within the BG level is 0.004.

Generation of Binary Outcome

The binary outcome variable was generated in the following two steps: (1) a continuous variable was generated similarly to the continuous outcome variable in the previous section, and then (2) the continuous variable was dichotomized as a binary variable based on a specified cut-point. Generation of the continuous variable at baseline and visit 2 (W1 and W2) was based on fitted models for estimated glomerular filtration rate (eGFR) with the corresponding covariates using the HCHS/SOL data (among a sample restricted to subjects with baseline eGFR > 60 mg/min/1.73 m2). The following model was used to generate the baseline continuous variable:

W1=Xbin,1β1+cBG+cHH+ϵ1

where Xbin,1=1,X7,X8,X9,X10,age,stratum1,stratum2,stratum3,stratum1*X7,stratum2*X7,stratum3*X7,I{age45}*X10. β1=(150,-0.17,-2.95,4.87,-4,-0.85,2.37,-6.38,-2.58,-1,1.2,0.2,2.5) is the vector of fixed parameters. The BG clustering effect was generated as cBGN(0,3), the HH clustering effect was generated as cHHN(0,30), and the within-subject error was generated as ϵ1N(0,120). With this data generation process, the ICC of W1 within the HH level is 0.216, and the ICC within the BG level is 0.020.

The following model was used to generate the difference in the continuous variable between visits:

W2-W1=Xbin,2β2+dBG+dHH+ϵ2

where Xbin,2=1,X7,X8,X9,X10,age,stratum1,stratum2,stratum3,X6,X7*X6,X8*X6,X9*X6,X10*X6,age*X6,stratum1*X6,stratum2*X6,stratum3*X6,Y1,stratum1*X7,stratum2*X7,stratum3*X7,I{age45}*X10. β2=(19,9.30,-7.53,9.46,2.6,0.04,-6.87,10.50,-3.19,7.00,-1.67,0.95,-2.05,-0.26,-0.07,1.36,-1.98,0.68,-0.40,-3.1,1.7,0.7,-1.6) is the vector of fixed parameters. The BG clustering effect was generated as dBGN(0,3), the HH clustering effect was generated as dHHN(0,30), and the within-subject error was generated as ϵ2N(0,90). The value of this continuous variable from visit 2 W2 was then calculated as the sum of the baseline continuous variable and the difference in the continuous variable between visits. With this data generation process, the ICC of W2 within the HH level is 0.239, and the ICC within the BG level is 0.022. The binary outcome variable was created by dichotomizing this continuous variable (binary variables at visit 1 and 2 denoted by Z1 and Z2)Zt=IWt<90, for t=1,2. The final overall event rate was about 0.20.

3.3 |. Generation of Missing Indicators

Missing indicators of not returning to visit 2 were generated after the samples were drawn from the population. Missing indicators were generated under missing-completely-at-random (MCAR), missing-at-random (MAR), or missing-not-at-random (MNAR) mechanisms, with 30% missing at visit 2.

MCAR

A common MCAR indicator was generated for both the continuous outcome and binary outcome variable, as a Bernoulli random variable with a 30% event rate for missing.

MAR

A common MAR indicator was generated for both the continuous outcome and binary outcome variable. The following model was used to generate the MAR indicator (R) :

logit{P(R=1)}=Xmarγ1

where Xmar=1,age,stratum1,stratum2,stratum3,X2,X8, and γ1=(2.1-0.1,0.25,-0.5,1,0.2,-0.5).

MNAR

Separate MNAR indicators were generated for the continuous outcome variable and the binary outcome variable. The following model was used to generate the MNAR indicator (R) for the continuous outcome:

logit{P(R=1)}=Xmnar,1γ2

where Xmnar,1=1,age,stratum1,stratum2,stratum3,Y2, and γ2=(1,-0.1,0.1,-1,0.2,0.05). Similarly, the following model was used to generate the MNAR indicator (R) for the binary outcome:

logit{P(R=1)}=Xmnar,2γ3

where Xmnar,2=1,age,stratum1,stratum2,stratum3,W2, and γ3=(-1.8,0.1,0.8,-0.1,1,-0.05).

3.4 |. Analysis Models

Since in HCHS/SOL, we are interested in characterizing the effect of the baseline variables on the change in the response variables between baseline and the second study visit, and the HCHS/SOL data have both continuous and discrete response variables of interest, we considered the following models for analysis depending on the type of the variable.

1. Difference model

The linear model with covariate Xdiff=1,X1,X5,X6,Y1 was used to model the difference of BMI between the two visits Y2-Y1 in the population. Specifically, the design-based parameter of interest, Bdiff, is the solution of the following score equation:

i[Xdiff,iXdiff,iBdiff-Xdiff,i(Y2,i-Y1,i)]=0.

2. Rate of change model

Another quantity of interest is the rate of change of the continuous variable. In this case, we use the linear model with covariate Xrate=1,X1,X5,Y1 to modeled the annual rate of change of the BMI (Y2-Y1X6) in the population. Specifically, the design based parameter of interest, Brate, is the solution of the following score equation:

i[Xrate,iXrate,iBrate-Xrate,i(Y2,i-Y1,iX6)]=0.

3. Logistic regression model

We used the logistic regression with covariate Xlogistic=1,X6,X7,Y1 to model the incidence probability of the binary outcome Z2 in the population. Specifically, the design-based parameter of interest, Blogistic, is the solution of the following score equation:

i:Z1,i=0[Xlogistic,iZ2,i-exp(Xlogistic,iBlogistic)1+exp(Xlogistic,iBlogistic)Xlogistic,i]=0.

Note that, for incident event analysis, we restricted the analysis to the domain where the baseline response is 0, i.e, Z1=0.

4. Poisson regression model

We used the Poisson regression model with covariate Xpoi=1,X5,X8 and offset term logX6 to model the incidence rate of the binary outcome Z2 in the population. Specifically, the design-based parameter of interest, Bpoi, is the solution of the following score equation:

i:Z1,i=0[Z2X6-exp(Xpoi,iBpoi)]Xpoi,i=0.

Note that, for incident event analysis, we restricted the analysis to the domain where the baseline response is 0, i.e, Z1=0.

Note that all of the above analysis models are misspecified, i.e., the analysis models we postulate for the sample data are not the same as the models held in the population.

3.5 |. Missing Data Methods

In our simulation studies, we considered 3 missing data mechanisms (MCAR, MAR, and MNAR) and 5 methods for handling missing data: complete case (CC) analysis, inverse probability weighting (IPW), non-response cell weighting (NRCW), multiple imputation (MI), and full information maximum likelihood (FIML) approach. Among the five different missing data methods that we considered, the FIML approach is different from the other methods in the sense that FIML is the only method that assumes the multivariate normal distribution among all the variables. Therefore, we only used the FIML approach to fit linear regression models (difference model and rate of change model) and with only continuous covariates, which we refer to as the “reduced” models. To assess whether the performance of the FIML approach in linear models is robust to the multivariate normality assumption, we also used the FIML approach to fit the linear regression models with both continuous and discrete covariates, which we refer to as the “full” models. Table 3 provides a summary of the models fitted using each missing data method.

TABLE 3.

Summary of Analysis Models with Missing Data

Missing Data Method Model Covariates included in the model

CC/IPW/NRCW/MI Difference Model X1, X5, X6, Y1.
Rate of Change Model X1, X5, Y1.
Logistic Regression Model X6, X7, Y1.
Poisson Regression Model X5, X8, log X6 as offset.

FIML (Reduced) Difference Model X5, X6, Y1.
Rate of Change Model X5, Y1.

FIML (Full) Difference Model X1, X5, X6, Y1.
Rate of Change Model X1, X5, Y1.

3.6 |. Misspecified Missingness Models

When we use various approaches for handling attrition (IPW, NRCW, MI, and FIML), there is an underlying assumption on what variables are associated with the missingness. In the simulation study, we also considered the situations where the missingness model is misspecified. In particular, we considered three types of situations to specify the missingness model: (1) under-specified model, which is defined as the model which fails to consider all the variables associated with the missingness; (2) correctly specified model, which is defined as the model which considers the same set of variables as that truly associated with missingness; and (3) over-specified model, which is defined as the model which considers more variables in addition to all the variables that are actually associated with missingness.

Table 4 summarizes the variables that we considered to be associated with missingness in under-specified, correctly specified, and over-specified models. In particular, the variables used in the correctly specified models correspond to the variables used in generating attrition indicators under MAR missing mechanism. The variable list for over-specified models is formed by adding Hispanic Surname, and that for under-specified models is formed by ignoring stratum and baseline age.

TABLE 4.

Summary of Variables in Under-Specified, Correctly Specified, and Over-Specified Models

Model Specification Variables considered to be Associated with Missingness

Under-specified X2, X8.
Correctly specified stratum, baseline age, X2, X8.
Over-specified stratum, baseline age, X2, X8, Hispanic Surname.

Note that, for each type of missingness model specification, although the variables that we considered to be associated with missingness are the same across various missing data methods (IPW, NRCW, MI, and FIML), different missing data methods do handle those variables differently: IPW regards those variables as auxiliary variables to model the missingness; NRCW uses those variables as post-stratification variables to form weighting classes; MI includes those variables in addition to the covariates in the analysis model to impute the missing data; and FIML approach treats them as auxiliary variables by using the maximum likelihood estimation of Graham saturated correlates model (Graham, 2003).20

3.7 |. Simulation Results

In Tables 5, 6, 7, and 8, we present the simulation results for difference (linear) model, rate of change (linear) model, logistic regression model, and Poisson regression model, in which the missing percentage is 30%. The column “mechanism” indicates the missing data mechanisms (i.e., MCAR, MAR, or MNAR); the column “specification” shows the missingness model specifications (i.e., under-specified, correctly specified, or over-specified); the column “effect” presents the covariates in the analysis model; and the column “true” in the table shows the true value in the population, which is obtained by fitting the corresponding generalized linear model in the target population. The “bias” section shows the average bias of the 1,000 simulations resulted from various missing data methods (i.e., CC, IPW, NRCW, MI, and FIML), where the bias in each simulation is calculated by subtracting the point estimate by the corresponding true value. In addition, the “ESE; SE” section presents the information regarding the variance estimation of various methods. In particular, “SE” represents the average standard error of the regression coefficient of the 1,000 simulations, and “ESE” represents the empirical standard deviation of the regression coefficient, which is defined as the empirical standard deviation of the 1,000 point estimates obtained from the simulations. Finally, the “coverage” section compares the coverage rates of the 95% confidence intervals resulted from various methods, defined as the proportions of the simulations in which the 95% confidence interval covers the true value for each parameter.

TABLE 5.

Simulation Results from Difference Models (Linear Regression) for Various Methods and Missing Mechanisms

Bias ESE; SE Coverage




Mechanism Specification Effect True CC IPW NRCW MI FIML CC IPW NRCW MI FIML CC IPW NRCW MI FIML




MCAR Under Int −2.354 0.009 0.009 0.006 −0.086 0.011 1.382; 1.331 1.383; 1.332 1.383; 1.331 1.223; 1.327 1.385; 1.330 0.946 0.947 0.946 0.974 0.946
x1 −0.013 −0.008 −0.008 −0.007 −0.001 −0.008 0.197; 0.203 0.197; 0.204 0.197; 0.204 0.176; 0.196 0.197; 0.203 0.955 0.956 0.954 0.971 0.956
x5 0.198 −0.002 −0.002 −0.002 0.005 −0.002 0.062; 0.061 0.062; 0.061 0.062; 0.061 0.053; 0.057 0.062; 0.061 0.945 0.945 0.945 0.959 0.944
x6 0.553 0.003 0.003 0.003 0.015 0.002 0.214; 0.205 0.214; 0.205 0.214; 0.205 0.189; 0.207 0.215; 0.205 0.940 0.940 0.939 0.971 0.940
y1 −0.032 −0.001 −0.001 −0.001 −0.002 −0.001 0.015; 0.014 0.015; 0.014 0.015; 0.014 0.013; 0.013 0.015; 0.014 0.939 0.943 0.939 0.950 0.942




Correct Int −2.354 0.009 0.010 0.019 −0.091 0.005 1.382; 1.331 1.386; 1.333 1.401; 1.350 1.244; 1.329 1.332; 1.257 0.946 0.948 0.946 0.971 0.943
x1 −0.013 −0.008 −0.008 −0.005 −0.006 −0.007 0.197; 0.203 0.197; 0.204 0.201; 0.207 0.177; 0.194 0.189; 0.191 0.955 0.955 0.950 0.966 0.949
x5 0.198 −0.002 −0.001 −0.001 0.000 0.000 0.062; 0.061 0.062; 0.061 0.063; 0.062 0.054; 0.057 0.057; 0.056 0.945 0.949 0.941 0.962 0.943
x6 0.553 0.003 0.002 0.001 0.019 0.002 0.214; 0.205 0.215; 0.206 0.218; 0.208 0.190; 0.207 0.206; 0.194 0.940 0.941 0.941 0.965 0.940
y1 −0.032 −0.001 −0.001 −0.001 −0.001 −0.001 0.015; 0.014 0.015; 0.014 0.015; 0.014 0.013; 0.013 0.014; 0.013 0.939 0.940 0.937 0.949 0.932




Over Int −2.354 0.009 0.010 0.014 −0.070 0.005 1.382; 1.331 1.386; 1.333 1.385; 1.336 1.248; 1.344 1.331; 1.257 0.946 0.947 0.948 0.965 0.942
x1 −0.013 −0.008 −0.007 −0.008 −0.003 −0.007 0.197; 0.203 0.197; 0.204 0.197; 0.204 0.177; 0.196 0.189; 0.191 0.955 0.955 0.954 0.970 0.950
x5 0.198 −0.002 −0.001 −0.002 −0.001 0.000 0.062; 0.061 0.062; 0.061 0.062; 0.061 0.054; 0.057 0.057; 0.056 0.945 0.950 0.940 0.962 0.943
x6 0.553 0.003 0.002 0.002 0.016 0.002 0.214; 0.205 0.215; 0.206 0.215; 0.206 0.192; 0.210 0.206; 0.194 0.940 0.941 0.941 0.965 0.939
y1 −0.032 −0.001 −0.001 −0.001 −0.001 −0.001 0.015; 0.014 0.015; 0.014 0.015; 0.014 0.013; 0.013 0.014; 0.013 0.939 0.940 0.940 0.955 0.931




MAR Under Int −2.354 −0.641 −0.638 −0.646 −0.538 −0.629 1.452; 1.383 1.451; 1.384 1.443; 1.384 1.264; 1.388 1.456; 1.382 0.907 0.911 0.910 0.954 0.910
x1 −0.013 −0.005 −0.007 −0.007 −0.008 −0.005 0.217; 0.211 0.217; 0.212 0.217; 0.212 0.190; 0.217 0.217; 0.211 0.942 0.939 0.940 0.972 0.943
x5 0.198 0.169 0.171 0.171 0.219 0.169 0.065; 0.064 0.065; 0.065 0.065; 0.064 0.058; 0.060 0.065; 0.064 0.254 0.250 0.247 0.056 0.249
x6 0.553 0.043 0.043 0.044 0.060 0.042 0.221; 0.214 0.221; 0.214 0.220; 0.214 0.191; 0.216 0.222; 0.214 0.939 0.943 0.943 0.965 0.937
y1 −0.032 0.006 0.006 0.006 −0.003 0.006 0.015; 0.014 0.015; 0.015 0.015; 0.015 0.014; 0.015 0.015; 0.014 0.913 0.914 0.913 0.960 0.915




Correct Int −2.354 −0.641 −0.021 −0.174 −0.349 −0.235 1.452; 1.383 1.577; 1.505 1.562; 1.515 1.262; 1.361 1.349; 1.274 0.907 0.931 0.937 0.959 0.923
x1 −0.013 −0.005 −0.012 −0.011 −0.011 −0.014 0.217; 0.211 0.234; 0.229 0.239; 0.231 0.182; 0.200 0.200; 0.194 0.942 0.954 0.943 0.962 0.939
x5 0.198 0.169 −0.001 0.005 −0.002 −0.002 0.065; 0.064 0.069; 0.067 0.070; 0.068 0.056; 0.059 0.060; 0.059 0.254 0.942 0.940 0.957 0.942
x6 0.553 0.043 0.005 0.017 0.062 0.041 0.221; 0.214 0.239; 0.232 0.234; 0.233 0.192; 0.212 0.205; 0.196 0.939 0.946 0.955 0.964 0.935
y1 −0.032 0.006 0.000 0.002 −0.001 0.000 0.015; 0.014 0.017; 0.016 0.017; 0.016 0.013; 0.014 0.014; 0.014 0.913 0.936 0.927 0.959 0.937




Over Int −2.354 −0.641 −0.018 −0.354 −0.347 −0.231 1.452; 1.383 1.577; 1.505 1.510; 1.431 1.277; 1.355 1.354; 1.274 0.907 0.929 0.922 0.956 0.921
x1 −0.013 −0.005 −0.012 −0.012 −0.012 −0.014 0.217; 0.211 0.234; 0.229 0.223; 0.219 0.181; 0.200 0.200; 0.194 0.942 0.955 0.945 0.969 0.939
x5 0.198 0.169 −0.001 0.021 −0.003 −0.002 0.065; 0.064 0.069; 0.067 0.066; 0.065 0.056; 0.059 0.060; 0.059 0.254 0.941 0.926 0.950 0.942
x6 0.553 0.043 0.005 0.023 0.060 0.040 0.221; 0.214 0.239; 0.232 0.228; 0.221 0.194; 0.211 0.205; 0.196 0.939 0.947 0.948 0.960 0.934
y1 −0.032 0.006 0.000 0.004 −0.001 0.000 0.015; 0.014 0.017; 0.016 0.016; 0.015 0.013; 0.014 0.014; 0.014 0.913 0.936 0.912 0.968 0.936




MNAR Under Int −2.354 −0.842 −0.491 −0.848 −0.852 −0.837 1.431; 1.352 1.390; 1.348 1.429; 1.353 1.261; 1.372 1.437; 1.352 0.883 0.932 0.884 0.924 0.884
x1 −0.013 0.002 0.006 0.002 0.000 0.002 0.209; 0.209 0.214; 0.207 0.209; 0.209 0.184; 0.214 0.209; 0.209 0.944 0.941 0.944 0.979 0.944
x5 0.198 0.095 0.008 0.095 0.141 0.095 0.065; 0.063 0.061; 0.062 0.065; 0.063 0.059; 0.060 0.065; 0.063 0.662 0.949 0.660 0.366 0.662
x6 0.553 0.059 0.047 0.060 0.072 0.058 0.220; 0.210 0.214; 0.209 0.220; 0.210 0.194; 0.213 0.221; 0.210 0.925 0.943 0.926 0.949 0.924
y1 −0.032 0.011 0.003 0.011 0.007 0.011 0.015; 0.014 0.015; 0.014 0.015; 0.014 0.013; 0.014 0.015; 0.014 0.859 0.934 0.859 0.933 0.860




Correct Int −2.354 −0.842 0.099 −0.366 −0.591 −0.475 1.431; 1.352 1.444; 1.392 1.537; 1.443 1.258; 1.352 1.321; 1.258 0.883 0.945 0.928 0.946 0.910
x1 −0.013 0.002 −0.004 −0.005 0.001 −0.002 0.209; 0.209 0.218; 0.213 0.223; 0.223 0.179; 0.200 0.196; 0.193 0.944 0.944 0.952 0.966 0.948
x5 0.198 0.095 −0.109 −0.037 −0.002 −0.001 0.065; 0.063 0.061; 0.063 0.067; 0.066 0.055; 0.059 0.059; 0.058 0.662 0.581 0.902 0.961 0.943
x6 0.553 0.059 0.014 0.034 0.069 0.048 0.220; 0.210 0.221; 0.215 0.237; 0.223 0.194; 0.211 0.202; 0.194 0.925 0.948 0.935 0.962 0.932
y1 −0.032 0.011 −0.006 0.005 0.000 0.000 0.015; 0.014 0.015; 0.015 0.016; 0.015 0.013; 0.014 0.014; 0.013 0.859 0.916 0.920 0.965 0.945




Over Int −2.354 −0.842 0.098 −0.491 −0.612 −0.475 1.431; 1.352 1.443; 1.393 1.467; 1.388 1.267; 1.356 1.321; 1.258 0.883 0.944 0.913 0.945 0.911
x1 −0.013 0.002 −0.004 −0.006 −0.001 −0.002 0.209; 0.209 0.218; 0.213 0.213; 0.214 0.179; 0.200 0.196; 0.193 0.944 0.944 0.948 0.966 0.947
x5 0.198 0.095 −0.109 −0.024 −0.001 −0.001 0.065; 0.063 0.061; 0.063 0.065; 0.063 0.055; 0.059 0.059; 0.058 0.662 0.582 0.926 0.961 0.943
x6 0.553 0.059 0.014 0.038 0.072 0.048 0.220; 0.210 0.221; 0.215 0.225; 0.215 0.195; 0.212 0.202; 0.194 0.925 0.948 0.939 0.958 0.932
y1 −0.032 0.011 −0.006 0.006 0.000 0.000 0.015; 0.014 0.015; 0.015 0.015; 0.014 0.013; 0.014 0.014; 0.013 0.859 0.916 0.910 0.963 0.945

Abbreviations: CC = complete-case analysis; IPW =inverse probability weighting; NRCW = non-response cell weighting; MI = multiple imputation; FIML: full information maximum likelihood approach (full model); MCAR = missing-completely-at-random; MAR = missing-at-random; MNAR = missing-not-at-random.

Notation: bias = average bias from 1,000 simulations; ESE = empirical standard deviation of the point estimates from 1,000 simulations; SE = average standard error estimates from 1,000 simulations; coverage = 95% confidence interval coverage rate from 1,000 simulations.

TABLE 6.

Simulation Results from Rate of Change Models (Linear Regression) for Various Methods and Missing Mechanisms

Bias ESE; SE Coverage




Mechanism Specification Effect True CC IPW NRCW MI FIML CC IPW NRCW MI FIML CC IPW NRCW MI FIML




MCAR Under Int 0.160 0.004 0.004 0.004 0.000 0.004 0.085; 0.082 0.085; 0.082 0.085; 0.082 0.075; 0.078 0.085; 0.082 0.943 0.946 0.946 0.949 0.946
x1 0.003 −0.001 −0.001 −0.001 0.000 −0.001 0.033; 0.034 0.033; 0.035 0.033; 0.034 0.030; 0.033 0.033; 0.034 0.952 0.952 0.952 0.960 0.954
x5 0.033 0.000 0.000 0.000 0.001 0.000 0.011; 0.010 0.011; 0.010 0.010; 0.010 0.009; 0.010 0.010; 0.010 0.941 0.944 0.941 0.961 0.944
y1 −0.005 0.000 0.000 0.000 0.000 0.000 0.003; 0.002 0.003; 0.002 0.003; 0.002 0.002; 0.002 0.003; 0.002 0.942 0.943 0.944 0.953 0.943




Correct Int 0.160 0.004 0.004 0.004 0.004 0.003 0.085; 0.082 0.085; 0.082 0.087; 0.084 0.076; 0.078 0.081; 0.078 0.943 0.948 0.940 0.956 0.941
x1 0.003 −0.001 −0.001 −0.001 0.000 −0.001 0.033; 0.034 0.033; 0.035 0.034; 0.035 0.030; 0.033 0.032; 0.032 0.952 0.952 0.947 0.967 0.951
x5 0.033 0.000 0.000 0.000 0.000 0.000 0.011; 0.010 0.010; 0.010 0.011; 0.010 0.009; 0.010 0.010; 0.010 0.941 0.941 0.942 0.957 0.944
y1 −0.005 0.000 0.000 0.000 0.000 0.000 0.003; 0.002 0.003; 0.002 0.003; 0.002 0.002; 0.002 0.002; 0.002 0.942 0.947 0.936 0.958 0.939




Over Int 0.160 0.004 0.004 0.004 0.005 0.003 0.085; 0.082 0.085; 0.082 0.085; 0.083 0.075; 0.078 0.081; 0.078 0.943 0.948 0.942 0.955 0.941
x1 0.003 −0.001 −0.001 −0.001 −0.001 −0.001 0.033; 0.034 0.033; 0.035 0.033; 0.035 0.030; 0.033 0.032; 0.032 0.952 0.952 0.951 0.967 0.950
x5 0.033 0.000 0.000 0.000 0.000 0.000 0.011; 0.010 0.010; 0.010 0.011; 0.010 0.009; 0.010 0.010; 0.010 0.941 0.941 0.939 0.957 0.945
y1 −0.005 0.000 0.000 0.000 0.000 0.000 0.003; 0.002 0.003; 0.002 0.003; 0.002 0.002; 0.002 0.002; 0.002 0.942 0.947 0.945 0.957 0.940




MAR Under Int 0.160 −0.065 −0.065 −0.065 −0.030 −0.064 0.090; 0.084 0.090; 0.084 0.090; 0.084 0.079; 0.084 0.090; 0.084 0.857 0.859 0.854 0.955 0.859
x1 0.003 −0.001 −0.001 −0.001 −0.001 −0.001 0.037; 0.036 0.037; 0.036 0.037; 0.036 0.032; 0.037 0.037; 0.036 0.940 0.940 0.940 0.975 0.939
x5 0.033 0.028 0.029 0.029 0.037 0.028 0.011; 0.011 0.011; 0.011 0.011; 0.011 0.010; 0.010 0.011; 0.011 0.261 0.263 0.256 0.056 0.261
y1 −0.005 0.001 0.001 0.001 0.000 0.001 0.003; 0.002 0.003; 0.002 0.003; 0.002 0.002; 0.002 0.003; 0.002 0.913 0.914 0.914 0.960 0.914




Correct Int 0.160 −0.065 0.002 −0.013 0.004 0.001 0.090; 0.084 0.097; 0.092 0.100; 0.094 0.079; 0.080 0.085; 0.080 0.857 0.936 0.931 0.952 0.934
x1 0.003 −0.001 −0.002 −0.002 −0.002 −0.002 0.037; 0.036 0.040; 0.039 0.041; 0.039 0.031; 0.034 0.034; 0.033 0.940 0.950 0.940 0.967 0.941
x5 0.033 0.028 0.000 0.001 −0.001 0.000 0.011; 0.011 0.012; 0.011 0.012; 0.012 0.010; 0.010 0.010; 0.010 0.261 0.941 0.938 0.954 0.945
y1 −0.005 0.001 0.000 0.000 0.000 0.000 0.003; 0.002 0.003; 0.003 0.003; 0.003 0.002; 0.002 0.002; 0.002 0.913 0.939 0.932 0.960 0.943




Over Int 0.160 −0.065 0.002 −0.037 0.002 0.001 0.090; 0.084 0.097; 0.092 0.095; 0.088 0.078; 0.081 0.085; 0.080 0.857 0.935 0.920 0.955 0.934
x1 0.003 −0.001 −0.002 −0.002 −0.002 −0.002 0.037; 0.036 0.040; 0.039 0.038; 0.037 0.032; 0.034 0.034; 0.033 0.940 0.950 0.945 0.958 0.942
x5 0.033 0.028 0.000 0.004 −0.001 0.000 0.011; 0.011 0.012; 0.011 0.011; 0.011 0.009; 0.010 0.010; 0.010 0.261 0.941 0.925 0.960 0.945
y1 −0.005 0.001 0.000 0.001 0.000 0.000 0.003; 0.002 0.003; 0.003 0.003; 0.003 0.002; 0.002 0.002; 0.002 0.913 0.939 0.921 0.959 0.945




MNAR Under Int 0.160 −0.083 −0.036 −0.083 −0.072 −0.083 0.086; 0.081 0.086; 0.081 0.087; 0.081 0.077; 0.082 0.086; 0.081 0.795 0.908 0.789 0.878 0.795
x1 0.003 0.001 0.002 0.001 0.000 0.001 0.035; 0.035 0.036; 0.035 0.035; 0.035 0.032; 0.036 0.035; 0.035 0.945 0.944 0.946 0.970 0.945
x5 0.033 0.016 0.001 0.016 0.024 0.016 0.011; 0.011 0.010; 0.011 0.011; 0.011 0.010; 0.010 0.011; 0.011 0.659 0.953 0.659 0.397 0.662
y1 −0.005 0.002 0.001 0.002 0.001 0.002 0.002; 0.002 0.003; 0.002 0.002; 0.002 0.002; 0.002 0.002; 0.002 0.860 0.937 0.857 0.947 0.860




Correct Int 0.160 −0.083 0.031 −0.028 −0.030 −0.032 0.086; 0.081 0.089; 0.085 0.093; 0.088 0.077; 0.079 0.083; 0.078 0.795 0.928 0.921 0.945 0.918
x1 0.003 0.001 0.000 0.000 0.001 0.000 0.035; 0.035 0.037; 0.036 0.038; 0.038 0.031; 0.034 0.033; 0.033 0.945 0.949 0.954 0.974 0.950
x5 0.033 0.016 −0.018 −0.006 0.000 0.000 0.011; 0.011 0.010; 0.011 0.011; 0.011 0.009; 0.010 0.010; 0.010 0.659 0.584 0.901 0.961 0.940
y1 −0.005 0.002 −0.001 0.001 0.000 0.000 0.002; 0.002 0.003; 0.002 0.003; 0.003 0.002; 0.002 0.002; 0.002 0.860 0.916 0.925 0.967 0.945




Over Int 0.160 −0.083 0.030 −0.045 −0.032 −0.031 0.086; 0.081 0.089; 0.085 0.090; 0.084 0.076; 0.079 0.083; 0.078 0.795 0.925 0.901 0.940 0.917
x1 0.003 0.001 0.000 −0.001 0.000 0.000 0.035; 0.035 0.037; 0.036 0.036; 0.036 0.030; 0.034 0.033; 0.033 0.945 0.945 0.953 0.966 0.950
x5 0.033 0.016 −0.018 −0.004 0.000 0.000 0.011; 0.011 0.010; 0.011 0.011; 0.011 0.009; 0.010 0.010; 0.010 0.659 0.581 0.931 0.959 0.940
y1 −0.005 0.002 −0.001 0.001 0.000 0.000 0.002; 0.002 0.003; 0.002 0.003; 0.002 0.002; 0.002 0.002; 0.002 0.860 0.916 0.917 0.958 0.945

Abbreviations: CC = complete-case analysis; IPW =inverse probability weighting; NRCW = non-response cell weighting; MI = multiple imputation; FIML: full information maximum likelihood approach (full model); MCAR = missing-completely-at-random; MAR = missing-at-random; MNAR = missing-not-at-random.

Notation: bias = average bias from 1,000 simulations; ESE = empirical standard deviation of the point estimates from 1,000 simulations; SE = average standard error estimates from 1,000 simulations; coverage = 95% confidence interval coverage rate from 1,000 simulations.

TABLE 7.

Simulation Results from Binary Models (Logistic Regression) for Various Methods and Missing Mechanisms

Bias ESE; SE Coverage




Mechanism Specification Effect True CC IPW NRCW MI CC IPW NRCW MI CC IPW NRCW MI




MCAR Under Int −2.669 0.013 0.013 0.005 0.059 1.407; 1.344 1.407; 1.345 1.405; 1.347 1.235; 1.402 0.931 0.938 0.931 0.967
x7 0.106 0.007 0.007 0.007 0.005 0.209; 0.211 0.209; 0.211 0.211; 0.211 0.187; 0.218 0.948 0.947 0.945 0.974
x6 −0.241 −0.004 −0.005 −0.003 0.004 0.219; 0.209 0.219; 0.209 0.219; 0.209 0.193; 0.220 0.938 0.941 0.940 0.968
y1 0.053 0.000 0.000 0.000 −0.002 0.014; 0.014 0.014; 0.014 0.014; 0.014 0.013; 0.015 0.943 0.944 0.941 0.975




Correct Int −2.669 0.013 0.011 −0.002 −0.024 1.407; 1.344 1.410; 1.345 1.441; 1.364 1.238; 1.374 0.931 0.936 0.933 0.970
x7 0.106 0.007 0.007 0.005 0.006 0.209; 0.211 0.210; 0.211 0.216; 0.214 0.182; 0.212 0.948 0.948 0.947 0.975
x6 −0.241 −0.004 −0.004 −0.001 0.005 0.219; 0.209 0.220; 0.209 0.224; 0.212 0.193; 0.215 0.938 0.941 0.939 0.972
y1 0.053 0.000 0.000 0.000 −0.001 0.014; 0.014 0.014; 0.014 0.015; 0.014 0.013; 0.015 0.943 0.944 0.943 0.975




Over Int −2.669 0.013 0.011 0.007 −0.025 1.407; 1.344 1.410; 1.345 1.412; 1.353 1.263; 1.364 0.931 0.936 0.931 0.963
x7 0.106 0.007 0.007 0.008 0.006 0.209; 0.211 0.210; 0.211 0.212; 0.212 0.184; 0.211 0.948 0.948 0.947 0.976
x6 −0.241 −0.004 −0.004 −0.003 0.004 0.219; 0.209 0.220; 0.209 0.220; 0.210 0.197; 0.214 0.938 0.941 0.942 0.961
y1 0.053 0.000 0.000 0.000 0.000 0.014; 0.014 0.014; 0.014 0.014; 0.014 0.013; 0.014 0.943 0.944 0.941 0.970




MAR Under Int −2.669 0.413 0.382 0.385 0.517 1.278; 1.255 1.280; 1.257 1.279; 1.256 1.103; 1.319 0.937 0.940 0.940 0.967
x7 0.106 0.011 0.012 0.011 0.013 0.197; 0.196 0.197; 0.196 0.197; 0.196 0.170; 0.208 0.946 0.949 0.947 0.985
x6 −0.241 0.020 0.025 0.024 0.029 0.199; 0.195 0.200; 0.195 0.199; 0.195 0.174; 0.207 0.941 0.941 0.941 0.979
y1 0.053 −0.009 −0.009 −0.009 −0.013 0.014; 0.013 0.014; 0.013 0.014; 0.013 0.012; 0.014 0.872 0.875 0.873 0.891




Correct Int −2.669 0.413 0.006 0.095 −0.116 1.278; 1.255 1.380; 1.347 1.407; 1.382 1.196; 1.307 0.937 0.946 0.953 0.967
x7 0.106 0.011 0.006 0.008 0.010 0.197; 0.196 0.213; 0.208 0.214; 0.213 0.179; 0.201 0.946 0.946 0.951 0.967
x6 −0.241 0.020 −0.005 −0.011 0.028 0.199; 0.195 0.213; 0.208 0.217; 0.213 0.186; 0.203 0.941 0.958 0.954 0.960
y1 0.053 −0.009 0.000 −0.001 −0.002 0.014; 0.013 0.015; 0.014 0.015; 0.015 0.013; 0.014 0.872 0.947 0.938 0.970




Over Int −2.669 0.413 0.008 0.266 −0.130 1.278; 1.255 1.380; 1.346 1.349; 1.338 1.195; 1.312 0.937 0.946 0.948 0.972
x7 0.106 0.011 0.006 0.002 0.012 0.197; 0.196 0.213; 0.208 0.212; 0.208 0.180; 0.200 0.946 0.948 0.949 0.966
x6 −0.241 0.020 −0.005 −0.002 0.030 0.199; 0.195 0.213; 0.208 0.210; 0.207 0.185; 0.204 0.941 0.958 0.940 0.975
y1 0.053 −0.009 0.000 −0.005 −0.002 0.014; 0.013 0.015; 0.014 0.015; 0.014 0.013; 0.014 0.872 0.948 0.923 0.973




MNAR Under Int −2.669 0.053 0.035 0.058 −0.019 1.795; 1.680 2.019; 1.964 1.800; 1.686 1.642; 1.699 0.925 0.935 0.923 0.953
x7 0.106 0.019 0.022 0.017 0.019 0.279; 0.266 0.329; 0.312 0.278; 0.266 0.257; 0.268 0.938 0.930 0.940 0.960
x6 −0.241 −0.072 −0.097 −0.072 −0.061 0.281; 0.262 0.320; 0.307 0.281; 0.262 0.256; 0.265 0.922 0.933 0.920 0.946
y1 0.053 −0.009 −0.016 −0.009 −0.008 0.019; 0.018 0.022; 0.021 0.019; 0.018 0.017; 0.018 0.907 0.868 0.904 0.957




Correct Int −2.669 0.053 −0.135 0.021 0.067 1.795; 1.680 2.203; 2.070 1.900; 1.755 1.558; 1.653 0.925 0.937 0.925 0.965
x7 0.106 0.019 0.007 0.009 0.021 0.279; 0.266 0.364; 0.334 0.295; 0.280 0.241; 0.258 0.938 0.944 0.942 0.956
x6 −0.241 −0.072 −0.047 −0.036 −0.050 0.281; 0.262 0.351; 0.326 0.299; 0.274 0.244; 0.260 0.922 0.929 0.926 0.964
y1 0.053 −0.009 −0.014 −0.008 −0.006 0.019; 0.018 0.024; 0.021 0.020; 0.018 0.016; 0.017 0.907 0.852 0.892 0.965




Over Int −2.669 0.053 −0.133 0.115 0.112 1.795; 1.680 2.197; 2.066 1.812; 1.677 1.579; 1.646 0.925 0.937 0.922 0.959
x7 0.106 0.019 0.007 0.025 0.020 0.279; 0.266 0.362; 0.333 0.274; 0.265 0.243; 0.255 0.938 0.945 0.935 0.959
x6 −0.241 −0.072 −0.047 −0.054 −0.057 0.281; 0.262 0.350; 0.325 0.282; 0.261 0.248; 0.259 0.922 0.927 0.928 0.957
y1 0.053 −0.009 −0.014 −0.011 −0.006 0.019; 0.018 0.024; 0.021 0.018; 0.018 0.016; 0.017 0.907 0.855 0.879 0.957

Abbreviations: CC = complete-case analysis; IPW =inverse probability weighting; NRCW = non-response cell weighting; MI = multiple imputation; FIML: full information maximum likelihood approach (full model); MCAR = missing-completely-at-random; MAR = missing-at-random; MNAR = missing-not-at-random.

Notation: bias = average bias from 1,000 simulations; ESE = empirical standard deviation of the point estimates from 1,000 simulations; SE = average standard error estimates from 1,000 simulations; coverage = 95% confidence interval coverage rate from 1,000 simulations.

TABLE 8.

Simulation Results from Incidence Models (Poisson Regression) for Various Methods and Missing Mechanisms

Bias ESE; SE Coverage




Mechanism Specification Effect True CC IPW NRCW MI CC IPW NRCW MI CC IPW NRCW MI




MCAR Under Int −4.418 −0.010 −0.010 −0.011 0.007 0.195; 0.188 0.195; 0.188 0.193; 0.189 0.186; 0.192 0.938 0.939 0.939 0.959
x8 0.329 −0.011 −0.011 −0.011 0.006 0.202; 0.199 0.202; 0.199 0.202; 0.200 0.185; 0.205 0.934 0.935 0.937 0.960
x5 −0.002 −0.001 −0.001 −0.001 0.000 0.051; 0.048 0.051; 0.048 0.051; 0.049 0.047; 0.050 0.938 0.938 0.940 0.958




Correct Int −4.418 −0.010 −0.009 −0.010 −0.003 0.195; 0.188 0.194; 0.188 0.195; 0.191 0.179; 0.191 0.938 0.939 0.943 0.968
x8 0.329 −0.011 −0.011 −0.011 −0.007 0.202; 0.199 0.202; 0.199 0.205; 0.202 0.184; 0.201 0.934 0.936 0.937 0.966
x5 −0.002 −0.001 −0.001 −0.001 −0.001 0.051; 0.048 0.051; 0.048 0.052; 0.049 0.046; 0.049 0.938 0.936 0.936 0.963




Over Int −4.418 −0.010 −0.009 −0.010 −0.001 0.195; 0.188 0.194; 0.188 0.192; 0.189 0.182; 0.193 0.938 0.940 0.944 0.963
x8 0.329 −0.011 −0.011 −0.010 −0.008 0.202; 0.199 0.202; 0.199 0.202; 0.200 0.186; 0.201 0.934 0.938 0.938 0.968
x5 −0.002 −0.001 −0.001 −0.001 0.000 0.051; 0.048 0.051; 0.048 0.051; 0.049 0.047; 0.050 0.938 0.936 0.941 0.956




MAR Under Int −4.418 0.280 0.280 0.280 0.288 0.173; 0.172 0.174; 0.172 0.173; 0.172 0.171; 0.178 0.626 0.633 0.631 0.663
x8 0.329 −0.084 −0.086 −0.086 −0.075 0.186; 0.181 0.186; 0.181 0.186; 0.181 0.171; 0.189 0.916 0.916 0.916 0.956
x5 −0.002 0.000 0.000 0.000 0.002 0.045; 0.044 0.045; 0.044 0.045; 0.044 0.041; 0.046 0.941 0.940 0.938 0.969




Correct Int −4.418 0.280 −0.010 0.019 −0.003 0.173; 0.172 0.189; 0.188 0.193; 0.191 0.176; 0.183 0.626 0.949 0.950 0.957
x8 0.329 −0.084 −0.010 −0.015 −0.007 0.186; 0.181 0.201; 0.197 0.204; 0.202 0.181; 0.191 0.916 0.942 0.941 0.958
x5 −0.002 0.000 −0.002 −0.001 0.000 0.045; 0.044 0.050; 0.048 0.051; 0.049 0.044; 0.047 0.941 0.930 0.938 0.966




Over Int −4.418 0.280 −0.010 0.092 0.001 0.173; 0.172 0.189; 0.188 0.188; 0.186 0.175; 0.187 0.626 0.947 0.920 0.957
x8 0.329 −0.084 −0.010 0.011 −0.007 0.186; 0.181 0.201; 0.197 0.198; 0.193 0.180; 0.194 0.916 0.942 0.934 0.968
x5 −0.002 0.000 −0.002 0.000 0.000 0.045; 0.044 0.050; 0.048 0.048; 0.048 0.044; 0.048 0.941 0.929 0.937 0.966




MNAR Under Int −4.418 −0.645 −0.645 −0.647 −0.636 0.258; 0.247 0.258; 0.247 0.259; 0.247 0.243; 0.246 0.272 0.278 0.276 0.278
x8 0.329 0.041 0.042 0.042 0.051 0.280; 0.261 0.280; 0.261 0.280; 0.261 0.261; 0.260 0.922 0.922 0.924 0.938
x5 −0.002 −0.002 −0.002 −0.002 −0.001 0.067; 0.063 0.067; 0.063 0.067; 0.063 0.063; 0.063 0.931 0.932 0.933 0.948




Correct Int −4.418 −0.645 −0.384 −0.420 −0.406 0.258; 0.247 0.277; 0.260 0.270; 0.258 0.224; 0.240 0.272 0.636 0.586 0.632
x8 0.329 0.041 0.032 0.030 0.024 0.280; 0.261 0.296; 0.275 0.282; 0.272 0.232; 0.247 0.922 0.933 0.935 0.962
x5 −0.002 −0.002 −0.001 −0.001 −0.001 0.067; 0.063 0.071; 0.066 0.070; 0.066 0.055; 0.061 0.931 0.937 0.936 0.967




Over Int −4.418 −0.645 −0.384 −0.476 −0.405 0.258; 0.247 0.276; 0.259 0.257; 0.245 0.224; 0.241 0.272 0.637 0.490 0.640
x8 0.329 0.041 0.033 −0.016 0.017 0.280; 0.261 0.295; 0.275 0.275; 0.258 0.236; 0.248 0.922 0.934 0.929 0.958
x5 −0.002 −0.002 −0.001 −0.002 −0.001 0.067; 0.063 0.071; 0.066 0.066; 0.062 0.056; 0.061 0.931 0.937 0.931 0.965

Abbreviations: CC = complete-case analysis; IPW =inverse probability weighting; NRCW = non-response cell weighting; MI = multiple imputation; FIML: full information maximum likelihood approach (full model); MCAR = missing-completely-at-random; MAR = missing-at-random; MNAR = missing-not-at-random.

Notation: bias = average bias from 1,000 simulations; ESE = empirical standard deviation of the point estimates from 1,000 simulations; SE = average standard error estimates from 1,000 simulations; coverage = 95% confidence interval coverage rate from 1,000 simulations.

As we can observe from the tables, under the MCAR missing data mechanism, as expected, all the missing data methods, including CC analysis, produced approximately unbiased estimates of all the regression coefficients, regardless of whether the missingness model is under-specified, correctly specified, or over-specified. The average estimated standard errors are close to the corresponding empirical standard errors. All of the coverage rates for the 95% confidence intervals are close to the nominal level. Under the MAR missing data mechanism, the CC analysis generally produces biased estimates. For example, in the difference model, the bias for the regression coefficient of X5 is as high as 0.169, leading to a 95% confidence interval coverage rate as low as 0.247. Similar observations can also be made in the simulation results for other types of generalized linear models. Additionally, under MAR, when the missingness model is under-specified, none of the missing data methods are guaranteed to work. For example, in the rate of change model, the biases for the coefficient of X5 are 0.029, 0.029, 0.037, and 0.028 for IPW, NRCW, MI, and FIML, respectively, leading to 95% confidence interval coverage rates of 0.263, 0.256, 0.056, and 0.261. The under-coverage of the 95% confidence interval due to high bias also appears on the covariate X5 in difference model, Y1 in logistic model, as well as the intercept terms in all four types of models. We also note that under MAR situation when the missing data model was under-specified, none of the methods work well due to bias and low 95% confidence interval coverage. Specifically, we see biases for the coefficient estimates for some covariates but not others. The reason for this could be due to the underlying data generation process. Specifically, in difference model and rate of change model, the outcome of interest is associated with X5 via an interaction term with I{Age45}, which implies that for different age groups, X5 will have different effects on the outcome. Therefore, we see bias in X5 for those scenarios where the missingness models are under-specified and not adjusted for age. For the logistic model, again, it is the misspecification of the missingness model that leads to the bias of Y1, since Y1 is the only covariate included in the model that has an association with age. For the Poisson model, neither X5 nor X8 has an interaction effect with age on the outcome and they are both independent of age. Yet, the outcome is associated with age because of the way the data were generated. Consequently, we see the bias for the intercept estimates only. Because of the bias in some of the coefficient estimates, we concluded that none of the methods work well in this situation because in real studies we would not know which estimates would have bias and which one would not. However, by correctly specifying or over-specifying the missingness model in IPW, NRCW, MI, and FIML, the bias gets close to zero, the average estimated standard errors are close to the empirical standard deviation, and the 95% confidence interval coverage rates are also close to the nominal level. We also observe that the variance estimates for correctly specified model and over-specified model tend to be very similar. In addition, by comparing the empirical standard errors across various missing data methods, one can also observe that the empirical standard errors from MI are generally lower than those of other methods, which supports the missing data theory that MI is often more efficient.5,21 Finally, under the MNAR missing data mechanism, as expected, none of the missing data methods are guaranteed to provide valid estimates and inference.

We also conducted additional simulations for different missing percentages (%) and the results are presented in the Supplementary Material. The conclusions are similar. The simulation results for the “reduced” models using the FIML approach can also be found in the Supplementary Material. The conclusions from the results for the “reduced” models are similar to those from the “full” models using the FIML approach. Sample computing codes for the simulations are available at https://github.com/jianwen-cai/HCHSMissingData.

4 |. APPLICATION

We applied the IPW, NRCW, MI, and FIML methods to data from HCHS/SOL. There were 16,415 participants in visit 1 and 11,623 participants returned for visit 2. We analyzed three outcomes: (1) linear regression for the difference in eGFR between visit 1 and visit 2, (2) linear regression for the difference in eGFR between visit 1 and visit 2 divided by years between the two visits (i.e., rate of change of eGFR), and (3) Poisson regression for the incidence of chronic kidney disease (CKD) between visits. The Poisson regression only included participants who did not have CKD at visit 1 and used an offset term for log-transformed years between visits. All three analyses included the following covariates: age, sex, field center, Hispanic/Latino background, education (high school graduate/GED vs. not), baseline eGFR, smoking status, born in the US or not, and hours per day of moderate or vigorous physical activity. In addition, the eGFR difference model included years between visits as an additional covariate.

For each outcome, we compared the results from IPW, NRCW, MI, and FIML methods. We first used a classification tree approach26 to identify the variables that are related to returning to Visit 2. The following covariates were identified: sex, sampling strata, refusal to participate in annual follow-up (AFU) data collection, natural log of distance between residence and data collection clinic, Hispanic/Latino background, baseline age, baseline eGFR, and baseline education level. These covariates were then used to estimate the IPW weights and the NRCW weights.

For the IPW weights, logistic regression model was used to estimate the probability of attending Visit 2. The logistic regression model included these identified covariates as well as some selected two-way interaction terms. The two-way interaction terms were selected using backwards variable selection procedure based on p-value of 0.05. The IPW weights were then calculated as in section 2.1.

For the NRCW weights, the continuous variables in the identified covariate list were first categorized by the resulting cutpoints provided by the classification tree approach. These categorized variables together with the identified categorical variables were then used to form post stratification strata. The proportion of participants who attended Visit 2 in each stratum was then calculated. The NRCW weights were calculated as in section 2.2.

For the MI approach, the multiple imputation models included all covariates that were identified to be related to attendance at Visit 2, all covariates that were included in the final regression models for the outcome of interest, baseline sampling weights, and some selected auxiliary variables and two-way interaction terms. The auxiliary variables were selected from the following set of initial baseline variables: marital status, employment status, health insurance, language preference, years lived in the U.S., body mass index, asthma, systolic blood pressure, diastolic blood pressure, total cholesterol, high-density lipoprotein (HDL) cholesterol, low-density lipoprotein (LDL) cholesterol, diabetes, cardiovascular heart disease, previous myocardial infarction, Center for Epidemiologic Studies Depression Scale (CES-D), State-Trait Anxiety Inventory (STAI), frequency of doctor visits, periodontal disease, and diet quality. The sets of auxiliary variables and two-way interaction terms in each imputation model were identified through a backwards variable selection procedure based on a linear regression model for years between the two visits, a linear regression model for eGFR at Visit 2, or a logistic regression model for CKD at Visit 2. FCS imputation models were used by the MI procedure in SAS.

For the FIML approach, the difference model and rate of change model included the same set of covariates as the models for other missing data methods, and the baseline sampling weights were used as subject-level weights in the analysis. The auxiliary variables and two-way interaction terms were also included and were the same as for MI. The FIML approach can only handle continuous response variables, so it was only used for linear models.

Table 9 presents regression coefficient estimates, their corresponding standard errors and p-values for the eGFR difference model, eGFR rate of change model, and CKD incidence model. Due to the similarities between inverse probability weighting (IPW) and non-response cell weighting (NRCW) (both estimate the probability of attending Visit 2, and adjust the sampling weights accordingly), regression estimates were mostly similar between these methods (with some exceptions), with non-response cell weighting generally providing smaller standard error. Note that the inverse probability weights and non-response cell weights were highly correlated (correlation = 0.69). The estimates from MI and FIML are somewhat different but in general in the same direction as the weighting methods.

TABLE 9.

Coefficient Estimates, Standard Errors, and p-values from: (1) Linear Model for Difference in eGFR; (2) Linear Model for Rate of Change in eGFR; (3) Poisson Model for CKD Incidence.

IPW NRCW MI FIML





Model Effect Estimate Standard Error p-value Estimate Standard Error p-value Estimate Standard Error p-value Estimate Standard Error p-value





Linear Model for Difference in eGFR Intercept 57.426 3.275 <.001 55.206 2.829 <.001 53.240 3.631 <.001 54.685 2.945 <.001
Age −0.353 0.020 <.001 −0.365 0.019 <.001 −0.354 0.020 <.001 −0.359 0.018 <.001
Background
Central American 2.174 0.941 0.021 2.909 0.864 0.001 2.399 1.063 0.033 2.935 0.771 <.001
Cuban 0.921 0.939 0.327 1.185 0.858 0.168 1.296 1.009 0.209 1.458 0.783 0.063
Mexican 3.547 0.876 <.001 3.807 0.800 <.001 3.032 0.940 0.002 3.641 0.774 <.001
Puerto Rican 0.624 0.820 0.446 0.596 0.698 0.394 0.831 0.931 0.381 0.525 0.667 0.432
South American 2.574 1.103 0.020 1.754 0.772 0.023 1.430 0.931 0.134 2.002 0.703 0.004
Other/Mixed 2.694 0.993 0.007 2.896 0.919 0.002 2.948 1.544 0.071 2.940 0.928 0.002
Field Center
Chicago −0.566 0.662 0.393 −0.775 0.544 0.155 −0.536 0.567 0.346 −0.831 0.563 0.140
Miami −0.354 0.814 0.664 −0.224 0.714 0.754 −0.555 0.720 0.445 −0.565 0.623 0.365
San Diego −1.029 0.685 0.134 −1.103 0.614 0.073 −0.627 0.757 0.409 −0.969 0.630 0.124
Current Smoker 0.831 0.487 0.088 0.772 0.430 0.073 0.540 0.478 0.263 0.433 0.428 0.311
High School Diploma/GED −0.308 0.383 0.422 −0.810 0.318 0.011 −0.217 0.343 0.529 −0.706 0.316 0.025
Gender (Male) −2.140 0.315 <.001 −1.874 0.288 <.001 −1.954 0.314 <.001 −1.842 0.274 <.001
eGFR −0.398 0.017 <.001 −0.395 0.015 <.001 −0.383 0.014 <.001 −0.390 0.016 <.001
Hrs/Day MVPA 0.142 0.055 0.010 0.138 0.048 0.005 0.165 0.060 0.009 0.158 0.044 <.001
Born in US −0.164 0.461 0.722 −0.215 0.416 0.606 −0.544 0.484 0.264 −0.382 0.390 0.328
Years between Y1 and Y2 −0.574 0.283 0.043 −0.181 0.210 0.389 −0.153 0.328 0.645 −0.202 0.214 0.346





Linear Model for Rate of Change in eGFR Intercept 8.717 0.415 <.001 8.795 0.365 <.001 8.595 0.350 <.001 8.679 0.372 <.001
Age −0.057 0.003 <.001 −0.060 0.003 <.001 −0.058 0.003 <.001 −0.058 0.003 <.001
Background
Central American 0.371 0.154 0.016 0.511 0.135 0.000 0.443 0.174 0.017 0.500 0.124 0.017
Cuban 0.170 0.157 0.278 0.256 0.137 0.063 0.316 0.152 0.043 0.267 0.128 0.037
Mexican 0.569 0.144 <.001 0.628 0.131 <.001 0.560 0.169 0.002 0.574 0.126 <.001
Puerto Rican 0.125 0.138 0.369 0.124 0.114 0.278 0.211 0.120 0.083 0.103 0.108 0.336
South American 0.470 0.202 0.020 0.322 0.127 0.011 0.285 0.131 0.032 0.336 0.116 0.004
Other/Mixed 0.419 0.158 0.008 0.477 0.147 0.001 0.465 0.286 0.121 0.451 0.114 0.002
Field Center
Chicago −0.093 0.107 0.385 −0.127 0.087 0.142 −0.104 0.109 0.344 −0.118 0.089 0.183
Miami −0.056 0.134 0.675 −0.073 0.108 0.501 −0.115 0.134 0.400 −0.121 0.100 0.224
San Diego −0.170 0.111 0.126 −0.179 0.099 0.071 −0.111 0.153 0.472 −0.156 0.101 0.120
Current Smoker 0.120 0.077 0.121 0.110 0.068 0.107 0.028 0.078 0.724 0.052 0.068 0.442
High School Diploma/GED −0.042 0.064 0.513 −0.132 0.052 0.011 −0.042 0.058 0.476 −0.104 0.051 0.043
Gender (Male) −0.348 0.053 <.001 −0.306 0.047 <.001 −0.311 0.052 <.001 −0.311 0.045 <.001
eGFR −0.065 0.003 <.001 −0.064 0.002 <.001 −0.063 0.002 <.001 −0.064 0.003 <.001
Hrs/Day MVPA 0.025 0.009 0.006 0.024 0.008 0.002 0.028 0.009 0.003 0.028 0.007 <.001
Born in US −0.044 0.074 0.550 −0.044 0.067 0.508 −0.097 0.086 0.270 −0.073 0.061 0.228





Poisson Model for CKD Incidence Intercept 9.577 1.163 <.001 9.381 1.224 <.001 8.292 1.223 <.001 - - -
Age −0.026 0.005 <.001 −0.026 0.006 <.001 −0.025 0.005 <.001 - - -
Background
Central American −0.638 0.228 0.005 −0.696 0.247 0.005 −0.506 0.247 0.047 - - -
Cuban −0.187 0.229 0.414 −0.311 0.236 0.190 −0.122 0.240 0.612 - - -
Mexican −0.640 0.300 0.033 −0.672 0.264 0.011 −0.527 0.273 0.060 - - -
Puerto Rican −0.230 0.189 0.223 −0.189 0.176 0.284 −0.136 0.192 0.482 - - -
South American −0.429 0.243 0.078 −0.389 0.235 0.099 −0.333 0.211 0.116 - - -
Other/Mixed −0.602 0.325 0.065 −0.549 0.323 0.090 −0.221 0.344 0.525 - - -
Field Center
Chicago −0.152 0.218 0.485 −0.167 0.196 0.396 −0.217 0.218 0.328 - - -
Miami 0.133 0.214 0.536 0.156 0.220 0.478 0.071 0.201 0.724 - - -
San Diego −0.026 0.283 0.927 −0.024 0.250 0.925 −0.119 0.285 0.678 - - -
Current Smoker −0.094 0.136 0.488 −0.103 0.129 0.424 −0.063 0.128 0.624 _ _ _
High School Diploma/GED 0.087 0.115 0.450 0.078 0.109 0.475 0.055 0.092 0.550 - - -
Gender (Male) 0.074 0.105 0.483 0.065 0.103 0.532 0.014 0.117 0.908 - - -
eGFR −0.109 0.009 <.001 −0.107 0.010 <.001 −0.097 0.011 <.001 - - -
Hrs/Day MVPA −0.030 0.019 0.112 −0.026 0.017 0.130 −0.021 0.016 0.177 - - -
Born in US −0.058 0.173 0.738 −0.092 0.169 0.586 0.016 0.152 0.915 - - -

5 |. DISCUSSION

In this paper, we compared the performance of various methods for handling missing data in the design-based approach for data from complex sample surveys. Since there is no existing literature that systematically examines the performance of various statistical methods for handling missing data for design-based analysis of different types of outcomes in multi-stage complex surveys, even in the context of cross-sectional data, our work is needed in order to examine whether some commonly used methods for handling missing data in model-based analysis are valid in the design-based inference. Our simulation results provide evidence and confirmation that the commonly used missing data methods (MI, IPW, NRCW, and FIML for continuous outcomes) in the model-based approach are also valid for handling missing data in the design-based approach with data from multi-stage complex surveys. This information is new and significant because it forms the basis for adopting these missing data methods in the design-based analysis of data from multi-stage complex surveys. Without these results, we can only speculate that these methods might be valid in design-based inference. More importantly, these new results were the basis for our recommendation for analyzing data from the baseline and one follow-up visit in HCHS/SOL, a large ongoing cohort study of Hispanic/Latino adults with diverse backgrounds. This work has a direct impact on public health research.

In this paper, the population models we considered are linear, logistic, and Poisson regression, the missing data methods are CC, IPW, NRCW, MI, and FIML, and the missing data mechanisms are MCAR, MAR, and MNAR. As expected, from the simulation results, we observed that all methods worked well for all population models under MCAR, and these methods did not perform well under MNAR. We observed that CC produced biased results under MAR, because the complete cases may be systematically different from the incomplete cases, and CC analyses are biased when the missingness depends on some variables that are not included in target population parameters.

We also observed that under MAR, when the missingness model was correctly specified or over-specified, IPW, NRCW, MI, and FIML all performed well, while they did not perform well when the missingness model was under-specified. In practice, one would not know what the correct missingness model is, so based on these results, we recommend that analysts include all the factors that are considered to be possibly related to not returning to the follow up visit for the missingness model. This could result in an over-specified model, which would be preferred over an under-specified model.

Even though it is preferred to have an over-specified model for the missingness model, we caution that the size of the model should be controlled so that the model does not include too many unnecessary covariates. This is important especially for the NRCW methods, where weighting classes will be formed based on the factors in the missingness model (post-stratification variables). If there are too many post-stratification variables, it will form many scarce or even empty weighting classes. One way to control the size of the missingness model is to conduct some variable selection procedure to reduce the number of variables from the initial list that are considered to be potentially related to the missingness. For HCHS/SOL, we used a classification tree approach26 to identify the variables that are related to returning to Visit 2. The advantage for using classification tree approach is that interactions among the variables will be taken into consideration and estimates of the cutpoints for continuous variables are provided, which will facilitate discretizing continuous variables to form weighting class for the NRCW method.

The FIML approach assumes that all variables follow multivariate normal distribution. We examined the robustness of the FIML approach to the violation of this assumption by including discrete covariates in the model. The results showed that the estimates performed well even though the multivariate normality assumption is violated in the situations we considered.

We also examined the performance of these methods with various amounts of missing data. The results provided in the supplementary material showed that when the missing percentage was low, such as 2%, these methods all performed well. However, when the missing percentage was higher than 5%, CC did not perform well under the MAR missing mechanism and missing data methods (such as IPW, NRCW, MI, and FIML) would be needed.

In terms of choosing the missing data analysis method when conducting design-based analysis for complex survey data, our recommendations based on the simulation results are the following. (1) Use MI because it can be more efficient compared to other methods. We note that, for valid MI inference, the imputation procedure should incorporate not only the variables associated with the missingness but also the sampling weights and design variables. In addition, the imputation model needs to be tailored to a particular analysis model so that they are congenial (i.e., they need to be derived from the same joint model); for example, the imputation model should at least include all the covariates from the analysis model.27,28 In situations when analysts do not have access to all the design information, as might happen when strata are collapsed, or other measures are taken for disclosure limitation, sampling weights, although not perfect, can serve as a surrogate for the design information. (2) If the computation of MI is an issue (e.g., there are multiple variables that need to be imputed, or computation time is too long due to limited computation resources), weight-based methods (IPW or NRCW) can be used as alternative approaches because they are also valid.

It should also be noted that while the definition of the design consistent estimators in the design-based inference is robust to the misspecification of the analysis model (i.e., they are well defined regardless of whether the analytic model is correctly specified or incorrectly specified), focusing on such type of estimators is not without a price. In the case where the analysis model postulated for the sample data is correctly specified, the design consistent estimators under the design-based framework, which are weighted by sampling weights, can be less efficient than other methods that are model dependent but do not use sampling weights. In other words, when the outcome model is correctly specified, other model-based methods are preferred over design consistent estimators in the design-based inference due to efficiency.29

In this paper, we considered the situation where there are only two visits. This is an important situation to consider in practical studies, for example, in our motivating HCHS/SOL study alone there are hundreds of manuscripts that are under active preparation using two-visit data. In addition, all longitudinal studies have to start with two visits. For longitudinal studies with more than two visits, Si et. al. (2022) conducted an empirical evaluation of various approaches for handling attrition when analyzing longitudinal survey data based on the Monitoring the Future (MTF) panel study.30 We are currently working on simulations to examine the extension of the methods considered in this paper to more than two visits, which include some of the methods considered in Si et. al. (2022). We will report these results in separate communications.

Supplementary Material

supplement

ACKNOWLEDGEMENTS

The authors thank Drs. Sierra A. Bainter and Maria M. Llabre for their helpful advice on Mplus programing. This work was supported in part by contract from the National Heart, Lung, and Blood Institute of the National Institutes of Health (NIH) N01-HC65233, NIH grant P01CA142538, and National Institute of Environmental Health Sciences (NIEHS) grant T32ES007018. The Hispanic Community Health Study/Study of Latinos was carried out as a collaborative study supported by contracts from the NHLBI (N01-HC65233, N01-HC65234, N01-HC65235, N01-HC65236, and N01-HC65237). The authors thank the staff and participants of HCHS/SOL for their important contributions. Investigators website - http://www.cscc.unc.edu/hchs/. The Hispanic Community Health Study/Study of Latinos is a collaborative study supported by contracts from the National Heart, Lung, and Blood Institute (NHLBI) to the University of North Carolina (HHSN268201300001I / N01-HC-65233), University of Miami (HHSN268201300004I / N01-HC-65234), Albert Einstein College of Medicine (HHSN268201300002I / N01-HC-65235), University of Illinois at Chicago (HHSN268201300003I / N01- HC-65236 Northwestern Univ), and San Diego State University (HHSN268201300005I / N01-HC-65237). The following Institutes/Centers/Offices have contributed to the HCHS/SOL through a transfer of funds to the NHLBI: National Institute on Minority Health and Health Disparities, National Institute on Deafness and Other Communication Disorders, National Institute of Dental and Craniofacial Research, National Institute of Diabetes and Digestive and Kidney Diseases, National Institute of Neurological Disorders and Stroke, NIH Institution-Office of Dietary Supplements.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL). Restrictions apply to the availability of these data. Information about collaborating with HCHS/SOL can be found at https://sites.cscc.unc.edu/hchs/.

References

  • 1.Sorlie PD, Avilés-Santa LM, Wassertheil-Smoller S, et al. Design and implementation of the Hispanic Community Health Study/Study of Latinos. Ann Epidemiol. 2010;20(8):629–641. doi: 10.1016/j.annepidem.2010.03.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Lavange LM, Kalsbeek WD, Sorlie PD, et al. Sample design and cohort selection in the Hispanic Community Health Study/Study of Latinos. Ann Epidemiol. 2010;20(8):642–649. doi: 10.1016/j.annepidem.2010.05.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Seaman S, White I. Inverse Probability Weighting with Missing Predictors of Treatment Assignment or Missingness. Communications in Statistics - Theory and Methods. 2014;43(16):3499–3515. doi: 10.1080/03610926.2012.700371 [DOI] [Google Scholar]
  • 4.Grau E, Potter F, Williams S, Diaz-Tena N. Nonresponse adjustment using logistic regression: To weight or not to weight. American Statistical Association, Survey Research Methods Section. Alexandria. 2006;3073–3080. [Google Scholar]
  • 5.Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Stat Methods Med Res. 2013;22(3):278–295. doi: 10.1177/0962280210395740 [DOI] [PubMed] [Google Scholar]
  • 6.Little RJ, Vartivarian S. On weighting the rates in non-response weights. Stat Med. 2003;22(9):1589–1599. doi: 10.1002/sim.1513 [DOI] [PubMed] [Google Scholar]
  • 7.Bethlehem JG. Weighting nonresponse adjustments based on auxiliary information; 2002. [Google Scholar]
  • 8.Little RJ, Vartivarian S. Does weighting for nonresponse increase the variance of survey means?. Survey Methodology, 2005;31(2), 161. [Google Scholar]
  • 9.Little RJA, Rubin DB. Statistical Analysis with Missing Data. John Wiley & Sons, Inc.; 2002. doi: 10.1002/9781119013563 [DOI] [Google Scholar]
  • 10.Molenberghs G Handbook of Missing Data Methodology. Chapman and Hall/CRC; 2014. doi: 10.1201/b17622 [DOI] [Google Scholar]
  • 11.Rubin DB. Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, Inc.; 1987. doi: 10.1002/9780470316696 [DOI] [Google Scholar]
  • 12.Kim JK, Michael Brick J, Fuller WA, Kalton G. On the bias of the multiple-imputation variance estimator in survey sampling. J Royal Statistical Soc B. 2006;68(3):509–521. doi: 10.1111/j.1467-9868.2006.00546.x [DOI] [Google Scholar]
  • 13.Anderson TW. Maximum Likelihood Estimates for a Multivariate Normal Distribution when Some Observations are Missing. J Am Stat Assoc. 1957;52(278):200–203. doi: 10.1080/01621459.1957.10501379 [DOI] [Google Scholar]
  • 14.Hartley HO, Hocking RR. The analysis of incomplete data. Biometrics. 1971;27(4):783. doi: 10.2307/2528820 [DOI] [Google Scholar]
  • 15.Enders CK. A primer on maximum likelihood algorithms available for use with missing data. Structural Equation Modeling: A Multidisciplinary Journal. 2001;8(1):128–141. doi: 10.1207/S15328007SEM0801_7 [DOI] [Google Scholar]
  • 16.Binder DA. On the Variances of Asymptotically Normal Estimators from Complex Surveys. International Statistical Review / Revue Internationale de Statistique. 1983;51(3):279. doi: 10.2307/1402588 [DOI] [Google Scholar]
  • 17.Lumley T, Scott A. Fitting regression models to survey data. Stat Sci. 2017;32(2):265–278. doi: 10.1214/16-STS605 [DOI] [Google Scholar]
  • 18.Lohr SL. Sampling: Design and Analysis. Chapman and Hall/CRC; 2021. doi: 10.1201/9780429298899 [DOI] [Google Scholar]
  • 19.Woodruff RS. A simple method for approximating the variance of a complicated estimate. J Am Stat Assoc. 1971;66(334):411–414. doi: 10.1080/01621459.1971.10482279 [DOI] [Google Scholar]
  • 20.Graham JW. Adding missing-data-relevant variables to FIML-based structural equation models. Structural Equation Modeling: A Multidisciplinary Journal. 2003;10(1):80–100. doi: 10.1207/S15328007SEM1001_4 [DOI] [Google Scholar]
  • 21.Schenker N, Raghunathan TE, Chiu P-L, Makuc DM, Zhang G, Cohen AJ. Multiple imputation of missing income data in the national health interview survey. J Am Stat Assoc. 2006;101(475):924–933. doi: 10.1198/016214505000001375 [DOI] [Google Scholar]
  • 22.Little RJA. Missing-Data Adjustments in Large Surveys. Journal of Business & Economic Statistics. 1988;6(3):287. doi: 10.2307/1391878 [DOI] [Google Scholar]
  • 23.Brick JM, Kalton G. Handling missing data in survey research. Stat Methods Med Res. 1996;5(3):215–238. doi: 10.1177/096228029600500302 [DOI] [PubMed] [Google Scholar]
  • 24.Enders C, Bandalos D. The relative performance of full information maximum likelihood estimation for missing data in structural equation models. Structural Equation Modeling: A Multidisciplinary Journal. 2001;8(3):430–457. doi: 10.1207/S15328007SEM0803_5 [DOI] [Google Scholar]
  • 25.Enders CK. The impact of nonnormality on full information maximum-likelihood estimation for structural equation models with missing data. Psychol Methods. 2001;6(4):352–370. doi: 10.1037/1082-989X.6.4.352 [DOI] [PubMed] [Google Scholar]
  • 26.Therneau T, Atkinson B, Ripley B, et al. Package ‘rpart’. https://cran.r-project.org/web/packages/rpart/rpart.pdf. Published 2019. Accessed October 1, 2021. [Google Scholar]
  • 27.Reiter RJ, Raghunathan TE, Kinney SK. The importance of modeling the sampling design in multiple imputation for missing data. Survey Methodology, 2006;32(2), 143–149. [Google Scholar]
  • 28.Quartagno M, Carpenter JR, Goldstein H. Multiple imputation with survey weights: A multilevel approach. Journal of Survey Statistics and Methodology. 2020;8(5):965–989. doi: 10.1093/jssam/smz036 [DOI] [Google Scholar]
  • 29.Pfeffermann D The role of sampling weights when modeling survey data. International Statistical Review. 1993;61(2):317. doi: 10.2307/1403631 [DOI] [Google Scholar]
  • 30.Si Y, West BT, Veliz P, et al. An empirical evaluation of alternative approaches to adjusting for attrition when analyzing longitudinal survey data on young adults’ substance use trajectories. International Journal of Methods in Psychiatric Research. 2022;31(3). doi: 10.1002/mpr.1916 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

Data Availability Statement

The data that support the findings of this study are available from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL). Restrictions apply to the availability of these data. Information about collaborating with HCHS/SOL can be found at https://sites.cscc.unc.edu/hchs/.

RESOURCES