Summary
Multiple informant data refers to information obtained from different individuals or sources used to measure the same construct; for example, researchers might collect information regarding child psychopathology from the child's teacher and the child's parent. Frequently, studies with multiple informants have incomplete observations; in some cases the missingness of informants is substantial. We introduce a Maximum Likelihood (ML) technique to fit models with multiple informants as predictors that permits missingness in the predictors as well as the response. We provide closed form solutions when possible and analytically compare the ML technique to the existing Generalized Estimating Equations (GEE) approach. We demonstrate that the ML approach can be used to compare the effect of the informants on response without standardizing the data. Simulations incorporating missingness show that ML is more efficient than the existing GEE method. In the presence of MCAR missing data, we find through a simulation study that the ML approach is robust to a relatively extreme departure from the normality assumption. We implement both methods in a study investigating the association between physical activity and obesity with activity measured using multiple informants (children and their mothers).
Keywords: missingness in the response, missingness in the covariates, missing completely at random, multiple informants, maximum likelihood
1 Introduction
Multiple informant data refers to information obtained from different individuals or sources used to measure the same construct. Obtaining accurate physical activity and inactivity measurements, for example, can be difficult and time-consuming. Hernández et al. [1] developed a self-reported questionnaire to assess these measures and performed a validation study to compare responses from the questionnaire to a comparison criterion of 24-hour recalls. The validation study collected activity measurements from multiple informants: children and their mothers. In addition to using this study for validation of their scales, they also planned to design a larger study [2] to more fully investigate the relationship between physical activity and obesity. However, collecting information from mothers in a large scale study is not feasible, so it is of interest to compare the predictive values of the mother's and child's reports from a subset of the larger study [2] in the context of study design.
Missing data is prevalent in multiple informant research; as the number of informants increases, the risk of missingness also rises. In the Hernández et al. [2] study with body mass index (BMI) as response and physical activity reported by the child and the child's mother, 26% of the cases have missingness either in the response, the multiple informant variables or both. In a study of mental health service utilization in children from Connecticut, 43% of cases had missingness [3], [4], [5], [6], [7]. Having a large amount of missingness can lead to inferential problems including bias and efficiency loss [8].
In previous work [9], we reviewed a GEE approach [10], [11] and provided a novel Maximum Likelihood (ML) technique for analysis of multiple informants as predictors. In addition, we showed that the GEE and ML approaches yield identical estimates for a broad range of models, but the ML technique permits more flexibility in modeling. That research did not consider missing data; however, general properties of the GEE and ML methods in the presence of missingness are well known. If nonresponse is Missing Completely At Random (MCAR) [12], then the GEE approach gives unbiased yet potentially inefficient estimates [11], [13], while ML leads to consistent and asymptotically normal estimates assuming Missing At Random (MAR) missingness [12], [14], [15] provided the likelihood model is correct. MCAR missingness does not depend on values of the variables (e.g., observed cases are a random subsample of all cases) while MAR missingness depends only on the components of the variable that are observed.
In this paper, rather than simply focus on the common situation with missingness in the response, we consider the non-standard case of fitting marginal regression models with MCAR missingness in the covariates as well as the response. We assume MCAR missingness, partly because it is a reasonable assumption for our data and also to ensure consistency for the GEE approach. We will show under what conditions the GEE and ML approaches yield the same estimates in the presence of MCAR missing data.
In general, when fitting marginal regression models with multiple informant covariates the goal is to compare the relationship of one informant with response to the other informant with response. In data without missingness, this is done by standardizing the multiple informant data and comparing the marginal regression coefficients. However, standardizing data with missingness can be undesirable, so we describe how to use ML to compare the effect of the informants on response without doing so.
Another goal of this paper is to evaluate the efficiency of using ML compared with GEE for fitting marginal regression models in the presence of MCAR missingness. It is expected that ML will offer some efficiency compared with the GEE approach. One well-known drawback to using ML is that it requires the assumption of multivariate normality; thus, we investigate how robust the ML approach is to this assumption in the presence of MCAR missingness.
We begin in Section 2 by describing how to implement the GEE approach incorporating missingness in the response and/or multiple informant covariates. In Section 3, we derive closed form solutions for the ML approach and analytically compare the two methods in the presence of missing data. We also describe how to use ML to compare the effect of each informant on response without standardizing the data. We compare the efficiency of the ML approach with the GEE technique through a simulation study in Section 4. We begin Section 5 by describing an application of the GEE and ML methods using data from Hernández’s study of physical activity/inactivity and obesity [1], [2]. We also describe a simulation that demonstrates how the ML approach is robust to the assumption of multivariate normality in the presence of MCAR missingness.
2 Generalized Estimating Equations Approach Incorporating Missingness
Pepe et al. [11] and Horton et al. [10] describe a nonstandard GEE approach for fitting multiple informants as covariates assuming no missing data; we review the method and extend it to include missingness in the response and the multiple informants. Let Y be the outcome and X1, . . . , XK be the K multiple informant predictors. The GEE approach models the univariate associations between Y and Xk, defined as E(Y |Xk) for k = 1, ..., K. The model with no covariates and distinct parameters for each informant assuming an identity link is E(Y|Xk) = αk + βkXk for k = 1, . . . , K where αk is the intercept and βk is the slope in the kth regression. Defining
the GEE equations assuming an identity link, constant variance and a working independence correlation matrix are
(1) |
The GEE approach generalizes easily when Xik is missing by removing the corresponding rows from and Xi. However, missing mean the entire is omitted from the estimating equations. Specifically, αk and β;k are estimated by Ordinary Least Squares (OLS) using only cases with both observed Yi and Xik values. For example, with two multiple informant covariates (K = 2), consider six simple monotone and MCAR missingness patterns: data missing X1 only, X2 only, both Y and X1, both Y and X2, Y only and both X1 and X2. When X1 only is missing, the estimate of β1 is based on cases with observed Yi and Xi1 (Complete Cases (CC)) whereas the estimate of β2 is based on cases with observed Yi and Xi2 (Available Cases (AC)) . A similar pattern occurs when X2 only is missing. In all other situations considered, GEE estimates of β1 and β2 are based on complete cases (see Table 1).
Table 1.
Variable(s) Missing | ||
---|---|---|
X1 only | *† | ‡ available cases of Y and X2 |
X2 only | ‡ available cases of Y and X1 | §† |
Both Y and X1 | *† | ‡ complete cases of Y and X2 |
Both Y and X2 | ‡ complete cases of Y and X1 | §† |
Y only | *† | §† |
Both X1 and X2 | *† | §† |
GEE is based on complete cases of Y and X1
GEE is based on complete cases of Y and X2
ML is a combination of expressions involving all observed data
ML estimate is the same as the GEE estimate
As previously done we assume working independence [11] and use the model-based estimate of variance:
(2) |
where
As described in Pepe et al. [11], assuming working independence treats data from each subject as independent data clusters and uses independence as the working correlation matrix. This must be done for the model to be valid [16]; however, we have shown that the use of the independence working correlation matrix is optimal for certain models when assuming normality where the GEE and ML approaches yield identical estimates and standard errors [9]. The variance in Equation 2 is model-based since it assumes the same for each individual and does not depend on the design matrix. To accommodate missingness, estimates of var are based only on available cases or complete cases depending on the missingness pattern, as when obtaining . We can fit the GEE model and model-based estimates of standard error with missingness using a standard statistical package such as R [17].
Often data from multiple informants are standardized to the same scale and a model with equal slope coefficients (e.g., setting β1 = β2 = . . . = βK = βC) may be desirable; this can lead to more precise parameter estimates [9]. To fit this constrained model in the presence of missing data, and Xi change depending on the observed data. With K = 2 informants, we solve for β assuming β1 = β2 = βC and find and
(3) |
where
and Iik is an indicator function for those observations with observed Yi and Xik for k = 1, 2. In the Hernández et al. [2] study, the physical activity measurements from the mother and child have different variances; hence, it is desirable to first standardize the Xik values and then find estimates under the constrained model. However, in the presence of non-monotone missing data, standardizing a priori on the basis of observed responses can be unattractive since standardization of different multiple informant covariates may deal with different subsets of the data. We provide an alternative technique for dealing with this issue in Section 3.
3 Maximum Likelihood Approach Incorporating Missingness
To use the ML approach we assume a joint multivariate distribution for the outcome and multiple informant predictors. We assume for simplicity only two predictors here. The model can be easily extended to accommodate more predictors. For each of n observations with complete data, let = (Yi, X1i, X2i)T and thus
From this distribution, we find estimates for Here we consider the regression parameter estimates, found from solving Equation 1 when K = 2. Using conditional mean formulas for the multivariate normal distribution, we find where i = 1, 2. We define αi = μY − βiμXi and where i = 1, 2. We also define V11, V22 and V12 in terms of θ by utilizing conditional variance formulas for the multivariate normal distribution, e.g., V11 = var(Y |X1), V22 = var(Y |X2) and V12 = cov(Y |X1, Y |X2). To make the full rank transformation, we include two parameters, μY and from θ into Estimates of θ have closed form solutions with complete data in this case; furthermore, estimates of α1, β1, α2, β2 are identical to those obtained using GEE. This implies that although the derivation of ML estimates assumes multivariate normality, because the ML and GEE estimates are the same, both estimates have the same robustness properties [9].
To find solutions when any element of Q can have missingness, we use the expectation-maximization (EM) algorithm [18], consisting of an expectation (E) and a maximization (M) step. The ML approach is performed as if there are no missing observations in the M step and the E step finds the conditional expectation of the missing data given the observed data and the current parameter estimates. These steps are iterated until convergence. In particular, we use R [17] for implementation; we stratify by whether or not an observation is complete and then use the EM algorithm (see Appendix). Details are also provided in the Appendix for a case with monotone missingness where closed form solutions can be found. While other techniques (e.g. Newton-Raphson, Fisher scoring) have quadratic convergence rates [19], the EM algorithm has a linear convergence rate since it does not depend on second derivatives for its calculation. Consequently, the algorithm does not automatically provide the asymptotic standard errors of the ML estimates, so we use the bootstrap [20] to obtain estimates of var
For data with simple monotone missingness patterns considered in Section 2 (data missing X1 only, X2 only, both Y and X1, both Y and X2, Y only and both X1 and X2), some closed form solutions exist. Using the factorization theorem for finding ML estimates with monotone missingness patterns, in some cases we find explicit solutions for and (results in Table 1). Furthermore, when solutions involve only complete or available cases for ML, the estimates are the same as GEE, but in general, they differ.
We now describe using the factorization theorem in the specific monotone situations to derive ML estimates. First we consider when X1 alone is missing. According to the factorization theorem, we write f(Y, X1, X2) = f (X1|Y, X2)f (Y, X2) where parameters from f (X1|Y, X2) are estimated from complete cases and parameters from f(Y, X2) are estimated from all observed cases (available cases). We show how to use the EM algorithm to obtain the estimates in the Appendix. The estimate of β1 calculated from estimates of σX1,Y and is based on data from a combination of expressions involving all observed data, but is calculated from estimates of σX2,Y and and is based on available cases for Y and X2. Therefore, from ML is equivalent to that obtained using GEE, but is not. The argument with missing X2 is derived similarly.
In the case of missingness in both Y and X1, we write f (Y, X1, X2) = f (Y, X1|X2)f (X2), hence parameters from f(Y, X1|X2) are estimated from complete cases and parameters from f (X2) come from all observed cases. In deriving f(Y |X1), we find that X2 does give information about the joint distribution of Y and X1, so the estimate of β1 is based on all observed cases. However, the estimate of β2 is the same using complete cases as when implementing ML. As expected, the converse is true with data where both Y and X2 are missing.
When Y only is missing, we find f(Y, X1, X2) = f(Y |X1, X2)f (X1, X2) where parameters from f(Y |X1, X2) are estimated using complete cases and parameters in f (X1, X2) are estimated using all observations. Therefore, estimates of β1 and β2 are found from a combination of all observed data. This implies that even when Y is missing, observations with observed X1 and X2 values contribute to the likelihood; this is contrary to previous work on estimating parameters in ordinary regression models with missing Y values [21]. Specifically, they found that when interested in inferences about a response with missing data given complete covariates, the missingness in Y did not affect the regression estimates obtained; that is, the same regression estimates were found when analyzing the data with and without the missing Y values.
When both X1 and X2 are missing, using the same technique as above, f(Y, X1, X2) = f(X1, X2|Y )f (Y ) where parameters from f (X1, X2|Y ) are estimated using complete cases and parameters from f(Y ) are from all observations. Therefore, estimates of both slopes will be different when calculated using GEE and ML.
While in many examples comparing the slopes is desirable, in the Hernández et al. [2] study, the variances of the two multiple informants are not the same, so comparing the predictive values of the multiple informant reports is more appropriate. For example, we are interested if the relationship of vigorous exercise reported by the child and BMI is similar to the relationship of vigorous exercise reported by the mother without assuming that the variance of vigorous exercise reported by each informant is the same. If the relationship is indeed similar, this implies that using the child's report of vigorous exercise is adequate to assess the relationship in future studies. To compare the two marginal relationships, we test ρX1,Y = ρX2,Y where and ; we use the bootstrap [20] to obtain an appropriate standard error for the difference in correlations. This ability to easily compare the correlation between each informant report and response is an advantage of using ML with the bootstrap variance estimate.
4 Simulations: Efficiency of ML Compared with GEE
MCAR missingness means that the outcomes (response and multiple informant covariates) are a random subsample of all outcomes [12]. For simplicity, we do not include additional covariates with missingness, although the techniques will extend to do so. We consider MCAR missingness only in the simulations since GEE may be biased when missingness is not MCAR and in our example the MCAR assumption appears reasonable since missing data arose due to logistical reasons rather than refusals. Thus, we use the simulations to compare the efficiency of the ML and GEE approaches.
In the case of two multiple informant predictors, monotone missingness often occurs when one predictor has missingness and the other is complete; for example, a mental health service utilization study had many teacher ratings of children missing but all parents responded about their children [3], [4], [5], [6], [7]. In general, multiple informant data missingness is not necessarily monotone and missing responses are also possible. For example, in a study predicting utilization of health services obtained from multiple sources [22], both self report and administrative source responses had missingness. The Hernández et al. [2] example has missingness in the response in addition to the multiple informant covariates.
In our simulations, we compare the GEE and ML models under similar situations as described above. We assume a multivariate normal distribution for and generate 500 observations assuming for different values of σX1,Y, σX2,Y and σ X1, X2. Specifically, we consider situations with zero covariance between the two informants, σ X1, X2 = 0.3334 (as calculated from the Hernández et al. dataset [2] without missingness) and a moderately large value, σ X1, X2= 0.6. For comparison, we also consider σX1,Y and σX2,Y for the same values (see Table 2). For each set of values, we perform 10,000 simulations with four scenarios: 1.) data with no missingness, 2.) data where X1 is missing 50% of its observations at random (MCAR and monotone missingness in one multiple informant), 3.) data where X1 is missing 25% of its observations at random and X2 is missing 25% of its observations at random (non-monotone MCAR missingness in the multiple informants) and 4.) data where Y is missing 25% of its observations at random (MCAR and monotone missingness in the response). This last case is designed to illustrate loss of efficiency when respondents with Y missing are discarded.
Table 2.
50% missing X1 | 25% missing X1 and 25% missing X2 | 25% missing Y | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
σ X1, X2 = 0 | ||||||||||
σX1,Y | σX1,Y | σX1,Y | ||||||||
0 | 0.3334 | 0.6 | 0 | 0.3334 | 0.6 | 0 | 0.3334 | 0.6 | ||
0 | 101% | 91% | 77% | 100% | 100% | 100% | 100% | 100% | 100% | |
σX2,Y | 0.3334 | 101% | 91% | 77% | 100% | 95% | 96% | 97% | 97% | 96% |
0.6 | 101% | 88% | 74% | 100% | 96% | 88% | 90% | 90% | 85% | |
σ X1, X2 = 0.3334 | ||||||||||
σX1,Y | σX1,Y | σX1,Y | ||||||||
0 | 0.3334 | 0.6 | 0 | 0.3334 | 0.6 | 0 | 0.3334 | 0.6 | ||
0 | 96% | 88% | 77% | 100% | 100% | 100% | 100% | 100% | 99% | |
σX2,Y | 0.3334 | 95% | 89% | 78% | 100% | 95% | 95% | 97% | 98% | 99% |
0.6 | 92% | 90% | 76% | 100% | 95% | 88% | 89% | 92% | 93% | |
σ X1, X2 = 0.6 | ||||||||||
σX1,Y | σX1,Y | σX1,Y | ||||||||
0 | 0.3334 | 0.6 | 0 | 0.3334 | 0.6 | 0 | 0.3334 | 0.6 | ||
0 | 83% | 76% | 68% | 100% | 100% | 100% | 100% | 99% | 92% | |
σX2,Y | 0.3334 | 83% | 79% | 73% | 100% | 98% | 95% | 96% | 99% | 100% |
0.6 | 73% | 81% | 76% | 100% | 95% | 87% | 86% | 93% | 97% |
With the monotone missingness in the multiple informants situation, we calculate and var only since we illustrated in Section 3 that the ML estimate of β2 is equivalent to that obtained from GEE. We also calculate the ratio of var from ML to var from GEE and compare the two methods where 100% means that ML and GEE are equivalent in terms of efficiency (Table 2). For the non-monotone missingness in the multiple informants situation and with missingness in Y , both efficiency ratios for var and var are affected similarly by the missingness so we present only var (also in Table 2). We repeat the simulations to compare ML and GEE assuming the constrained model with equal variances under similar scenarios (not presented).
In Table 2, simulations with 50% missingness in X1 show that the correlation between informant one and response, σX1,Y , is the largest determinant of the efficiency of ML compared with GEE whereas σX2,Y has the smallest effect. Specifically, the largest efficiency gains (22 – 32%) for ML compared with GEE are found when σX1,Y = 0.6. ML is also more efficient than GEE with moderate effects (σX1,Y = 0.3334) (efficiency gains of 9 – 24%). For a nonzero σX1,Y value, gains are largest when σX2,Y = 0 and σX1,X2 = 0.6; in this case, X2 and Y provide independent information about X1.Under the null hypothesis of no relationship between the first informant and response (σX1,Y = 0), with σX1,X2 = 0 or σX2,X2 = 0.3334, small efficiency gains (0 – 8%) are realized, but when the relationship between the two informants is large (σX1,X2 = 0.6), the efficiency gain is as high as 27%.
Regarding bias, we find no evidence that GEE or ML produce biased estimates in these situations. Thus, we present only the efficiency ratios not or var . The ratio of variances for data missing 50% of its X1 observations to data with no missingness for all scenarios is about two for GEE; this follows since approximately 50% of the data is not being used. As described earlier, the loss is not as great when using ML.
Efficiency gains for the non-monotone missingness in the multiple informants pattern are less than for the previous pattern (Table 2). For var gains are largest (10 – 13%) when σX1,Y = 0.6; it appears to make little difference what the σX2,Y and σX1,X2 values are. Similarly, for var the biggest gains occur when σX2,Y = 0.6 with a range of 11 – 13% (not presented). Thus, the non-monotonicity of the missingness pattern does not affect the efficiency gains, but the amount of missingness in the X1 and X2 values does. Specifically, in the monotone scenario first considered with 50% missingness in X1, more efficiency is found than in the non-monotone case with only 25% missingness in X1 (and 25% missingness in X2). When 25% are missing Y (Table 2), the largest efficiency gain for estimation of β1 using ML (14%) is found when σX1,Y = 0, σX2,Y = 0.6, σX1,X2 = 0.6 or when σX1,Y = 0.6, σX2,Y = 0.6, σX1,X2 = 0.6 A maximum efficiency gain of the same magnitude occurs for estimation of β2 (not presented).
Under the constrained model with monotone missingness in the multiple informants, the efficiency gains of ML compared with GEE are small (0 – 7%) since the constrained estimate averages over both X1 and X2. Instead, if we consider the non-monotone missingness pattern where both X1 and X2 are missing 25% of their observations, efficiency gains for ML compared with GEE are 7 – 15% with the largest gains found when σX1,Y = σX2,Y = 0.6. When Y is missing 25% of its observations, as in the unconstrained case, the largest efficiency gain (18%) is found when σX1,X2 = 0, σX1,Y = 0.6, σX2,Y = 0.6. While generally the efficiency gains of ML compared with GEE under the constrained model are equal or smaller than the unconstrained model, the efficiency loss when comparing models with missing values and no missing values is less than in the unconstrained case.
5 Illustration
We begin this section by describing results of applying the GEE and ML methods and end by investigating the robustness of ML to the assumption of multivariate normality in the presence of MCAR missingness. In 1996, a study investigating the association between physical activity/inactivity and obesity was performed in two towns of Mexico City [1], [2]. We illustrate ML to evaluate the relationship between BMI (Y ) and vigorous exercise as reported by the child (X1) and the child’s mother (X2) and compare to the GEE approach. One question we consider is the consequence of relying on the child’s information for larger surveys. Partial information is available for 29 of the 111 observations (3 are missing X1 only, 13 are missing X2 only, 10 are missing Y only and 3 are missing Y and X1). We find that X1 and X2 are not predictive of missingness in Y ; similar conclusions are drawn for the missingness in X1 and X2. Because missing data arose due to logistical reasons rather than refusals, the MCAR assumption is reasonable. To calculate standard errors for both approaches, we use 2000 bootstrap [20] samples.
We begin by presenting the estimated means and covariance matrix using GEE and when using ML (Table 3) ; both the means and variances are similar from both approaches. However, while the ML and GEE estimates in this analysis of 111 cases (Table 4) are similar to the previous complete case analysis of 82 cases (data not shown), the GEE model-based standard errors are 6–18% smaller whereas the ML standard errors are 15–20% smaller. This indicates that missingness has not substantially biased our estimates, but by incorporating information from the missing data, we gain efficiency when using ML or GEE. In addition, we can directly compare the ML estimates of standard error to the model-based GEE estimates for our analysis of the 111 cases. We find estimated efficiency gains of ML compared with GEE of 4% for estimating var and 20% for estimating var . Generally consistent with our simulations (Section 4), our example demonstrates that compared with using GEE, ML is more efficient.
Table 3.
Variable | Estimated Mean using ACs | Estimated Mean from ML |
---|---|---|
BMI (Y ) | 21.294 | 21.287 |
Vigorous exercise reported by child (X1) | 0.972 | 0.974 |
Vigorous exercise reported by mother (X2) | 0.791 | 0.787 |
∑ using ACs | |||
---|---|---|---|
Y | X1 | X2 | |
Y | 10.975 | −0.282 | −0.391 |
X1 | −0.282 | 0.939 | 0.187 |
X2 | −0.391 | 0.187 | 0.439 |
∑ from ML | |||
---|---|---|---|
Y | X1 | X2 | |
Y | 10.989 | −0.317 | −0.358 |
X1 | −0.317 | 0.937 | 0.170 |
X2 | −0.358 | 0.170 | 0.436 |
Table 4.
Method | ||||
---|---|---|---|---|
GEE | −0.333 | 0.331 | −0.889 | 0.529 |
ML | −0.338 | 0.324 | −0.820 | 0.474 |
Our primary goal is to compare the relationship of child’s report of vigorous exercise and BMI with mother’s report and BMI; however, the variances of the two reports are different according to an F test (p < 0.0001). Thus, rather than constraining the two slopes to be equal, we test whether the two correlations are equal (ρX1,Y = ρX2,Y). We find and the difference between the two correlations, with standard error 0.131, is not statistically significant. Thus, using either mother’s or child’s report of vigorous exercise gives similar predictive power and would be appropriate to use in a subsequent study.
When using the ML approach, we must be concerned about deviations from the assumption of normality. In this example, the distributions of vigorous exercise reported by the child and the child’s mother are skewed to the right (not presented) with many children receiving no exercise. Residual plots (not presented) show some evidence of increasing heterogeneity in the residuals as the predicted Y values increase. Therefore, it is probable that the multivariate distribution of BMI, vigorous activity reported by the child and the child’s mother is likely not normally distributed. We therefore describe a simulation done to address whether a deviation from normality impacts results from the ML model.
Although ML assumes normality, with complete data the ML estimates are not sensitive to this assumption since they are identical to the GEE estimates. To investigate this in the presence of MCAR missingness, we simulated data from the observed underlying distribution of 82 complete cases in the Hernández et al. dataset [2]. Specifically, for each of the 2000 simulated datasets, we drew with replacement 111 cases from this complete case distribution. For each of these simulated datasets, we created missing data in 29 cases randomly according to the same general pattern of missingness in the original dataset of 111 observations. We calculated estimates of β using ML from the simulated data and compared these to the original complete case dataset to test if there was bias (Table 5); indeed, we find no evidence of bias. In summary, we generated data which were not multivariate normal, used an MCAR missingness mechanism to create missing data, assumed ML to estimate parameters and found no bias in the ML estimates. Hence, we have confirmed the robustness of ML to the assumption of multivariate normality in the presence of MCAR missingness in our data.
Table 5.
95% CI for β1 | 95% CI for β2 | |||
---|---|---|---|---|
Original dataset | −0.417 | −0.908 | ||
Simulated dataset | −0.417 | (−0.433,−0.401) | −0.888 | (−0.911,−0.865) |
6 Conclusion
Both ML and GEE marginal regression models can be used to combine information from multiple informants into one analysis; here we describe how to extend them to deal with missing data. In the Hernández et al. dataset [2], researchers collected activity measurements from children and their mothers. We found that the relationship between BMI and physical activity as reported by the child is similar to the relationship of BMI and physical activity as reported by the child's mother in this small study. Thus, a subsequent larger study (where it is not feasible to obtain mother's reports) would provide reasonable estimates of the relationship between physical activity and BMI. In addition, our simulations illustrated that the ML approach can be more efficient than the GEE technique.
Although we have found that the ML approach for estimation of marginal regression models with multiple source predictors incorporating missingness is more efficient than using GEE, ML does, however, require additional assumptions. For instance, ML assumes multivariate normality whereas GEE only assumes that the mean model is correctly specified. Without missingness, the ML and GEE approaches are identical in many cases indicating that ML is not at all sensitive to the normality assumption in the absence of missing data. The vigorous activity multiple informant measurements in the Hernández et al. [2] dataset were skewed to the right, but the residuals from the model were approximately normal. Analyzing the data without missingness reveals that ML is still equivalent to GEE, thus confirming the robustness of ML to deviations from normality. We also performed a simulation study based on the Hernández et al. data [2] confirming the robustness of the ML estimates to the assumption of normality with MCAR missing data. Hence, there is evidence that ML does not require the additional assumption of normality, has little potential for bias due to normality assumptions and does provide more efficient estimates than GEE for estimation of marginal regression models with multiple source predictors incorporating MCAR missingness.
We have considered MCAR missingness in the simulations since GEE can give biased estimates under more general missingness patterns and this was a reasonable assumption for the Hernández et al. dataset [2]. However, the ML technique is also valid assuming MAR missingness. While the GEE method presented is generally not appropriate for MAR missingness, an extension of the method (inverse probability weighted GEE [23], [24]) is. The weighted GEE technique has been applied in the presence of missing Y ′s [24] or missing X′s [23], not both (as in our example); construction of appropriate weights has not yet been investigated. The weights are designed to correct the bias of the estimates and their effect on efficiency has yet to be investigated.
In our example, the variances of the two multiple informants are not equal; to alleviate this problem, data can be standardized. However, with missingness, standardizing data a priori is not attractive and we prefer to leave the data unstandardized. When the scale of measurement of the informants is an issue and standardizing is not feasible, using ML allows us to compute correlations to compare the effect of the informants on response without standardizing the data. Therefore, the ML technique has the advantage of providing a technique to compare the relationships of informants on response without having to standardize the data. Comparing the relationship between one informant with response to another informant with response could alleviate the need to collect information from both informants in future studies.
We have considered a simple case with one response and two multiple informant covariates. The ML method extends straightforwardly to include more than two sets of multiple informants or to measure one construct with more than two multiple informants. For example, the Hernández et al. study [2] also included measures such as video viewing, moderate exercise and videogame playing. We can also envision a study collecting information from additional informants, e.g., children, the child's mother and the child's teacher. In addition, the ML model extends readily to include covariates that are not measured by multiple informants as in Litman et al [9].
Acknowledgments
We are grateful for the support provided by the National Institute of Mental Health (NIMH) grant number MH54693 and National Institute of Health (NIH) grant number T32-MH017119.
Appendix
We begin by giving details on how the EM algorithm is implemented for a case with two multiple informant predictors where Y and X2 are complete but X1 has MCAR missingness [12]. As in Section 3, we let with Qi having a trivariate normal distribution. We note that Qi is composed of an observed portion, Qobs;i and a missing portion, Qmis;i. We aim to maximize the log likelihood based on the observed data; the complete data belong to an exponential family with the following sufficient statistics:
In general, the estimates of θ are found by equating the observed sufficient statistics with the expected values, defined as the M step. We let the current estimates at the tth iteration be θ(t). The E step finds the conditional expectation of the missing data given the observed data and the current parameter estimates. For example in the case with X1i, the E step is:
Thus, missing values of X1i are replaced by the conditional mean of X1i given the set of values Qobs,i that are observed for that case. From the M step, closed form solutions for can be found. In particular, for the parameters that have no missing data from the joint distribution of Y and X2 such as the mean of Y, the M step is:
This leads to the solution that
For the parameters from the conditional distribution of X1 given Y that do have missing data, the same technique is used; however, the solutions are more complicated. For example, the M step involving is:
which leads to the solution that
where are sample means based on only the observed values, are sample means based on all observations and are functions of the sample variance estimates involving only observed values. Once all the parameters are found directly through the M step, a transformation is made to find estimates of . Another transformation is made to obtain since interest lies in the marginal regression parameter estimates, e.g. .
In our situation where any of the Y, X1, X2 can have missingness, the sufficient statistics and the M step remain the same as in the case just described, but the E step is more complicated because all of the sufficient statistics involve missing data. Thus, none of the θ parameters have simple solutions from the M step; all involve combinations of expressions involving all observed data. However, the general technique to obtain estimates using the EM algorithm remains the same. That is, the expectations of the sufficient statistics are found as sums of the quantities over all observations and the M step calculates the moment based estimates from the filled-in sufficient statistics [12]. We also note that although we have demonstrated using the EM algorithm in a situation with one response and two multiple informant covariates, it can be similarly used in situations with more than two covariates, but finding closed form estimates becomes increasingly complicated.
References
- 1.Hernández B, Gortmaker SL, Laird NM, Colditz GA, Parra-Cabrera S, Peterson KE. Validity and reproducibility of a physical activity and inactivity questionnaire for Mexico City's schoolchildren. Salud Publica de Mexico. 2000;42:315–323. [PubMed] [Google Scholar]
- 2.Hernández B, Gortmaker SL, Colditz GA, Peterson KE, Laird NM, Parra-Cabrera S. Association of obesity with physical activity, television programs and other forms of video viewing among children in Mexico City. International Journal of Obesity. 1999;23:845–854. doi: 10.1038/sj.ijo.0800962. [DOI] [PubMed] [Google Scholar]
- 3.Achenbach TM. University of Vermont: Department of Psychiatry; 1991. Manual for the teacher's report form and 1991 profile. [Google Scholar]
- 4.Horton NJ, Laird NM. Maximum likelihood analysis of generalized linear models with missing covariates. Statistical Methods in Medical Research. 1999;8:37–50. doi: 10.1177/096228029900800104. [DOI] [PubMed] [Google Scholar]
- 5.Zahner GEP, Daskalakis C. Factors associated with mental health, general health and school-based service use for psychopathology. American Journal of Public Health. 1997;87:1440–1448. doi: 10.2105/ajph.87.9.1440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zahner GEP, Jacobs JH, Freeman DH, Trainor K. Rural-urban child psychopathology in a north-eastern US state: 1986–1989. Journal of the American Academy of Child Adolescent Psychiatry. 1993;32:378–387. doi: 10.1097/00004583-199303000-00020. [DOI] [PubMed] [Google Scholar]
- 7.Zahner GEP, Pawelkiewicz W, DeFrancesco JJ, Adnopoz J. Children's mental health service needs and utilization patterns in an urban community. Journal of the American Academy of Child Adolescent Psychiatry. 1992;31:951–960. doi: 10.1097/00004583-199209000-00025. [DOI] [PubMed] [Google Scholar]
- 8.Horton NJ, Laird NM, Murphy JM, Monson RR, Sobol AM, Leighton AH. Multiple informants: Mortality associated with psychiatric disorders in the Stirling County Study. American Journal of Epidemiology. 2001;154:649–656. doi: 10.1093/aje/154.7.649. [DOI] [PubMed] [Google Scholar]
- 9.Litman HJ, Horton NJ, Hernández B, Laird NM. Estimation of Marginal Regression Models with Multiple Source Predictors. Handbook of Statistics 2005. doi: 10.1002/sim.2593. Submitted. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Horton NJ, Laird NM, Zahner GEP. Use of multiple informant data as a predictor in psychiatric epidemiology. International Journal of Methods in Psychiatric Research. 1999;8:6–18. [Google Scholar]
- 11.Pepe MS, Whitaker RC, Seidel K. Estimating and comparing univariate associations with application to the prediction of adult obesity. Statistics in Medicine. 1999;18:163–173. doi: 10.1002/(sici)1097-0258(19990130)18:2<163::aid-sim11>3.0.co;2-f. [DOI] [PubMed] [Google Scholar]
- 12.Little RJA, Rubin DB. 2nd Edition. John Wiley; New Jersey: 2002. Statistical Analysis with Missing Data. [Google Scholar]
- 13.Fitzmaurice GM, Laird NM, Zahner GEP, Daskalakis C. Bivariate logistic regression analysis of child psychopathology ratings using multiple informants. American Journal of Epidemiology. 1995;142:1194–1203. doi: 10.1093/oxfordjournals.aje.a117578. [DOI] [PubMed] [Google Scholar]
- 14.Goldwasser MA, Fitzmaurice GM. Multivariate linear regression of childhood psychopathology using multiple informant data. International Journal of Methods in Psychiatric Research. 2001;20:1–11. [Google Scholar]
- 15.Little RJA. Regression with missing Xs: A review. Journal of the American Statistical Association. 1992;87:1227–1237. [Google Scholar]
- 16.Pepe MS, Anderson GL. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics. 1994;23(4):939–951. [Google Scholar]
- 17.R Development Core Team. R Foundation for Statistical Computing; Vienna, Austria: 2004. R: A language and environment for statistical computing. [Google Scholar]
- 18.Dempster AP, Laird NM, Rubin DB. Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion) Journal of the Royal Statistical Society Series B. 1977;39:1–38. [Google Scholar]
- 19.Laird NM. Vol. 8. Institute of Mathematical Statistics and the American Statistical Association; Ohio: 2004. Analysis of longitudinal and cluster-correlated data. [Google Scholar]
- 20.Efron B, Tibshirani RJ. Chapman and Hall; New York, NY: 1993. An introduction to the bootstrap. [Google Scholar]
- 21.Baker SG, Laird NM. Regression analysis for categorical variables with outcome subject to nonignorable nonresponse. Journal of the American Statistical Association. 1988;83:62–69. [Google Scholar]
- 22.Horton NJ, Saitz R, Laird NM, Samet JH. A method for modeling utilization data from multiple sources: Application in a study of linkage to primary care. Health Services and Outcomes Research Methodology. 2002;3:211–223. [Google Scholar]
- 23.Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association. 1994;89:846–866. [Google Scholar]
- 24.Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association. 1995;90:106–121. [Google Scholar]