ABSTRACT
Generalized linear mixed models have been widely used in the analysis of correlated data in a lot of research areas. The linear mixed model with normal errors has been a popular model for the analysis of repeated measures and longitudinal data. Outliers, however, can severely have an wrong influence on the linear mixed model. The aforementioned model has not fully taken those severe outliers into consideration. One of the popular robust estimation methods, M-estimator attains robustness at the expense of first-order or second-order efficiency whereas minimum Hellinger distance estimator is efficient and robust. In this paper, we propose more robust Bayesian version of parameter estimation via pseudo posterior distribution based on minimum Hellinger distance. It accommodates an appropriate nonparametric kernel density estimation for longitudinal data to require the proposed cross-validation estimator. We conduct simulation study and real data study with the orthodontic study data and the Alzheimers Disease (AD) study data. In simulation study, the proposed method shows smaller biases, mean squared errors, and standard errors than the (residual) maximum likelihood method (REML) in the presence of outliers or missing values. In real data analysis, standard errors and variance-covariance components for the proposed method in two data sets are shown to be lower than those for REML method.
Keywords: Linear mixed model, outliers, kernel density estimation, minimum Hellinger distance, pseudo posterior
1. Introduction
Linear mixed models have become a popular method for analyzing repeated measurement data arising in many areas such as agriculture, biology, economics, and so on. The increasing popularity of these models is mostly explained by the flexibility they offer in modeling the within-subject correlation often present in repeated measures data, by handling both balanced and unbalanced data and by virtue of efficient software. Linear mixed models are frequently used to analyze repeated measures data, because they model the within-subject correlation often present in this type of data flexibly. The most popular linear mixed model for a continuous response assumes normal distributions for the random effects and the within-subject errors [9]. Estimation of parameters has been usually based on the REML. In longitudinal data, repeated measurements on each subject are in general correlated within subjects whereas each subject is assumed to be independent. It is well known that the linear mixed model is often used for this type of data. As a motivating example in Section 5.3 (AD), 248 individuals with AD are included in the study [12]. They are diagnosed with mild cognitive impairment at the first visit. Their diagnosis is changed to AD till their last visit. They visit hospitals 6 times on average with an average duration of 6 or 12 months between successive visits. That is, the data consists in repeated multivariate measurements of 248 individuals. For a given individual, they obtain the measurements at time points with observations for the ith individual. We model the data by the linear mixed model as below.
For each subject i with observations, the linear mixed model for the outcome is given by
where is an vector, is an matrix of fixed covariates, is a vector of fixed effects, is an vector of covariates for random effects , and is an vector of error terms. and are assumed to be independent of each other and are normally distributed with , , and [8,9]. Under these assumptions along with conditional independence of observations of subject i conditionally on the random effect ,
Large-sample inference regarding should not be much affected by the distribution of random effects [9]. In this sense, marginally,
which shows the unique covariance structure for subject i. For balanced data, our model assumes for and . Observation, however, may be an outlier due to either an extreme subject effect or an extreme time varying error. Those outlying observations can potentially have an inordinate influence on the analysis of the linear mixed model. For example, the model inference is not robust when the normality assumption is violated because of heavy tails and outliers. Within the generalized linear mixed model framework, parameters in such a model are often estimated through the REML. The method, however, is vulnerable to outliers and it can be computationally very intensive and sometimes may be computationally infeasible. In case of missing values, it also assumes missing at random (MAR). The present paper takes up these issues directly. Our aim is to propose an approach to provide details about the estimation of parameters to achieve effectiveness and robustness against outliers and influential data points. Robust estimation procedures in longitudinal data proposed so far mostly rely on asymptotic theory in the framework of partial or (semi) parametric mixed models by focusing on robustness, not efficiency [2,7,11,13–15]. The minimum Hellinger distance estimator belongs to a class of efficient as well as robust ones, whereas one of the most popular robust estimation method, M-estimator achieves robustness at the cost of first-order or second-order efficiency [1]. It minimizes the squared Hellinger distance between a nonparametric density estimate from the sample and the parametric density given by the model assumption. It was originally designed for independent and identically distributed observations (IID). Missing data points reduce a sample size (small sample size) and then lose statistical efficiency in the data. They also cause a significant bias in results. Unfortunately, robust methods for missing values have been little developed so far. The Alzheimer data in the motivating example may have outliers or missing values because of real data structure. Owing to it, the standard method for parameter estimates may be biased or ineffective. Issues of effect of missing data points on minimum Hellinger distance estimators are considered as well.
The pseudo posterior distribution was proposed so as to develop the minimum Hellinger distance estimator for dependent observations in dealing with time interest rate models [3]. They described classical and Bayesian minimum divergence estimators and related bandwidth selection in the aforementioned data. In order to establish the minimum Hellinger distance estimation method for longitudinal data, we account for dependence structures among repeated measurements. In this sense, we provide a general framework for Bayesian version of minimum Hellinger distance estimator in which nonparametric kernel density estimation is more precisely presented for longitudinal data.
The paper is organized as follows. We introduce the minimum Hellinger distance estimator and show how to apply the method to the linear mixed model in Section 2. In Section 3, nonparametric kernel density estimation for the linear mixed model should be presented. In Section 4, a Bayesian version of the minimum Hellinger distance estimator for the linear mixed model is developed. We assess the performance of the proposed method in simulation study and real data in Section 5. Section 6 includes the concluding remarks.
2. Minimum Hellinger distance for longitudinal data
We consider a model for a continuous random variable Y and it has the parametric density , where is the set of parameters. In order to measure the distance between and a nonparametric density estimator , one of the most popular and effective distance measures in the literature is the squared Hellinger distance. To define the Hellinger distance, let P and Q denote two probability measures absolutely continuous with respect to a third probability measure λ. We define the squared Hellinger distance between P and Q as the quantity
Now, in terms of and the nonparametric estimator of , , define the parameter based on squared minimum Hellinger distance as
| (1) |
Minimum distance method attains one of the most attractive alternatives to the maximum likelihood estimator because the nonparametric estimator has nice robustness properties while being first-order efficient under the assumed model. In the case of data containing severe outliers which makes the likelihood-based inference infeasible, the method has more appeals. Usually most of the parameter estimation methods for random-effects model utilized the REML based on the normal theory which might deviate from the data structure in the context of longitudinal data analysis. Now applying the method to longitudinal data, assuming that integral should be approximated by sum, the squared Hellinger distance is rewritten as
where and
| (2) |
It was known that minimum Hellinger distance estimator keeps efficiency even in small sample size compared with other statistical approaches. Therefore, it has statistical merits when considering missing data issues. Minimum Hellinger-type distance method for censored data was developed by Ying [16]. Censored data may belong to missing observations. This method was shown to be robust and efficient in this kind of data. Besides, the minimum Hellinger distance utilizes nonparametric kernel regression estimator. It was known that this kernel type estimator attains robustness by considering an appropriate bandwidth in data with missing values [17]. We deal with nonparametric kernel regression estimation for longitudinal data in next section for more details.
3. Nonparametric kernel regression estimation for longitudinal data
Most previous research for kernel regression estimation relies on IID observations. An appropriate band width choice for kernel regression estimation based on dependent data was proposed [4,5]. Kernel regression estimation in repeated measurement data was also introduced for one explanatory variable [6]. It is straightforward that our observations made on each experimental unit will be correlated. We assume that the observations and should be generated by the nonparametric kernel regression below with observations taken at regularly spaced design points i/n. Define
| (3) |
The errors should come from a stationary process with covariate function given by
| (4) |
where the variance is independent of m and and is a correlation function. The structure of the covariance function above allows the autocorrelation among the errors to vary both with the distance between design points and the sample size. For our case, we assume that where ρ is a Lipschitz continous function which satisfies the Lipschitz condition for all a and b and some constant A. The estimate of is given by
| (5) |
where , , and K is a density function. In order to compute an appropriate value of bandwidth , we use an estimated mean average squared error (MASE) curve which is given by
| (6) |
For our model, we rewrite as
| (7) |
where denotes trace, , is an n ×n matrix with th element and is a matrix of correlations with th element . For more details, please see [6].
4. Bayesian minimum Hellinger distance estimator for longitudinal data
In this section, we describe how a Bayesian version of the minimum Hellinger distance estimator for the linear mixed model can be formulated [3]. For simplicity, our data are assumed to have the same number of repeated measurements, that is, . As stated before,
| (8) |
We assume that conditional on the random effect , the repeated observations on subject i are independent. Thus, the likelihood for subjects in the GLMM is
| (9) |
Letting , the priors should be made as gamma and In the Bayesian framework, is commonly assumed to have a Wishart prior, that is, Wishart, where , and is a positive definite matrix. Then, Henceforth, we have the posterior density for expressed as
| (10) |
where gamma
The posterior density of is given by
Generating posterior sample gives us robust simultaneous estimators of .
5. Numerical study
5.1. Simulation study
The performance of our proposed method is compared with that of the REML by allowing for outliers in the data. We also demonstrate the computations and inference for these models.
To assess the performance of the proposed estimator, two sets of simulations are run. In the first set of simulation, the behavior of the robust estimates is assessed for the case when outliers and missing values are not considered in the data. In the second set, it is investigated when outliers or missing values are considered in the data. The statistical package R is utilized for the analysis. In the first set, the proposed method is compared with the REML in terms of the biases and mean squared errors (MSE) of the parameter estimates in the case of uncontaminated data without missing values. 100 data sets are created with sample sizes m=100 and n=5. The values of the predictor are generated from . The time points are selected as . The regression coefficients are fixed at as true value. τ is also fixed at as true value.
100 random effects are created from a bivariate normal distribution with mean and covariance structure given by
with (,,)=(0.30,0.25,0.30). Table 1 illustrates the biases and MSEs of the estimates obtained from the REML and the proposed method without outliers and missing values. The proposed method and the REML perform almost equally well. Our aim is to investigate the behavior of the robust method for analyzing data when outliers or missing values are considered in the data. The gain in precision for the robust method is expected to be large, compared with the REML method in the presence of them.
Table 1. Biases and MSEs of parameter estimates without outliers and missing values.
| Proposed method | REML | ||||||
|---|---|---|---|---|---|---|---|
| Parameter | True value | Bias | standard error | MSE | Bias | standard error | MSE |
| .6 | −.0054 | .0104 | .0118 | −.0048 | .0108 | .0123 | |
| .6 | .0138 | .0114 | .0132 | .0144 | .0115 | .0135 | |
| .6 | .0136 | .0081 | .0062 | .0146 | .0069 | .0069 | |
| .30 | −.0028 | .0088 | .0082 | −.0025 | .0092 | .0088 | |
| .30 | .0036 | .0076 | .0314 | .0039 | .0080 | .0324 | |
| .25 | .0029 | .0015 | .0278 | .0030 | .0020 | .0298 | |
| τ | .4 | .0058 | .0078 | .0284 | .0061 | .0081 | .0288 |
In the second set, the two methods are examined in the presence of outliers or missing values (missing at random (MAR) assumption). 100 data sets are generated for longitudinal data. The outliers are created by replacing 20 randomly chosen y values by y+10. Table 2 shows simulated biases and MSEs of the estimates of and τ obtained from the two methods with outliers. Unlike the proposed estimates, the REML estimator is influenced by the outliers or missing values in the data. The REML method has larger biases and MSEs for all the parameter estimates. Standard errors from the proposed parameter estimates have smaller values than those from the REML for all the parameter estimates. Table 3 shows the simulated biases and MSEs of the estimates of , and τ obtained from the two methods when data has design outliers or missing values. As expected, proposed method provides smaller biases, MSEs, and standard errors for all the parameter estimates than the REML method. The standard deviation in Monte Carlo simulation experiments without outliers is 0.0058 and with outliers 0.0124. The averages of standard errors in Table 1 are 0.0079 for proposed method and 0.00807 for REML. The averages of standard errors in Table 2 are 0.0051 for proposed method and 0.0057 for REML.
Table 2. Biases and MSEs of parameter estimates with outliers or missing values.
| Proposed method | REML | ||||||
|---|---|---|---|---|---|---|---|
| Parameter | True value | Bias | standard error | MSE | Bias | standard error | MSE |
| .6 | .0144 | .0094 | .0118 | .0238 | .0108 | .0215 | |
| .6 | .0048 | .0107 | .0122 | .0134 | .0124 | .0219 | |
| .6 | .0036 | .0055 | .0045 | .0130 | .0062 | .0239 | |
| .30 | −.0026 | .0045 | .0059 | −.0025 | .0048 | .0061 | |
| .30 | .0023 | .0016 | .0219 | .0025 | .0020 | .0225 | |
| .25 | .0032 | .0016 | .0387 | .0035 | .0021 | .0398 | |
| τ | .4 | .0014 | .0018 | .0161 | .0015 | .0021 | .0162 |
Table 3. Biases and MSEs of parameter estimates with design outliers or missing values (standard errors in parentheses).
| Proposed method | REML | ||||
|---|---|---|---|---|---|
| Parameter | True value | Bias | MSE | Bias | MSE |
| .6 | .0056 (.0064) | .0129 | .0165 (.0078) | .0238 | |
| .6 | .0065 (.0096) | .0145 | .0256 (.0097) | .0438 | |
| .6 | .0018 (.0035) | .0056 | .0126 (.0052) | .0289 | |
| .30 | .0016 (.0025) | .0079 | .0143 (.0056) | .0161 | |
| .30 | .0033 (.0015) | .0079 | .0234 (.0023) | .0285 | |
| .25 | .0038 (.0025) | .0314 | .0039 (.0027) | .0348 | |
| τ | .4 | .0025 (.0029) | .0087 | .0136 (.0038) | .0259 |
5.2. Application to longitudinal data 1
The data come from the orthodontic study data for our analysis [10]. A study was conducted involving 27 children, 16 boys and 11 girls. On each child, the distance (mm) from the center of the pituitary to the pterygomaxillary fissure was made at ages 8, 10, 12, and 14 years of age. The analysis scheme utilizes a linear growth curve model for the boys and girls as well as a variance-covariance model to incorporate correlations for all of the observations from the same person. The data are assumed to be Gaussian, and their likelihood is maximized to estimate the parameters. We also fit the data by using the proposed method for comparison. Table 4 presents the estimates of and τ computed by the two methods where and
Table 4. Parameter estimates to orthodontic study data.
| Proposed method | REML | ||||
|---|---|---|---|---|---|
| Parameter coefficient | Gender | Estimate | Standard error | Estimate | Standard error |
| gender | F | 16.4987 | 0.5386 | 17.3727 | 0.7386 |
| gender | M | 18.4570 | 0.8478 | 16.3406 | 1.1114 |
| age*gender | F | 0.5367 | 0.0585 | 0.4795 | 0.0618 |
| age*gender | M | 0.8794 | 0.0854 | 0.7844 | 0.0972 |
| 0.9871 | 0.9876 | 3.1978 | 1.4169 | ||
| 0.0078 | 0.8765 | 0.0197 | 0.9835 | ||
| 0.0186 | 0.5674 | 0.0297 | 0.6865 | ||
| τ | F | 0.5635 | 0.1342 | 0.4449 | 0.1764 |
| τ | M | 0.2469 | 0.0959 | 0.2456 | 0.1648 |
The estimates of the regression coefficients are similar obtained from the two methods. Variance-covariance components , and in the REML are large compared to those in the proposed method. The proposed method has smaller standard errors than the REML method for all the parameter estimates.
5.3. Application to longitudinal data 2
Brain diseases related to age, for example, Parkinsons disease or Alzheimers Disease are shown to be complex diseases having multiple metabolic effects on the mechanism of the brain. We may have models of disease progression for the sequence and timing of these effects during the course of the disease largely hypothetical. They have collected large databases recently with biological evidence of the patterns of disease progression. These databases are longitudinal in that they have repeated measurements of a lot of subjects at several time-points. They have the neuropsychological assessment test ADAS-Cog 13 from the ADNI1, ADNIGO or ADNI2 cohorts of the Alzheimers Disease Neuroimaging Initiative. The data included 248 individuals diagnosed with mild cognitive impairment at their first visit and whose diagnosis changed to AD before their last visit. There is an average of 6 visits per subjects with min=3 and max=11. The response variable is measurement of the specific biomarker. The fixed effects in the linear mixed model are the parameters of the average geodesic: the point on the manifold, the time-point and the velocity. The random effects are the acceleration factors, time-shifts and space-shifts. For more details, please refer to [12].
We analyze the data by applying both the proposed method and the REML for comparison. Table 5 presents the estimates of and τ by the two methods, where and
Table 5. Parameter estimates to Alzheimers Disease study data.
| Proposed method | REML | |||
|---|---|---|---|---|
| Parameter coefficient | Estimate | Standard error | Estimate | Standard error |
| the point on the manifold | 9.4584 | 0.4983 | 10.0011 | 0.4986 |
| the time-point | 20.4681 | 0.5647 | 22.3516 | 1.0931 |
| the velocity | 1.5324 | 0.1831 | 1.8923 | 0.2618 |
| 0.0671 | 0.0825 | 1.1124 | 1.1034 | |
| 0.0066 | 0.0134 | 0.0298 | 0.0349 | |
| 0.0786 | 0.0097 | 0.0998 | 0.0159 | |
| 0.0187 | 0.0677 | 0.0199 | 0.0799 | |
| 0.0393 | 0.0569 | 0.0497 | 0.0864 | |
| 0.0886 | 0.0431 | 0.0999 | 0.0996 | |
| τ | 0.4983 | 0.0351 | 0.8175 | 0.5683 |
The estimates of the regression coefficients are similar obtained from the two methods. Variance-covariance components , , , , , and in the REML are large compared to those in the proposed method. The proposed method has smaller standard errors than the REML method for all the parameter estimates.
6. Concluding remarks
The REML method has been often used for analyzing the linear mixed model for longitudinal data. It, however, can be highly influenced by the presence of severe outliers or missing values. The most popular robust method, M-estimator may be utilized but sacrifice first-order or second-order efficiency whereas the minimum Hellinger distance method provides the efficient and robust estimators. The pseudo posterior is constructed based upon minimum Hellinger distance for the linear mixed model. The posterior requires a nonparametric kernel density estimation for longitudinal data. By using the pseudo posterior, the Bayesian version of minimum Hellinger distance estimator for longitudinal data is proposed. Simulation study shows that the proposed method is not influenced by the outliers or missing values in the data, which provides smaller biases, MSEs, and standard errors for all the parameter estimates than the REML method in the presence of them. In real data analysis, standard errors and variance-covariance components for proposed method in two data sets show lower values than those for REML method.
Funding Statement
This work was supported by the Research Institute of Natural Science of Gangneung-Wonju National University.
Disclosure statement
No potential conflict of interest was reported by the author.
References
- 1.Beran R., Minimum Hellinger distance estimates for parametric models, Ann. Stat. 5 (1977), pp. 445–563. doi: 10.1214/aos/1176343842 [DOI] [Google Scholar]
- 2.Cantoni E. and Ronchetti E., Robust inference for generalized linear models, J. Am. Stat. Assoc. 45 (2016), pp. 3053–3073. [Google Scholar]
- 3.Giet L. and Lubrano M., Minimum Hellinger distance estimator for stochastic differential equations: An application to statistical inference for continuous time interest rate models, Comput. Stat. Data An. 52 (2008), pp. 2945–2965. doi: 10.1016/j.csda.2007.10.004 [DOI] [Google Scholar]
- 4.Hall P., Lahiri S.N. and Truong Y.K., On bandwidth choice for density estimation with dependent data, Ann. Stat. 23 (1995), pp. 2241–2263. doi: 10.1214/aos/1034713655 [DOI] [Google Scholar]
- 5.Hart J.D. and Vieu P., Data-driven bandwidth choice for density estimation based on dependent data, Ann. Stat. 18 (1990), pp. 873–890. doi: 10.1214/aos/1176347630 [DOI] [Google Scholar]
- 6.Hart J.D. and Wehrly T.E., Kernel regression estimation using repeated measurements data, J. Am. Stat. Assoc. 81 (1990), pp. 1080–1088. doi: 10.1080/01621459.1986.10478377 [DOI] [Google Scholar]
- 7.He X., Fung W.K. and Zhu Z., Robust estimation in generalized partial linear models for clustered data, J. Am. Stat. Assoc. 100 (1990), pp. 1176–1184. doi: 10.1198/016214505000000277 [DOI] [Google Scholar]
- 8.Kleinman K.P. and Ibrahim J.G., A semiparametric Bayesian approach to the random effects model, Biometrics 54 (1998), pp. 921–938. doi: 10.2307/2533846 [DOI] [PubMed] [Google Scholar]
- 9.Laird N.M. and Ware J.M., Random-effects models for longitudinal data, Biometrics 38 (1982), pp. 963–974. doi: 10.2307/2529876 [DOI] [PubMed] [Google Scholar]
- 10.Potthoff R. and Royf S.N., A generalized multivariate analysis of variance model useful especially for growth curve problems, Biometrika 51 (1964), pp. 313–326. doi: 10.1093/biomet/51.3-4.313 [DOI] [Google Scholar]
- 11.Qin G. and Zhu Z., Robust estimation in generalized semiparametric mixed models for longitudinal data, J. Multivar. Anal. 98 (2007), pp. 1658–1683. doi: 10.1016/j.jmva.2007.01.006 [DOI] [Google Scholar]
- 12.Schiratti J.B., Allassonnière S., Colliot O. and Durrleman S., Learning spatiotemporal trajectories from manifold-valued longitudinal data, NIPS (2015), pp. 1–9. [Google Scholar]
- 13.Sinha S.K., Robust analysis of generalized linear mixed models, J. Am. Stat. Assoc. 99 (2004), pp. 451–460. doi: 10.1198/016214504000000340 [DOI] [Google Scholar]
- 14.Wang Y.G., Lin X. and Zhu M., Robust estimation functions and bias correction for longitudinal data analysis, J. Am. Stat. Assoc. 61 (2005), pp. 684–691. [DOI] [PubMed] [Google Scholar]
- 15.Yau K.K.W. and Kuk A.Y.C., Robust estimation in generalized linear mixed models, J. R. Stat. Soc. B 64 (2002), pp. 101–117. doi: 10.1111/1467-9868.00327 [DOI] [Google Scholar]
- 16.Ying Z., Minimum Hellinger-type distance estimation for censored data, Ann. Stat. 20 (1992), pp. 1361–1390. doi: 10.1214/aos/1176348773 [DOI] [Google Scholar]
- 17.Zhao G. and Ma Y., Robust nonparametric kernel regression estimator, Stat. Probab. Lett. 116 (2016), pp. 72–79. doi: 10.1016/j.spl.2016.04.010 [DOI] [Google Scholar]
