Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Oct 9.
Published in final edited form as: Multivariate Behav Res. 2019 Sep 25;55(5):704–721. doi: 10.1080/00273171.2019.1667217

The Performance of Multivariate Methods for Two-Group Comparisons with Small Samples and Incomplete Data

Keenan A Pituch a, Megha Joshi b, Molly E Cain b, Tiffany A Whittaker b, Wanchen Chang c, Ryoungsun Park d, Graham J McDougall e
PMCID: PMC7093229  NIHMSID: NIHMS1052761  PMID: 31552754

Abstract

In intervention studies having multiple outcomes, researchers often use a series of univariate tests (e.g., ANOVAs) to assess group mean differences. Previous research found that this approach properly controls Type I error and generally provides greater power compared to MANOVA, especially under realistic effect size and correlation combinations. However, when group differences are assessed for a specific outcome, these procedures are strictly univariate and do not consider the outcome correlations, which may be problematic with missing outcome data. Linear mixed or multivariate multilevel models (MVMMs), implemented with maximum likelihood estimation, present an alternative analysis option where outcome correlations are taken into account when specific group mean differences are estimated. In this study, we use simulation methods to compare the performance of separate independent samples t tests estimated with ordinary least squares and analogous t tests from MVMMs to assess two-group mean differences with multiple outcomes under small sample and missingness conditions. Study results indicated that a MVMM implemented with restricted maximum likelihood estimation combined with the Kenward–Roger correction had the best performance. Therefore, for intervention studies with small N and normally distributed multivariate outcomes, the Kenward–Roger procedure is recommended over traditional methods and conventional MVMM analyses, particularly with incomplete data.

Keywords: ANOVA, Kenward–Roger correction, missing data, multivariate multilevel model, small samples

Introduction

Comparing mean performance of two groups for each of several outcomes is common practice in intervention research, where participants are randomized to groups. When researchers wish to examine group mean differences for each of several outcomes (and not for linear combinations of such variables), methodologists (e.g., Enders, 2003; Frane, 2015; Huberty & Petoskey, 2000; Rencher & Chistensen, 2012) often recommend using a series of Bonferroni-adjusted univariate tests (e.g., multiple t tests or ANOVAs) to assess such differences with no need for a prerequisite MANOVA. Indeed, Frane’s extensive simulation study confirmed that using a series of Bonferroni-adjusted independent samples t tests not only properly controlled Type I error when multiple outcomes are present but also generally provided greater power than MANOVA, with MANOVA providing greater power than univariate methods only for parameter combinations that, according to Frane (2015, p. 237), rarely occur in practice.

Although GLM, or, general linear model, methods (e.g., ANOVA) have long been applied to traditional (i.e., non-multilevel) experimental designs, a more recently developed procedure, the multivariate multilevel model (MVMM), also known as the multivariate linear mixed model, has important advantages for group comparisons, even for simple experimental designs. First, like the GLM, which is often accompanied by use of ordinary least squares (OLS) estimation, MVMMs, implemented with maximum likelihood estimation, can be used to estimate group mean differences for multiple outcomes in traditional randomized experimental designs (i.e., where participants are randomly assigned to treatment groups). Further, simulation studies have shown that statistical tests associated with MVMMs provide more power for group mean differences than those available with OLS GLMs (Ashbeck & Bell, 2016; Park, Pituch, Kim, Chung, & Dodd, 2015; Sullivan, White, Salter, Ryan, & Lee, 2018). Second, when the analyst transforms scores for each variable so that they are on the same scale, MVMMs can be used to test whether group mean differences are the same or differ across the outcomes. In fact, Baldwin, Imel, Braithwaite, and Atkins (2014) assert that intervention researchers routinely hypothesize that treatment effects are stronger for primary than secondary outcomes but rarely conduct a statistical test of that hypothesis, which MVMMs offer. Although OLS GLMs are capable of implementing such a test, MVMMs are more powerful because they take into account the correlations among the multiple outcomes, whereas such GLMs do not. Third, when estimating differences in means for multiple outcomes, MVMMs allow you to estimate different variance–covariance matrices for each group, which is rarely, if at all, done in traditional analysis (Brown & Prescott, 2015).

Fourth, and perhaps most importantly given the prevalence of missing data in applied studies, GLMs implemented with OLS (which we now label throughout the paper as GLMs) have limited capacity to handle incomplete responses (i.e., listwise or pairwise deletion). Deleting cases with incomplete data, as the GLM does, may be suitable when data are missing completely at random (MCAR), which means that the probability of missingness is not related to observed or unobserved variables. With MCAR data, GLM analysis may experience a reduction in power but parameter estimates remain unbiased (Ashbeck & Bell, 2016; Park et al., 2015). GLMs may also be appropriate for data missing at random (MAR), which means that the probability of missingness is related to one or more of the observed variables. The critical aspect of treating data that are MAR is that the observed variables accounting for missingness must be included in the analysis model to yield unbiased estimates (Collins, Schafer, & Kam, 2001; Rubin, 1976; Schafer & Graham, 2002). Thus, the potential problem with using a series of separate univariate t tests or ANOVAs in the presence of multiple outcomes is that the variables that account for missingness may be one or more of the multivariate outcomes. If so, such univariate GLMs may yield biased estimates (Ashbeck & Bell, 2016; Chang & Pituch, 2019; Sullivan et al., 2018). MVMMs, on the other hand, provide maximum likelihood treatment for incomplete outcome data, although not incomplete predictors. Further, if the response variables that are responsible for missingness are included in the model, this treatment is capable of providing unbiased estimates and greater power, compared to GLMs, when outcome data are MCAR or MAR (Ashbeck & Bell, 2016; Chang & Pituch, 2019; Park et al., 2015; Sullivan et al., 2018).

Although MVMMs provide additional advantages compared to GLMs, MVMMs do not appear to be used regularly in applied intervention research involving conventional (i.e., non-multilevel) designs. To assess how often MVMMs are used in applied studies having a traditional experimental design, a design that is the focus of this paper, we examined 64 recently published carefully controlled intervention studies. These studies were implemented mostly in a lab setting, where participants were randomly assigned to treatments, data on multiple outcomes were collected, and where the goal of each study was to estimate differences in group means for two or more independent groups. The 64 articles were published during 2017 in two journals: the Journal of Strength and Conditioning Research (n = 38) and the American Journal of Physical Medicine and Rehabilitation (n = 26). Given the context of the interventions, group sizes were generally small, with the mean, median, and modal group size being 17, 12, and 10, respectively, and two treatment conditions were most common (in 44 studies, 69%).

In these intervention studies, MVMMs or linear mixed models were rarely used. Instead, a vast majority of the studies (59 studies or 92%) used a series of univariate GLM analyses (i.e., ANOVAs, ANCOVAs). In contrast, only three studies (5%) used a linear mixed model (analogous to MVMM), and two studies (3%) used traditional MANOVA. In addition, even in these carefully controlled interventions, incomplete responses were reported in the majority of studies. We could not determine whether response data were complete or incomplete in one of the 64 studies, but of the remaining 63 studies, 30 (48%) reported no missing data and 33 (52%) reported incomplete response data for one or more outcomes. Only three studies reported using MVMMs to treat incomplete response data, with the remaining 30 studies using GLMs, with such analyses accompanied by the use of pairwise or listwise deletion. These findings are consistent with other reviews that examined analysis methods used in larger scale clinical trials. For example, Vickerstaff, Ambler, King, Nazareth, and Omar (2015) found that MVMMs were not used in any of 209 published randomized controlled trials in the fields of neurology and psychiatry. Further, Bell, Fiero, Horton, and Hsu (2014) reported that of 73 clinical trials published in top medical journals in 2013 that reported missing data, only 15% of the studies used linear mixed models to treat missing data whereas the remaining studies used an analysis method that assumed data were MCAR.

When small group sizes are present, conventional MVMM procedures have been shown to yield inflated Type I error rates for tests of group mean differences (Park et al., 2015). However, MVMM procedures designed for smaller samples (i.e., the Kenward–Roger correction) may be able to provide for unbiased parameter estimates (as a standard MVMM may do), but also adequately control Type I error and provide as much or more power than other competing analyses when incomplete responses are present. Although methodological studies have examined the performance of the Kenward–Roger correction, such studies have included only complete data (Kowalchuk, Keselman, Algina, & Wolfinger, 2004; Ferron, Bell, Hess, Rendina-Gobioff, & Hibbard, 2009; Vallejo & Livacic-Rojas, 2005), a univariate outcome (Bell, Morgan, Schoenberger, Kromrey, & Ferron, 2014; McNeish & Stapleton, 2016), or moderate to large sample sizes and/or limited missingness conditions (Ashbeck & Bell, 2016; Gosho & Maruo, 2018; Spilke, Piepho, & Hu, 2005; Sullivan et al., 2018). We are not aware of any such study that has systematically assessed the small sample performance of MVMMs in this particular multivariate context focused on group mean differences, while also including multiple missing data mechanisms, as described below.

The purpose of this study is to extend multivariate methodological research to conventional experimental designs where the primary research interest is to estimate and test group mean differences for each of a set of outcomes where small group sizes and various types of missingness are present. Our goal is to determine how well MVMMs – particularly those intended for small samples – perform compared to standard MVMMs and GLMs under these taxing conditions by examining estimation bias, Type I error, and power. If MVMMs perform universally better than GLMs, use of OLS GLMs may be unnecessary for group mean comparisons for multiple outcomes. To compare the performance of these different methods, we conducted two simulation studies that differ primarily in the number of outcome variables included in the design, where the first study had two outcomes and the second study included a third outcome. Because the results of the two studies were similar, we focus only on the two-outcome study here. The methodology and results associated with the three-outcome study are available from the first author.

Simulation study method

We generated data having two outcomes with the following model:

y1=β10+β11D+r11 (1)
y2=β20+β21D+r22, (2)

where y1 and y2 are outcomes, and D is a dummy-coded variable (e.g., 0 = control and 1 = treatment). Thus, β10 and β20 represent control group means and β11 and β21 represent group mean differences for y1 and y2, respectively. The residuals are multivariate normal, each with mean of zero and variance of 1, and covariance matrix

R=[σ12σ12σ12σ22], (3)

Where σ12 and σ22 2 represent the variance of the residuals in Equations (1) and (2), respectively, and σ12 represents the covariance of the residuals. SAS software, version 9.4 (SAS Institute Inc., 2014), was used for this study, with 10,000 replications for each condition. Two groups were used because many interventions have just two groups, as our review of 64 intervention studies suggests. Note that the SAS code used to generate the data is included in the online Supplementary Materials.

Conditions

Five factors were systematically varied for a total of 1296 conditions. Table 1 shows the study factors and levels, and we now provide justifications for these study conditions. As Park et al. (2015) found somewhat elevated Type I error rates for conventional MVMMs with an N of 40, the smaller sample sizes included here are used to assess how MVMMs perform under more taxing sample size conditions, which includes some missing data conditions where the treatment and control sizes differ, as described below. Given our missing data conditions, the average group sizes for this simulation study are 7, 8.5, 10, 14, 17, 20, 21, 25.5, and 30, values consistent with those obtained in our review of the 64 intervention studies. Note also that we include more analysis methods, described below, than in the work by Park et al. (2015) that are designed to have better performance for extremely small samples.

Table 1.

Simulation design features.

Factors Levels
Sample size 20, 40, 60
Residual outcome correlation −0.8, −0.6, −0.4, −0.2, 0, 0.2, 0.4, 0.6, 0.8
Percent missing on y. 0, 15%, 30%
y1 effect size 0, 0.2, 0.5, 0.8
y2 effect size 0, 0.2, 0.5, 0.8

We varied correlations between the outcome variables and included nine different correlation values. The primary advantage of including a variety of correlation values is to assess how well the analysis procedures perform under different missing data mechanisms. As we explain below, when data are incomplete, responses that are uncorrelated yield data that are MCAR. In contrast, non-zero correlations will produce data MAR with correlations having large absolute values producing MAR data that will likely produce biased estimates for GLM analyses.

We included a variety of effect size values to examine the performance of the analysis methods under a wide range of univariate and multivariate effect sizes. Null effect sizes allow us to assess Type I error. Also, given the population residual standard deviations were set to 1, non-zero effect sizes represent small (0.2), moderate (0.5), and large (0.8) effects for each outcome, enabling an assessment of power across a wide range of univariate effects. Similarly, given that multivariate effect size D2 (Mahalanobis, 1936) is calculated from the univariate effect sizes and correlations, the resulting non-zero multivariate effect sizes range widely from a very small value of 0.04 to a very large value of 6.4. Using Stevens (1980) classifications for D2, 21% of the non-null D2 values are small effects (≤0.25), 24% are medium effects (0.25 ≤ 0.64), 31% are large (0.64≤1), and 24% are very large effects (≥1). The effects sizes, given the other conditions in the study, provided for maximum univariate and multivariate power values near 0.90 and 1.00, respectively.

A no missingness condition was included to assess how the analysis models perform with complete data. The other missingness levels (15%, 30%) allow us to assess how the methods perform under very small samples (e.g., N of 20 becomes N of 14 under 30% missingness, which is 7 cases per group) and under realistic proportions of incomplete data. For example, in the 64 intervention studies we reviewed, 48% of the studies reported no missing data, 27% of the studies reported that at least 15% of the participants had incomplete response data, and 10% of the studies reported that at least 30% of the participants had incomplete response data. To a greater degree, Bell et al. (2014) found that 73 of 77 (95%) randomized clinical trials published in top medical journals reported missing data, with about half the studies reporting that incomplete data was present for at least 10% of study participants.

Missing data

Missing data were obtained only for y1 by selecting as missing either 15% or 30% of the cases from the complete data. The missing mechanisms included in this study are MCAR and MAR, which are more missing data mechanisms than included in the work by Park et al. (2015), Frane (2015), and Gosho and Maruo (2018), whose simulation studies included only complete data or one missing data mechanism. Further, although Sullivan et al. (2018) generated MCAR and MAR data in their simulation study that examined the performance of various methods that can be used to compare means in a two-group study, their study did not consider small sample sizes (they used N=600).

There are infinitely many ways to obtain MAR data. Here, we modified a procedure by Enders (2001) and Chang and Pituch (2019) so that missingness for y1 depends on the observed y2 scores, such that cases having lower y2 scores are more likely to be selected as missing for y1: To obtain missing scores for y1, we first generated complete outcome data for all cases according to our generating model. We then sorted the complete data in ascending order for y2 and obtained the corresponding cumulative proportion for each case. Subtracting each of these proportions from 1 then served as the probability of being missing. For example, for N of 20, the cumulative proportion for the case having the lowest y2 score is 0.05, and 1 – 0.05 results in a 0.95 probability of being missing for this case. To ensure that only cases from the lower end of the y2 distribution could be selected as missing for y1, we allowed only those cases scoring below the 50th percentile on y2 to be missing. From these lower scoring y2 cases, we then selected cases to be missing for y1 without replacement with probabilities proportional to the probability of being missing, which means that among the cases scoring below the 50th percentile those with lower y2 scores are more likely to be selected as missing for y1.

Several desirable characteristics resulted from this missing data selection method. First, when the population correlation between y1 and y2 is zero, the missing data mechanism is MCAR. The reason for MCAR here is that cases having a lower y2 score could have virtually any score on y1 due to the variables being uncorrelated. Thus, as noted by Collins et al. (2001), this process mimics a completely random selection of cases to be missing for y1. Second, when the population correlation is not zero, the missing data mechanism is MAR, as missingness on y1 is now systematically dependent on the observed y2 score. Further, when the magnitude of this correlation increases, we obtain increasing departures from the MCAR mechanism or more severe cases of MAR. For example, when y1 is positively correlated with y2, cases having low y2 scores also tend to have lower y1 scores. As a result, cases that are selected as missing for y1 will likely have below average y1 scores. Note this type of missingness is consistent with scenarios where participants who are not responding to a given treatment are willing to complete a less revealing or less demanding measure but not a more revealing or intrusive measure. For example, participants may be willing to report low compliance to an exercise regimen in a weight loss intervention but may refuse to complete a treadmill test. On the other hand, in our simulation study, when the correlation between y1 and y2 is negative, cases having higher y1 scores are more likely to be selected as missing because they tend to have lower y2 scores (as only those cases having lower y2 scores can be selected as missing). This type of missingness is consistent, for example, with an intervention study for patients with Type 2 diabetes where participants who have lower weight loss refuse to submit to a blood glucose test. Generating missing data for y1 in this way allowed us to examine the performance of the statistical models under two different missingness mechanisms: MCAR and MAR, with increasing departures from MCAR occurring as the magnitude of the correlation increases.

Estimating models and expected performance

Table 2 displays the key features of the four estimating models that were used to analyze each simulated data set. We now explain why these analysis methods were selected and describe the expected performance for each method. First, separate GLM independent samples t tests were included because such methods are commonly used to assess group mean differences, as our review of the 64 studies indicated. When GLM is implemented with OLS, unbiased parameter estimates are expected with complete and MCAR data but we expect biased estimation for MAR data, particularly as the effect size for y2 and the magnitude of the correlation between y1 and y2 increases (i.e., for more severe cases of MAR, as found in Sullivan et al. (2018)). For example, when the y2 effect size is large and the correlation is positive and large, cases selected to be missing for y1 will be those primarily in the control group (due to the large y2 effect size) that have generally low y1 scores (due to the correlation), analogous to the situation where lower performing control group participants are refusing to complete one of the measures. When these y1 scores are removed (because they are selected as missing), the estimate of the control group mean for y1 will be larger and the control group y1 variance will be smaller than if the cases had been present. As such, the treatment effect estimate and residual variance for y1 is expected to be underestimated, along with the potential for an inflated Type I error rate. Similarly, when the y2 effect size is large and the correlation between y1 and y2 is negative and strong, the cases that will be selected as missing for y1 will be those primarily in the control group that have higher y1 scores. As a result, the treatment effect for y1 is expected to be overestimated with the removal of the higher y1 scores from the control group. The residual y1 variance is, again, expected to be underestimated, which, as before, may result in inflated Type I errors. Note also that unbalanced group sizes will be created with this method, as more control cases will, under some conditions, be selected as missing. However, this imbalance alone will not be responsible for any estimation bias because inclusion of the dummy-coded treatment variable would provide for proper missing data treatment if group membership –and not y2 – were the cause of missingness (Sullivan et al., 2018, p. 2613).

Table 2.

Analysis models used in the two-outcome simulation study.

Acronym Type Estimation Test distribution
GLM Separate general linear model t tests implemented with pairwise deletion least squares t distribution with N - 2 df
ML Multivariate multilevel model t tests Full maximum likelihood Standard normal (infinite df)
REML Multivariate multilevel model t tests Restricted maximum likelihood t distribution with between-subjects df
KR Multivariate multilevel model t tests Restricted maximum likelihood with Kenward-Roger corrected standard errors t distribution with Kenward-Roger corrected df

A second set of statistical tests included in the simulation study are Wald (or z) tests from a MVMM using full maximum likelihood (ML) estimation. This procedure is included because this is the default estimation and testing procedure in some structural equation modeling programs as well as the multilevel software program MLwiN (Charlton, Rasbash, Browne, Healy, & Cameron, 2019). ML-based procedures are known to provide for unbiased estimates of fixed effects (i.e., group mean differences) under MCAR and MAR data, provided that for MAR data the variables responsible for missingness are included in the analysis model. However, when sample size is small, ML estimation is expected to produce negatively biased variance estimates. This is a well-known problem (e.g., Raudenbush & Bryk, 2002; West, Welch, & Galecki, 2015) and is due to fact that ML estimation does not take into account the degrees of freedom. For example, Raudenbush and Bryk (2002) state that the expression for the variance of variable x with ML estimation is (xx¯)2/n, which will underestimate the population variance, a bias which, in effect, vanishes when sample size is sufficiently large. Another expected performance problem for ML is that the accompanying Wald tests will yield inflated Type I error rates for the tests of group mean differences for two reasons. First, the negatively biased variance estimates will produce downwardly biased standard errors, which are expected to inflate Type I error rates, especially for small sample sizes. Second, the use of the standard normal reference distribution is known to be improper for linear mixed models, particularly for small sample sizes. As such, the critical value for the Wald z test (i.e., ±1.96, for α = 0.05) is too small, and its use is known to produce inflated Type I error rates (McNeish & Matta, 2018; Schaalje, McBride, & Fellingham, 2002).

The remaining two statistical models are MVMMs using restricted maximum likelihood estimation. In particular, the third analysis procedure uses MVMM t tests with the between-subjects df. This procedure is included because it is the default procedure for mixed models in SAS (version 9.4) when the unstructured covariance matrix is specified, as in Equation (3). (See the Appendix for the default df for mixed modeling with SAS and SPSS). The fourth and final procedure also uses MVMM t tests but with a Kenward–Roger correction (Kenward & Roger, 1997). Given that these two procedures use the same estimation method – restricted maximum likelihood estimation – they will produce identical point estimates of treatment effects and residual variances. Like the aforementioned ML procedure, REML is known to produce unbiased estimates of group mean differences with MCAR and MAR data. Further, because this estimation procedure properly takes into account the loss of degrees of freedom (analogous to using n – 1 instead of n for the variance), variances are known to be estimated without bias. However, even with unbiased parameter estimates, the standard errors estimated for tests of group differences are known to be downwardly biased (Kackar & Harville, 1984). As such, inflated Type I error rates are expected for the restricted maximum likelihood procedure with the default degrees of freedom (which we label as REML) particularly under smaller sample sizes, as observed by Park et al. (2015). In contrast, with the Kenward-Roger method, these standard errors are adjusted (inflated) to improve inference. In addition, the KR procedure employs a Satterthwaite (1946) type correction to improve the estimation of the degrees of freedom (which generally makes the critical value larger). The KR procedure was included, then, based on its superior performance for very small samples (e.g., Bell et al., 2014; Kowalchuk et al., 2004; McNeish & Stapleton, 2016). See McNeish (2017) for a conceptually focused presentation of the KR correction.

Note that for the statistical tests, an alpha of 0.025 was used (given two outcomes) when testing the omnibus multivariate null hypothesis, and we rejected the omnibus null when the p value for at least one outcome was ≤0.025. For study conditions with missing data, pairwise deletion was used for the GLM t tests so that all available information was used for the complete outcome, y2. For tests of group-mean differences for a specific outcome, the same statistical tests were used. Here, though, an alpha of 0.05 was used so that readers can easily make comparisons to the familiar 0.05 alpha level. In practice, Bonferroni-adjustments may be a better choice given the superior Type I error control. We use such adjustments mentioned in the “Illustration” section.

Analysis

The analysis models were evaluated by computing several dependent variables, including relative bias for estimates of treatment effects (group mean differences for each outcome) and residual variances. Relative bias was calculated as (θ^θ)/θ, where θ^ is an estimate its population parameter θ. When the population treatment effect was equal to zero, where relative bias cannot be calculated, we inspected plots of mean bias (i.e., the mean of θ^θ for each condition) and examined Type I error rates. Point estimates were considered to be biased when the magnitude of the relative bias exceeded 0.10 (Muthén & Muthén, 2002).

We also computed empirical Type I error rates and power for tests of multivariate and univariate null hypotheses. For the multivariate null hypothesis, we computed the proportion of replications for which the null was rejected. This proportion served as the multivariate Type I error rate for null group mean differences for all outcomes and multivariate power when at least one population group mean difference was non-null. For each univariate null hypothesis, we computed the proportion of replications for which the univariate null was rejected. This proportion is the univariate Type I error rate when the population mean difference was null and represents power otherwise. To assess the accuracy of Type I error rates, we used Bradley’s (1978) fairly stringent criterion, which, for an alpha of 0.05, considers the mean Type I error rate as accurate for a given condition if it lies between 0.045 and 0.055.

The effects of the study factors (e.g., method, missingness, etc.) on relative bias, Type I error, and power were examined by estimating full factorial repeated measures ANOVAs for each outcome, where method is the repeated factor. Interpretations of the importance of the effects were based on partial eta squared (ηp2) effect size estimates, with nontrivial effects defined as ηp2 values greater than 0.01 (Enders, 2001). We provide graphs of the key study results, with missingness levels included in each figure given the focus on incomplete data.

Results

Estimation bias

The models converged for all simulated data sets. For the estimates of group mean differences (i.e., treatment effects), there was no bias when the outcomes were complete, with all relative bias estimates being smaller than 0.05. However, for the incomplete outcome (i.e., y1), GLM estimates were biased at times when data were MAR. ANOVA results indicated that method was associated with relative bias via a two-way interaction with correlation (ηp2=0.03), a three-way interaction with correlation and missingness (ηp2=0.03), a three-way interaction with correlation and the y2 effect size (ηp2=0.01), and a four-way interaction with correlation, missingness, and y2 effect size (ηp2=0.02).

Figure 1 shows the treatment effect estimates for (a) the y1 population effect size of 0.5 by method, (b) values of 0.2, 0.5, and 0.8 for the population effect size for y2 (representing increasingly stronger cases of MAR), (c) missingness levels, and (d) population correlation. Also, Figure 1 displays the y1 treatment effect estimates themselves – not relative bias – so that readers can directly see the values of the estimates, which are averaged across sample size levels, as N was not related to relative bias. Further, Figure 1 displays the estimates only for the moderate y1 effect size as bias was not related to the y1 effect sizes. Figure 1 shows that the treatment effect estimates for all methods, except for GLM, completely overlap one another and also virtually overlap the population effect size, thus, exhibiting no estimation bias across all study conditions. For no missingness, Figure 1 shows that the GLM treatment effect estimates are also unbiased and that some generally minimal bias is present for 15% missingness. However, for 30% missingness, the bias associated with the GLM estimates is evident, especially for more severe cases of MAR and when the magnitude of the residual outcome correlation is greater. Specifically, the y1 treatment effect estimates for GLM were much greater than the population effect sizes when the correlation was negative, much smaller than the true effect sizes when the correlation was positive, but not biased for MCAR conditions (i.e., when ρ12 = 0).

Figure 1.

Figure 1.

Estimates for the y1 population effect size of 0.5 by method, y2 effect size, missingness, and residual outcome correlation. Increasing values of the y2 effect size and of the correlation magnitude represent more severe MAR conditions.

For estimates of the residual variances, the ANOVA results for the complete outcome (y2) indicated that relative bias was associated with method through its main effect (ηp2=0.90) and interaction with sample size (ηp2=0.65). For the incomplete outcome (y1), ANOVA results showed that residual variance estimation was related to method through its main effect (ηp2=0.41), two-way interactions with sample size (ηp2=0.13), the y1 and y2 correlation (ηp2=0.08), and missingness (ηp2=0.15), and a three-way interaction with correlation and missingness (ηp2=0.08).

Figure 2 shows the y1 residual variance estimates by method, sample size, missingness, and correlation. When N = 20, Figure 2 shows that the ML residual variance estimates were consistently near the value of 0.90 for both complete and missing data, with the degree of bias becoming negligible for larger sample sizes. GLM estimates of the y1 residual variance were unbiased for complete data and generally negligibly biased when data were 15% missing, but were negatively biased for 30% missingness when the magnitude of the residual outcome correlation was large (i.e., for MAR data). In contrast, the REML and KR variance estimates, which are identical, of course, always produced unbiased variance estimates.

Figure 2.

Figure 2.

Estimates for the y1 residual population variance of 1.0 by method, sample size, missingness, and residual outcome correlation. Increasing magnitudes of the correlation represent more severe MAR conditions.

Type I error

For omnibus Type I error, results from the ANOVA indicated that method is related to the Type I errors through its main effect, (ηp2=0.02), with ηp2 for all other effects being smaller than 0.01. Figure 3 shows estimates of the Type I error rates for the test of the omnibus multivariate null hypothesis by method, sample size, and missingness, along with reference lines at 0.045 and 0.055, inside which, represents an accurate Type I error rate. As shown in Figure 3, ML tests produced Type I error rates that were nearly always above the upper limit of the robustness range and never below it, with elevated Type I error rates appearing in 73 of 81 (90%) multivariate null conditions. REML tests produced much more accurate Type I errors, but still exhibited inflated Type I error rates in 7 of 81 (9%) conditions and conservative error rates in 12 conditions (15%). The GLM and KR tests did not produce elevated omnibus Type I error rates for any condition. However, KR produced Type I error rates that were below the robustness interval in 21 conditions (26%), and GLM produced such conservative error rates in 10 conditions (12%).

Figure 3.

Figure 3.

Empirical Type I error rates for the test of the multivariate null hypothesis by method, sample size, and missingness.

For the Type I error rates of specific outcomes (i.e., y1 and y2), results from the ANOVAs indicated that the main effect of method was important for y1(ηp2=0.01) and y2(ηp2=0.02), with ηp2 for all other effects being smaller than 0.01. Figure 4 shows the estimated Type I error rates for the outcome having missing data (i.e., y1) for each method across sample size, missingness, and correlation levels. As shown in Figure 4, the ML Wald test produced elevated Type I error rates in 323 of the 324 (99.7%) conditions where population effects were null for y1. Although we do not display a figure for y2, ML tests similarly yielded Type I error rates that exceeded the robustness interval in 321 (or 99%) of such null conditions. REML t tests also produced elevated Type I error rates, particularly for the missing data conditions. For y1, REML produced error rates greater than 0.055 in 90 (28%) study conditions and had such elevated error rates in only six study conditions for the specific test for y2 (2%). In addition, REML produced Type I error rates that were below the robustness interval for one (<1%) condition for y1 and three conditions for y2 (< 1%). The GLM t tests produced elevated Type I error rates in 30 (9.2%) conditions for y1, particularly for MAR conditions with 30% missingness, as shown in Figure 4. These MAR conditions are those evidenced in Figures 1 and 2, where GLM provided biased estimates of treatment effects and residual variances. For other conditions, though, Type I error was well controlled with GLM, with elevated Type I error rates for the complete outcome (y2) occurring in only six (2%) conditions and error rates lower than 0.045 occurring one and three times (each <1%) for y1 and y2, respectively. For the KR method, Type I error rates were above the robustness interval in 2 (<1%) conditions for y1 and below this interval in 11 (3%) conditions. The KR method had performance identical to GLM for the complete outcome (y2).

Figure 4.

Figure 4.

Empirical Type I error rates for the test of the null hypothesis for y1 by method, sample size, missingness, and residual outcome correlation.

For the overall study performance of the methods with respect to multivariate and univariate Type I error, the KR method was the best performing procedure. Across all null conditions, the KR method produced Type I error rates that were outside the robustness interval for 29 of 729 conditions (4%), with these error rates exceeding the 0.055 level in 8 (1%) study conditions and falling below the 0.045 level in 21 (3%) of conditions. GLM was the next best performing method having empirical Type I error rates outside the robustness interval for 50 conditions (7%), with elevated error rates in 36 (5%) conditions and conservative error rates in 14 (2%) of conditions. REML produced inaccurate Type I error rates in 119 conditions (16%), yielding elevated Type I error rates in 103 (14%) conditions and conservative error rates in 16 (2%) of conditions. The poorest Type I error control was provided by ML, which yielded elevated Type I error rates in 717 or 98% of study conditions and never exhibited conservative Type I error rates.

Power

For omnibus power, results from the ANOVA indicated that power was associated with the main effects of method (ηp2=0.03), effect size (ηp2=0.16), sample size (ηp2=0.05), and correlation (ηp2=0.01), along with two-way interactions involving effect and sample size (ηp2=0.02) and effect size and correlation (ηp2=0.01). Figure 5 shows omnibus power as a function of method, effect size, and missingness. The power advantage of ML Wald tests across effect size and missing levels is apparent in the plots. Recall, though, that the power advantage associated with ML is accompanied by inflated Type I error rates that occurred across most study conditions. Note that Figure 5 shows generally very small power differences between the other three methods. Though, of lesser importance and thus not shown in Figure 5, multivariate power increased as N increased, the increase in power that occurred for larger effect sizes was more pronounced for larger sample sizes, and the power increase that occurred for larger effect sizes was also somewhat smaller when the magnitude of the residual correlation was larger.

Figure 5.

Figure 5.

Empirical power for the test of the multivariate null hypothesis by method, multivariate effect size, and missingness.

For the power of tests for specific outcomes, conclusions from the ANOVA results were the same for y1 and y2. That is, the main effect of method was important for y1(ηp2=0.03) and y2(ηp2=0.04). Other non-related method effects include the main effect of effect size for y1(ηp2=0.23) and y2(ηp2=0.25), the main effect of sample size for y1(ηp2=0.06) and y2(ηp2=0.07), and a two-way interaction between effect and sample size for y1 and y2(ηp2=0.03 for both ). Figure 6 shows the power to test y1 for the large y1 effect size for each method by sample size, missingness, and correlation levels. Figure 6 shows the power advantage for ML, particularly for N of 20, although the plots show that this power advantage decreases for larger sample sizes. Again, though, any enthusiasm for use of the Wald z test must be tempered by its inflated Type I error rate that occurred across virtually all study conditions. Further, in the plots showing power for 30% missingness, note that GLM is nearly as powerful as ML for negative correlations but offers the least power for positive correlations. This finding is due particularly to the misestimated treatment effects for GLM under MAR, where this effect is overestimated for negative correlations and underestimated for positive correlations. Note that the plots of the complete outcome (y2) mirror the plots in Figure 6 when y1 is complete, where the power of ML is superior with little to no difference among the other methods. Finally, the interaction between effect size and sample size is such that the increase in power as effect size increases is much greater for the larger sample sizes.

Figure 6.

Figure 6.

Empirical power when the population effect size for y1 = 0.8 by method, sample size, missingness, and residual outcome correlation.

Illustration

In this illustration, we compare results obtained by applying GLM and MVMM KR procedures to multivariate data from a well-conducted intervention, where estimating and testing differences in means across multiple correlated outcomes is the primary research interest. The data for this illustration are taken from the Senior WISE (Wisdom is Simply Exploration) study, a two-group multisite randomized trial that enrolled 265 community-dwelling adults without dementia aged 65 years and older between 2001 and 2006 (McDougall, Becker, Pituch et al., 2010). Study participants were randomly assigned to 12 hours of either a memory or health intervention at each of the several sites. The content of the memory intervention was based on the cognitive behavioral model of everyday memory and self-efficacy theory (Bandura, 1997; McDougall, Becker, Acee et al., 2010; McDougall, Becker, Pituch et al., 2010). The health training condition, in contrast, featured 18 topics that emphasized successful aging, with such topics including, for example, exercise, weight management, getting the most from your physician visit, and nutrition. The primary hypothesis tested in the Senior WISE study was that at-risk older adults who receive the memory training intervention will show significantly better memory related self-efficacy, performance, and instrumental functioning in daily living than participants in the health training condition.

For this illustration, we use data from only one of the Senior WISE study sites so that the sample size for the illustration is 27. At this site, 14 adults were assigned to the memory intervention and 13 to the health group. The average age for participants in the memory group is 75 and in the health group is 73 (memory group SD = 4.5; health group SD = 7.0), and the years of formal education completed by participants is similar for each group (memory group M = 12.1, SD = 5.0; health group M = 11.9, SD = 3.0). In addition, 12 females are in each group, with most participants (23 of 27) identifying themselves as African-American or Hispanic.

Depressive symptoms, known to affect memory functioning, are the outcomes selected for the illustration, and we include symptoms collected upon completion of the intervention. To assess depressive symptoms, we used scale scores of well-being measured with 4 items (e.g., felt good, hopeful), depressed affect measured with 7 items (e.g., felt sad, lonely), and somatic symptoms measured with 7 items (e.g., had restless sleep, felt bothered). These represent three of the four subscales from the Center for Epidemiologic Studies Depression instrument. Each item has four response options (0, 1, 2, 3) ranging from “rarely or none of the time” to “most or all of the time.” Item responses are summed for each sub-scale with the well-being items reverse scored (with the scale now called lack of well-being) so that higher scores on each scale indicate greater depressive symptoms. Factor analytic work by Hertzog, Van Alstine, Usala, and Hultsch (1990) provides support for using the subscales with older adults. The statistical procedures used in the illustration to test for group mean differences are (a) GLM independent samples t tests (one for each outcome) with OLS estimation and pairwise deletion and (b) t tests from the MVMM procedure with the Kenward–Roger correction. Given three outcomes, we used a Bonferroni-adjustment to maintain the family-wise alpha level at 0.05.

Although the 27 cases had no missing data for these variables, we created missing responses for one of the outcomes in a manner consistent with that in the simulation study. In particular, for the somatic symptoms scale, we removed two scores from the health group. These two cases reported post-intervention scores of zero for depressed affect and somatic symptoms, as these variables were highly correlated (r = 0.73). The two cases also had scores of 0 and 1 for lack of well-being although the correlation between this variable and somatic symptoms was not as great (r = 0.41). As such, the illustration data are not MCAR, as missingness on somatic symptoms is correlated with depressed affect and lack of well-being. Creating the two missing responses in this way allows us to illustrate what can happen even when a small percent of cases (2 of 27, 7%) has incomplete responses.

Table 3 provides the key analysis results. Note that the values in Table 3 are the same for each analysis method except for the health group mean for somatic symptoms and the associated mean difference. In the actual data set (with no missing data for somatic symptoms), this group mean is 5.42. Now, with the two cases removed, Table 3 shows that the GLM estimate is about one point greater (6.41) but that the REML estimate is not as large (at 6.01) as the GLM estimate. This GLM assumes that the data are MCAR, which is not true with these data, as the two cases with missingness have very low scores on the other two outcomes. The REML estimate takes the correlation among the outcomes into account and provides an estimate for the health group mean that is closer to the value obtained with the actual data. In this case, these different analysis procedures yield a different conclusion for somatic symptoms. As shown in Table 3, the mean difference for somatic symptoms is statistically significant with the standard t test procedure, t(23) = –2.60, p = 0.048 (Bonferroni adjusted), but not for the MVMM KR procedure, t(24.3) = –2.34, p = 0.084 (Bonferroni adjusted). Although we do not know whether the null hypothesis is true for this outcome, the performance of the GLM procedure – producing apparently an overestimate of the mean difference and perhaps a Type I error – is consistent with the results of the simulation study.

Table 3.

Key results from GLM and MVMM methods for depressive symptoms.

Estimate Lack of well-being Depressive affect Somatic symptoms
GLM independent samples t tests
Memory group M 2.29 (n = 14) 2.50 (n = 14) 2.93 (n = 14)
Health group M 3.67 (n = 13) 3.86 (n = 13) 6.41 (n = 11)
Difference (SE) −1.38 (1.01) −1.36 (1.52) −3.48* (1.34)
MVMM t tests with KR correction (N = 27)
Memory group M 2.29 2.50 2.93
Health group M 3.67 3.86 6.01
Difference (SE) −1.38 (1.01) −1.36 (1.52) −3.08 (1.32)
*

p < 0.05 using the Bonferroni-adjustment.

Discussion

The goal of this study was to compare the performance of much more commonly used GLM independent samples t tests implemented with OLS to procedures used much less often (MVMM) in the context of a conventional two-group intervention study where the interest is in estimating and testing group mean differences for each of a set of normally distributed outcomes. We tested the analysis methods using small samples as these are often present in applied studies and because tests of group mean differences with use of the MVMM have been previously shown to exhibit inflated Type I error rates (Park et al., 2015). Incomplete response data were also included because of their prevalence in applied research, including the set of 64 interventions we examined. An extensive simulation study was conducted to assess the performance of the GLM and MVMM approaches across a variety of realistic conditions, including sample size, effect size, residual outcome correlation, and missingness. Balanced and unbalanced group sizes were present in the simulation.

How well did the analysis methods perform under the taxing sample size and missingness conditions? First, the performance of the GLM procedure depended on the type of missingness that was present, highlighting the importance of context in the performance and selection of an analysis method. On the positive side, when response data were complete or MCAR, GLM estimates were always unbiased, Type I error rates were accurate, and power, though lower than ML, was comparable to that provided by the other methods. This suggests that pairwise deletion, often regarded as a poor missing data treatment, can function well in this specific context. Note also that although other studies have found the use of MVMM tends to provide more power than GLM approaches for complete and MCAR data (Ashbeck & Bell, 2016; Park et al., 2015), the MVMM power advantage for these situations appears to be due to the use of conventional testing procedures that do not involve the KR correction. When KR procedures are compared to GLM, minimal differences in performance between these two methods were observed in our simulation study for MCAR data, which is also evident in the work by Sullivan et al. (2018).

In contrast, when missingness on one outcome was related to another outcome (i.e., data MAR), GLM provided noticeably biased estimates of treatment effects and residual variances, particularly so when these outcomes were highly correlated. In addition, the independent samples t tests produced inflated Type I error rates for tests of group mean differences – especially for 30% missingness – with this problem worsening as N increased. Similar performance was observed by Sullivan et al. (2018). As such, in the presence of MAR missingness, the GLM exhibited the poorest performance. This poor performance was due to use of a univariate analysis model that excluded a variable responsible for missingness (e.g., y2) when group mean differences were estimated for an incomplete outcome (e.g., y1). This can be considered a type of model misspecification that is known to produce biased parameter estimates and inflated Type I error rates, with such misspecification causing problems even for the relatively simple analysis of multivariate group mean differences, as evidenced in this study. Note that use of OLS MANOVA would have made matters worse for two reasons. First, when group mean differences are estimated and tested for individual outcomes (e.g., y1 or y2), standard MANOVA uses strictly univariate procedures. Thus, the outcome responsible for missingness would have been excluded here as well. Second, OLS MANOVA uses listwise deletion. As such, the performance problems observed in this study for y1 would also have been present for y2 with the use of MANOVA, particularly when the outcomes were highly correlated.

Second, the performance of the MVMM implemented with full maximum likelihood estimation and the Wald z test was generally unacceptable. Although the ML method always produced unbiased estimates of group mean differences, negatively biased estimates of residual variances were observed when N < 20, with such estimates expected due to the use of full maximum likelihood estimation. Further, Type I error rates associated with the z tests of multivariate and univariate group mean differences were almost always inflated. Thus, although the ML method offered greater power than the other methods, this power advantage was accompanied by a cost of an inflated Type I error rate that occurred for virtually all conditions included in this study.

For the MVMM implemented with the between-subjects degrees of freedom (i.e., with no KR correction), statistical performance depended particularly on missingness and was worse for smaller sample sizes. First, as expected, this REML procedure produced unbiased estimates of group mean differences for all conditions, and, unlike the ML procedure, produced unbiased estimates of residual variances. Unfortunately, as shown by Kackar and Harville (1984), the standard errors for the fixed effects – or group mean differences – are negatively biased even when these unbiased parameter estimates are obtained. In our simulation study, this problem resulted in inflated Type I error rates that occurred particularly when data were missing. To illustrate, for N of 20 and N of 40, when data were complete, the REML method – using the uncorrected standard errors and the default degrees of freedom – produced inflated univariate Type I error rates for 6% of the study conditions for each of these sample sizes. However, for N of 20, the percentage of conditions with inflated univariate Type I error rates increased to 47% with 15% missingness (or N = 17) and 92% for 30% missingness (or N = 14). For N of 40, the percentage of conditions with inflated Type I error rates was 14% when missingness was 15% (or N = 34) and 67% when missingness was 30% (or N = 28). Thus, it is not sample size per se which causes particular problems for the non-KR corrected REML, but missingness (both MCAR and MAR) combined with smaller sample sizes that result in a much greater frequency (or greater likelihood) of inflated Type I error rates. Note that this same pattern also occurred for N of 60, where this REML procedure produced accurate Type I error rates when data were complete and for 15% missingness (or N = 51), but produced elevated Type I error rates in 19% of conditions for 30% missingness (or N = 42). Thus, if response data are missing and sample size is relatively small, the uncorrected REML procedure cannot be recommended given the importance of accurate Type I error rates.

The inflated Type I error rates for the REML procedure were remedied by use of the KR correction. This procedure produced the most accurate Type I error rates, provided unbiased estimates of treatment effects and residual variances, and exhibited power that equaled or approximated all other procedures, except for ML, which had inflated Type I error rates. The KR procedure also performed well for complete, MCAR, and MAR data as well as for the various small sample size conditions included in the study. As such, the KR procedure is the best performing method of those we studied.

We should note that there are several limitations of this study. First, we examined the performance of the methods only for normally distributed outcomes. As such, we cannot be sure that the results obtained here apply to binary or categorical outcomes. Second, the data were generated with all multivariate assumptions perfectly satisfied in the population. However, the work by Arnau, Bendayan, Blanca, and Bono (2014) and Kowalchuk et al. (2004), who examined the performance of the KR correction under a variety of non-normality and other conditions, found that KR correction performs well in these conditions. However, Arnau et al. note that a sample size of at least 45 is recommended if the groups have different distributions, provided that excessive skew or kurtosis is not present. Third, the simulation study included only relatively simple models where group mean differences on a set of outcomes is the focus, which is often the primary interest in intervention studies, but the study did not include any interactions. However, Bell et al. (2014), for a multilevel design with small sample sizes and a univariate outcome, reported good performance for KR corrected tests for main effects and interactions. We also limited the number of dependent variables to 3. This small number seems reasonable given the small group sizes (as low as 7 per group), but we cannot be sure that the KR corrected t tests will perform as well for a larger number of outcome variables under such small sample sizes, although we have no reason to suspect its performance would be poor. Obviously, when the number of outcomes becomes too large relative to sample size, estimation failures will occur. In such cases where only a relatively small number of outcomes can be included in a given MVMM, our simulation results suggest that complete outcomes that are highly correlated with incomplete outcomes should be included in the model to minimize or avoid estimation bias and poor Type I error control.

Two other important issues should be mentioned. First, we did not include any conditions where data were missing not at random (MNAR).With MNAR data, the probability of missingness depends on unobserved variables. Instead, we focused on data MCAR and MAR because these conditions – MAR in particular – are commonly assumed for the primary analysis of intervention data, especially for longitudinal designs that also have correlated outcomes. Second, the outcome variables included in the simulation study were not considered to represent items that form a scale. For data MNAR and for incomplete responses to multiple items that make up a scale, multiple imputation offers more flexibility than maximum likelihood treatment of incomplete data and in these contexts may be preferred or needed over maximum likelihood treatment (Mazza, Enders, & Ruehlman, 2015; O’Kelly & Ratitch, 2014). However, treating incomplete data with multiple imputation is more complicated than applying maximum likelihood missing data treatment. Further, in the context of a randomized control trial, Sullivan et al. (2018) found that multiple imputation did not generally provide better performance than other methods and that multiple imputation can provide biased estimates of treatment effects, particularly when an interaction involving the treatment group variable is excluded from the imputation model. In contrast, maximum likelihood estimation does not include a separate imputation model. As such, with maximum likelihood estimation, this type of misspecification problem cannot occur.

In conclusion, this study supports the use of MVMMs with the Kenward–Roger correction for intervention studies where estimating and testing mean differences between two independent groups for a set of multivariate normally distributed outcomes is the primary research goal. Although use of MVMM with the KR correction is often recommended for repeated measures designs, it is far more commonplace for a series of univariate analyses (i.e., separate t tests or ANOVAs) to be recommended – and used in practice as our review showed – for the analysis of multivariate outcomes that arise in designs having, for example, one post-intervention measurement occasion. Although our simulation study provides support for using standard independent samples t tests for multivariate outcomes with two groups when data are complete or MCAR, MVMM or multivariate linear mixed modeling with the KR correction exhibits similar performance for such conditions as well as offering much better performance for MAR data when an incomplete outcome is correlated with another complete outcome. In the presence of such MAR data, MVMMs with the KR correction provide what traditional GLMs could not generally deliver – unbiased estimates of treatment effects and accurate Type I error rates – even for very small sample size conditions. Given that Pituch, Whittaker, and Chang (2016) found that applied intervention studies nearly always collect data on multiple outcomes, researchers should consider using the superior performing MVMMs with the KR correction for testing and estimating group mean differences. For those interested in learning more about MVMMs or linear mixed models, several accessible textbook treatments are available (e.g., Brown & Prescott, 2015; Pituch & Stevens, 2016; West et al., 2015).

Supplementary Material

Supplemental Material

Acknowledgments:

The ideas and opinions expressed herein are those of the authors alone, and endorsement by the authors’ institutions or the NIA is not intended and should not be inferred.

Funding

This work was supported by Grant R01 AG 15384 from the National Institutes of Health, National Institute on Aging.

Role of the Funders/Sponsors: None of the funders or sponsors of this research had any role in the design and conduct of the study; collection, management, analysis, and interpretation of data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication.

Appendix

Mixed modeling default methods in SAS (Version 9.4) and SPSS software (Version 25) for calculating the denominator degrees of freedom for fixed effect tests

Residual covariance matrix (R) specification SAS SPSS
Random Containment Satterthwaite
Repeated (with no random statement) Between-within Satterthwaite
Repeated with UN (and no random statement) Between-subjects Satterthwaite

Notes. UN is the unstructured residual covariance matrix specification, which was used in the simulation study and illustration.

Footnotes

Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/hmbr.

Supplemental data for this article can be accessed at publisher’s website.

Conflict of interest disclosures:

Each author signed a form for disclosure of potential conflicts of interest. No authors reported any financial or other conflicts of interest in relation to the work described.

Ethical Principles:

The authors affirm having followed professional ethical guidelines in preparing this work. These guidelines include obtaining informed consent from human participants, maintaining ethical treatment and respect for the rights of human or animal participants, and ensuring the privacy of participants and their data, such as ensuring that individual participants cannot be identified in reported results or from publicly available original or archival data.

References

  1. Arnau J, Bendayan R, Blanca MJ, & Bono R (2014). Should we rely on the Kenward–Roger approximation when using linear mixed models if the groups have different distributions? British Journal of Mathematical and Statistical Psychology, 67(3), 408–429. doi: 10.1111/bmsp.12026 [DOI] [PubMed] [Google Scholar]
  2. Ashbeck EL, & Bell ME (2016). Single time point comparisons in longitudinal randomized controlled trials: Power and bias in the presence of missing data. BMC Medical Research Methodology, 16(1), 43. doi: 10.1186/s12874-016-0144-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Baldwin SA, Imel ZE, Braithwaite SR, & Atkins DC (2014). Analyzing multiple outcomes in clinical research using multivariate multilevel models. Journal of Consulting and Clinical Psychology, 82(5), 920–930. doi: 10.1037/a0035628 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bandura A (1997). Self-efficacy: The exercise of control. New York, NY: W. H. Freeman and Company. [Google Scholar]
  5. Bell BA, Morgan GB, Schoeneberger JA, Kromrey JD, & Ferron JM (2014). How low can you go? An investigation of the influence of sample size and model complexity on point and interval estimates in two-level linear models. Methodology, 10(1), 1–11. doi: 10.1027/1614-2241/a000062 [DOI] [Google Scholar]
  6. Bell ML, Fiero F, Horton NJ, & Hsu CH (2014). Handling missing data in RCTs; a review of the top medical journals. BMC Medical Research Methodology, 14(1), 118. doi: 10.1186/1471-2288-14-118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bradley JV (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31(2), 144–152. doi: 10.1111/j.2044-8317.1978.tb00581.x [DOI] [PubMed] [Google Scholar]
  8. Brown H, & Prescott R (2015). Applied mixed models in medicine (3rd ed.). Chichester, UK: Wiley. [Google Scholar]
  9. Chang W, & Pituch KA (2019). The performance of multilevel models when outcome data are incomplete. The Journal of Experimental Education, 87(1), 1–16. doi: 10.1080/00220973.2017.1377676 [DOI] [Google Scholar]
  10. Charlton C, Rasbash J, Browne WJ, Healy M, & Cameron B (2019). MLwiN Version 3.03. Centre for Multilevel Modelling, University of Bristol, United Kingdom: http://www.bristol.ac.uk/cmm/software/mlwin/download/manuals.html [Google Scholar]
  11. Collins LM, Schafer JL, & Kam C (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6(4), 330–351. doi: 10.1037//1082-989X.6.4.330 [DOI] [PubMed] [Google Scholar]
  12. Enders CK (2001). The performance of the full information maximum likelihood estimator in multiple regression models with missing data. Educational and PsychologicalMeasurement, 61(5), 713–740. 10.1177/0013164401615001 [DOI] [Google Scholar]
  13. Enders CK (2003). Performing multivariate group comparisons following a statistically significant MANOVA. Measurement and Evaluation in Counseling and Development, 36(1), 40. doi: 10.1080/07481756.2003.12069079 [DOI] [Google Scholar]
  14. Ferron JM, Bell BA, Hess MR, Rendina-Gobioff G, & Hibbard ST (2009). Making treatment effect inferences from multiple-baseline data: The utility of multilevel modeling approaches. Behavior Research Methods, 41(2), 372–384. doi: 10.3758/BRM.41.2.372 [DOI] [PubMed] [Google Scholar]
  15. Frane AV (2015). Power and type I error control for univariate comparisons in multivariate two-group designs. Multivariate Behavioral Research, 50(2), 233–247. doi: 10.1080/00273171.2014.968836 [DOI] [PubMed] [Google Scholar]
  16. Gosho M, & Maruo K (2018). Effect of heteroscedasticity between treatment groups on mixed-effects models for repeated measures. Pharmaceutical Statistics, 17(5), 578–592. 10.1002/pst.1872 [DOI] [PubMed] [Google Scholar]
  17. Hertzog C, Van Alstine J, Usala PD, & Hultsch DF (1990). Measurement properties of the Center for Epidemiological Studies Depression Scale (CESD) in older populations. Psychological Assessment, 2(1), 64–72. 10.1037/1040-3590.2.1.64 [DOI] [Google Scholar]
  18. Huberty CJ, & Petoskey MD (2000). Multivariate analysis of variance and covariance In Tinsely HEA &Brown SD (Eds.), Handbook of multivariate statistics and mathematical modeling (pp. 183–208). San Diego, CA: Academic Press. [Google Scholar]
  19. Kackar RN, & Harville DA (1984). Approximations for standard errors of fixed and random effects in mixed linear models. Journal of the American Statistical Association, 79, 853–862. doi: 10.2307/2288715 [DOI] [Google Scholar]
  20. Kenward MG, & Roger JH (1997). Small sample inference for fixed effects from restricted maximum likelihood. Biometrics, 53(3), 983–997. doi: 10.2307/2533558 [DOI] [PubMed] [Google Scholar]
  21. Kowalchuk RK, Keselman HJ, Algina J, & Wolfinger RD (2004). The analysis of repeated measurements with mixed model adjusted F tests. Educational and Psychological Measurement, 64(2), 224–242. doi: 10.1177/0013164403260196 [DOI] [Google Scholar]
  22. Mahalanobis PC (1936). On the generalized distance in statistics Proceedings of the National Institute of Science, Calcutta, 2, 49–55. [Google Scholar]
  23. Mazza GL, Enders CK, & Ruehlman LS (2015). Addressing item-level missing data: A comparison of proration and full information maximum likelihood estimation. Multivariate Behavioral Research, 50(5), 504–519. doi: 10.1080/00273171.2015.1068157 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. McDougall GJ, Becker H, Acee TW, Vaughan PW, Pituch K, & Delville C (2010). Health-training intervention for community elderly. Archives of Psychiatric Nursing, 24(2), 125–136. doi: 10.1016/j.apnu.2009.06.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. McDougall GJ, Becker H, Pituch K, Acee TW, Vaughan PW, & Delville C (2010). The SeniorWISE study: Improving everyday memory in older adults. Archives of Psychiatric Nursing, 24(5), 291–306. doi: 10.1016/j.apnu.2009.11.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. McNeish D (2017). Small sample methods for multilevel modeling: A colloquial elucidation of REML and the Kenward–Roger correction. Multivariate Behavioral Research, 52(5), 661–670. doi: 10.1080/00273171.2017.1344538 [DOI] [PubMed] [Google Scholar]
  27. McNeish D, & Matta T (2018). Differentiating between mixed-effects and latent-curve approaches to growth modeling. Behavior Research Methods, 50(4), 1398–1414. doi: 10.3758/s13428-017-0976-5 [DOI] [PubMed] [Google Scholar]
  28. McNeish D, & Stapleton LM (2016). The effect of small sample size on two-level model estimates: A review and illustration. Educational Psychology Review, 28(2), 295–314. doi: 10.1007/s10648-014-9287-x [DOI] [Google Scholar]
  29. Muthén LK, & Muthén BO (2002). How to use a Monte Carlo study to decide on sample size and power. Structural Equation Modeling: A Multidisciplinary Journal, 9(4), 599–620. doi: 10.1207/S15328007SEM0904_8 [DOI] [Google Scholar]
  30. O’Kelly M, & Ratitch B (2014). Clinical trials with missing data: A guide for practitioners. Chichester, UK: Wiley. [Google Scholar]
  31. Park R, Pituch KA, Kim J, Chung H, & Dodd BG (2015). Comparing the performance of multivariate multilevel modeling to traditional analyses with complete and incomplete data. Methodology, 11(3), 100–109. doi: 10.1027/1614-2241/a000096 [DOI] [Google Scholar]
  32. Pituch KA, & Stevens JP (2016). Applied multivariate statistics for the social sciences: Analyses with SAS and IBM’s SPSS (6th ed.). New York, NY: Routledge. [Google Scholar]
  33. Pituch KA, Whittaker TA, & Chang W (2016). Multivariate models for normal and binary responses in intervention studies. American Journal of Evaluation, 37(2), 270–286. doi: 10.1177/1098214015626297 [DOI] [Google Scholar]
  34. Raudenbush SW, & Bryk AS (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage Publications, Inc. [Google Scholar]
  35. Rencher AC, & Chistensen WF (2012). Methods of multivariate analysis (3rd ed.). Hoboken, NJ: Wiley. [Google Scholar]
  36. Rubin DB (1976). Inference and missing data. Biometrika, 63(3), 581–592. doi: 10.2307/2335739 [DOI] [Google Scholar]
  37. SAS Institute Inc. (2014). SAS (Version 9.4) [Computer software]. Cary, NC: SAS Institute Inc. [Google Scholar]
  38. Satterthwaite FE (1946). An approximate distribution of estimates of variance components. Biometrics Bulletin, 2(6), 110–114. doi: 10.2307/3002019 [DOI] [PubMed] [Google Scholar]
  39. Schaalje GB, McBride JB, & Fellingham GW (2002). Adequacy of approximations to distributions of test statistics in complex mixed linear models. Journal of Agricultural, Biological, and Environmental Statistics, 7(4), 512–524. doi: 10.1198/108571102726 [DOI] [Google Scholar]
  40. Schafer JL, & Graham JW (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177. doi: 10.1037//1082-989X.7.2.147 [DOI] [PubMed] [Google Scholar]
  41. Spilke J, Piepho HP, & Hu X (2005). A simulation study on tests of hypotheses and confidence intervals for fixed effects in mixed models for blocked experiments with missing data. Journal of Agricultural, Biological, and Environmental Statistics, 10(3), 374–389. doi: 10.1198/108571105X58199 [DOI] [Google Scholar]
  42. Stevens JP (1980). Power of the multivariate analysis of variance tests. Psychological Bulletin, 88(3), 728–737. doi: 10.1037/0033-2909.88.3.728 [DOI] [Google Scholar]
  43. Sullivan TR, White IR, Salter AB, Ryan P, & Lee KJ (2018). Should multiple imputation be the method of choice for handling missing data in randomized trials? Statistical Methods in Medical Research, 27(9), 2610–2626. doi: 10.1177/0962280216683570 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Vallejo G, & Livacic-Rojas P (2005). Comparison of two procedures for analyzing small sets of repeated measures data. Multivariate Behavioral Research, 40(2), 179–205. doi: 10.1207/s15327906mbr4002_2 [DOI] [PubMed] [Google Scholar]
  45. Vickerstaff V, Ambler G, King M, Nazareth I, & Omar RZ (2015). Are multiple primary outcomes analysed appropriately in randomised controlled trials? A review. Contemporary Clinical Trials, 45, 8–12. doi: 10.1016/j.cct.2015.07.016 [DOI] [PubMed] [Google Scholar]
  46. West BT, Welch KB, & Galecki AT (2015). Linear mixed models: A practical guide using statistical software (2nd ed.). Ann Arbor MI: CRC Press. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

RESOURCES