Abstract
Standardization procedures are commonly used to combine phenotype data that were measured using different instruments, but there is little information on how the choice of standardization method influences pooled estimates and heterogeneity. Heterogeneity is of key importance in meta-analyses of observational studies because it affects the statistical models used and the decision of whether or not it is appropriate to calculate a pooled estimate of effect. Using 2-stage individual participant data analyses, we compared 2 common methods of standardization, T-scores and category-centered scores, to create combinable memory scores using cross-sectional data from 3 Canadian population-based studies (the Canadian Study on Health and Aging (1991–1992), the Canadian Community Health Survey on Healthy Aging (2008–2009), and the Quebec Longitudinal Study on Nutrition and Aging (2004–2005)). A simulation was then conducted to assess the influence of varying the following items across population-based studies: 1) effect size, 2) distribution of confounders, and 3) the relationship between confounders and the outcome. We found that pooled estimates based on the unadjusted category-centered scores tended to be larger than those based on the T-scores, although the differences were negligible when adjusted scores were used, and that most individual participant data meta-analyses identified significant heterogeneity. The results of the simulation suggested that in terms of heterogeneity, the method of standardization played a smaller role than did different effect sizes across populations and differential confounding of the outcome measure across studies. Although there was general consistency between the 2 types of standardization methods, the simulations identified a number of sources of heterogeneity, some of which are not the usual sources considered by researchers.
Keywords: cognition, harmonization, individual participant data, meta-analysis, standardization
To explore many important scientific questions (e.g., understanding the influence of lifestyle, psychological, social, nutritional, or genetic factors on disease or phenotypic outcomes), researchers need to link both genotype and phenotype data. Currently, investigators for large national and international cohorts, such as those from the Canadian Longitudinal Study on Aging (CLSA) (1), UK Biobank (2), and LifeLines (3) from the Netherlands, are collecting a wide range of information, including biological, social, psychological, lifestyle, and health status data, on hundreds of thousands of participants. Although many of these individual cohorts and data sources are large, multiple data sets are sometimes required when studying rare outcomes or gene-environment interactions or when exploring the influence of geographical and cultural variations in exposure-outcome relationships. To maximize the utility of publicly funded projects and increase the speed of scientific discovery, there has been a worldwide push to combine multiple data sources in order to explore important research questions (4).
The current gold standard for analyzing multiple data sources as part of a systematic review is individual participant data (IPD) meta-analysis, because it provides flexibility with regard to the types of analyses that can be done and thus provides reliable results (5). It also increases the power to explore differential treatment effects in randomized controlled trials and allows for adjustments of confounding factors in meta-analyses of observational studies; however, it is time consuming and costly to conduct (6). Combining IPD is also scientifically and technically very challenging. Ensuring data compatibility and content equivalence through harmonization allows integration of information from different studies/databases and can thereby permit pooling of data from a large number of studies to obtain valid results. It also allows one to properly explore the similarities and discrepancies across studies, jurisdictions, or countries and improve the validity and reliability of research results.
Deriving combinable phenotypic variables using algorithmic methods, for example, creating common categories of weight from continuous data, is fairly straightforward. There is less research, however, on how best to harmonize complex constructs such as cognition measures. We conducted an environmental scan of meta-analyses in the area of cognition to explore the current practices of harmonization and found that in most aggregate data meta-analyses, researchers used standardization to combine cognitive measures across studies (7). Standardization methods are often utilized because they are easily implemented and do not require complex modeling, such as latent variable analysis.
Although many studies used these methods, we could not find general guidelines for the selection of which specific standardization method to use or find information on their performances in IPD meta-analyses. We therefore undertook a case study and simulation to explore the influence of commonly used standardization methods on harmonization of cognition measures in IPD analyses. We chose to focus our analysis on the relationship between physical activity and memory because of known associations (8, 9), and we used data from 3 large Canadian studies to examine how standardization methods influence the overall estimates of effect and measures of heterogeneity in a 2-stage IPD meta-analysis. We further explored the robustness of our results using a simulation study. In the present study, we provide evidence of the influence of using easily implemented procedures for harmonization of complex constructs.
METHODS
We included individual-level data from following 3 Canadian studies: the Canadian Study on Health and Aging (CSHA) (10), the Canadian Community Health Survey on Healthy Aging (CCHS) (11), and the Quebec Longitudinal Study on Nutrition and Aging (NuAge) (12) (Web Table 1, available at http://aje.oxfordjournals.org/). Each study provided population-based data on adults who were 65 years of age or older, including results from neuropsychological tests and physical activity level.
The Rey Auditory Verbal Learning Test (13), a 15-item word-learning test, was used to measure short-term memory in the CSHA and the CCHS. The test is one of the most widely used neuropsychological tests (14) and generally has good test-retest reliability (0.51 ≤ r ≤ 0.86) (15). The Buschke Cued Recall Procedure tests memory under conditions of free recall (hereafter referred to as the Free Buschke test) and cued recall (hereafter referred to as the Total Buschke test). The CSHA used English and French versions of the 12-item Buschke memory test (16), and NuAge used a French version of the 16-item Free and Cued Selective Reminding Test adapted from Grober and Buschke (17). Free recall and cued recall have acceptable sensitivity (62%–100%) and specificity (94%–100%) when comparing individuals with Alzheimer disease to healthy controls (18). We also used the Health Utility Index as an indirect measure of memory. The Health Utility Index has been used in many settings and has been shown to have strong validity and reliability (19, 20). Furthermore, the cognition subscale of the Health Utility Index has been shown to be correlated with the Rey Auditory Verbal Learning Test and other neuropsychological tests (21).
Potential confounding variables were selected from each of the 3 data sets based on the demonstrated relationship with cognition and physical activity in the literature (22) and were endorsed by a technical expert panel (23). These variables included sociodemographic and lifestyle factors (age, sex, educational level, income, country of birth, smoking status, and alcohol consumption) and anthropometric and health conditions (height, weight, body mass index, hip circumference, heart rate, diastolic and systolic blood pressures, and self-reported diagnosis of high blood pressure, stroke, diabetes, or myocardial infarction and family history of high blood pressure, stroke, diabetes, or myocardial infarction).
An algorithmic approach using DataSchema and Harmonization Platform for Epidemiological Research was used to harmonize physical activity and potential confounding variables (24, 25). In short, a priori rules were used to determine whether the information collected in a given study could be used to generate a variable that would be common among all data sets. Selection and definition of variables, rule creation, and decisions about whether or not a variable could be harmonized were based on protocols involving iteration between domain experts and a validation panel. The compatibility of each study's data was assessed on a 3-level scale of matching quality: complete, partial, or impossible match. Variables that were a complete or partial match in all 3 data sets were included. There were complete or partial matches across all studies for 14 targeted variables: physical activity, age, sex, income, educational level, country of birth, height, weight, body mass index, alcohol consumption, diabetes, high blood pressure, and myocardial infarction. The remaining variables could not be included because they were not recorded across all studies or because they did not represent the same information.
Statistical analyses
We studied 2 commonly used standardization methods, T-scores and category-centered scores, to create combinable memory scores across studies in order to examine whether or not these approaches provided similar results in terms of overall effect estimates and measures of heterogeneity in a 2-stage IPD meta-analysis. T-scores are dependent on the full underlying distribution of cognitive measures in each study and have been used to create norms and compare different cognitive measures on a common scale (26). Category-centered scores use the mean and standard deviation for a common demographically determined group (within studies) that is presumed to be homogeneous with respect to the cognitive measures to standardize or “center” the individual cognitive measures. More details about the standardization methods are provided in Web Appendix 1. We applied the scores to our case study and also separately undertook a simulation study to examine the robustness of our case study findings.
Case study
T-scores were standardized with respect to selected covariates (age, sex, and educational level) using linear regression analysis. Category-centered scores were standardized with respect to a homogeneous subgroup with a sufficient sample size across all studies. We standardized to the subgroup of female participants with a high educational level and an age range of 70–74 years (Web Appendix 1).
We conducted a 2-stage IPD meta-analysis in which summary estimates were first created for each study and then combined across studies using traditional aggregate data meta-analysis methods (27). We restricted the meta-analyses to memory constructs that are presumed to be most similar. The Rey Auditory Verbal Learning Test and Free Buschke test both measure noncued recall, whereas the Total Buschke test is a cued test and more similar to Health Utility Index because they are both susceptible to ceiling effects (28). Separate meta-analyses were conducted for each combination of compatible memory scores. Thus, we used only the following combinations of scores from the CCHS, CSHA, and NuAge studies, respectively: Rey, Rey, and Free Buschke; Rey, Free Buschke, and Free Buschke; and Health Utility Index, Total Buschke, and Total Buschke.
For each of the 3 combinations, we calculated effect sizes that were unadjusted, unadjusted and calculated using participants with complete data for all potential confounders, and adjusted. We used Hedges’ g on the weighted mean differences of the T-scores, and category-centered scores between participants reporting no or low physical activity and to those reporting moderate or high levels of physical activity (29). We applied the random effects model (30, 31) and assessed heterogeneity with the Q statistic (32) and the I2 statistic (33). An I2 greater than 50% was considered to indicate substantial heterogeneity (34). Meta-analyses were conducted using MetaAnalyst 3.0 (Tufts Evidence Based Practice Centers, Medford, Massachusetts) (35).
Simulation study
The covariates age, sex, and educational level were generated independently and uncorrelated from each other separately for 3 cohort studies by using the normal and Bernoulli distributions. In this way, we could generate populations that were homogeneous or heterogeneous with respect to these 3 potential confounders. A 3-level ordinal variable for physical activity level was generated through the continuous logistic distribution in which the mean level depended on the 3 covariates age, sex, and educational level. The relationships between the confounders and physical activity level were selected consistently across cohort studies (homogeneous associations). Memory scores were generated with latent variables that were generated per cohort study, indicating the true memory ability of individuals. Conditionally on the latent construct, we applied a binomial distribution to simulate a sum score on memory. The latent variable was affected by age, sex, educational level, and physical activity level. We simulated homogeneous or heterogeneous associations between the latent variable memory and the 3 potential confounders (confounder association with memory = homogeneous or heterogeneous), and we simulated a homogeneous or heterogeneous associations between physical activity and memory across cohort studies (effect size of physical activity = homogeneous or heterogeneous). Details of the simulation study are provided in Web Appendix 1 and Web Tables 2–4. We then applied a similar analysis approach, as in the case study. For each of the 8 possible scenarios, we generated an average effect size, the power, and the average I2. The simulation analyses were conducted using SAS, version 9.2 (SAS Institute, Inc., Cary, North Carolina) (36).
RESULTS
The average ages of the CCHS participants (73.2 years) and NuAge participants (73.7 years) were younger than that of CSHA participants (79.7 years) (Table 1). The CSHA participants tended to have a lower levels of education and income (adjusted to 1992) than did the CCHS and NuAge participants. In addition, fewer participants in CSHA reported being born in Canada. More CSHA participants reported a low level of physical activity compared with CCHS and NuAge participants. The CCHS participants reported high blood pressure and diabetes more often than did CSHA or NuAge participants.
Table 1.
Baseline Demographic and Health-Related Characteristics of Participants With Cognition Dataa, Canadian Community Health Survey-Canadian Longitudinal Study on Aging (2008–2009), Canadian Study of Health and Aging (1991–1992), and Quebec Longitudinal Study on Nutrition and Aging (2004–2005)
| Characteristic | Study | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CCHS-CLSA (n = 7,107) | CSHA (n = 1,730) | NuAge (n = 432) | ||||||||||
| No. | % | Mean | SD | No. | % | Mean | SD | No. | % | Mean | SD | |
| Age, years | 73.2 | 5.9 | 79.7 | 7.0 | 73.7 | 4.0 | ||||||
| Age group, years | ||||||||||||
| 65–74 | 4,162 | 58.6 | 367 | 21.2 | 265 | 61.3 | ||||||
| 75–85 | 2,945 | 41.4 | 976 | 56.4 | 167 | 38.7 | ||||||
| >85 | 387 | 22.4 | ||||||||||
| Female sex | 4,103 | 57.7 | 1,084 | 62.7 | 232 | 53.7 | ||||||
| Highest level of education | ||||||||||||
| Low (0–8 years) | 1,342 | 19.0 | 841 | 48.9 | 66 | 15.3 | ||||||
| Medium (9–12 years) | 2,664 | 37.7 | 619 | 35.8 | 171 | 39.6 | ||||||
| High (≥13 years) | 3,055 | 43.3 | 270 | 15.6 | 195 | 45.1 | ||||||
| Missing | 46 | |||||||||||
| Household income, $ | ||||||||||||
| <10,000 | 75 | 1.1 | 39 | 6.9 | 3 | 0.7 | ||||||
| 10,000–14,999 | 412 | 5.8 | 108 | 19.1 | 20 | 4.6 | ||||||
| 15,000–19,999 | 727 | 10.2 | 35 | 6.2 | 19 | 4.4 | ||||||
| 20,000–29,999 | 1,287 | 18.1 | 66 | 11.6 | 66 | 15.2 | ||||||
| 30,000–39,999 | 975 | 13.7 | 31 | 5.5 | 90 | 20.8 | ||||||
| 40,000–49,999 | 694 | 9.8 | 21 | 3.7 | 57 | 13.2 | ||||||
| 50,000–59,999 | 591 | 8.3 | 15 | 2.7 | 50 | 11.6 | ||||||
| 60,000–69,999 | 379 | 5.3 | 9 | 1.6 | 19 | 4.4 | ||||||
| ≥70,000 | 1,433 | 20.2 | 8 | 1.4 | 54 | 12.5 | ||||||
| Preferred not to answer | 981 | 13.8 | 81 | 14.3 | ||||||||
| Do not know | 34 | 23.6 | ||||||||||
| Missing | 20 | 3.5 | 54 | |||||||||
| Not asked | 1,163 | |||||||||||
| Canadian | 5,781 | 81.4 | 1,166 | 67.4 | 387 | 89.6 | ||||||
| Ever/current alcohol useb | 6,550 | 92.2 | 344 | 22.8 | 411 | 95.1 | ||||||
| Level of physical activity | ||||||||||||
| None | 1,161 | 16.3 | 788 | 45.5 | 46 | 10.6 | ||||||
| Low | 777 | 10.9 | 203 | 11.7 | 45 | 10.4 | ||||||
| Moderate to high | 5,169 | 72.7 | 493 | 28.5 | 341 | 78.9 | ||||||
| Missing | 246 | |||||||||||
| Heightb | ||||||||||||
| Male | 3,000 | 174.7 | 7.2 | 607 | 170.5 | 7.7 | 200 | 168.5 | 7.4 | |||
| Female | 4,075 | 160.4 | 6.4 | 993 | 157.3 | 7.4 | 232 | 155.4 | 5.7 | |||
| All | 7,075 | 166.5 | 9.7 | 1,600 | 162.3 | 9.9 | 432 | 161.5 | 9.2 | |||
| Weightb | ||||||||||||
| Male | 2,990 | 82.8 | 14.3 | 622 | 72.6 | 12.7 | 200 | 80.0 | 12.9 | |||
| Female | 4,011 | 68.6 | 13.8 | 1,032 | 60.3 | 12.5 | 232 | 66.4 | 12.8 | |||
| All | 7,001 | 74.6 | 15.7 | 1,554 | 64.9 | 13.9 | 400 | 72.7 | 14.5 | |||
| Chronic conditionsb | ||||||||||||
| High blood pressure | 3,993 | 56.2 | 614 | 35.5 | 206c | 47.7 | ||||||
| Stroke | 283 | 4.0 | 226 | 13.5 | 0c | |||||||
| Diabetes | 1,258 | 17.7 | 228 | 13.2 | 40c | 9.3 | ||||||
| Myocardial infarction | 876 | 12.4 | 263 | 15.2 | 57c | 13.4 | ||||||
Abbreviation: SD, standard deviation.
a Shown are potential confounding variables for each of the 3 data sets. The compatibility of each study's data was assessed on a 3-level scale of matching quality: complete, partial, or impossible match. Variables that were a “complete” or “partial” match in all 3 data sets were included.
b Total number may not add up to total sample because of missing values. Percentages were calculated excluding missing data.
c Partial match.
Combined data set analysis: 2-stage IPD meta-analysis
Table 2 and Web Table 5 present the meta-analysis results for the combinations of cognitive measures. The overall estimated effect sizes were small, ranging from 0.07 to 0.18. None of the Health Utility Index/Total Buschke/Total Buschke summary estimates of association were statistically significant. The cognitive measure combination most likely to result in a statistically significant overall estimate was the Rey/Rey/Free Buschke (4 of 6 comparisons). In most analyses, significant heterogeneity was found, and none of the analyses had an I2 value less than 50%. Six of the 18 analyses had a P value greater than 0.05, and 1 had a P value greater than 0.10. The analyses with the least heterogeneity were also associated with the Rey/Rey/Free Buschke combinations (4 of 6 analyses with P < 0.05). Of the 6 analyses that did not indicate statistically significant heterogeneity at the P < 0.05 level, 5 included T-scores, as did the 1 analysis that indicated a lack of heterogeneity at the P < 0.10 level. In general, the results for the adjusted T-score and category-centered score analyses were more similar than were the unadjusted analyses.
Table 2.
Summary Hedges’ g Values for the Weighted Mean Difference of Combinations of Memory Tests in People Who Reported No or Low Physical Activity Compared With People Who Reported Moderate or High Levels of Physical Activitya, Canadian Community Health Survey-Canadian Longitudinal Study on Aging (2008–2009), Canadian Study of Health and Aging (1991–1992), and Quebec Longitudinal Study on Nutrition and Aging (2004–2005)
| Study/Memory Test Given and Type of Outcome | Hedges’ g | 95% CI | I2 | Q Statistic for Heterogeneity | P for Heterogeneity |
|---|---|---|---|---|---|
| Unadjusted | |||||
| CCHS/RAVLT; CSHA/RAVLT; and NuAge/Free Buschke | |||||
| T-score | 0.12 | 0.01, 0.23 | 0.64 | 5.5 | 0.06 |
| Category-centered score | 0.16 | 0.01, 0.30 | 0.78 | 8.96 | 0.01 |
| CCHS/RAVLT; CSHA/Free Buschke; and NuAge/Free Buschke | |||||
| T-score | 0.14 | −0.03, 0.31 | 0.85 | 12.92 | 0.002 |
| Category-centered score | 0.18 | −0.01, 0.36 | 0.88 | 16.46 | <0.001 |
| CCHS/HUI; CSHA/Total Buschke; and NuAge/Total Buschke | |||||
| T-score | 0.07 | −0.04, 0.19 | 0.65 | 5.72 | 0.06 |
| Category-centered score | 0.10 | −0.05, 0.24 | 0.80 | 9.76 | 0.008 |
| Unadjusted Using Participants With Complete Data for All Potential Confounders | |||||
| CCHS/RAVLT; CSHA/RAVLT; and NuAge/Free Buschke | |||||
| T-score | 0.12 | 0.01, 0.22 | 0.55 | 4.47 | 0.11 |
| Category-centered score | 0.16 | 0.01, 0.30 | 0.75 | 7.84 | 0.02 |
| CCHS/RAVLT; CSHA/Free Buschke; and NuAge/Free Buschke | |||||
| T-score | 0.14 | −0.03, 0.32 | 0.82 | 11.35 | 0.003 |
| Category-centered score | 0.18 | 0.0001, 0.36 | 0.85 | 13.37 | 0.001 |
| CCHS/HUI; CSHA/Total Buschke; and NuAge/Total Buschke | |||||
| T-score | 0.07 | −0.05, 0.19 | 0.64 | 5.51 | 0.06 |
| Category-centered score | 0.10 | −0.05, 0.24 | 0.77 | 8.65 | 0.01 |
| Adjusted Effect Estimates | |||||
| CCHS/RAVLT; CSHA/RAVLT; and NuAge/Free Buschke | |||||
| T-score | 0.11 | −0.02, 0.23 | 0.66 | 5.81 | 0.06 |
| Category-centered score | 0.11 | −0.02, 0.23 | 0.66 | 5.81 | 0.06 |
| CCHS/RAVLT; CSHA/Free Buschke; and NuAge/Free Buschke | |||||
| T-score | 0.14 | −0.05, 0.32 | 0.85 | 13.38 | 0.001 |
| Category-centered score | 0.13 | −0.05, 0.31 | 0.85 | 12.95 | 0.002 |
| CCHS/HUI; CSHA/Total Buschke; and NuAge/Total Buschke | |||||
| T-score | 0.08 | −0.05, 0.21 | 0.69 | 6.39 | 0.04 |
| Category-centered score | 0.08 | −0.04, 0.20 | 0.67 | 6.10 | 0.047 |
Abbreviations: CCHS, Canadian Community Health Survey; CI, confidence interval; CSHA, Canadian Study of Health and Aging; Free Buschke, Buschke Cued Recall Procedure under conditions of free recall; HUI, Health Utility Index; NuAge, Quebec Longitudinal Study on Nutrition and Aging; RAVLT, Rey Auditory Verbal Learning Test; Total Buschke, Buschke Cued Recall Procedure under conditions of cued recall.
a Shown are results from separate meta-analyses for combinations of compatible memory tests for each study. CCHS included the RAVLT and HUI; CSHA included the RAVLT, Free Buschke, and Total Buschke; and NuAge include the Free Buschke and Total Buschke.
Simulation study
Adjustment for age, sex, and educational level increased the effect sizes on average by approximately 4%–13%, despite the fact that the category-centered scores and T-scores are essentially corrected for these 3 variables (Table 3). The increase was greater for the category-centered score than for the T-score; however, the 2 scores gave identical results when the effect sizes were adjusted for the covariates age, sex, and educational level.
Table 3.
Summary Hedges’ g for the Weighted Mean Difference for Simulated Memory Tests in People Who Reported No or Low Physical Activity Compared With People Who Reported Moderate or High Levels of Physical Activitya, Canadian Community Health Survey-Canadian Longitudinal Study on Aging (2008–2009), Canadian Study of Health and Aging (1991–1992), and Quebec Longitudinal Study on Nutrition and Aging (2004–2005)
| Effect of Physical Activity | Population | Confounder Effect on Memory | Type of Outcome | Effect Size | Power | Average I2 |
|---|---|---|---|---|---|---|
| Unadjusted | ||||||
| Homogeneous | Homogeneous | Homogeneous | T-score | 0.57 | 100 | 16.6 |
| Homogeneous | Homogeneous | Homogeneous | C-score | 0.53 | 100 | 17.1 |
| Homogeneous | Homogeneous | Heterogeneous | T-score | 0.62 | 100 | 87.6 |
| Homogeneous | Homogeneous | Heterogeneous | C-score | 0.57 | 100 | 82.7 |
| Homogeneous | Heterogeneous | Homogeneous | T-score | 0.58 | 100 | 27.7 |
| Homogeneous | Heterogeneous | Homogeneous | C-score | 0.54 | 100 | 30.0 |
| Homogeneous | Heterogeneous | Heterogeneous | T-score | 0.62 | 100 | 91.9 |
| Homogeneous | Heterogeneous | Heterogeneous | C-score | 0.57 | 100 | 89.8 |
| Heterogeneous | Homogeneous | Homogeneous | T-score | 0.39 | 73.2 | 95.4 |
| Heterogeneous | Homogeneous | Homogeneous | C-score | 0.36 | 56.2 | 95.6 |
| Heterogeneous | Homogeneous | Heterogeneous | T-score | 0.46 | 19.9 | 98.2 |
| Heterogeneous | Homogeneous | Heterogeneous | C-score | 0.41 | 13.5 | 98.0 |
| Heterogeneous | Heterogeneous | Homogeneous | T-score | 0.40 | 62.8 | 96.0 |
| Heterogeneous | Heterogeneous | Homogeneous | C-score | 0.37 | 46.9 | 96.0 |
| Heterogeneous | Heterogeneous | Heterogeneous | T-score | 0.47 | 10.2 | 98.4 |
| Heterogeneous | Heterogeneous | Heterogeneous | C-score | 0.42 | 7.0 | 98.3 |
| Adjusted | ||||||
| Homogeneous | Homogeneous | Homogeneous | T-score | 0.59 | 100 | 18.4 |
| Homogeneous | Homogeneous | Homogeneous | C-score | 0.59 | 100 | 18.4 |
| Homogeneous | Homogeneous | Heterogeneous | T-score | 0.65 | 100 | 88.5 |
| Homogeneous | Homogeneous | Heterogeneous | C-score | 0.65 | 100 | 88.5 |
| Homogeneous | Heterogeneous | Homogeneous | T-score | 0.60 | 100 | 29.9 |
| Homogeneous | Heterogeneous | Homogeneous | C-score | 0.60 | 100 | 29.9 |
| Homogeneous | Heterogeneous | Heterogeneous | T-score | 0.64 | 100 | 92.5 |
| Homogeneous | Heterogeneous | Heterogeneous | C-score | 0.64 | 100 | 92.5 |
| Heterogeneous | Homogeneous | Homogeneous | T-score | 0.41 | 72.7 | 95.8 |
| Heterogeneous | Homogeneous | Homogeneous | C-score | 0.41 | 72.7 | 95.8 |
| Heterogeneous | Homogeneous | Heterogeneous | T-score | 0.48 | 19.4 | 99.4 |
| Heterogeneous | Homogeneous | Heterogeneous | C-score | 0.48 | 19.4 | 99.4 |
| Heterogeneous | Heterogeneous | Homogeneous | T-score | 0.41 | 61.5 | 96.4 |
| Heterogeneous | Heterogeneous | Homogeneous | C-score | 0.41 | 61.5 | 96.4 |
| Heterogeneous | Heterogeneous | Heterogeneous | T-score | 0.48 | 9.2 | 98.6 |
| Heterogeneous | Heterogeneous | Heterogeneous | C-score | 0.48 | 9.2 | 98.6 |
Abbreviation: C-score, category-centered score.
a Shown are results from meta-analyses for 3 scenarios in which the following were either homogeneous or heterogeneous across population-based studies: 1) effect size, 2) distribution of confounders, and 3) relationship between confounders and the outcome.
The effect sizes for all adjusted and unadjusted scores were affected by the different simulation settings (i.e., homogeneous or heterogeneous: 1) effect size of the association between physical activity and memory, 2) population distribution of confounders, and 3) relationship between confounders and memory). Homogeneous and heterogeneous associations of physical activity and memory had the largest influence in terms of change in the overall effect size. The reason is that the pooled estimates were not identical for these 2 settings. However, the association between physical activity and memory for the different settings of age, sex, and educational level should have been identical, because that was consistent across all settings. Whether we pooled homogeneous or heterogeneous populations with respect to the distribution of confounders (age, sex, and educational level) had the least influence on the pooled estimates. Different relationships between the confounders and memory across studies also influenced the pooled results because the heterogeneous setting was 6%–17% larger than the homogeneous setting.
The I2 clearly detects when the association between physical activity and memory is different across studies. However, a large I2 is also observed when the association between physical activity and memory is consistent across studies. This occurs when the influences of the confounders on memory are different across studies. Populations that are heterogeneous with respect to the distribution of confounders have only a limited influence on the I2 compared with homogeneous populations. Power was also most influenced when the influence of confounders on memory was heterogeneous across populations.
DISCUSSION
In the present study, we explored whether 2 standardization methods, T- and category-centered scores, can influence estimates of effect and heterogeneity when outcomes are measured using different scales or instruments. Researchers conducting meta-analyses often use measures of heterogeneity, which may be defined as the proportion of total variation in measured pooled risk estimates that is due to between-study heterogeneity rather than to chance, as an indication that findings across studies are consistent and thus can be pooled. In the case study, there is a suggestion that important heterogeneity may be masked by one's choice of standardization procedure. When using a criterion of I2 > 50%, all analyses indicated there was important heterogeneity. When the criterion of PQ < 0.05 was used, however, 6 of the 18 analyses indicated there was not statistically significant heterogeneity; 5 of the 6 analyses involved the T-score. Because the T-scores are standardized to the same mean across studies, it was expected that the T-scores would reduce between-study heterogeneity when compared with the category-centered score, especially in the unadjusted analyses. In fact, in the adjusted analysis, the same results were found regardless of the method of standardization.
In the case study, also we found that the effect estimates of physical activity on memory based on the unadjusted T-score and category-centered score were similar, but the magnitudes of those using the category-centered scores tended to be larger. In the adjusted analysis, these effect estimates based on the category-centered scores and T-scores were nearly identical, and they were closer to the unadjusted T-scores than were the unadjusted category-centered scores. This is supported by the simulation analysis and implies that the method of standardization may be less important if standardized measures are adjusted for a common set of important confounders. If only unadjusted analyses are available, the T-scores may be preferable in terms of bias, because they are already adjusted for important confounders. It was interesting, however, that there was still residual confounding, because the effect estimates based on the T-scores in the simulation still increased by approximately 4%.
In the simulation study, we compared the 2 standardization methods across a number of scenarios to examine the types of heterogeneity that researchers generally explore in a meta-analysis. We found the method of standardization and the population characteristics had only a small influence on heterogeneity. As one would expect, heterogeneity was evident when we varied the effect size of physical activity on memory across population. Interestingly, substantial heterogeneity was also evident when the relationship between the confounding variables and the outcome differed across the studies, even when the population distributions of the confounders and the effect sizes of physical activity on memory were consistent across cohorts and regardless of whether or not the effect estimates were adjusted. This implies that in terms of sources of heterogeneity, the method of standardization plays a much smaller role than does differential confounding of the outcome measure across studies and that a significant I2 can be obtained even when the “standard” sources of heterogeneity are not existent across studies.
This also has implications for conducting aggregate data meta-analyses. To fully explore the contribution of these factors to heterogeneity, one requires exploration of study-specific data and IPD meta-analysis. In our analyses, we conducted 2-stage meta-analysis. Although the results are often similar, there are occasions when 1-stage and 2-stage meta-analyses can provide different parameter estimates and different conclusion (37); however, it is not clear whether the use of a 1-stage rather than a 2-stage model affects measures of heterogeneity. We expect that a 1-stage IPD analysis would be able to better address the heterogeneity in each of the effect sizes. Indeed, using random coefficient models makes it possible to study heterogeneity for each effect size. Furthermore, exploring whether or not the outcome being measured is unidimensional and consistent across studies would also require more complex modeling. For example, latent variable modeling allows for simultaneously use of information on all measures of a construct, testing of the goodness of fit of the proposed model, and testing of whether or not there is consistency of the measures across data sets. When using the other methods of standardization, researchers implicitly assume that all instruments are measuring the same construct, and this assumption is generally not verified.
Data from observational studies are presented in this article; methods to retrospectively harmonize outcome, exposure, and covariate data were used. If one were applying harmonization methods to a meta-analysis of randomized controlled trials, using unadjusted measures of effect would generally be more appropriate, and thus the inclusion of covariate data would not be warranted. There are situations, however, in which one is interested in effect modification in which combinable covariate data are required. As well, in the context of evaluating harms, one is often limited to nonexperimental data.
Overall, there was a general consistency between the 2 types of standardization methods, especially when an adjusted analysis was performed. In the case study, there were multiple examples when using the less complex standardization methods, in which important heterogeneity was not identified. This masking of heterogeneity happened most often when using a T-score to standardize the cognition scores compared with a category-centered score. The simulation study also identified a number of sources of heterogeneity that can affect the I2, some of which are not the standard sources considered by researchers. One would not be able to explore these types of heterogeneity in an aggregate data meta-analysis because individual-level data are needed. Moreover, to fully explore these different sources of heterogeneity and the underlying structure of the construct, more complex models are required. Indeed, standardization by itself is not harmonization because putting variables on the same scale can be done with any 2 variables, which does not necessarily imply that the standardized variables carry the same information.
Supplementary Material
ACKNOWLEDGMENTS
Author affiliations: Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada (Lauren E. Griffith, Parminder Raina, Nazmul Sohel, Meghan Kenny); Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, The Netherlands (Edwin van den Heuvel); Research Institute of the McGill University Health Centre, Montreal, Quebec, Canada (Isabel Fortier, Dany Doiron); Department of Psychology, University of Victoria, Victoria, British Columbia, Canada (Scott M. Hofer); Research Center on Aging, CIUSSS de l'Estrie-CHUS, and Faculty of Medicine and Health Sciences, University of Sherbrooke, Sherbrooke, Quebec, Canada (Hélène Payette); Research Institute of the McGill University Health Centre, McGill University, Montreal, Quebec, Canada (Christina Wolfson); Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, Quebec, Canada (Christina Wolfson); and Research Center, Institut Universitaire de Gériatrie de Montréal and Psychology Department, Université de Montréal, Montreal, Quebec, Canada (Sylvie Belleville).
The groundwork for this manuscript is based on the methods research report Harmonization of Cognitive Measures in Individual Participant Data and Aggregate Data Meta-Analysis, funded by the Agency for Healthcare Research and Quality, United States Department of Health and Human Services, under contract No. HHSA 290 2007 10060 I. L.E.G. is supported by a Canadian Institutes of Health Research New Investigators Award. P.R. holds a Tier 1 Canada Research Chair in Geroscience and the Raymond and Margaret Labarge Chair in Research and Knowledge Application for Optimal Aging.
The authors are solely responsible for the content of the review. The opinions expressed herein do not necessarily reflect the opinions of the Agency for Healthcare Research and Quality.
Conflict of interest: none declared.
REFERENCES
- 1.Raina PS, Wolfson C, Kirkland SA, et al. The Canadian longitudinal study on aging (CLSA). Can J Aging. 2009;28(3):221–229. [DOI] [PubMed] [Google Scholar]
- 2.Ollier W, Sprosen T, Peakman T. UK Biobank: from concept to reality. Pharmacogenomics. 2005;6(6):639–646. [DOI] [PubMed] [Google Scholar]
- 3.Stolk RP, Rosmalen JG, Postma DS, et al. Universal risk factors for multifactorial diseases: LifeLines: a three-generation population-based study. Eur J Epidemiol. 2008;23(1):67–74. [DOI] [PubMed] [Google Scholar]
- 4.Thompson A. Thinking big: large-scale collaborative research in observational epidemiology. Eur J Epidemiol. 2009;24(12):727–731. [DOI] [PubMed] [Google Scholar]
- 5.Stewart LA, Tierney JF. To IPD or not to IPD? Advantages and disadvantages of systematic reviews using individual patient data. Eval Health Prof. 2002;25(1):76–97. [DOI] [PubMed] [Google Scholar]
- 6.Griffith LE, Shannon HS, Wells RP, et al. Individual participant data meta-analysis of mechanical workplace risk factors and low back pain. Am J Public Health. 2012;102(2):309–318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Griffith LE, van den Heuvel E, Fortier I, et al. Statistical approaches to harmonize data on cognitive measures in systematic reviews are rarely reported. J Clin Epidemiol. 2015;68(2):154–162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Carvalho A, Rea IM, Parimon T, et al. Physical activity and cognitive function in individuals over 60 years of age: a systematic review. Clin Interv Aging. 2014;9:661–682. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Roig M, Nordbrandt S, Geertsen SS, et al. The effects of cardiovascular exercise on human memory: a review with meta-analysis. Neurosci Biobehav Rev. 2013;37(8):1645–1666. [DOI] [PubMed] [Google Scholar]
- 10.Canadian Study of Health and Aging Working Group Canadian study of health and aging: study methods and prevalence of dementia. CMAJ. 1994;150(6):899–913. [PMC free article] [PubMed] [Google Scholar]
- 11.Statistics Canada Canadian Community Health Survey – Healthy aging (CCHS). http://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&SDDS=5146. Published March 26, 2008. Updated November 27, 2008. Accessed January 5, 2016.
- 12.Gaudreau P, Morais JA, Shatenstein B, et al. Nutrition as a determinant of successful aging: description of the Quebec longitudinal study Nuage and results from cross-sectional pilot studies. Rejuvenation Res. 2007;10(3):377–386. [DOI] [PubMed] [Google Scholar]
- 13.Taylor EM. The Appraisal of Children With Cerebral Deficits. Cambridge, MA: Harvard University Press; 1959. [Google Scholar]
- 14.Butler M, Retzlaff P, Vanderploeg R. Neuropsychological test usage. Prof Psychol Res Pr. 1991;22(6):510–512. [Google Scholar]
- 15.Lezak MD, Howlesonn DB, Loring DW. Neuropsychological Assessment. 4th ed New York, NY: Oxford University Press; 2004. [Google Scholar]
- 16.Buschke H. Cued recall in amnesia. J Clin Neuropsychol. 1984;6(4):433–440. [DOI] [PubMed] [Google Scholar]
- 17.Grober E, Buschke H. Genuine memory deficits in dementia. Dev Neuropsychol. 1987;3(1):13–36. [Google Scholar]
- 18.Carlesimo GA, Perri R, Caltagirone C. Category cued recall following controlled encoding as a neuropsychological tool in the diagnosis of Alzheimer's disease: a review of the evidence. Neuropsychol Rev. 2011;21(1):54–65. [DOI] [PubMed] [Google Scholar]
- 19.Horsman J, Furlong W, Feeny D, et al. The Health Utilities Index (HUI): concepts, measurement properties and applications. Health Qual Life Outcomes. 2003;1:54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kavirajan H, Hays RD, Vassar S, et al. Responsiveness and construct validity of the health utilities index in patients with dementia. Med Care. 2009;47(6):651–661. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Findlay L, Bernier J, Tuokko H, et al. Validation of cognitive functioning categories in the Canadian Community Health Survey-Healthy Aging. Health Rep. 2010;21(4):85–100. [PubMed] [Google Scholar]
- 22.Coley N, Andrieu S, Gardette V, et al. Dementia prevention: methodological explanations for inconsistent results. Epidemiol Rev. 2008;30:35–66. [DOI] [PubMed] [Google Scholar]
- 23.Griffith L, van den Heuvel E, Fortier I, et al. Harmonization of Cognitive Measures in Individual Participant Data and Aggregate Data Meta-Analysis. Rockville, MD: Agency for Healthcare Research and Quality; 2013. (AHRQ Publication No. 13-EHC040-EF). [PubMed] [Google Scholar]
- 24.Fortier I, Burton PR, Robson PJ, et al. Quality, quantity and harmony: the DataSHaPER approach to integrating data across bioclinical studies. Int J Epidemiol. 2010;39(5):1383–1393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Fortier I, Doiron D, Little J, et al. Is rigorous retrospective harmonization possible? Application of the DataSHaPER approach across 53 large studies. Int J Epidemiol. 2011;40(5):1314–1328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Tuokko H, Woodward TS. Development and validation of a demographic correction system for neuropsychological measures used in the Canadian Study of Health and Aging. J Clin Exp Neuropsychol. 1996;18(4):479–616. [DOI] [PubMed] [Google Scholar]
- 27.Riley RD, Simmonds MC, Look MP. Evidence synthesis combining individual patient data and aggregate data: a systematic review identified current practice and possible methods. J Clin Epidemiol. 2007;60(5):431–439. [DOI] [PubMed] [Google Scholar]
- 28.Dion M, Potvin O, Belleville S, et al. Normative data for the Rappel libre/Rappel indicé à 16 items (16-item Free and Cued Recall) in the elderly Quebec-French population. Clin Neuropsychol 2015;28(suppl 1):S1–S19. [DOI] [PubMed] [Google Scholar]
- 29.Horn JL. Organization of abilities and the development of intelligence. Psychol Rev 1968;75(3):242–259. [DOI] [PubMed] [Google Scholar]
- 30.DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials. 1986;7(3):177–188. [DOI] [PubMed] [Google Scholar]
- 31.DerSimonian R, Kacker R. Random-effects model for meta-analysis of clinical trials: an update. Contemp Clin Trials. 2007;28(2):105–114. [DOI] [PubMed] [Google Scholar]
- 32.Fleiss JL. Statistical Methods for Rates and Proportions. 2nd ed New York, NY: John Wiley & Sons; 1981. [Google Scholar]
- 33.Higgins JP, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med. 2002;21(11):1539–1558. [DOI] [PubMed] [Google Scholar]
- 34.Deeks JJ, Higgins JPT, Altman DG. Chapter 9: Analysing data and undertaking meta-analyses In: Higgins JPT, Green S, eds. Cochrane Handbook for Systematic Reviews of Interventions. Version 5.1.0 Chichester, UK: The Cochrane Collaboration; 2011. http://handbook.cochrane.org/. Updated January 19, 2016. Accessed January 19, 2016. [Google Scholar]
- 35.Wallace BC, Schmid CH, Lau J, et al. Meta-analyst: software for meta-analysis of binary, continuous and diagnostic data. BMC Med Res Methodol. 2009;9:80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. SAS/STAT User's Guide, Version 8. Cary, NC: SAS Institute Inc.; 2003. [Google Scholar]
- 37.Debray TP, Moons KG, Abo-Zaid GM, et al. Individual participant data meta-analysis for a binary outcome: one-stage or two-stage. Plos One. 2013;8(4):e60650. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
