Abstract
Background: Precision medicine is the Holy Grail of interventions that are tailored to a patient’s individual characteristics. However, the conventional design of randomized trials assumes that each individual benefits by the same amount.
Methods: We reviewed parallel trials with quantitative outcomes published in 2004, 2007, 2010 and 2013. We collected baseline and final standard deviations of the main outcome. We assessed homoscedasticity by comparing the outcome variability between treated and control arms.
Results: The review provided 208 articles with enough information to conduct the analysis. At the end of the study, 113 (54%, 95% CI 47 to 61%) papers find less variability in the treated arm. The adjusted point estimate of the mean ratio (treated to control group) of the outcome variances is 0.89 (95% CI 0.81 to 0.97).
Conclusions: Some variance inflation was observed in just 1 out of 6 interventions, suggesting the need for further eligibility criteria to tailor precision medicine. Surprisingly, the variance was more often smaller in the intervention group, suggesting, if anything, a reduced role for precision medicine. Homoscedasticity is a useful tool for assessing whether or not the premise of constant effect is reasonable.
Keywords: Constant Effect, Precision medicine, Homoscedasticity, Clinical Trial, Variability, Standard deviation, Review
Introduction
The idea behind precision medicine is to develop prevention and treatment strategies that take into account individual characteristics. With this strong endorsement “The prospect of applying this concept broadly has been dramatically improved by recent developments in large-scale biologic databases (such as the human genome sequence), powerful methods for characterizing patients (such as proteomics, metabolomics, genomics, diverse cellular assays, and mobile health technology), and computational tools for analyzing large sets of data.”, US President Obama launched the Precision Medicine initiative in 2015 to capitalize on these developments 1, 2. However, we are not convinced that this is a sensible idea.
Variability of a clinical trial outcome measure should interest researchers because it conveys important information about whether there is a need for precision medicine. Does variance come only from unpredictable and ineluctable sources of patient variability? Or should it also be attributed to a different treatment effect that requires more precise prescription rules 3– 5? Researchers assess treatment effect modifications (“interactions”) among subgroups based on relevant variables. The main problem with that methodology is that, by the usual standards of classical phase III trial, the stratification factors must be known in advance and be measurable. This in turn implies that when new variables are discovered and introduced into the causal path, new clinical trials are needed. Fortunately, one observable consequence of a constant effect is that the treatment will not affect variability, and therefore the outcome variances in both arms should be equal (“homoscedasticity”). If this homoscedasticity holds, there is no need to repeat the clinical trial once a new possible effect modifier becomes measurable.
Nevertheless, the fundamental problem of causal inference is that for each patient in a parallel group trial, we can know the response for only one of the interventions. That is, we observe their response to either the new Treatment or to the Control, but not both. By experimentally controlling unknown confounders through randomization, a clinical trial may estimate the averaged causal effect. In order to translate this population estimate into effects for individual patients, additional assumptions are needed. We try to elucidate whether the comparison of observed variances may shed some light on the non-observable individual treatment effect. See examples and references that illustrate in their interpretation in Figure 1 6– 15.
The assumption that the average effect equals the single unit effect underlies the rationale behind the usual sample size calculation, where only a single effect is specified. As an example, the 10 clinical trials published in the Trials Journal in October 2017 (see Supplementary File 1 : Table S1) were designed under this scenario of a fixed, constant or unique effect in the sample size calculation.
Our objectives were, first, to compare the variability of the main outcome between different arms in clinical trials published in medical journals and, second, to provide a first, rough estimate of the proportion of studies that could potentially benefit from precision medicine. As sensitivity analysis, we explore the changes in the experimental arm’s variability over time (from baseline to the end of the study). We also fit a random effect model to the outcome variance ratio in order to isolate studies with a variance ratio outside their expected random variability values (heterogeneity).
Methods
Population
Our target population was parallel randomized clinical trials with quantitative outcomes. Trials needed to provide enough information to assess two homoscedasticity assumptions in the primary endpoint: between arms at trial end; and baseline to outcome over time in the treated arm. Therefore, baseline and final SDs for the main outcome were necessary or, failing that, at least one measure that would allow us to calculate them (variances, standard errors or mean confidence intervals).
Data collection
Articles on parallel clinical trials from the years 2004, 2007, 2010 and 2013 were selected from the Medline database with the following criteria: “ AB (clinical trial* AND random*) AND AB (change OR evolution OR (difference AND baseline)” [The word “difference” was paired with “baseline” because the initial purpose of the data collection, subsequently modified, was to estimate the correlation between baseline and final measurements]. The rationale behind the election of these years was to have a global view of the behavior of the studies over a whole decade. For the years 2004 and 2007, we selected all papers that met the inclusion criteria; while for the years 2010 and 2013, as we obtained a greater number of articles retrieved from the search (478 and 653, respectively), we chose a random sample of 300 papers (Section II in Supplementary File 1).
Data were collected by two different researchers (NM, MkV) in two phases: 2004/2007 and 2010/2013. Later, two statisticians (JC, MtV) verified the correctness of the data and to make them accessible to readers through a shiny application and through the Figshare repository 16.
Variables
Collected variables were: baseline and outcome SDs; experimental and reference interventions; sample size in each group; medical field according to Web of Science (WOS) classification; main outcome; patient’s disease; kind of disease (chronic or acute); outcome type (measured or scored); intervention type (pharmacological or not); and whether or not the main effect was significant.
For studies with more than one quantitative outcome, primary endpoint was determined according to the following hierarchical criteria: (1) objective or hypothesis; (2) sample size determination; (3) main statistical method; or (4) first quantitative variable reported in results.
In the same way, the choice of the "experimental" arm was determined depending on the role in the following sections of the article: (1) objective or hypothesis; (2) sample size determination; (3) rationale in the introduction or (4) first comparison reported in results (in the case of more than two arms)
Statistical analysis
We assessed homoscedasticity between treatments and over time. Our main analysis compared, for the former, the outcome variability between Treated (T) and Control (C) arms at the trial end. For the latter, we compared the variability between Outcome (O) and its Baseline (B) value for the treated arm.
To distinguish between random variability and heterogeneity, we fitted a random mixed effects model using the logarithm of the variance ratio at the end of the trial as response with the study as random effect and the logarithm of the variance ratio at baseline as fixed effect 17. An analogous model was employed to assess the homoscedasticity over time, as such a model allows the separation of random allocation variability from additional heterogeneity. To obtain a reference in the absence of treatment effect, we first modeled the baseline variance ratio as a response that is expected to have heterogeneity equal to 0 due to randomization – so long as no methodological impurities are present.(e.g., consider the outcomes obtained 1 month after the start of treatment as the baseline values). This reference model allows us to know the proportion of studies in the previous models that could have additional heterogeneity which cannot be explained by the variability among studies (sections III, V and VI in Supplementary File 1).
Funnel plots, centered at zero, of the measurement of interest as a function of its standard errors are reported in order to investigate asymmetries.
As sensitivity analyses, we assessed homoscedasticity in each single study: (a) between outcomes on both arms with F-test for independent samples; and (b) between baseline and outcome in the treated arm with a specific test for paired samples 18 when the variance of the paired difference was available. All tests were two-sided (α=5%).
Several subgroup analyses were carried out according to the statistical significance of the main effect of the study and to the different types of outcomes and interventions.
All analyses were performed with the R statistical package version 3.2.5. (The R code for the main analysis is available from https://doi.org/10.5281/zenodo.1133609 19)
Results
Population
A total of 1214 articles were retrieved from the search. Of those papers, 542 belong to the target population and 208 (38.4%) contained enough information to conduct the analysis ( Figure 2).
Mainly, the selected studies were non-pharmacological (122, 58.6%), referred to chronic conditions (101, 57.4%), had an outcome measure with units (132, 63.8%) instead of a constructed scale, and this outcome were measured rather than assessed (125, 60.1%). Regarding the primary objective of each trial, the authors found statistically significant differences between the arms in 83 (39.9%) studies. Following the WOS criteria, 203 articles (97.6%) belonged to at least one medical field. The main areas of study were: General & Internal Medicine (n=31, 14.9%), Nutrition & Dietetics (21, 10.1%), Endocrinology & Metabolism (19, 9.1%), and Cardiovascular System & Cardiology (16, 7.7%).
Homoscedasticity
There is a high average concordance between variances in the treatment and control arm, but with evidence of a smaller variability in the treated arm. At the end of the study, 113/208 (54%, 95% CI, 47 to 61%) papers showed less variability in the treated arm ( Supplementary File 1 : Figure S1). Among the treated arms, 111/208 (53%, 95% CI, 46 to 60%) had less or equal variability at the end of follow-up than at the beginning ( Supplementary File 1 : Figure S2).
We found statistically significant differences (at 5%) between outcome variances in 41 out of 208 (19.7%) studies: 7.2% were in favor of greater variance in the treated arm, and 12.5% in the opposite direction. A greater proportion was obtained from the comparisons over time of 95 treated arms: 16.8% had significantly greater variability at the end of the study and 23.2% at the beginning ( Table 1).
Table 1. Variance comparison.
Comparing variances | N | Method | After treatment, variability is… | ||
---|---|---|---|---|---|
Increased
n (%) |
Decreased
n (%) |
Not changed
n (%) |
|||
Outcome between
treatment arms |
208 | F test | 15 (7.2%) | 26 (12.5%) | 167 (80.3%) |
Random model | 11 (5.3%) | 19 (9.1%) | 178 (85.6%) | ||
Outcome versus
baseline in treated arm |
95 ¥ | Paired test | 16 (16.8%) | 22 (23.2%) | 57 (60.0%) |
Random model | 13 (13.7%) | 19 (20.0%) | 63 (66.3%) |
Regarding the comparison between arms, the adjusted point estimate of the ratio (Treated to Control group) of the outcome variances is 0.89 (95% CI 0.81 to 0.97), indicating that treatments tend to reduce the variability of the patient's response by about 11% on average. The comparison over time provides a similar result: the average variability at the end of the study is 14% lower than that at the beginning ( Supplementary File 1 : Table S4).
According to the random model, the baseline heterogeneity was 0.31; this is a very high value which can only be explained by methodological flaws similar to those presented by Carlisle 20. Fortunately, the exclusion of the four most extreme papers reduced it to 0.07; one of them was the study by Hsieh et al. 21 whose “baseline” values were obtained 1 month after the treatment started. When we modeled the outcome instead of the baseline variances as the response, heterogeneity was approximately doubled. We found 30 studies that compromised homoscedasticity (11 with higher variance in the treated arm and 19 with lower, Table 1). Figure 3 shows the funnel plots for both between-arm and over-time comparisons.
Subgroup analyses suggest that only significant interventions had an effect on reducing variability ( Supplementary File 1 : Figures S3–S5), which has already been observed in other studies 22, 23 and in the line of other works that had found a positive correlation between the effect size and its heterogeneity 24, 25: in fact, it is difficult to find heterogeneity when there is no overall treatment effect. The remaining subgroups analyses did not raise concerns (section V in Supplementary File 1).
Discussion
Our main objective was to show that comparing variances can provide some evidence about how much precision medicine is needed. The variability seems to decrease for treatments that perform significantly better than the reference; otherwise, it remains similar. Therefore, contrary to popular belief, variability tends to be reduced on average after treatment, thus making precision medicine dispensable in most cases. This could be due to several reasons: some measurements may have “ceiling” or “floor” effects (e.g., in the extreme case, if a treatment makes a person well, no further improvement is possible); or the treatment may act proportionally rather than linearly, in which case the logarithm of the outcome would serve as a better scale.
When both arms have equal variances, then an obvious default explanation is that the treatment is equally effective for all, thus rendering the search for predictors of differential response futile. This means that treatment effects obtained by comparing the means between groups can be used to estimate both the averaged treatment effect and the non-observable patient treatment effect.
Furthermore, our second objective was to provide a rough estimate of the proportion of interventions that require a greater degree of precision medicine, and our answer is “not many”: considering the most extreme result from Table 1, we found that 1/6 interventions (16.8%) showed some variance inflation.
There are three reasons why these findings do not invalidate precision medicine in all settings. First, some additional heterogeneity is present in the outcome variances ratio, which indicates that the variability had increased between arms as well as over time. Second, the outcomes of some type of interventions such as surgeries, for example, are greatly influenced by the skills and training of those administering the intervention, and these situations could have some effect on increasing variability. And third, we focus on quantitative outcomes, which are neither time-to-event nor binary, meaning that the effect could take a different form, such as all-or-nothing.
The results rely on published articles, which raises some relevant issues. First, some of our analyses are based on Normality assumptions for the outcomes that are unverifiable without access to raw data. Second, a high number of manuscripts (61.6%, Figure 2) act contrary to CONSORT 26 advice in that they do not report variability. Thus, the results may be biased in either direction. Third, trials are usually powered to test constant effects and thus the presence of greater variability would lead to underpowered trials, non-significant results and unpublished papers. Fourth, the random effect model reveals additional heterogeneity in the outcome variances ratio, which may be the result of methodological inaccuracies 20 arising from typographical errors in data translation, inadequate follow-up, insufficient reporting, or even data fabrication. On the other hand, this heterogeneity could also be the result of relevant undetected factors interacting with the treatment, which would indeed justify the need for precision medicine. A fifth limitation is that many clinical trials are not completely randomized. For example, multicenter trials are often blocked by center through the permuted blocks method. This means that if variances are calculated as if the trial were completely randomized (which is standard practice), the standard simple theory covering the random variation of variances from arm to arm is at best approximately true 22
The main limitation of our study arises from the fact that, although constant effect always implies homoscedasticity on the chosen scale, the reverse is not true; i.e., homoscedasticity does not necessarily imply a constant effect. Nevertheless, a constant effect is the simplest explanation for homoscedasticity. For example, the non-parsimonious situation reflected in Figure 4 indicates homoscedasticity but without a constant effect.
Heteroscedasticity may suggest the need for further refinements of the eligibility criteria or for finding an additive scale 22, 27. Because interaction analyses cannot include unknown variables, all trials would potentially need to be repeated once any new potential interaction variable emerges (e.g., a new biomarker) as a candidate for a new subgroup analysis. Nevertheless, we have shown how homoscedasticity can be assessed when reporting trials with numerical outcomes, regardless of whether every potential effect modifier is known.
For most trials, subjects vary little in their response to treatment, which suggests that precision medicine’s scope may be less than what is commonly assumed. In the past century, Evidence-Based Medicine operated under the paradigm of a constant effect assumption, by which we learned from previous patients in order to develop practical clinical guides for treating future ones. Here, we have provided empirical insights for the rationale behind Evidence-Based Medicine. However, even where one common effect applies to all patients fulfilling the eligibility criteria, this does not imply the same decision is optimal for all patients, specifically because different patients and stakeholders may vary in their weighting not only of efficacy outcomes, but also of the harm and cost of the interventions – thus bridging the gap between common evidence and personalized decisions.
Nevertheless, in 16 trials of our sample, there was some evidence of variation arising from the treatment effect, suggesting a possible role for more tailored treatments: either with finer selection criteria (common effect within specific subgroups), or with n-of-1 trials (no subgroups of patients with a common effect). By identifying indications where the scope for precision medicine is limited, studies such as ours may free up resources for cases with a greater scope.
Our results uphold the assertion by Horwitz et al. that there is a “need to measure a greater range of features to determine [...] the response to treatment” 28. One of these features is an old friend of statisticians: the variance. Looking only at averages can cause us to miss out on important information.
Data availability
Data is available through two sources:
-
-
A shiny app that allows the user to interact with the data without the need to download it: http://shiny-eio.upc.edu/pubs/F1000_precision_medicine/
The Figshare repository: https://doi.org/10.6084/m9.figshare.5552656 16
In both sources, the data can be downloaded under a Creative Commons License v. 4.0.
The code for the main analysis is available in the following link: https://doi.org/10.5281/zenodo.1133609 19
Funding Statement
Partially supported by Methods in Research on Research (MiRoR, Marie Skłodowska-Curie No. 676207); MTM2015-64465-C2-1-R (MINECO/FEDER); and 2014 SGR 464.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 1; peer review: 2 approved with reservations
Supplementary material
Supplementary File 1: The supplementary material contains the following sections:
- Section I: Constant effect assumption in sample size rationale
- Section II: Bibliographic review
- Section III: Descriptive measures
- Section IV: Random effects models
- Section V: Subgroup analyses
- Section VI: Standard error of log(V OT/V OC) in independent samples
- Section VII: Standard error of log(V OT/V BT) in paired samples
References
- 1. Collins FS, Varmus H: A new initiative on precision medicine. N Engl J Med. 2015;372:793–-5. 10.1056/NEJMp1500523 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Kohane IS: HEALTH CARE POLICY. Ten things we have to do to achieve precision medicine. Science. 2015;349(6243):37–8. 10.1126/science.aab1328 [DOI] [PubMed] [Google Scholar]
- 3. Schork NJ: Personalized medicine: Time for one-person trials. Nature. 2015;520(7549):609–11. 10.1038/520609a [DOI] [PubMed] [Google Scholar]
- 4. Willis JC, Lord GM: Immune biomarkers: the promises and pitfalls of personalized medicine. Nat Rev Immunol. 2015;15(5):323–29. 10.1038/nri3820 [DOI] [PubMed] [Google Scholar]
- 5. Wallach JD, Sullivan PG, Trepanowski JF, et al. : Evaluation of Evidence of Statistical Support and Corroboration of Subgroup Claims in Randomized Clinical Trials. JAMA Intern Med. 2017;177(4):554–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Durán-Cantolla J, Aizpuru F, Montserrat JM, et al. : Continuous positive airway pressure as treatment for systemic hypertension in people with obstructive sleep apnoea: randomised controlled trial. BMJ. 2010;341:c5991. 10.1136/bmj.c5991 [DOI] [PubMed] [Google Scholar]
- 7. Kojima Y, Kaga H, Hayashi S, et al. : Comparison between sitagliptin and nateglinide on postprandial lipid levels: the STANDARD study. World J Diabetes. 2013;4(1):8–13. 10.4239/wjd.v4.i1.8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. International conference on harmonisation: statistical principles for clinical trials ICH-E9. 1998.. Accessed September 14 2017. Reference Source [Google Scholar]
- 9. Shamseer L, Sampson M, Bukutu C, et al. : CONSORT extension for reporting N-of-1 trials (CENT) 2015: Explanation and elaboration. BMJ. 2015;350:h1793. 10.1136/bmj.h1793 [DOI] [PubMed] [Google Scholar]
- 10. Araujo A, Julious S, Senn S: Understanding Variation in Sets of N-of-1 Trials. PLoS One. 2016;11(12):e0167167. 10.1371/journal.pone.0167167 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Senn S: Individual response to treatment: is it a valid assumption? BMJ. 2004;329(7472):966–68. 10.1136/bmj.329.7472.966 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Senn S: Mastering variation: variance components and personalised medicine. Stat Med. 2016;35(7):966–77. 10.1002/sim.6739 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Wang R, Lagakos SW, Ware JH, et al. : Statistics in medicine--reporting of subgroup analyses in clinical trials. N Engl J Med. 2007;357(21):2189–94. 10.1056/NEJMsr077003 [DOI] [PubMed] [Google Scholar]
- 14. Senn S, Richardson W: The first t-test. Stat Med. 1994;13(8):785–803. 10.1002/sim.4780130802 [DOI] [PubMed] [Google Scholar]
- 15. Kim SH, Schneider SM, Bevans M, et al. : PTSD symptom reduction with mindfulness - based stretching and deep breathing exercise: randomized controlled clinical trial of efficacy. J Clin Endocr Metab. 2013;98(7):2984–92. 10.1210/jc.2012-3742 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Cortés J: ‘review_homoscedasticity_clinical_trials’. [Data set].2017. 10.6084/m9.figshare.5552656 [DOI] [Google Scholar]
- 17. Bartlett MS, Kendall DG: The statistical analysis of variance-heterogeneity and the logarithmic transformation. J R Stat Soc. 1946;8(1):128–38. 10.2307/2983618 [DOI] [Google Scholar]
- 18. Sachs L: Applied Statistics: A Handbook of Techniques.2nd ed. New York: Springer-Verlag,1984. 10.1007/978-1-4612-5246-7 [DOI] [Google Scholar]
- 19. Cortés J: R code for analysis of homoscedasticity in clinical trials. Zenodo. 2017. 10.5281/zenodo.1133609 [DOI] [Google Scholar]
- 20. Carlisle JB: Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944–952. 10.1111/anae.13938 [DOI] [PubMed] [Google Scholar]
- 21. Hsieh LL, Kuo CH, Yen MF, et al. : A randomized controlled clinical trial for low back pain treated by acupressure and physical therapy. Prev Med. 2004;39(1):168–76. 10.1016/j.ypmed.2004.01.036 [DOI] [PubMed] [Google Scholar]
- 22. Senn S: controversies concerning randomization and additivity in clinical trials. Stat Med. 2004;23(24):3729–53. 10.1002/sim.2074 [DOI] [PubMed] [Google Scholar]
- 23. Jamieson J: Measurement of change and the law of initial values: A computer simulation study. Educ Psychol Meas. 1995;55(1):38–46. 10.1177/0013164495055001004 [DOI] [Google Scholar]
- 24. Senn S: Trying to be precise about vagueness. Stat Med. 2007;26(7):1417–30. 10.1002/sim.2639 [DOI] [PubMed] [Google Scholar]
- 25. Greenlaw N: Constructing appropriate models for meta-analyses. University of Glasgow,2010. Accessed September 14, 2017. Reference Source [Google Scholar]
- 26. Schulz KF, Altman DG, Moher D, et al. : CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials. BMJ. 2010;340:c332. 10.1016/j.jclinepi.2010.02.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Rothman KJ, Greenland S, Walker AM: Concepts of interaction. Am J Epidemiol. 1980;112(4):467–70. 10.1093/oxfordjournals.aje.a113015 [DOI] [PubMed] [Google Scholar]
- 28. Horwitz RI, Cullen MR, Abell J, et al. : Medicine. (De)personalized medicine. Science. 2013;339(6124):1155–6. 10.1126/science.1234106 [DOI] [PubMed] [Google Scholar]