Weighing evidence: robustness vs quantity

Scott R Evans; Toshimitsu Hamasaki

doi:10.1093/jnci/djac186

editorial

. 2022 Sep 26;115(1):1–3. doi: 10.1093/jnci/djac186

Weighing evidence: robustness vs quantity

Scott R Evans ^1,^2,^✉, Toshimitsu Hamasaki ^3,⁴

PMCID: PMC9830481 PMID: 36156151

Many new cancer therapies target specific molecular subtypes. This effectively reduces the size of the target population and the number of eligible participants to study in trials. Augmenting control group data in a small randomized trial with historical control data has been proposed to address this issue.

Freidlin and Korn (1) logically appraise the utility of using historical control data in this setting. They thoughtfully evaluate operating characteristics of augmentation via dynamic borrowing and meta-analytic approaches that can lessen the biases of using historical data, and conduct a thorough review of applications of the approach in the literature. They conclude by noting the limited value of the augmentation strategy. Sample sizes in the precision-medicine setting are too small to guide dynamic borrowing, and limitations in the number of historical cohorts render meta-analytic approaches unreliable. Molecular classifications involving novel biomarkers are regularly updated, challenging the important matching of current and historical populations to help reduce bias. We agree with the conclusions of Freidlin and Korn (1).

Identifying enough eligible patients to conduct a well-powered trial within a reasonable period is a practical challenge. In theory, the magnitude of the treatment effect may be larger in the right refined population, and smaller sample sizes would be required to detect this larger effect. However, reality infrequently measures up to researcher optimism in well-controlled research settings. Researchers might simply accept that well-powered trials will take longer to enroll and conduct. However, patience carries costs. Desperation motivates the search for alternatives.

At the core is whether one sacrifices evidence robustness or quantity. Loosely one can choose an imprecise right answer or a more precise wrong answer. Is the increase in bias worth the increase in precision? Much rides on this decision, including the control of errors.

Randomization is the foundation for statistical inference, providing the basis for understanding uncertainty and the control of errors. This important fact is now more frequently underappreciated, with the increasing momentum of real world–data hype. Scientists must not only appreciate but demand scientific rigor in research settings where our most reliable evidence is earned through quality data from randomized controlled studies rather than through assumptions that lower the evidentiary standard and introduce greater uncertainty (2).

The benefits that randomization provides must be protected during study design, conduct, and analyses, else these benefits can be lost. Here with the addition of historical data, the biases of observational studies have been introduced, despite the initial randomization. There is no longer the assurance of the expectation of balance of all potentially confounding factors except treatment assignment nor the foundational infrastructure for controlling error rates. The study must now be considered observational, subject to associated biases.

Many studies comparing results obtained from nonrandomized studies with randomized studies illustrate that much of the nonrandomized evidence does not replicate under randomized controlled settings (3). Nonrandomized studies often underestimate risk of harms (4) and overestimate mortality benefits (5). Many examples of cancer therapies for which nonrandomized evidence suggested beneficial effects, some resulting in substantial uptake into clinical practice (6), were later shown to be ineffective or harmful in randomized trials (7,8).

Methods for addressing the bias induced from augmenting the control-arm data in a randomized clinical trial with nonconcurrent data from patients that have received the control treatment outside the randomized trial have been proposed (9). The effectiveness of these methods in producing valid results relies on the validity of assumptions. Assumptions include similarity of study design and conduct between the concurrent and historical studies including data collection and assessment (eg, study endpoints are defined in the same way and measured using the same protocol and procedures); homogeneity of the treatment effect and associated variance in the concurrent and historical studies; and balance between the treatment and control groups with respect to important confounding variables.

The expectation of balance of all factors—known or unknown, measured or unmeasured—is assured in fully randomized studies. This no longer applies with augmented historical control data. Statistical methods can be applied to address imbalances, though known and recorded covariates capture only the tip of the iceberg because some factors may be unrecognized or unrecorded. The historical control data source should be evaluated for quality and completeness to ensure sufficient data to conduct analyses to address imbalances. Matching is particularly important to consider for molecular characteristics, demographics, baseline disease severity, diagnostic evaluation, supportive clinical care and concomitant therapies, patient follow-up, and endpoint evaluation definition and procedures. It may not be feasible to balance some important factors. If the “age” of the historical control cohort is too old, then it represents a different population, one that was unable to benefit from recent advances in standard of care and evolution of clinical practice, including diagnostics and therapeutic interventions.

Assumptions come at the sacrifice of robustness. The validity of some of the assumptions can be evaluated using the data. Other assumptions cannot be verified. Acceptance of the assumptions increases interpretation complexity. Incorrect inferences ultimately result in ineffective or unsafe treatments and suboptimal patient care. Reasonable justification of all assumptions, verified with data to the extent possible, is critical.

An alternative framing of the decision provides insight. Suppose we have a study that is subject to the biases of nonrandomized evidence. We are offered the opportunity to buy the integrity of randomized evidence by sacrificing precision through a modest reduction in the effective sample size. With this purchase, we gain the expectation of balance with respect to all potentially confounding factors, measured or unmeasured, known or unknown, subsequently gaining error control and the peace of mind that goes along with the avoidance of the biases associated with observational studies. Unless the gains from the use of historical data were of substantive magnitude and induced biases are confidently and rationally reasoned to be low risk and small magnitude, then a modest reduction in effective sample size is a small price to pay.

Brown et al. (10) implemented an alternative strategy relaxing the statistical significance level to .15 in a trial evaluating treatments for newly diagnosed KMT2A-rearranged infant acute lymphoblastic leukemia. Japanese regulators have suggested this strategy in the development of drugs to treat rare diseases (11). This seemingly unattractive strategy has appealing features. The selection of statistical significance levels should be a conscious calculated decision rather than a compulsory one, with thresholds defined based on study goals, carefully weighing the trade-offs of the consequences of incorrect decisions (12). The selection of the statistical significance level is calculated and transparent, which translates to the control of error rates in the presence of well-conducted randomized trials. In contrast, true error probabilities associated with nonrandomized evidence are unknown without the infrastructure for fully understanding uncertainty that randomization provides. There are obvious scenarios where the significance level could be relaxed, for example, in exploratory settings where confirmatory evidence is expected to come via future studies or in the context of serious diseases for which there is no proven effective treatment. In this case, a false negative result may be a more important error than a false positive result.

We encourage a focus on randomized evidence particularly in most late-phase confirmatory settings. If researchers choose to implement designs with nonrandomized evidence, we advocate for improved reporting during publication. Presenting the study first as “randomized” and then secondly augmented with historical data without explicit clarification that the analyses are subject to the biases of observational studies, as has been done with many recent platform trials, is misleading. The readership may note that randomization was conducted, leading them to believe that the data should be interpreted as randomized evidence. Cautionary clarifying language that the analyses should not be interpreted as randomized evidence with a clear statement of the assumptions that are required for valid statistical inference will help improve transparency and reduce the risk of inappropriate interpretation.

Funding

None.

Notes

Role of the funder: Not applicable.

Disclosures: SRE reports board membership for the American Statistical Association, Frontier Science Foundation; Consulting fees from Takeda, Johnson & Johnson, ChemoCentryx, Becton Dickenson, Roivant, International Drug Development Institute; DSMB or Advisory Board service for the Breast International Group, Roche, Pfizer, Takeda, DayOneBio, Alexion, Tracon, Rakuten, Abbvie, GSK, Eli Lilly, Advantagene/Candel; and contracts from Aceragen. TH reports consulting fees from Tanabe-Mitsubishi Pharma K.K.; a contract from Aceragen; payment for lectures from Duke University, Johnson and Johnson K.K.; and payment for manuscript preparation from Cancer and Chemotherapy K.K. (Japanese Journal of Cancer and Chemotherapy).

Author contributions: SRE: writing—original draft; writing—review and editing. TH: writing—original draft; writing—review and editing

Contributor Information

Scott R Evans, The Biostatistics Center, Milken Institute School of Public Health, George Washington University, Rockville, MD, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA.

Toshimitsu Hamasaki, The Biostatistics Center, Milken Institute School of Public Health, George Washington University, Rockville, MD, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA.

Data availability

No data are presented or analyzed in this editorial.

References

1. Freidlin B, Korn EL.. Augmenting randomized clinical trial data with historical control data: precision medicine applications. J Natl Cancer Inst. 2023;115(1):14-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Evans SR. Radical thinking: scientific rigor and pragmatism. Stat Biopharm Res. 2022;14(2):140-152. [Google Scholar]
3. Sacks H, Chalmers TC, Smith H Jr. Randomized versus historical controls for clinical trials. Am J Med. 1982;72(2):233-240. [DOI] [PubMed] [Google Scholar]
4. Papanikolaou PN, Christidi GD, Ioannidis JPA.. Comparison of evidence on harms of medical interventions in randomized and nonrandomized studies. CMAJ. 2006;174(5):635-641. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Hemkens LG, Contopoulos-Ioannidis DG, Ioannidis JPA.. Agreement of treatment effects for mortality from rountinely collected data and subsequent randomized trials: meta-epidemiology survey. BMJ. 2016;352(8044):i493. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Mello MM, Brennan TA.. The controversy over high-dose chemotherapy with autologous bone marrow transplant for breast cancer. Health Aff (Millwood). 2001;20(5):101-117. [DOI] [PubMed] [Google Scholar]
7. Freidlin B, Korn EL.. Assessing causal relationships between treatments and clinical outcomes: always read the fine print. Bone Marrow Transplant. 2012;47(5):626-632. [DOI] [PubMed] [Google Scholar]
8. Fisher RI, Gaynor ER, Dahlberg S, et al. Comparison of a standard regimen (CHOP) with three intensive chemotherapy regimens for advanced non-Hodgkin's lymphoma. N Engl J Med. 1993;328(14):1002-1006. [DOI] [PubMed] [Google Scholar]
9. Chen J, Ho M, Lee K, et al. The current landscape in biostatistics of real-world data and evidence: clinical study design and analysis. Stat Biopharm Res. 2021;1. doi:10.1080/19466315.2021.1883474. [Google Scholar]
10. Brown PA, Kairalla JA, Hilden JM, et al. FLT3 inhibitor lestaurtinib plus chemotherapy for newly diagnosed KMT2A-rearranged infant acute lymphoblastic leukemia: Children's Oncology Group trial AALL0631. Leukemia. 2021;35(5):1279-1290. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Japanese Ministry of Health, Labor, and Welfare (MHLW). Japanese Translation of ICH E9 with Questions and Answers. 1998. https://www.pmda.go.jp/files/000156112.pdf. Accessed September 30, 2022.
12. Evans SR. Waking up to p: comment on “The role of p-values in judging the strength of evidence and realistic replication expectations”. Stat Biopharm Res. 2021;13(1):19-21. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No data are presented or analyzed in this editorial.

[djac186-B1] 1. Freidlin B, Korn EL.. Augmenting randomized clinical trial data with historical control data: precision medicine applications. J Natl Cancer Inst. 2023;115(1):14-20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[djac186-B2] 2. Evans SR. Radical thinking: scientific rigor and pragmatism. Stat Biopharm Res. 2022;14(2):140-152. [Google Scholar]

[djac186-B3] 3. Sacks H, Chalmers TC, Smith H Jr. Randomized versus historical controls for clinical trials. Am J Med. 1982;72(2):233-240. [DOI] [PubMed] [Google Scholar]

[djac186-B4] 4. Papanikolaou PN, Christidi GD, Ioannidis JPA.. Comparison of evidence on harms of medical interventions in randomized and nonrandomized studies. CMAJ. 2006;174(5):635-641. [DOI] [PMC free article] [PubMed] [Google Scholar]

[djac186-B5] 5. Hemkens LG, Contopoulos-Ioannidis DG, Ioannidis JPA.. Agreement of treatment effects for mortality from rountinely collected data and subsequent randomized trials: meta-epidemiology survey. BMJ. 2016;352(8044):i493. [DOI] [PMC free article] [PubMed] [Google Scholar]

[djac186-B6] 6. Mello MM, Brennan TA.. The controversy over high-dose chemotherapy with autologous bone marrow transplant for breast cancer. Health Aff (Millwood). 2001;20(5):101-117. [DOI] [PubMed] [Google Scholar]

[djac186-B7] 7. Freidlin B, Korn EL.. Assessing causal relationships between treatments and clinical outcomes: always read the fine print. Bone Marrow Transplant. 2012;47(5):626-632. [DOI] [PubMed] [Google Scholar]

[djac186-B8] 8. Fisher RI, Gaynor ER, Dahlberg S, et al. Comparison of a standard regimen (CHOP) with three intensive chemotherapy regimens for advanced non-Hodgkin's lymphoma. N Engl J Med. 1993;328(14):1002-1006. [DOI] [PubMed] [Google Scholar]

[djac186-B9] 9. Chen J, Ho M, Lee K, et al. The current landscape in biostatistics of real-world data and evidence: clinical study design and analysis. Stat Biopharm Res. 2021;1. doi:10.1080/19466315.2021.1883474. [Google Scholar]

[djac186-B10] 10. Brown PA, Kairalla JA, Hilden JM, et al. FLT3 inhibitor lestaurtinib plus chemotherapy for newly diagnosed KMT2A-rearranged infant acute lymphoblastic leukemia: Children's Oncology Group trial AALL0631. Leukemia. 2021;35(5):1279-1290. [DOI] [PMC free article] [PubMed] [Google Scholar]

[djac186-B11] 11. Japanese Ministry of Health, Labor, and Welfare (MHLW). Japanese Translation of ICH E9 with Questions and Answers. 1998. https://www.pmda.go.jp/files/000156112.pdf. Accessed September 30, 2022.

[djac186-B12] 12. Evans SR. Waking up to p: comment on “The role of p-values in judging the strength of evidence and realistic replication expectations”. Stat Biopharm Res. 2021;13(1):19-21. [Google Scholar]

PERMALINK

Weighing evidence: robustness vs quantity

Scott R Evans, PhD, MS

Toshimitsu Hamasaki, PhD, MS

Funding

Notes

Contributor Information

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Weighing evidence: robustness vs quantity

Scott R Evans, PhD, MS

Toshimitsu Hamasaki, PhD, MS

Funding

Notes

Contributor Information

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases