ABSTRACT
Two questions regarding the scientific literature have become grist for public discussion: 1) what place should P values have in reporting the results of studies? 2) How should the perceived difficulty in replicating the results reported in published studies be addressed? We consider these questions to be 2 sides of the same coin; failing to address them can lead to an incomplete or incorrect message being sent to the reader. If P values (which are derived from the estimate of the effect size and a measure of the precision of the estimate of the effect) are used improperly, for example reporting only significant findings, or reporting P values without account for multiple comparisons, or failing to indicate the number of tests performed, the scientific record can be biased. Moreover, if there is a lack of transparency in the conduct of a study and reporting of study results, it will not be possible to repeat a study in a manner that allows inferences from the original study to be reproduced or to design and conduct a different experiment whose aim is to confirm the original study's findings. The goal of this article is to discuss how P values can be used in a manner that is consistent with the scientific method, and to increase transparency and reproducibility in the conduct and analysis of nutrition research.
Keywords: transparency, reproducibility, reliability, P value, strategies
Introduction
Recently, much has been written about probability values (i.e. P values) and their misuse in the medical literature (1–3). Additionally, concern has been expressed about the nonreproducibility of studies (4, 5). There are many types of nutrition-related research which can broadly be broken into 2 categories, quantitative compared with qualitative. Although these 2 categories differ in some aspects of their methodology (and within each category there are additional differences), they share a need to be transparent in their methodology and reporting standards if we are to improve the reproducibility of research. In addition to discussing limitations of P values, the goal of this article is to highlight actions that increase the transparency (6) and ultimately, the reproducibility of research (4).
Karl Pearson formalized the concept of the P value as “The measure of the probability of a complex system of n errors occurring with a frequency as great or greater than that of the observed system” when he developed the chi-squared test in 1900 (7). The P value gives the probability that an effect as extreme or more extreme than the observed effect (the change attributed to the intervention) would be seen if the intervention truly had no effect on the measured outcome. The use of P values in statistical analyses of scientific hypotheses was promoted by Ronald Fisher in his seminal textbook, Statistical Methods for Research Workers (8). In a second textbook, The Design of Experiments (9), published in multiple editions from 1935 to 1971, Fisher introduced the concept of the null hypothesis in his description of a Gedankenexperiment, A Lady Tasting Tea. The thought experiment describes a randomized experiment designed to determine if, by taste alone, it was possible to determine whether milk or tea was poured first into a tea cup (10). Earlier investigators who used similar concepts include John Arbuthnot (11) and Pierre-Simon Laplace (12, 13). The P value has held its place in the testing of scientific hypotheses for >100 y.
The P value: what it does and does not tell us
In the biological sciences, a study generally starts with 2 hypotheses, a null hypothesis (referred to as H0), which serves as a straw man that we attempt to show is not correct, and the hypothesis of interest, the alternative hypothesis (referred to as Ha), that we try to show is supported by the data. The null hypothesis typically (but not always, e.g. when performing a noninferiority test) states that the effect of an intervention (or treatment) does not differ from the control state; the alternative hypothesis typically states that an intervention (or treatment) is more (or less) effective than the control. The P value is used to help determine if we should reject H0 and accept Ha. As noted below, failing to reject H0 does not mean that H0 is true.
P values have 4 important characteristics. (1) A P value is a probability value obtained from an indirect test of a hypothesis. The P value is not obtained from a test of the hypothesis of interest, Ha, but rather a test of the null hypothesis, H0. H0 and Ha need to be mutually exclusive (if one is true the other must be false and conversely) and collectively exhaustive (subsuming all possible outcome states). If the collective and exhaustive criteria hold, rejecting H0 implies that Ha, the alternative hypothesis of interest is supported. The hypothesis of interest, Ha, and the null hypothesis, H0, (if not obvious) should be stated so the hypothesis being tested will be clear and to allow the reader to determine that mutual exclusivity and collective exhaustion hold.
(2) The P value that is traditionally used to define statistical significance, P < 0.05, is an arbitrary threshold; there is no scientific proof or theory indicating that 0.05 is the correct or optimal criterion to identify a significant value. The 0.05 value was proposed by Fisher in 1926 (14):
“If 1 in 20 does not seem high enough odds, we may, if we prefer it, draw the line at 1 in 50 (the 2 percent point), or 1 in 100 (the 1 percent point). Personally, the writer prefers to set a low standard of significance at the 5 percent point, and ignore entirely all results which fail to reach that level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”
Because the P value denoting significance is an arbitrary value, the P value accepted as indicating significance must be selected a priori. More importantly, it must be understood that when something is said to be statistically significant, the division between significant and nonsignificant is done as a convenience and does not indicate the biological or clinical importance of a finding (a “nonsignificant” finding may be biologically important) nor that a finding should be deemed an absolute truth (a significant finding may be misleading, i.e. falsely positive).
(3) Failing to reject (e.g. P ≥ 0.05) the null hypothesis, H0, does not prove the null hypothesis. The P value quantifies the probability that the observed data support H0. Ha might be true even though the data from a given experiment do not reject the null hypothesis, H0, and thus do not indirectly support the alternative hypothesis, Ha (i.e. H0 is not rejected even though Ha is correct). There are many reasons why H0 might not be rejected when the alternative hypothesis Ha is in fact true, including confounding, bias, measurement error, model misspecification (e.g. over adjustment), lack of statistical power (e.g. insufficient sample size or too few outcome events), and insufficient variation in the variable quantifying the exposure. In The Design of Experiments, Fisher wrote, “We may speak of this hypothesis as the ‘null hypothesis’ […] the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation” (15). Rather than proving the null hypotheis, H0, a nonsignificant result indicates that the null hypothesis remains tenable, that the data at hand do not provide sufficient evidence to reject H0 (16). Armitage and Berry (17) describe a nonsignificant result as follows:
“Although a ‘significant’ departure provides some degree of evidence against a null hypothesis, it is important to realize that a ‘nonsignificant’ departure does not provide positive evidence in favor of that hypothesis. The situation is rather that we have failed to find strong evidence against the null hypothesis. To draw an analogy with a court of law, the null hypothesis is rather like a presumption of innocence of an accused person. A significant result is then like a verdict of guilty, but a nonsignificant result is more like the Scottish verdict of ‘not proven’ than the English verdict of ‘not guilty’.”
(4) The P value alone does not fully communicate the result of an experiment. The P value does not provide information about the effect size of a treatment or the strength of an association between exposure and outcome, nor does it quantify the magnitude of the treatment's impact on the outcome. It is important to understand that when other effects and variables are fixed (and when the 2 values being compared are not exactly equal), the P value is an inverse function of sample size. For an experiment where there is a true treatment effect, i.e. H0 is false, the larger the sample size, the smaller the P value for any given effect size (Figure 1).
FIGURE 1.
Relation between sample size and P value of an analysis from a 2-sample paired Student's t-test. The mean of 1 group was 0.0, the mean of the second group was 0.18. The SD of both groups was 1. Each estimate represents the mean of 10,000 simulations. The sample sizes for each group, experimental and control, in each of the simulations was 100, 125, 167, 250, 333, 500, 667, 833, and 1000 subjects. The horizontal dashed line represents the “standard” 2-tailed P value (0.05) used to establish significance.
Unlike a P value, the true treatment effect (e.g. the change brought about by a treatment) is unchanged with sample size, and the estimate of the true treatment effect is relatively stable as long as the sample size is reasonable. The fact that the P value is related to sample size whereas the estimate of the treatment effect is relatively stable implies that any time a P value is used, to allow the P value to be properly understood, additional information should be given. The additional information can include the sample size (and number of events for a categorical outcome), the estimated treatment effect and the precision of this estimate (e.g. SD or SE), or the estimated treatment effect and a measure its variability, such as a 95% CI. For a categorical outcome such as mortality, the number of events and nonevents (and for survival analyses follow-up time) are critical values that should be reported regardless of whether data are reported with a P value or a CI.
Common conditions that can lead to incorrect P values
When designing or analyzing the results of a study, limited resources may lead investigators to test the effect of treatment on multiple endpoints of interest, and they may or may not report the results of all the tests, or even the total number of statistical tests performed. Performing multiple tests increases the experiment-wise error rate. The experiment-wise error rate is the probability across all comparisons performed of having at least one type-I error. A type-I error occurs when one incorrectly states that a treatment is effective when it is not effective (i.e. when the investigator improperly rejects H0 and accepts Ha when H0 is true). The experiment-wise error rate rapidly increases with the number of tests performed when the P value threshold for significance is large (e.g. 0.05 or 0.01), and at a slower rate when the P value denoting significance is reduced, e.g. 0.001 or 0.0005 (Figure 2). Performing multiple tests and reporting only a subset, such as those that are found to be statistically significant, should never be done, as this eliminates the ability to evaluate a significant finding in the context of its experiment-wise type-I error rate. When multiple tests are performed, if not obvious, the number of tests performed should be reported and an adjustment to the P value should be considered for multiple comparisons, including tests that are not reported. If no adjustment for multiple comparisons is made, this should be clearly indicated and justified in the article's methods section.
FIGURE 2.
The probability of ≥1 false-positive result from a series of tests (the number of tests appears on the abscissa) when the null hypothesis, H0 is true (i.e. when there is no difference between experimental and control groups). Each line represents a series of tests with a different significance (i.e. type-I error rate), 0.0005, 0.001, 0.01, 0.025, and 0.05. The points of the lines were calculated using the formula probability = 1−(1 − Type-I Error rate)Number of Tests. See supplementary material for a derivation of the formula.
Commonly used methods that control the experiment-wise error rate include the Bonferroni adjustment (which is conservative, i.e. it can result in a low experiment-wise type-I error rate, increasing the probability of false negatives, as well as an important loss of power). Alternatively, there are approaches of controlling the false discovery rate (FDR) using the method of Benjamini and Hochberg (18, 19), a method that preserves more power than the Bonferroni adjustment. Minimizing the experiment-wise error rate in the setting of regression analyses can be achieved using the methods described by Dunnett (20, 21), Scheffé (22), and Tukey-Kramer (23), among others. Designing an experiment so that a limited number of tests will be performed also effectively reduces the type-I error rate. As noted by Armitage (page 598),
“The danger of data dredging is usually reduced by the specification of 1 response variable, or perhaps a very small number of variables, as the primary endpoint, reflecting the main purpose of the trial. Differences in these variables between treatments is taken at their face value. Other variables are denoted as secondary endpoints. Differences in these are regarded as important but less clearly established, perhaps being subject to a multiplicity correction or providing candidates for exploration in further trials” (24).
Thus, the selection of the primary and secondary endpoints should be done a priori.
For any nonzero effect size, whether the observed effect size approximates the true treatment effect or is the result of random variation, a sufficiently large sample will yield a P value that is less than the P value (0.05) usually used to denote a significant result (Figure 1). As a result, a large sample size can increase the probability of a type-I error. When P values are used, and the sample size is large, to constrain the type-I error rate, consideration should be given to using a P value smaller than 0.05 to denote significance. The choice of the P value that will denote statistical significance should be done a priori. Interpretation of results in the setting of multiple comparisons or with a large sample size, even with appropriate adjustment, must be undertaken with caution because there is no perfect adjustment method. For example, although adjustment can minimize the type-I error rate, doing so increases the probability of a type-II error, i.e. improperly concluding that an intervention is not effective when the intervention is truly effective.
Extreme data values can artificially inflate or deflate effect size, can decrease the precision with which the estimate of the effect size is measured (i.e. increase the variance, SD, or SE of the estimate), and can have a profound influence on P values. Although the statistical methods section of a scientific article should describe the steps taken to identify and handle extreme values, a detailed description of the results of the analyses need not be reported. If an extreme value is biologically plausible, and the value cannot be ascribed to identifiable errors such as recording or assay errors (25), consideration should be given to reporting study results with and without the extreme values. Alternatively, data can be analyzed using approaches that are resistant (i.e. less sensitive) to extreme values such as nonparametric methods or robust regression (26, 27). Review of data for extreme and influential outlier values should include exploratory data analyses (28) including univariate analyses, e.g. stem-and-leaf plots, bivariate plots, e.g. BMI compared with height, and multivariable analyses, e.g. review of residual plots, computation of the variance inflation factor (28), and Cook's D in the setting of regression.
Transparent reporting of statistical methods and results
An experiment's study design and analysis plan must be clearly stated in the methods section of the article. If page limitations preclude a complete description of the methods, additional details can be provided as supplementary online material, but as full a description as possible should be included in the main text of the article. References to published methodology articles and preregistered statistical analysis plans can also be utilized. Ultimately, the methods section of the manuscript should be sufficiently detailed so that a third party could replicate the study reported in the article (29, 30). Similarly, the statistical methods section should describe the analytic approach in sufficient detail so that if provided with the data, the reported results could be reproduced by an independent third party. In addition to the descriptions given in the methods section of the article, for all clinical trials, the registration number of the detailed trial protocol (obtained at or before study initiation) should be reported (31). The NIH defines a clinical trial as (32) “A research study in which 1 or more human subjects are prospectively assigned to 1 or more interventions (which may include placebo or other control) to evaluate the effects of those interventions on health-related biomedical or behavioral outcomes.” The definition has been converted to a series of 4 questions (33). If the answer to all 4 questions is yes, then the NIH considers the study to be a clinical trial: 1) does the study involve human participants? 2) Are the participants prospectively assigned to an intervention? 3) Is the study designed to evaluate the effect of the intervention on the participants? 4) Is the effect being evaluated a health-related biomedical or behavioral outcome?
The trial registration should include prespecified descriptions of all interventions, outcomes, subgroups, and analyses to be performed. If analyses are undertaken that were not anticipated at the time the trial was registered, or if analyses that were anticipated are not performed, a statement indicating the change and an explanation for the change should be included in the article reporting study results. Clinical trials supported by the NIH must be registered at clinicaltrials.gov (34). Other registries acceptable to the journals of the American Society for Nutrition include, but are not limited to, Clinical Trials, ANZCTR (Australian New Zeeland Clinical Trials Registry), ISRCTN (International Standard Randomized Controlled Trial Number), UMIN-CTR (UNIM Clinical Trials Registry), Netherlands TrialRegister, and any of the primary registries that participate in the WHO International Clinical Trials Portal (https://www.who.int/clinical-trials-registry-platform/network/primary-registries) (35). Systematic reviews of existing literature should be registered. This can be done at PROSPERO (https://www.crd.york.ac.uk/prospero/).
Although making the study data and statistical programming code (R, SAS, SPSS, STATA, etc.) publicly available is not currently a universal standard of practice and not required by the American Society for Nutrition publications at this time, this practice serves to promote reproducibility and transparency in research. Of note, the NIH has mandated that all studies stemming from NIH-funded research have a data-sharing plan (taking into account any potential restrictions or limitations on sharing), effective as of 25 January, 2023 (https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html). A list of data-sharing repositories can be found at: https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html.
Study results can be divided into those that are hypothesis driven, and those that are not hypothesis-driven serendipitous posthoc findings. Although the former is preferred, the latter has its place as a serendipitous posthoc finding can lead to new lines of research (36) and serve as impetus for future experiments. The a priori hypothesis that underlies the finding of a hypothesis-driven preplanned experiment should be clearly stated in the methods section of the article. For a clinical trial, the study registration should specify the a priori hypothesis. Nonhypothesis-driven results, so identified, can best be presented as a measure of association and a CI without including a P value. Formal statistical testing of a nonhypothesis-driven a posteriori observation is at best questionable as it represents a test of a hypothesis chosen to fit the data rather than a test of data to see if the data indirectly support a hypothesis of interest. Investigators who report nonhypothesis-driven results often add an a posteriori hypothesis to their article making the 2 types of publications look essentially the same. It is therefore essential, when reporting results, to distinguish between the 2.
Statistical analyses have inherent assumptions. For example, 1 assumption of a 2-group Student's t-test is that the data in the 2 groups are normally and identically distributed (i.e. the data being compared come from 2 normal distributions having the same variance). A linear regression has 4 basic assumptions; it assumes that the data fall along a straight line (i.e. the relation between the predictor variables and the outcome is a straight line), the predictor (i.e. x) variables are independent, the residuals of the regression of the predictor variables on the outcome variable are normally distributed, and the variance of the residuals is constant (homoscedasticity). An important part of every analysis is assuring that the data conform to the assumptions of the statistical method used to analyze the data, and that the statistical method is appropriate. If the data do not conform to the assumptions of the statistical method, other methods should be considered, e.g. data transformation or repeated measures analyses. Optimally, the steps taken to verify the assumptions should be described with the published results, or at least a statement that the assumptions were met. In addition to assuring that assumptions specific to the statistical method (e.g. linear regression compared with logistic regression) used to analyze study data are met, other factors that can bias a study's results need to be addressed. These factors include, among others, confounding, selection bias, and nonrandom systematic measurement error. Steps taken to avoid such sources of bias should be included in the methods and discussion sections of the article.
When reporting results, both the effect size estimate and a measure of its precision should be given. There are several metrics of precision including SD, SE, and CI. Although each of these metrics has its place, unless there is a specific reason for reporting only the SD or SE, consideration should be given to reporting a CI as the default measure of precision. An advantage of giving the point estimate of a parameter along with a CI (generally expressed as a 95% CI) is that it gives a range in which the true value of a parameter is expected to lie (under the assumptions of no bias, measurement error, or model misspecification, described above). The combination of the point estimate and CI therefore give more information than an effect estimate and its P value and certainly more than a P value alone. The P value says nothing about the magnitude of the observed point estimate (or effect) nor the range of values in which the true value is believed to lie; it simply provides an estimate of the probability of seeing the observed data if the null hypothesis is true. Estimates of precision should accompany effect estimates in the manuscript text, tables (in numeric form), and figures (as error bars). When the SD or SE is reported, the reader must be informed which is being used. Similarly, because various CIs can be computed, it is necessary that the size of the interval, e.g. 95%, be stated. A list of guidelines for reporting study data (including data from qualitative research) has been compiled at the EQUATOR network (www.equator-network.org).
Future of the P value
Some scientists have advocated banning P values (2) because they have been misinterpreted, misused, and misunderstood (37). We do not agree that P values should be banned. As noted above, P values have served research for >100 y, and with proper use may continue to do so. Medical and nutritional research often require making a binary choice, to declare a treatment effective or not, to recommend 1 set of nutritional recommendations or another, to further investigate or move on to another question. Clinical guidelines and treatment decisions do not generally rely on a single study's outcome, they are derived from an accumulation of effect estimates from the literature. P values can help inform the accumulation and the subsequent binary decision. Although P values can be helpful, when used, they should always be accompanied by the sample size, an estimated effect size, and a measure of the precision of the estimate. Ultimately any inference derived from a study is strongest when it is driven by all information available; including the quality of the study, potential for bias, the point estimate and its precision, the P value, and the nature and importance of the question being asked.
In addition to summarizing study findings, the discussion section of an article should include verbiage about the clinical or public health importance of the reported results. A finding can be statistically significant, but not clinically meaningful. Conversely a finding can be clinically important but not statistically significant. To help the reader understand the importance of a study's results, it can be helpful to report a minimal clinically important difference, MCID, and to compare the MCID to the estimate of the effect size, where applicable.
In summary (Table 1) when P values are used, their limitations must be understood, and they should be used properly. If the null hypothesis is not rejected, it is inappropriate to conclude that the null hypothesis has been proven. P values should be accompanied by a statement of the sample size, an estimate of the treatment effect, and its variability. It must be understood that any statement of significance compared with nonsignificance is based on an arbitrary cut-point and does not represent absolute truth. The distinction between a statistically significant compared with a nonsignificant finding is a convenience that aids the decision-making process; the practical application of a statement regarding significance is dependent upon the situation to which it is to be applied. It is critical for the scientific community to take steps that increase the transparency and reproducibility of study results. These include, but are not limited to reporting the total number of hypotheses tested over the course of the experiment, and adjusting for multiple comparisons when warranted (to minimize the experiment-wise error rate), to eschew reporting only statistically significant results, to describe in the methods section the study plan and statistical methods in sufficient detail to allow the study to be reproduced by a third party both in the article and, for clinical trials in an accredited trial registry. In conclusion, the P value remains useful, but must be applied and interpreted with care.
TABLE 1.
Summary of concepts that should be addressed when reporting study results
Topic | Concept to be addressed |
---|---|
P value | P = 0.05, a cut-point frequently used to denote statistical significance, is an arbitrary value that is not evidence based |
The P value chosen to indicate significance, and whether it is 1- or 2-sided, should be defined a priori and should be stated | |
The distinction of statistically nonsignificant vs. significant is a convenience. It does not indicate that a finding should be completely ignored, or conversely endorsed | |
A statistically significant finding is not necessarily clinically important; a nonsignificant finding from an adequately powered study is not necessarily of no importance | |
Failing to reject the null hypothesis H0 (e.g. P > 0.05) does not prove the null hypothesis is true | |
P values should always be accompanied by an estimate of effect size and a measure of the estimates’ precision (e.g. SD, SE, or 95% CI) | |
Methods | The description of research methods and statistical approach should be sufficiently detailed to allow a third party to replicate the study |
It is essential to distinguish between hypothesis-driven results and results that are nonhypothesis driven, i.e. serendipitous, posthoc findings | |
The null and alternative hypotheses should be explicitly stated if they are not obvious | |
The steps taken to avoid bias (e.g. confounding and nonrandom systematic measurement error) should be described | |
The steps taken to assure that the statistical assumptions are met should be described | |
The steps taken to identify and handle extreme data values should be described | |
Reporting results | The sample size, effect size, and a measure of the precision of the estimate should be given. A P value alone does not communicate this information. For an analysis where the outcome is a categorical event such as mortality, the number of events (and if the study is a survival analysis, the follow-up time) should be given in addition to sample size |
The number of tests performed should be reported if the number is not obvious or if some tests were not reported. Adjustment for multiple testing should be considered and if done, the method used to adjust should be described | |
Measures of precision should be included in the manuscript text, tables (in numeric form), and figures (as error bars) |
Supplementary Material
Acknowledgments
JDS would like to thank Reubin Andres who started him on his journey, and Amelia M Andres who supplied nutrition along the way. Our thanks to Steve Brooks and Hope Weiler who read the manuscript and made helpful suggestions that improved the wording of the manuscript.
The authors’ responsibilities were as follows—JDS: wrote the manuscript; PAMS, AJM, AA, RLP, BBH, JO, TAD, KLT CPD, and DT: edited the manuscript; and all authors: read and approved the final manuscript. JDS is a member of the America Society for Nutrition's Statistical Review Board; MM, PAMS, AJM, and AA are Associate Editors of the American Journal of Clinical Nutrition; JO is Editor-in-Chief of Current Developments in Nutrition; TAD is Editor-in-Chief of The Journal of Nutrition; KLT is Editor-in-Chief of Advances in Nutrition; CPD is the Editor-In-Chief of the American Journal of Clinical Nutrition; DT is Academic Editor of the American Journal of Clinical Nutrition; RLP reports no conflicts of interest.
Notes
JDS was supported by the Baltimore VA Medical Center GRECC, NIA AG028747 and NIDDK P30 DK072488, CPD was supported by K24 DK104676 and P30 DK04056, BBH was supported by VA RR&D Grant 5I21RX003169-02, TAD was supported by HD-085573, HD-099080, and USDA CRIS 3092-51000-060.
Supplemental Material is available from the “Supplementary data” link in the online posting of the article and from the same link in the online table of contents at https://academic.oup.com/ajcn/.
Contributor Information
John D Sorkin, Geriatric Research, Education, and Clinical Center, Baltimore VA Medical Center, Baltimore, MD, USA; Department of Medicine, Division of Gerontology, Geriatrics, and Palliative Medicine, University of Maryland School of Medicine, Baltimore, MD, USA.
Mark Manary, Department of Pediatrics, Washington University, St. Louis, MO, USA.
Paul A M Smeets, Division of Human Nutrition and Health, Wageningen University, Wageningen, The Netherlands.
Amanda J MacFarlane, Nutrition Research Division, Health Canada, Ottawa, Ontario, Canada; Department of Biology, Carleton University, Ottawa, Ontario, Canada.
Arne Astrup, Novo Nordisk Foundation, Centre for Healthy Weight, Hellerup, Denmark.
Ronald L Prigeon, Independent Scholar, Baltimore MD, USA.
Beth B Hogans, Geriatric Research, Education, and Clinical Center, Baltimore VA Medical Center, Baltimore, MD, USA; Department of Neurology, Johns Hopkins School of Medicine, Baltimore MD, USA.
Jack Odle, Department of Animal Science, North Carolina State University, Raleigh, NC, USA.
Teresa A Davis, USDA/Agricultural Research Service, Children's Nutrition Research Center, Department of Pediatrics, Baylor College of Medicine, Houston, TX, USA.
Katherine L Tucker, Department of Biomedical and Nutritional Sciences and Center for Population Health University of Massachusetts Lowell, Lowell, MA, USA.
Christopher P Duggan, Center for Nutrition, Division of Gastroenterology, Hepatology and Nutrition, Boston Children's Hospital, Boston, MA, USA; Department of Nutrition, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
Deirdre K Tobias, Division of Preventive Medicine, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School and Department of Nutrition, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
References
- 1.Harrington D, D'Agostino RB, Gatsonis C, Hogan JW, Hunter DJ, Normand S-LT, Drazen JM, Hamel MB. New guidelines for statistical reporting in the journal. N Engl J Med. 2019;381(3):285–6. [DOI] [PubMed] [Google Scholar]
- 2.Trafimow D. Editorial. Basic Appl Soc Psychol. 2014;36(1):1–2. [Google Scholar]
- 3.Wasserstein RL, Lazar NA. The ASA statement on P-values: context, process, and purpose. The American Statistician. 2016;70(2):129–33. [Google Scholar]
- 4.Rigor and reproducibility . [Internet]. 20 Jan, 2021; Available from: https://www.nih.gov/research-training/rigor-reproducibility. [Google Scholar]
- 5.Li SX, Imamura F, Ye Z, Schulze MB, Zheng J, Ardanaz E, Arriola L, Boeing H, Dow C, Fagherazzi Get al. Interaction between genes and macronutrient intake on the risk of developing type 2 diabetes: systematic review and findings from European Prospective Investigation into Cancer (EPIC)-InterAct. Am J Clin Nutr. 2017;106(1):263–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Garza C, Stover PJ, Ohlhorst SD, Field MS, Steinbrook R, Rowe S, Woteki C, Campbell E. Best practices in nutrition science to earn and keep the public's trust. Am J Clin Nutr. 2019;109(1):225–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. 1900;50(302):157–75. [Google Scholar]
- 8.Fisher RA. Statistical Methods for Research Workers. Edinburgh, London: Oliver and Boyd, 1925. [Google Scholar]
- 9.Fisher RA. The Design of Experiments. Edinburgh, London: Oliver and Boyd, 1935. xi, p. 252. [Google Scholar]
- 10.Salsburg D. The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. New York: W.H. Freeman, 2001. xi, p. 340. [Google Scholar]
- 11.Arbuthnot J. An argument for divine providence, taken from the constant regularity observed in the births of both sexes. By Dr. John Arbuthnott, Physitian in Ordinary to Her Majesty, and Fellow of the College of Physitians and the Royal Society. Phil Trans R Soc. 1710;27:186–90. [Google Scholar]
- 12.Stigler SM. The History of Statistics: The Measurement of Uncertainty before 1900. Cambridge, Mass: Belknap Press of Harvard University Press, 1986. xvi, p. 410. [Google Scholar]
- 13.P-Value. [Internet]. Available from: https://en.wikipedia.org/wiki/P-value. [Google Scholar]
- 14.Fisher RA. The arrangement of field experiments. Journal of the Ministry of Agriculture. 1926;33:503–15. [Google Scholar]
- 15.Simpson JA, Weiner ESC, Berg DL. The Compact Oxford English Dictionary: Complete Text Reproduced Micrographically. Oxford, New York: Clarendon Press; Oxford University Press, 1991, 1187, Sense 4. e. Statistics. [Google Scholar]
- 16.Snedecor GW, Cochran WG. Statistical Methods. 8th ed.Ames: Iowa State University Press, 1989. xx, p. 503. [Google Scholar]
- 17.Armitage P, Berry G. Statistical Methods in Medical Research. 2nd ed.Oxford: Blackwell Scientific, 1987, 96. [Google Scholar]
- 18.Hochberg Y, Benjamini Y. More powerful procedures for multiple significance testing. Stat Med. 1990;9(7):811–8. [DOI] [PubMed] [Google Scholar]
- 19.Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: a practical and powerful approach to multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological). 1995;57(1):289–300. [Google Scholar]
- 20.Dunnett CW. New tables for multiple comparisons with a control. Biometrics. 1964;20(3):482–91. [Google Scholar]
- 21.Dunnett's Test. Wikipedia. [Internet]. 20 Jan, 2021; Available from: https://en.wikipedia.org/wiki/Dunnett%27s_test. [Google Scholar]
- 22.Scheffé H. The Analysis of Variance. Wiley classics library ed. A Wiley Publication in Mathematical Statistics. New York: Wiley-Interscience Publication, 1999. xvi, p.477. [Google Scholar]
- 23.Tukey JW. Comparing individual means in the analysis of variance. Biometrics. 1949;5(2):99–114. [PubMed] [Google Scholar]
- 24.Armitage P, Berry G, Matthews JNS. Statistical Methods in Medical Research. 4th ed.Malden, MA: Blackwell Science, 2001, xi, 817. [Google Scholar]
- 25.O'Callaghan KM, Roth DE. Standardization of laboratory practices and reporting of biomarker data in clinical nutrition research. Am J Clin Nutr. 2019;112(Suppl 1):453S–7S. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Robust Regression | SAS Data Analysis Examples. [Internet]. 7 April, 2021; Available from: https://stats.idre.ucla.edu/sas/dae/robust-regression/. [Google Scholar]
- 27.Belsley DA, Kuh E, Welsch RE. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley series in Probability and Mathematical Statistics, New York: Wiley, 1980, xv, p. 292. [Google Scholar]
- 28.Hoaglin DC, Mosteller F, Tukey JW. Understanding Robust and Exploratory Data Analysis. Wiley series in Probability and Mathematical Statistics Applied Probability and Statistics, New York: Wiley, 1983. xvi, p. 447. [Google Scholar]
- 29.Kroeger CM, Garza C, Lynch CJ, Myers E, Rowe S, Schneeman BO, Sharma AM, Allison DB. Scientific rigor and credibility in the nutrition research landscape. Am J Clin Nutr. 2018;107(3):484–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Sorkin BC, Kuszak AJ, Williamson JS, Hopp DC, Betz JM. The challenge of reproducibility and accuracy in nutrition research: resources and pitfalls. Adv Nutr. 2016;7(2):383–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.ICMJE International Committee of Medical Journal Editors Clinical Trials 1. Registration. [Internet]. [cited 11 Jan, 2020]; Available from: http://www.icmje.org/recommendations/browse/publishing-and-editorial-issues/clinical-trial-registration.html. [Google Scholar]
- 32.Clinical Trial Research. [Internet]. [cited 19 Dec, 2020]; Available from: https://www.niaid.nih.gov/grants-contracts/clinical-trial-research. [Google Scholar]
- 33.NIH Definition of Clinical Trial Case Studies. [Internet]. 19 Dec, 2020; Available from: https://grants.nih.gov/policy/clinical-trials/case-studies.htm. [Google Scholar]
- 34.Clinical Trials Registration and Results Information Submission. A Rule by the Human Services Department on 09/21/2016. Federal Register. The Daily Journal of the United States Government. [Internet]. [cited 11 Jan, 2010]; Available from: https://www.federalregister.gov/documents/2016/09/21/2016-22129/clinical-trials-registration-and-results-information-submission. [Google Scholar]
- 35.AJCN Instruction for Authors. [Internet]. [cited 11 Jan, 2020]; Available from: https://academic.oup.com/ajcn/pages/General_Instructions#Research%20Registration. [Google Scholar]
- 36.Viagra: How a Little Blue Pill Changed the World. Feb 16, 2019. [Internet]. [cited 1 Jan, 2020]; Available from: https://www.drugs.com/slideshow/viagra-little-blue-pill-1043. [Google Scholar]
- 37.Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31(4):337–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.