Abstract
During the last decade, evidence‐based medicine has been described as a paradigm shift in clinical practice, and as “the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients”. Appropriate statistical methods for analyzing data are critical for the correct interpretation of the results in proof of the evidence. However, in the medical literature, these statistical methods are often incorrectly interpreted or misinterpreted, leading to serious methodological errors and misinterpretations. This review highlights several important aspects related to the design and statistical analysis for evidence‐based reproductive medicine. First, we clarify the distinction between ratios, proportions, and rates, and then provide a definition of pregnancy rate. Second, we focus on a special type of bias called ‘confounding bias’, which occurs when a factor is associated with both the exposure and the disease but is not part of the causal pathway. Finally, we present concerns regarding misuse of statistical software or application of inappropriate statistical methods, especially in medical research.
Keywords: Biostatistical methods, Confounding factor, Evidence‐based reproductive medicine, Exploratory data analysis, Pregnancy rate
Abbreviations
- ANCOVA
Analysis of covariance
- ANOVA
Analysis of variance
- ART
Assisted reproduction technologies
- BMI
Body mass index
- EBM
Evidence-based medicine
- EDA
Exploratory data analysis
- HCG
Human chorionic gonadotropin
- NEJM
New England Journal of Medicine
- ROC
Receiver operating characteristic curves
Introduction
Reproductive medicine has made remarkable progress over the past 50 years with the introduction of new medicines, diagnostic tools, advanced reproductive technology, and emerging scientific technology. Undoubtedly, reproductive medicine now offers many options that would allow couples to have a happy and healthy reproductive life [1, 2, 3, 4, 5, 6]. A new paradigm called evidence‐based medicine (EBM) has emerged, which places less emphasis on intuition and clinical experience obtained in a nonsystematic manner as being sufficient for making clinical decisions [7]. EBM focuses on tackling clinical problems using existing research findings that have to be sought and evaluated using formal rules for the critical appraisal of evidence. Therefore, clinicians and medical researchers have entered a new era of medical practice that requires them to critically appraise the design, conduct, and data analysis of a study to subsequently interpret the results, and to learn how to critically appraise the literature in order to better serve patients [8, 9, 10, 11].
EBM has established a consistent basis for the use of statistical information, which is classified into either truth, false, or by chance [11, 12]. Therefore, if the problem is serious, the inappropriate use of statistical methods may lead to incorrect conclusions, artificial results, and a waste of valuable resources [13, 14, 15]. In medical research, there are two potential sources of error that may lead to incorrect conclusions about the validity of study results: random error and systematic error (bias) [16]. Random error is always present in a measurement and is caused by unknown and unpredictable changes that may occur in the measuring instruments or in the environmental conditions. There are two basic ways to reduce random error in a study: increase the sample size and reduce variability in measurements. The other type of error is systematic error, which is defined as the “systematic tendency of any factors associated with the design, conduct, analysis and evaluation of the results to make the estimate of any effect deviate from its true value”. For medical research, critical appraisal focuses on the four general types of systematic error: selection bias (systematic differences in the comparison groups attributable to incomplete randomization), performance bias (systematic differences in the care provided apart from the intervention being evaluated), exclusion bias (systematic differences in withdrawals from the trial), and detection bias (systematic differences in outcome assessment) [17]. A primary objective in study design is to avoid systematic error since this error arises from poor study design. There are essentially two ways to reduce bias: randomization and retrospective adjustments for perceived sources of bias [16, 17]. Therefore, in medical research, it is critical to adhere to statistical principles and follow a sound statistical methodology to minimize bias and maximize precision [18]. The misuse of statistical methods is considered unethical and can have serious clinical consequences [15, 19, 20].
In the first section of the review, we explain the differences between rate, ratio, and proportion, and provide a definition of pregnancy rate. We provide an example to demonstrate that the same data can lead to dramatically different conclusions, and we then focus on a special type of bias called ‘confounding bias’. Next, we present concerns regarding misuse of statistical software and application of inappropriate statistical methods in medical research. Finally, we review what statistical methods are frequently being used in medical research, then present recent trends and perspectives of statistical methods.
Definition of pregnancy rate
The terms rate, ratio, and proportion are often used inappropriately. In epidemiology, these three terms refer to types of calculations used to describe and compare measures of disease occurrence. A ratio is defined as the result obtained by dividing one quantity by another (r = a/b) [17, 21, 22, 23]. The general form of a ratio does not necessarily have any specified relationship between the numerator and denominator. For example, the ‘sex ratiO' is a ratio of two unrelated numbers: the number of males divided by the number of females in a given population, usually expressed as the number of males per 100 females. Another example is the body mass index (BMI), the weight in kilograms divided by the square of height in meters (kg/m2), which is a simple index of weight‐for‐height and is commonly used to classify adults as either underweight, overweight, or obese.
A proportion is a type of ratio in which the numerator is part of the denominator (p = a/(a + b)), but the entities represented by these numbers are related to one another [21]. Proportions, also known as fractions, are often expressed as percentages and range from 0 to 1 or 0 to 100%. For example, a particular event such as occurrence of a particular disease has occurred before a given time. Incidence proportion (also known as cumulative incidence) is the number of new cases during a specified period divided by the number of subjects at risk in the population at the beginning of the study; for example, if a population initially had 1,000 nondiseased individuals and 28 of them develop a condition over 2 years of observation, the incidence proportion is 28 cases per 1,000 individuals (i.e., 2.8%). Proportions are often misidentified as ‘rates’ (e.g., attack ‘rate’ and relapse ‘rate’).
For rates, time is an integral part of the denominator; thus, it is defined as the ratio of the number of events in a group of individuals at risk for the event divided by the total time units contributed by the individuals at risk of the event [17, 21, 22, 23]. The term ‘rate’ describes how fast something is happening or going, such as a car's speedometer that indicates the rate of travel in kilometers per hour. In Fig. 1, by the end of the observational study, 3 of the individuals had died, and the individuals were observed for 30 person‐years; therefore, the measure of mortality is a rate (3 cases/30 person‐years = 0.1 deaths per person‐year). By multiplying the numerator and denominator by 10,000, the mortality rate becomes 1,000 cases per 10,000 person‐years. The denominator, person‐years, can be converted into any interval appropriate to the disease being studied. Epidemiologists generally use 100,000 person‐years for rare diseases and those that take a long time to develop [17]. However, if we calculate the ratio of the number of deaths to the number of individuals at the beginning of the time interval, then this will be a proportion, not a rate. In Fig. 1, the proportion of death is 0.6 (= 3/5), so the rate and proportion differ considerably. For epidemics, especially when disease development is very rapid, these two quantities may differ considerably.
Figure 1.

Measurement of person‐time in a hypothetical population
In reproductive medicine, pregnancy rate is used to indicate the success rate for treatment. Although there are several definitions of pregnancy, clinical pregnancy is defined by the Society for Assisted Reproductive Technology as visualizing a gestational sac in the uterine cavity by ultrasound [24, 25]. Pregnancy proportion is defined as the total number of clinical pregnancies divided by the total number of treated patients. On the other hand, pregnancy rate is defined as the product of the number of pregnancies in the women observed multiplied by 12 months, divided by the product of the number of women observed multiplied by the number of months observed. For example, if 200 women were treated with one fertility treatment for 24 months and 40 of them became pregnant, the pregnancy rate would be 10 per 100 woman‐years, and the pregnancy proportion would be 0.2 (= 40/200). Pregnancy rates are often used for comparison of institutes and treatments. However, a comparison of clinic success rates may not be meaningful because a patient's characteristics, treatment approaches, and entrance criteria for assisted reproduction technologies (ART) may vary from clinic to clinic.
Simpson's paradox
Yule [26] and Simpson [27] described a statistical paradox that occurs when data are aggregated. It refers to the phenomenon that sometimes an association between two dichotomous variables is similar within subgroups of a population, but changes its sign if the individuals of the subgroups are pooled without stratification [26, 27, 28, 29, 30, 31]. Several examples from real data have been presented [31, 32, 33, 34]. To provide an intuitive explanation, we present an example of Simpson's paradox with virtual data, which compares the pregnancy proportions between two institutes. Table 1 is the marginal table for the relationship between institute and pregnancy and shows that the pregnancy proportion of Hospital B is higher than Hospital A since P(pregnancy|Hospital A) = 60/220 = 0.27 < P(pregnancy|Hospital B) = 84/220 = 0.38, Fisher's P value = 0.019. Thus, should you conclude that Hospital B is better than Hospital A?
Table 1.
Overall data on pregnancy proportion in two institutes
| Institute | No. of pregnancies | No. of patients | Pregnancy proportion (%) |
|---|---|---|---|
| Hospital A | 60 | 220 | 27 |
| Hospital B | 84 | 220 | 38 |
| P = 0.019 |
If these patients were divided into two groups by age [i.e., young patients (< 35 years old) and old patients (≥ 35 years old)], then the data interpretation changes: the three‐way table representing this relation based on age is shown in Table 2. For the subgroup of younger patients, Hospital A is slightly better than Hospital B, since P(pregnancy|Hospital A & young) = 10/20 = 0.50 > P(pregnancy|Hospital B & young) = 80/200 = 0.40. Even in the subgroup of older patients, Hospital A is still slightly better [P(pregnancy|Hospital A & old) = 50/200 = 0.25 > P(pregnancy|Hospital B & old) = 4/20 = 0.20]. However, the difference in each subgroup did not achieve statistical significance (Fisher's P value = 0.48 and 0.79, respectively).
Table 2.
Pregnancy proportion in two institutes by patient's age
| Institute | Age < 35 | Age ≥ 35 | ||||
|---|---|---|---|---|---|---|
| No. of pregnancies | No. of patients | Pregnancy proportion (%) | No. of pregnancies | No. of patients | Pregnancy proportion (%) | |
| Hospital A | 10 | 20 | 50 | 50 | 200 | 25 |
| Hospital B | 80 | 200 | 40 | 4 | 20 | 20 |
| P = 0.48 | P = 0.79 | |||||
This phenomenon is paradoxical because there are two different conclusions based on whether we account for patient age. According to published studies [35, 36], patient's age appeared to be the most important factor affecting pregnancy outcome after ART. The third variable (in this case age), which causes the reversal of the direction of association is called a confounding variable. Confounding comes from the Latin confundere, to mix together [17], and is defined in an epidemiology dictionary as a “situation in which a measure of the effect of an exposure on risk is distorted because of the association of exposure with other factors that influence the outcome under study” [17, 22, 33].
In this example, the confounding factor ‘age’ is the disproportionate allocation of younger and older patients in the two institutes. Although we do not know the reason for the disproportionate allocation, the patients somehow selected themselves into the two institutes. In Table 2, the numbers are flipped between the two groups: 20 young patients and 200 old patients in Hospital A, and 200 young patients and 20 old patients in the Hospital B. Since it appears that the younger patient group benefited disproportionately more from each institute, Hospital A was penalized simply because of the smaller number of young patients in that group. Hence, two conditions play a role in this example: one is the existence of a confounding variable (age), and the other is a disproportionate allocation of age levels between the two institutes.
To avoid Simpson's paradox, a design that is most effective for generating balanced group sample sizes would be selected and appropriate statistical control procedures would be applied to account for potential confounding factors [28, 29, 37, 38, 39]. However, even if the study is carefully designed, it is not possible to control all confounding variables because some are unknown a priori. One of the most useful statistical procedures for adjusting potential confounding variables is the analysis of covariance (ANCOVA), which allows comparison of one variable in two or more groups taking into account variability of other variables, called covariates [17, 28]. Meticulous approaches to statistical analysis and study design are able to reduce the risk of Simpson's paradox.
Misuse of statistical software
Recently, statistical software has become considerably more specialized and sophisticated as the power and versatility of personal computers has increased. Many statistical software packages are currently available for diverse areas such as medical science, engineering, business and marketing and the social sciences [40], and allow users to quickly calculate and display results that researchers previously calculated by hand. Although the availability of multifaceted statistical software packages makes it easy for statistically unskilled researchers to conduct their own data analysis, this can lead to misinterpretation, misuse, and abuse of statistics arising from insufficient knowledge of the underlying mathematical concepts or statistical ideas [41]. Here, we describe an example of misuse of statistics and their software package.
Regression analysis is applied to examine the relationship between a response variable (Y) and a set of explanatory variables (X) for a wide variety of purposes. In particular, linear regression is one of the simpler and easier‐to‐use modeling techniques among a variety of regression analyses. To illustrate the method of simple regression, consider the fictitious data in Table 3. Data from 11 women were examined in an investigation of the relationship between human chorionic gonadotropin (HCG) and other hormones. These data were generated based on Anscombe's regression example data [42]. The example data in Table 3 were plotted in Fig. 2a–d. To investigate the relationships, we applied a simple linear regression model using the following formula:
Table 3.
Virtual data of relationship between HCG (X) and hormone (Y) in each experiment
| Observation No. | Experiment 1 | Experiment 2 | Experiment 3 | Experiment 4 | ||||
|---|---|---|---|---|---|---|---|---|
| X 1 | Y 1 | X 2 | Y 2 | X 3 | Y 3 | X 4 | Y 4 | |
| 1 | 30 | 34.08 | 42 | 43.56 | 54 | 42.66 | 48 | 46.26 |
| 2 | 48 | 41.7 | 48 | 48.84 | 24 | 32.34 | 114 | 75 |
| 3 | 78 | 45.48 | 60 | 54.84 | 72 | 48.9 | 48 | 39.48 |
| 4 | 36 | 43.44 | 30 | 28.44 | 66 | 46.86 | 48 | 33.36 |
| 5 | 24 | 25.56 | 66 | 55.56 | 60 | 44.76 | 48 | 53.04 |
| 6 | 54 | 52.86 | 24 | 18.6 | 78 | 76.44 | 48 | 31.5 |
| 7 | 60 | 48.24 | 36 | 36.78 | 42 | 38.52 | 48 | 34.56 |
| 8 | 42 | 28.92 | 54 | 52.62 | 48 | 40.62 | 48 | 50.82 |
| 9 | 66 | 49.98 | 72 | 54.78 | 30 | 34.38 | 48 | 47.46 |
| 10 | 84 | 59.76 | 84 | 48.6 | 36 | 36.48 | 48 | 41.34 |
| 11 | 72 | 65.04 | 78 | 52.44 | 84 | 53.04 | 48 | 42.24 |
Figure 2.

Scatter plots of pairs (X, Y) from the hypothetical dataset in Table 1
Y i = a + b × X i, where Y i is the observed value of HCG and X i is observed value of hormone on the ith experiment, i = 1, 2, 3, and 4. The slope of the line is ‘b’, and the intercept is ‘a’ (the value of Y when X = 0). The estimated equation for the line in each experiment is Ŷ = 18 + 0.5 × X. These four datasets all have the same regression line, the same value of Pearson's correlation coefficient, r = 0.816, and coefficient of determination, R 2 = 0.667, and the same results for the overall F test for linear fit (F(1,9) = 18; P value = 0.0022), but they are very different datasets. The correlation coefficient indicates that there is a strong positive relationship between HCG and hormones, and the P value indicates that a finding such as this is a statistically significant difference from 0 (no correlation) and thus, is not likely a chance finding. While the first graph (Fig. 2a) shows a well‐fitted line that provides a good fit to the data, the other data sets do not fit as well: the second one (Fig. 2b) is a quadratic, and does not have a linear correlation, the third (Fig. 2c) is a perfect fit with an outlier, and the final one (Fig. 2d) is unsuitable for linear fitting, the fitted line being determined essentially by one extreme observation. Whenever a linear regression model is created, it is assumed that the linear model meets independence, linearity, homoscedasticity, and normality [43]. One or more of these assumptions were not met in this example. An analysis based exclusively on an examination of summary statistics, such as the correlation coefficient, is unable to detect the difference in patterns. Therefore, it is important to examine the scatter plot of Y against X before interpreting the correlation coefficient value. The scatter plot is able to indicate unequal variance, nonlinearity, and outlying observations; thus the scatter plot should always be the first step in any regression analysis [44, 45].
Additionally, the assumptions of linear regression are generally checked by looking at the residuals from the data: where yi are the observed responses and ŷ i are the estimated responses calculated by the regression model [46]. A residual plot shows the residuals on the vertical axis and the independent variable on the horizontal axis. The residual plots of example data are shown in Fig. 3a–d. If the points in a residual plot have no pattern and are randomly dispersed around the horizontal axis, such as in Fig. 3a, a linear regression model is appropriate for the data. On the other hand, if a residual plot has a curved pattern, such as in Fig. 3b, then the original data does not have a good linear fit. Figure 3c and d show no pattern but have an increasing or decreasing spread about the line. For these data, a linear fit can be used, but the least squares regression line will be less accurate for larger values of y. Residual plots often emphasize violations of model assumptions better than a plot of the regression line on a scatter plot [46, 47, 48].
Figure 3.

Residual plots of linear regression model of predicted hormones from the hypothetical data set in Table 1
In 1977, Tukey [45] suggested the use of exploratory data analysis (EDA) that employs a variety of techniques (mostly graphical) to maximize insight into a data set, uncover underlying structure, extract important variables, detect outliers and anomalies, test underlying assumptions, and develop suitable models. The statistical approach should allow researchers to explore data with an open mind. Therefore, graphical techniques of visual exploration, in combination with natural pattern‐recognition capabilities and knowledge of the subject, facilitate the discovery of the structural secrets of the data [46].
How to detect and handle outliers
In medical research, outliers often arise due to either measurement error (system behavior, human error, instrument error) or simply through natural population deviations. In particular, a small number of outliers would be expected in large data set, and outliers may include the sample maximum or minimum, or both of them. However, the sample maximum and minimum are not always outliers. The definition of outlier by Grubbs is a data point that appears to deviate markedly from other members of the sample set [49], but there is no rigid statistical determination; therefore, identifying whether or not an observation is an outlier is ultimately subjective [48].
Subjective visual inspection alone cannot always identify an outlier and can lead to mislabeling an observation as an outlier; it is important to check the data using graphical methods such as histograms, scatter plots, or a box plot. Several statistical approaches for identifying outliers have been developed to accommodate outliers and to reduce their impact on the statistical analysis [48, 49, 50]. For example, the Grubbs–Smirnov test is used to detect a single outlier in a univariate data set that follows an approximately normal distribution [47], while the Dixon's Q test [51] does not require a normality assumption of the data and performs well with small sample size (< 10 observations) [48, 49, 50]. By hypothesis testing, if the null hypothesis (no outlier) is rejected, then the conclusion is that the most extreme value is an outlier.
Clear identification of an outlier is most convincing when justified medically as well as statistically; the medical context will then often define the appropriate action. On the other hand, eliminating outliers on the basis of statistical analysis without an assignable cause is not recommended. Altman [52] suggested that one analysis with the actual values and at least one other analysis without the outlier should be performed. Any differences between these results should be addressed if there was no predetermined method to account for outliers in the study protocol [52]. This type of analysis is called ‘sensitivity analysis’ [18]. As with alternative methods, nonparametric and robust methods (e.g., weighted least‐squares regression) have been suggested to minimize the effect of an outlier observation [53, 54, 55].
Outliers should be investigated carefully, and often provide useful information about the process under investigation or the data gathering and recording process. Before considering removing these points from the data, one should investigate why they are present and if similar values will continue to appear.
Statistical methods in medical research: recent trends and perspectives
The use of statistical methods has become a core component of data analysis, data interpretation, and presentation and is a necessary component of medical research. A wide variety of statistical methods have been used; the most common methods are simple descriptive methods including t tests, chi‐squared tests (analysis of contingency tables), and Pearson's correlation coefficient. There is evidence that journals with a more thorough statistical review process report a more complex and wider variety of statistical techniques [56, 57]. However, there is no definitive minimum amount of statistical knowledge that researchers need [57]. Indeed, previous studies [58, 59] have investigated the frequency of different statistical methods presented in published papers in medical fields including rehabilitation [60, 61], surgery [62], anesthesia [63], oncology [64], radiology [65, 66, 67], pediatrics [68], internal medicine [69], and general practice [70, 71].
Each article was reviewed in the methods section and scanned in the results section and supplemental appendixes to determine the types of statistical methods used. Statistical methods were classified according to categories based on those used in other similar investigations [58, 59]. The statistical methods included summary statistics (means, standard deviations, standard errors, modes, medians, central tendency, variation, range, and variance), t test (one‐sample, independent samples, and paired samples), contingency table analyses (chi‐squared, Fisher's exact test, and likelihood ratio), analysis of variance (simple and multivariate analysis), and nonparametric tests (Wilcoxon rank‐sum test, Wilcoxon's signed‐rank test, Mann–Whitney test, sign test, runs test, Kolmogorov–Smirnov test, and Kruskal–Wallis test). Advanced statistical methods included simple linear regression (least‐squares regression with one predictor and one response variable), multiple regression (polynomial regression and stepwise regression), epidemiological methods (relative risk, odds ratio, log odds, measures of association, sensitivity, specificity), and survival analysis (Kaplan–Meier, life tables, and Cox regression).
Table 4 shows a broad similarity across six reviews. Summary (descriptive) statistics was used in approximately 20% to 50% of all articles. The statistics were used to summarize data and provide information about the sample from which the data were drawn and the accuracy with which the sample represents the population of interest [52]. The most commonly used statistical methods were t tests, correlation tests, contingency table analysis, and regression analysis. Hayden [68] concluded that, although a reader familiar with these methods could have understood 97% of articles published in 1952, the percentage had fallen to 65% by 1982.
Table 4.
Statistical content of published papers in six reviews showing the number (%) of appearance of statistical methods
| Researchers | Emerson & Colditz | Horton & Switzer | Hayden | Reznick et al. | Goldin et al. | Rigby et al. |
|---|---|---|---|---|---|---|
| Journals reviewed | NEJM | NEJM | Pediatrics | Four surgical journals | Two radiology journals | Three general practice journals |
| Survey years | 1978–1979 | 2004–2005 | 1982 | 1985, 2003 | 1994 | 2000 |
| No. of papers | 760 | 311 | 151 | 200 | 218 | 305 |
| No. of summary (descriptive) statistics | 91 (27) | 39 (13) | 79 (52) | 89 (45) | 103 (47) | 121 (39) |
| t test | 147 (44) | 80 (26) | 53 (35) | 44 (22) | 43 (20) | 46 (15) |
| Contingency tables | 91 (27) | 166 (53) | 43 (28) | 44 (22) | 31 (14) | 79 (25) |
| Pearson's correlation | 40 (12) | 10 (3) | 19 (13) | 14 (7) | 20 (9) | 14 (5) |
| Nonparametric correlation | 13 (4) | 14 (5) | 1 (1) | 7 (4) | 12 (5) | 22 (7) |
| Nonparametric test | 38 (11) | 85 (27) | 4 (3) | 30 (15) | 2 (1) | 39 |
| Epidemiological methods | 33 (10) | 110 (35) | – | 13 (7) | – | 38 (13) |
| Simple linear regression | 28 (8) | 19 (6) | 10 (7) | – | 17 (8) | 74 (24) |
| Multiple regression | 15 (5) | 160 (51) | 5 (3) | 6 (3) | – | – |
| Analysis of variance | 25 (8) | 50 (16) | 9 (6) | 6 (3) | 0 (0) | 9 (3) |
| Multiple comparison | 11 (3) | 70 (23) | – | – | – | 10 (3) |
| Transformation | 23(7) | 31(10) | 0(0) | 0(0) | 9(4) | – |
| Survival analysis | 36 (11) | 190 (61) | 1 (1) | 24 (12) | 8 (4) | 6 (2) |
| Repeated measures analysis | – | 37 (12) | – | – | – | 8 (3) |
| Receiver operating characteristics | – | 7 (2) | 2 (1) | – | ||
| Power analysis | 10 (3) | 121 (39) | 4 (2) | 26 (9) | ||
| Other methods | 36 | 121 | 0 | 0 | 3 | 73 |
NEJM The New England Journal of Medicine
Four surgical journals: Annals of Surgery, Archives of Surgery, Journal of the American College of Surgeons and Journal of Surgical Research
Two radiology journals: Clinical Radiology and British Journal of Radiology
Three general practice journals: British Medical Journal, British Journal of General Practice, and Family Practice
Similarly, in a comparison between the 1978–1979 and 2004–2005 surveys in the New England Journal of Medicine (NEJM), the proportion of articles containing only summary statistics did not change substantially, but there were substantial increases in the use of contingency table analysis, nonparametric tests, epidemiological statistics, analysis of variance, survival analysis, multiple regression, multiple comparisons, and power analyses [56, 58, 59]. Despite the greater variety of statistical methods used since the previous studies were published, the use of summary statistics remains essentially universal in the evaluation of quantitative data. However, these observations from this general medicine journal (NEJM) are not necessarily applicable to medical specialty journals (e.g., pediatrics, radiology, and surgical literature).
In medical specialty journals, nonparametric methods (test and correlation) were common; however, there were some differences in the use of advanced statistical methods. Survival analysis was frequently utilized in surgical journals, ANOVA was used in pediatric journals, receiver operating characteristic (ROC) curves were only used in radiology journals, and repeated measures analysis was only used in general practice journals. These slight differences may in part be due to differences in research objectives and study sample sizes.
Based on these results, Horton and Switzer [56] concluded that with knowledge of summary (descriptive) statistics, t‐tests, contingency table analysis, nonparametric tests, epidemiological methods, correlations, and simple regression, medical researchers should be able to interpret up to 82% of the medical literature.
On the other hand, the transfer of new advanced statistical methods into the medical literature is years behind: Altman and Goodman [72] concluded that many methodological innovations of the 1980s had still not made their way into the medical literature by the 1990s indicating a typical lag‐time of 4–6 years, with only a modest increase in the use of newer, more sophisticated statistical techniques. Increased use of advanced statistical methods in medical studies and the increased sophistication of statistical methods have potential implications for medical and statistical educators [57, 60, 62]. Wainapel and Kayne [61] and Altman [73] insisted on a need of teaching biostatistics and epidemiology in the education of doctors who wish to undertake research, or expect to read, understand, and apply findings published in the medical literature [51, 57, 73]. Such education would promote application of appropriate statistical methods in their studies, better understanding of published literature, and better scientific writing skills. Furthermore, to support medical researchers, consulting biostatistician(s) would be useful in choosing an appropriate study design and statistical methods when planning and designing a study, for interpreting the results, and to provide guidance on data presentation. Maintaining and fostering good communication between medical researchers and biostatisticians will lead to good relationships and success.
Biostatistics is an important element of EBM. In reproductive medicine and all medical areas, it is imperative for researchers to be aware of statistical issues and to collaborate with biostatisticians and bioinformatists in developing novel approaches that can improve current techniques. The systematic evaluation of statistical methods represents an essential step toward the goal of EBM; the successful and efficient implementation of these strategies will greatly accelerate development and clinical validation of therapy.
Acknowledgments
We are grateful to Dr. Takao Miyake at Miyake Women's Clinic, Dr. Maki Murakami at Pharmaceuticals and Medical Devices Agency, Dr. Chizuru Ito at Department of Anatomy and Developmental Biology, Graduate School of Medicine, Chiba University, and Professor Isao Yoshimura at Tokyo University of Science for their valuable advice and suggestions.
Conflict of interest
None of the authors have a duality of interest with regard to this work.
References
- 1. Steptoe PC, Edwards RG. Birth after the reimplantation of a human embryo. Lancet, 1978, 2, 366 10.1016/S0140‐6736(78)92957‐4 [DOI] [PubMed] [Google Scholar]
- 2. Noord‐Zaadstra BM, Looman CW, Alsbach H, Habbema JD, te Velde ER, Karbaat J. Delaying childbearing: effect of age on fecundity and outcome of pregnancy. BMJ, 1991, 302, 1361–1365 10.1136/bmj.302.6789.1361 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Gnoth C, Godehardt D, Godehardt E, Frank‐Herrmann P, Freundl G. Time to pregnancy: results of the German prospective study and impact on the management of infertility. Hum Reprod, 2003, 18, 1959–1966 10.1093/humrep/deg366 [DOI] [PubMed] [Google Scholar]
- 4. Hamilton BE, Ventura SJ. Fertility and abortion rates in the United States, 1960–2002. Int J Androl, 2006, 29, 34–45 10.1111/j.1365‐2605.2005.00638.x [DOI] [PubMed] [Google Scholar]
- 5. Schoen R, Canudas‐Romo V. Timing effects on first marriage: twentieth‐century experience in England and Wales and the USA. Popul Stud (Camb), 2005, 59, 135–146 10.1080/00324720500099124 [DOI] [PubMed] [Google Scholar]
- 6. Tetering EA, Dessel HJHM, Mol BWJ. Evidence‐based reproductive medicine in clinical practice: the case of clomiphene‐resistant PCOS. Eur Clinics Obstet Gynecol, 2005, 1 (2) 89–94 10.1007/s11296‐005‐0014‐5 [Google Scholar]
- 7. Evidence‐based Medicine Working Group . Evidence‐based medicine: a new approach to teaching the practice of medicine. JAMA, 1992, 268, 2420–2425 10.1001/jama.1992.03490170092032 [DOI] [PubMed] [Google Scholar]
- 8. Davidoff F, Haynes B, Sackett D, Smith R. Evidence based medicine: a new journal to help doctors identify the information they need. BMJ, 1995, 310, 1085–1086 10.1136/bmj.310.6987.1085 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Sackett DL, Rosenberg WM, Gray JA, Haynes RB, Richardson WS. Evidence based medicine: what it is and what it isn't. BMJ, 1996, 312, 71–72 10.1136/bmj.312.7023.71 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Coomarasamy A, Khan KS. What is the evidence that postgraduate teaching in EBM changes anything? A systematic review. BMJ, 2004, 329, 1017 10.1136/bmj.329.7473.1017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Keus F, Wetterslev J, Gluud C, Laarhoven CJ. Evidence at a glance: error matrix approach for overviewing available evidence. BMC Med Res Methodol, 2010, 10, 90 10.1186/1471‐2288‐10‐90 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Higgins JPT, Green S. Cochrane handbook for systematic reviews of interventions. The Cochrane Collaboration. 2008.
- 13. Pocock SJ, Hughes MD, Lee RJ. Statistical problems in the reporting of clinical trials—a survey of three medical journals. N Engl J Med, 1987, 317, 426–432 10.1056/NEJM198708133170706 [DOI] [PubMed] [Google Scholar]
- 14. Porter AM. Misuse of correlation and regression in three medical journals. J Roy Soc Med., 1999, 92, 123–128 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Strasak AM, Zaman Q, Pfeiffer KP, Göbel G, Ulmer H. Statistical errors in medical research—a review of common pitfalls. Swiss Med Wkly, 2007, 137, 44–49 [DOI] [PubMed] [Google Scholar]
- 16. Cox DR, Reid N The theory of the design of experiments, 2000. London: Chapman & Hall/CRC; [Google Scholar]
- 17. Rothman KJ, Greenland S, Lash TL Modern epidemiology, 2008. 3 New York: Lippincott Williams & Wilkins; [Google Scholar]
- 18.ICH E9 Expert Working Group. Statistical principles for clinical trials. Stat Med 1999;18:1905–42. [PubMed]
- 19. Altman DG. Statistics and ethics in medical research: misuse of statistics is unethical. BMJ, 1980, 281, 1267–1269 10.1136/bmj.281.6250.1267 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Gardenier JS, Resnik DB. The misuse of statistics: concepts, tools, and a research agenda. Acc Res, 2002, 9, 65–74 [DOI] [PubMed] [Google Scholar]
- 21. Elandt‐Johnson RC. Definition of rates: some remarks on their use and misuse. Am J Epidemiol, 1975, 102 (4) 267–271 [DOI] [PubMed] [Google Scholar]
- 22. Ahrens W, Pigeot I Handbook of epidemiology, 2005. Berlin: Springer; [Google Scholar]
- 23. Aschengrau A, Seage G Essentials of epidemiology in public health, 2008. 2 Sudbury: Jones and Bartlett Publishers, Inc; [Google Scholar]
- 24. Tur‐Kaspa I, Yuval Y, Bider D, Levron J, Shulman A, Dor J. Difficult or repeated sequential embryo transfers do not adversely affect in‐vitro fertilization pregnancy rates or outcome. Hum Reprod, 1998, 13, 2452–2455 10.1093/humrep/13.9.2452 [DOI] [PubMed] [Google Scholar]
- 25. Hearns‐Stokes RM, Miller BT, Scott L, Creuss D, Chakraborty PK, Segars JH. Pregnancy rates after embryo transfer depend on the provider at embryo transfer. Fertil Steril, 2000, 74 (1) 80–86 10.1016/S0015‐0282(00)00582‐3 [DOI] [PubMed] [Google Scholar]
- 26. Yule GU. Notes on the theory of association of attributes in statistics. Biometrika, 1903, 2 (2) 121–134 10.1093/biomet/2.2.121 [Google Scholar]
- 27. Simpson EH. The interpretation of interaction in contingency tables. J Roy Stat Soc B, 1951, 13, 238–241 [Google Scholar]
- 28. Appleton DR, French JM, Vanderpump MPJ. Ignoring a covariate: an example of Simpson's paradox. Am Stat, 1996, 50 (44) 340–341 10.2307/2684931 [Google Scholar]
- 29. Julious SA, Mullee MA. Confounding and Simpson's paradox. BMJ, 1994, 309 (6967) 1480–1481 10.1136/bmj.309.6967.1480 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Wagner CH. Simpson's paradox in real life. Am Stat, 1982, 36 (1) 46–48 10.2307/2684093 [Google Scholar]
- 31. Reintjes R, Boer A, Pelt W, Mintjes‐de Groot J. Simpson's paradox: an example from hospital epidemiology. Epidemiology, 2000, 11 (1) 81–83 10.1097/00001648‐200001000‐00017 [DOI] [PubMed] [Google Scholar]
- 32. Neutel CI. The potential for Simpson's paradox in drug utilization studies. Ann Epidemiol, 1997, 7, 517–521 10.1016/S1047‐2797(97)00084‐7 [DOI] [PubMed] [Google Scholar]
- 33. Julious SA, Mullee MA. Confounding and Simpson's paradox. BMJ, 1994, 309, 1480–1481 10.1136/bmj.309.6967.1480 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Hand DJ. Psychiatric examples of Simpson's paradox. Br J Psychiatry, 1979, 135, 90–91 10.1192/bjp.135.1.90b [DOI] [PubMed] [Google Scholar]
- 35. Finer LB, Henshaw SK, Finer LB, Henshaw SK. Disparities in rates of unintended pregnancy in the United States, 1994 and 2001. Perspect Sex Repro H, 2006, 38 (2) 90–96 10.1363/3809006 [DOI] [PubMed] [Google Scholar]
- 36. Bateman BT, Simpson LL. Higher rate of stillbirth at the extremes of reproductive age: a large nationwide sample of deliveries in the United States. Am J Obstet Gynecol, 2006, 194 (3) 840–845 10.1016/j.ajog.2005.08.038 [DOI] [PubMed] [Google Scholar]
- 37. Altman DG, Deeks JJ. Meta‐analysis, Simpson's paradox, and the number needed to treat. BMC Med Res Methodol, 2002, 2, 3 10.1186/1471‐2288‐2‐3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Cates CJ. Simpson's paradox and calculation of number needed to treat from meta‐analysis. BMC Med Res Methodol. 2002;2:1. [DOI] [PMC free article] [PubMed]
- 39. Rücker G, Schumacher M. Simpson's paradox visualized: the example of the rosiglitazone meta‐analysis. BMC Med Res Methodol, 2008, 8, 34 10.1186/1471‐2288‐8‐34 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Howell DC Fundamental statistics for the behavioral sciences, 2008. 6 Belmont: Wadsworth; [Google Scholar]
- 41. Ercan I, Yazici B, Yang Y. Misusage of statistics in medical research. Eur J Gen Med, 2007, 4, 128–134 [Google Scholar]
- 42. Anscombe FJ. Graphs in statistical analysis. Am Stat, 1973, 27 (1) 17–21 10.2307/2682899 [Google Scholar]
- 43. Rao CR Linear statistical inference and its applications, 1973. 2 New York: Wiley; [Google Scholar]
- 44. Quinn GP, Keough MJ Experimental design and data analysis for biologists, 2002. Cambridge: Cambridge University Press; [Google Scholar]
- 45.Tukey JW. Exploratory data analysis. MA: Addison‐Wesley; 1977.
- 46. Theus M, Urbanek S Interactive graphics for data analysis: principles and examples, 2008. Boca Raton: CRC Press; [Google Scholar]
- 47. Barnett V, Lewis T Outliers in statistical data, 1994. 3 NY: Wiley; [Google Scholar]
- 48. Rousseeuw PJ, Leroy AM Robust regression and outlier detection, 1987. NY: Wiley; [Google Scholar]
- 49. Grubbs FE. Procedures for detecting outlying observations in samples. Technometrics, 1969, 11, 1–21 10.2307/1266761 [Google Scholar]
- 50. Iglewicz B, Hoaglin DC How to detect and handle outliers, 1993. Milwaukee: American Society for Quality Control; [Google Scholar]
- 51. Dean RB, Dixon WJ. Simplified statistics for small numbers of observations. Anal Chem, 1951, 23 (4) 636–638 10.1021/ac60052a025 [Google Scholar]
- 52. Altman DG Practical statistics for medical research, 1991. London: Chapman and Hall; [Google Scholar]
- 53. Conover WJ Practical nonparametric statistics, 1971. 2 NY: Wiley; [Google Scholar]
- 54. Armitage P, Berry G Statistical methods in medical research, 1994. 3 NY: Blackwell Science; [Google Scholar]
- 55. Hollander M, Wolfe DA Nonparametric statistical methods, 1999. 2 NY: Wiley‐Interscience; [Google Scholar]
- 56. Horton NJ, Switzer SS. Statistical methods in the journal. N Engl J Med, 2005, 353 (18) 1977–1979 10.1056/NEJM200511033531823 [DOI] [PubMed] [Google Scholar]
- 57. Altman DG, Bland JM. Improving doctors’ understanding of statistics. J Roy Stat Soc A, 1991, 154, 223–267 10.2307/2983040 [Google Scholar]
- 58. Emerson JD, Colditz GA. Use of statistical analysis in New England Journal of Medicine. N Engl J Med, 1983, 309, 709–713 10.1056/NEJM198309223091206 [DOI] [PubMed] [Google Scholar]
- 59. Emerson JD, Colditz GA Bailar JC. MosteUer F. Use of statistical analysis in the New England Journal of Medicine. Medical uses of statistics, 1992. 3 Boston: NEJM Books; 45–57 [Google Scholar]
- 60. Schwartz SJ, Sturr M, Goldberg G. Statistical methods in rehabilitation literature: a survey of recent publications. Arch Phys Med Rehabil, 1996, 77, 497–500 10.1016/S0003‐9993(96)90040‐4 [DOI] [PubMed] [Google Scholar]
- 61. Wainapel SF, Kayne HL. Statistical methods in rehabilitation research. Arch Phys Med Rehabil, 1985, 66, 322–324 10.1016/0003‐9993(85)90172‐8 [PubMed] [Google Scholar]
- 62. Kurichi JE, Sonnad SS. Statistical methods in the surgical literature. J Am Coll Surg, 2006, 202 (3) 476–484 10.1016/j.jamcollsurg.2005.11.018 [DOI] [PubMed] [Google Scholar]
- 63. Avram MJ, Shanks CA, Dykes MH et al. Statistical methods in anesthesia articles: an evaluation of two American journals during two six‐month periods. Anesth Analg, 1985, 64, 607–611 10.1213/00000539‐198506000‐00009 [PubMed] [Google Scholar]
- 64. Hokanson JA, Luttman DJ, Weiss GB. Frequency and diversity of use of statistical techniques in oncology journals. Cancer Treat Rep, 1986, 7, 589–594 [PubMed] [Google Scholar]
- 65. Goldin J, Zhu W, Sayre JW. A review of the statistical analysis used in papers published in Clinical Radiology and British Journal of Radiology. Clin Radiol, 1996, 51, 47–50 10.1016/S0009‐9260(96)80219‐4 [DOI] [PubMed] [Google Scholar]
- 66. Elster AD. Use of statistical analysis in the AJR and Radiology: frequency methods and subspecialty differences. Am J Roentgenol, 1994, 163, 711–715 [DOI] [PubMed] [Google Scholar]
- 67. Huang W, LaBerge JM, Lu Y, Glidden DV. Research publications in vascular and interventional radiology: research topics, study designs, and statistical methods. J Vasc Interv Radiol, 2002, 13, 247–255 10.1016/S1051‐0443(07)61717‐5 [DOI] [PubMed] [Google Scholar]
- 68. Hayden GF. Biostatistical trends in pediatrics: implications for the future. Pediatrics, 1983, 72, 84–87 [PubMed] [Google Scholar]
- 69. Cardiel MH, Goldsmith CH. Type of statistical techniques in rheumatology and internal medicine journals. Rev Invest Clin, 1995, 47, 197–201 [PubMed] [Google Scholar]
- 70. Thomas T, Fahey T, Somerset M. The content and methodology of research papers published in three United Kingdom primary care journals. Br J Gen Pract, 1998, 48, 1229–1232 [PMC free article] [PubMed] [Google Scholar]
- 71. Rigby AS, Armstrong GK, Campbell MJ, Summerton N. A survey of statistics in three UK general practice journal. BMC Med Res Methodol, 2004, 4 (1) 28 10.1186/1471‐2288‐4‐28 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72. Altman DG, Goodman SN. Transfer technology from statistical journals to the biomedical literature. Past trends and future predictions. JAMA, 1994, 272, 129–132 10.1001/jama.1994.03520020055015 [PubMed] [Google Scholar]
- 73. Altman DG. Statistics in medical journals: some recent trends. Stat Med, 2000, 19 (23) 3275–3289 10.1002/1097‐0258(20001215)19:23<3275::AID‐SIM626>3.0.CO;2‐M [DOI] [PubMed] [Google Scholar]
