
Statistics mean different things to different people. Ask any two statisticians the same question, and you are likely to receive three different answers. It is not that you have necessarily been given incorrect answers, but rather that each statistician looks at the same question from different perspectives. One may consider the method of data collection or the size of the data set, while the other may contemplate whether the observations were independent or correlated, or if the necessary assumptions were satisfied.
When I first joined the editorial board, I conducted an in-depth review of articles over a three-year period where statistical methods were used in Journal publications. As a result, the Journal team began encouraging one another and its authors to pay closer attention to the use of statistics in their research. Through outreach sessions at the annual meetings of the American Public Health Association and Statistically Speaking columns in the Journal, we brought statistics to the forefront of our discussions. It is in this spirit that I offer the following guidance regarding the research question, the size of the data set, the method of collection, the representativeness of the sample as compared with the population of interest, and the method of analysis used in making conclusions pertinent to public health.
Use care in ensuring that the research question from which the hypothesis is formulated is in harmony with the analysis that is undertaken.
Choose the proper software package for the size of the data set being analyzed. Options exist to analyze very small data sets with exact statistical methods, and others for large or so-called “big data.”
Failure to incorporate the method of data collection into the choice of the analysis can lead to incorrect conclusions, as the researcher–more often than not—obtains a standard error that is less than what is actually present in the data set. The core problem lies in favoring a method of data analysis that is well-known but ignores the study design.
Too often researchers confuse the fact that they may have random observations with the reality that the sample may not be a good representation of the data set. Such cases greatly impact any valid extrapolations that may be made.
It is not uncommon to find instances where the method of analysis is treated differently from the rest of the research study. “Collapsing of data,” for instance, has serious implications that often go unnoticed. Consider a system of several factors. Examining how one factor reacts with another in the presence of many other factors is not necessarily the same as examining how two factors react in isolation. The fundamental problem with this approach is well presented in the widely acknowledged bible of categorical data, Discrete Multivariate Analysis: Theory and Practice (Bishop YMM, Fienberg SE, Holland PWH. Cambridge, MA: The MIT Press, 1975). In doing so, the researcher is in fact answering two different questions: one before collapsing the data, and one after collapsing the data.
The field of statistics has grown considerably over the last three decades. Computers allow us to conduct exact statistical tests and data analysis without assuming independence of the observations or asymptotic theory for large data. During my tenure on the editorial board, I sought to promote the use of sound statistics in Journal articles. In response, there is now greater emphasis on using and reporting statistical tests that incorporate how the data were collected, while considering the effect of dependency on correlated data, as is so often the case in public health research.
Despite these successes, temptations exist for researchers to use outmoded and inappropriate statistics in their research studies. This likely has more to do with their comfort level in using what was available during their public health training, rather than with their access to the latest statistical techniques and software.
