For anyone interested in conducting or evaluating research, one of the fundamental skills is being able to adequately and succinctly describe data. Descriptive statistics are an essential component of biomedical research that provide simple summaries about the data set in order to communicate the largest amount of information as possible. Two types of descriptive statistics include measures of central tendency and measures of variability.
Measures of central tendency attempt to approximate the center of a distribution and thus determine the representative value of a data set. These include mean, median, and mode. It is important to realize that these terms are only relevant to data that includes variables that are continuous and cannot be used for categorical variables. Continuous variables are numerical data that can have any value ranging between a theoretical minimum and theoretical maximum. Examples of continuous variables in toxicology include drug concentrations, (e.g., acetaminophen, digoxin) or laboratory values (e.g., serum lactate). In contrast, categorical (or nominal) variables are those that have two or more discrete groups (or “categories”). Examples of categorical variables when discussing acetaminophen overdose include the “presence” or “absence” of hepatotoxicity or stage of hepatic encephalopathy.
For continuous data, the mean is the arithmetic average; it is calculated by summing all of the values in a data set and then dividing by the total number of values. The median is the middle most value when the data is arranged in either ascending or descending order of magnitude. The mode is the value that occurs most frequently in a data set. While measures of central tendency are fairly straightforward to calculate, it is worth noting some subtleties.
The following data set represents a hypothetical, fictitious sample of 80 patients, and their corresponding acetaminophen concentrations (Fig. 1) created for the purpose of illustration.
Fig. 1.
Cases of supratherapeutic acetaminophen concentrations in a normal distribution. The y-axis represents case counts, and the x-axis represents acetaminophen concentration. The red arrow represents the mean, median, and mode of this data set
The data here are normally distributed (or have a Gaussian distribution). As depicted, these data fall in a “bell shaped” curve. In a normal distribution, there is no skew, and the mean, median, and mode are the same value. Here, the mean, median, and mode are all 125 ug/mL as delineated by the red arrow. While normal distributions represent a unimodal data set where the mean, median, and mode are the same value, it should be noted that multimodal (e.g., bimodal) distributions can also be found.
When determining which measure of central tendency to report in a scientific manuscript, it is essential to consider how the data are distributed. Outliers (i.e., data points that differ significantly from other values in the data set) will skew the data and may therefore affect measures of central tendency. The following data set represents a fictitious, hypothetical sample of patients and corresponding acetaminophen concentrations (Fig. 2).
Fig. 2.
Cases of supratherapeutic acetaminophen concentrations with outliers. The y-axis represents case counts, and the x-axis represents acetaminophen concentration. The red arrow represents the median and mode of this data set. The black arrow represents the mean
In this data set, most acetaminophen concentrations fall between 50 and 150 ug/mL. However, there are ten cases with a concentration of 250 ug/mL and these values represent outliers. As a result, while both the mode and median are 100 ug/mL (red arrow), the mean is skewed by the larger acetaminophen concentrations and is 125 ug/mL (black arrow). When a sample has multiple outliers, the mean becomes a less accurate marker of central location and is “dragged” away from the “true” central location of the data. In contrast, the median is not as strongly influenced by extreme values. Thus, when describing skewed data sets, the median typically represents a better measure of central tendency compared to mean, especially when the study sample size is small.
When discussing both Gaussian distributions and outliers, it is imperative to reference the central limit theorem as well. At its core, this theorem states that as the size of a sample increases, the distribution of sample means will approximate a Gaussian distribution regardless of outliers. As a result, obtaining a sufficiently large sample size will reduce sampling error and thus more accurately predict the characteristics of a population.
In contrast to measures of central tendency, measures of variability (or dispersion) attempt to quantify the spread of a data set. These include range, interquartile range, variance, and standard deviation. Ultimately, these tools aim to describe the degree of precision (i.e., homogeneity or heterogeneity) of the data set. Low variability implies that values are fairly consistent and thus more precise; high variability reflects less precision and greater dispersion. While precision refers to the degree of spread, accuracy is a distinctly different term that reflects how well the data captures a population parameter.
The range is calculated by taking the difference between the largest and smallest value in the data set. Since it only involves two values, the range is heavily influenced by outliers. When evaluating the data set in Fig. 2, the range is 200 ug/mL (250–50ug/mL). However, the range would have been 100 ug/mL if the outliers did not exist (150–50 ug/mL).
Unlike the range which reflects the spread of the entire data set, the interquartile range (IQR) describes the spread of the middle half of the data and is less susceptible to the presence of outliers. To calculate IQR, the data must first be divided into quartiles which partition the sample into four equal-sized groups. The first quartile (Q1) is the middle value between the minimum and the median data points. The third quartile (Q3) is the middle value between the median and the maximum data points. The IQR is then calculated by taking the difference between the first and third quartile (IQR = Q3 − Q1).
All of the aforementioned data can be represented graphically with a box-and-whisker plot. These plots allow for visualization of a five-number summary: minimum value, Q1, median, Q3, and maximum value. Figure 3 represents a box-and-whisker plot using the same data as Fig. 2.
Fig. 3.
Box-and-whisker plot of supratherapeutic acetaminophen concentrations with outliers
In this type of graphic, the height of the box represents the IQR. The smaller this height, the smaller the degree of variance. Moreover, the direction of the skew is reflected by box placement. As in this figure, the whiskers (though not always) help to define outliers. The lower whisker is Q1 − [1.5 × IQR] whereas the upper whisker is Q3 + [1.5 × IQR]. Any value beyond these limits is generally considered an outlier.
In contrast to range and IQR, variance and standard deviation are measures of spread that take into account every value in a data set. The sample variance is the average of squared differences from the mean and is represented by s2 (Fig. 4). The sample standard deviation, or s, is the square root of the variance. (when referring to the variance and standard deviation in a sample of patients, “s2” and “s” are used to represent the sample variance and standard deviation, respectively. When describing the variance and standard deviation of an entire population, “σ2” and “σ,” respectively, are used. Population values of variance and standard deviation for most variables are not usually known and therefore “s2” and “s” will be more commonly seen in the scientific literature.)
Fig. 4.
Sample variance (s2) and sample standard deviation (s). sum of, x each value, sample mean, n number of values in the sample
Whereas the variance is expressed in squared units, the standard deviation is expressed in the same units as the mean of the data set. A smaller standard deviation indicates that the values are close to the mean and thus less dispersed. In a normal distribution, 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations respectively of the mean.
Often, calculating the variance is the penultimate step before determining the standard deviation. Since the variance is expressed in squared units, it is functionally an arbitrary number that represents mathematical dispersion. The standard deviation is generally more intuitive, as it has the same units as the original data set, and is thus used to analyze population samples to gain a better grasp of what is normal in the true population.
When presenting research and data, in general, understanding measures of central tendency (i.e., mean, median, mode, and range) helps one to synthesize and illustrate data in an elegant and meaningful manner for others to understand. Characterization of the data in terms of spread (i.e., variance and standard deviation) is useful to determine the homo- or heterogeneity of any dataset. These fundamental concepts of the measure of central tendency are the foundation of most quantitative data analysis and are usually the first analyses performed on quantitative data. Furthermore, any more advanced or statistical testing (i.e., inferential statistics) may rely on knowledge of measures of central tendency first prior to proceeding onto further statistical analysis. Familiarity with the measures of central tendency and their calculations will help you find the “middle” of the data and how the data are distributed (i.e., “spread”).
Sources of Funding
None.
Declarations
Conflicts of Interest
None.
Footnotes
Key Points
• In a normal (or Gaussian) distribution, the mean, median, and mode are all the same value and the data will follow a “bell shaped” curve.
• Outliers in a sample may skew the data set and “pull the mean” away from the true “center” of the data especially when the study sample size is small. In cases like this, the median represents a better measure of central tendency.
• A large sample size reduces sampling error and thus more accurately reflects the characteristics of a population. If the sample size is large enough, its measures of central tendency appear to resemble a normal distribution even if the sample is skewed or has outliers.
• Variance and standard deviation are mathematical tools used to describe the spread of a data set and represent the homo- or heterogeneity of the data. Datasets with a small or narrow variance (or standard deviation) are more homogeneous than datasets with a large or wide variance (or standard deviation).
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.




