Abstract
The Mahalanobis distance is a statistical technique that can be used to measure how distant a point is from the centre of a multivariate normal distribution. By measuring Mahalanobis distances in environmental space ecologists have also used the technique to model: ecological niches, habitat suitability, species distributions, and resource selection functions. Unfortunately, the original description of the Mahalanobis distance technique for ecological modelling contained an error describing how Mahalanobis distances could be converted into probabilities using a chi-squared distribution. This error has been repeated in the literature, and is present in popular modelling software. In the hope of correcting this error to maximise the potential application of the Mahalanobis distance technique within the ecological modelling community, I explain how Mahalanobis distances are calculated, and through a virtual ecology experiment demonstrate how to correctly produce probabilities and discuss the implications of the error for previous Mahalanobis distance studies.
Keywords: Species distribution modelling, Ecological niche modelling, Mahalanobis distance, Habitat suitability modelling, Resource selection functions, Virtual ecology, Presence-only, Multivarite normal distribution, Chi-squared distribution, Probability
Introduction
The Mahalanobis distance (Mahalanobis, 1936) is a statistical technique that can be used to measure how distant a point is from the centre of a multivariate normal distribution. Consider a data matrix A with m rows of observations and n columns of measured variables.
The Mahalanobis distance D2 for each observation vector xm = [xm1, xm2, xm3, …, xmn] is calculated as a function of an n-dimensional vector containing the means for each column of variables, and a variance–covariance matrix S of dimensions n × n that contains variances for each column along the main diagonal and pair-wise column covariances values elsewhere (Manly, 2005).
| (1) |
When applied to m = 50 points in n = 2 dimensions, the calculated D2 values follow a characteristic elliptical pattern with D2 radiating out from the central location of the distribution (Fig. 1A).
Figure 1. Two-dimensional example of Mahalanobis distance.
(A) Given a set of points distributed in two-dimensional space, the Mahalanobis distances (D2) for each point can be calculated. (B) The D2 values can be transformed into probabilities using a chi-squared cumulative probability distribution. This highlights that there are two points points that have a very high probability of not belonging to the distribution and could be classified as outliers when α = 0.975 as they are beyond that probability threshold.
While D2 can be calculated for any n-dimensions, the values of D2 are not comparable when n varies, as D2 increases as n increases (Figs. 2A–2C). However, the and S−1 used to calculate D2 (Eq. (1)) transform the values from each n into independent standard normal distributions, with centering and S−1 scaling and rotating each variable distribution. This means that D2 is essentially the sum of n independent standard normal variables, and as such follows a chi-squared distribution with degrees of freedom equal to the number of dimensions n (Manly, 2005).
Figure 2. Mahalanobis distance (D2) dimensionality effects using data randomly generated from independent standard normal distributions.
We can see that the values of D2 grow following a chi-squared distribution as a function of the number of dimensions (A) n = 2, (B) n = 4, and (C) n = 8. By using a chi-squared cumulative probability distribution the D2 values can be put on a common scale, such that all values range 0–1 and so that statistics such as the median are consistent across dimensions of (D) n = 2, (E) n = 4, and (F) n = 8. The near uniform distribution of the probability values is expected given the underlying data were randomly generated from independent standard normal distributions.
For a chi-squared random variable with n degrees of freedom denoted as , the probability density function f of when x ≥ 0 is
| (2) |
where Γ is a gamma function (as the chi- squared distribution is actually a special case of the gamma distribution), and the associated cumulative distribution function is (Johnson, Kotz & Balakrishnan, 1994). When the variables in A are normally distributed, the association between D2 and can be clearly seen (Figs. 2A–2C).
By converting D2 into probabilities using we can put D2 from any number of dimensions on a common 0–1 scale that indicates the probability that a location has a D2 that is greater than that would be expected by chance (Figs. 2D–2F). By specifying a significance level α this process is commonly used as an outlier detection method as it is: parameter free, computational efficient, accounts for collinearity between variable dimensions, and is scale independent (Aggarwal, 2017). Returning to our earlier example, having transformed the values from D2 to we can see that there are two points that are very likely to be outliers, and would be classified as outliers with α = 0.975 (Fig. 1B).
The potential of D2 for use in habitat modelling was first identified by Clark, Dunn & Smith (1993) and then discussed further in the context of niche modelling by Farber & Kadmon (2003)—which is how I will continue the discussion. The premise here is that given a data matrix A of m species observations for which various environmental variables n are measured, D2 can be used as a measure of niche suitability from an optimum location in environmental space. Having defined a niche in this way, by measuring the D2 for each location on a landscape a map of niche suitability can then be produced. The key advantages of using D2 over other methods are that the D2 method needs only presence information, and so does not require either absences or a background definition, and that independence of explanatory variables is not required (Clark, Dunn & Smith, 1993; Farber & Kadmon, 2003). Studies that have compared Mahalanobis distance to other modelling approaches have also shown that while the optimum method tends to vary by the species in question, the Mahalanobis distance approach performs well against a variety of other presence-only, presence-background, and presence-absence modelling approaches (Dettmers, Buehler & Bartlett, 2002; Johnson & Gillingham, 2005; Tsoar et al., 2007).
While introducing D2 into the ecological modelling domain Clark, Dunn & Smith (1993) also highlighted that D2 will follow a chi-squared distribution, and therefore the potential to convert the distances into probabilities. However, as ecological niche models aim to describe probability of belonging to the niche, rather than use the chi-squared cumulative distribution function that has 0 at the optimum, we use the inverse chi-squared cumulative distribution function that has 1 at the optimum. This means that the probabilities indicate locations with a D2 that is less than that would be expected by chance, and hence are more likely to be within the niche.
Unfortunately, when describing the use of a chi-squared distribution to convert D2 into probabilities, Clark, Dunn & Smith (1993) p.522 state that “Assuming multivariate normality, Mahalanobis distances are approximately distributed as Chi-square with n − 1 degrees of freedom, where n equals the number of habitat characters.”, but this is incorrect. D2 values follow a chi-squared distribution with degrees of freedom equal to n (Manly, 2005) as has already been clearly shown (Fig. 2). This error has been repeatedly described in the literature (Knick & Rotenberry, 1998; Farber & Kadmon, 2003; Hellgren et al., 2007), and has even permeated into software such as the R package adehabitat (Calenge, 2006).
To demonstrate this error and to examine its implications, I present an experiment based on a virtual ecology approach (Zurell et al., 2010)—which allows us to examine methodologies in a controlled system uncomplicated by the uncertainties of the real world!
Materials and Methods
The virtual ecology experiment began by defining, and therefore knowing truthfully, the fundamental niche of the imaginary species Mimbulus mimbletonia (Rowling, 2003). This fundamental niche N was defined using a multivariate normal distribution describing the niche in relation to two environmental variables of temperature and rainfall.
| (3) |
A sample of observations was then created by randomly sampling the niche space ranging 15–35 °C of temperature and 0–200 mm of rainfall. At each randomly selected location the species was considered detected using a probability equal to the fundamental niche. This process was continued until a sample of 200 observations was generated.
The values of temperature and rainfall at each of the 200 sampled locations were then used to estimate the fundamental niche using Mahalanobis distances. The D2 was calculated for each of the sampled locations, and the resulting distribution of D2 was compared against a range of inverse cumulative chi-squared distribution functions based on differing degrees of freedom.
Finally, the D2 values for the samples were converted into probabilities of belonging to the fundamental niche, and these predictions were compared against the known fundamental niche values for the locations. Probabilities were calculated using chi-squared distributions with n and n − 1 degrees of freedom to examine any differences.
Results
The fundamental niche defined using the multivariate normal distribution (Eq. (3)) produced an elliptically shape niche with positive correlation between rainfall and temperature (Fig. 3A). The random sampling resulted in a set of 200 samples that followed this elliptical niche pattern, with a greater concentration of samples towards the centre of the niche (Fig. 3B).
Figure 3. Virtual ecology experiment to examine the importance of chi-squared distribution degrees of freedom.
(A) The virtual fundamental niche with (B) random sampling across niche space and (C) resulting D2 calculations. (D) The trend of D2 compared against the inverse chi-squared cumulative probability distributions with differing degrees of freedom n. The trend for the sampling locations between the actual fundamental niche values and the niche estimates from D2 values converted to probabilities via an inverse cumulative probability distribution with degrees of freedom equal to (E) n = 2 and (F) n = 1 (or n = 2 − 1).
The D2 values calculated on the basis of these samples also followed an elliptical pattern (Fig. 3C), and when the calculated D2 values for each sample were plotted against the actual fundamental niche value for each sample, a trend that clearly follows the inverse cumulative chi-squared distribution when n = 2 can be seen (Fig. 3D). This is as we would expect as in this example the fundamental niche is based on the two environmental variables of temperature and rainfall.
When the D2 values were converted to probabilities using n = 2 degrees of freedom we see a near-perfect linear fit between the estimated niche suitability and the actual known niche suitability (Fig. 3E). This proves quite clearly that to get a truthful estimate of our known fundamental niche, the probabilities need to be based on an inverse cumulative chi-squared distribution with n degrees of freedom. In contrast, when the D2 values are converted to probabilities using n − 1 degrees of freedom we see a badly-fitting curvilinear relationship that underestimates niche suitability (Fig. 3F).
Discussion
The results of the virtual ecology experiment (Fig. 3) clearly shows the erroneous under prediction of niche suitability when using a chi-squared distribution with n − 1 degrees of freedom. Fortunately most previous D2 studies simply rescaled the D2 into quantiles or ranks of increasing suitability (Knick & Dyer, 1997; Knick & Rotenberry, 1998; Johnson & Gillingham, 2005; Hellgren et al., 2007; Etherington et al., 2009) or binary classifications based on a threshold (Farber & Kadmon, 2003; Thatcher, Van Manen & Clark, 2006; Tsoar et al., 2007) and so are not affected by this problem. Also, although I could not find any examples of this, those studies that did create chi-square probabilities using n − 1 degrees of freedom, but then converted to categories based on quantiles or a predictive threshold would not have impacted the conclusions of the study as while the shape of the relationship becomes curved with n − 1, the trend is still one of monotonic increase, therefore the quantiles would be the same. However, those studies that have created chi-squared probabilities with n − 1 as some form of suitability index (Clark, Dunn & Smith, 1993) will have underestimated the niche suitability.
Conclusion
Given that D2 values are unitless and unbounded and are not directly comparable for different dimensions, I would argue that anyone using the Mahalanobis distance method should present their results as chi-squared probabilities. This will put Mahalanobis distance models on a 0–1 scale that enables models based on differing numbers of n to be directly comparable, and is consistent with most other types of ecological niche models that also use a 0–1 scale. As such, it will be very important that the chi-square probabilities are calculated correctly and hopefully the methodological description and experimental evidence presented here will enable that to be achieved.
Supplemental Information
Each R script will generate one of the figures used as examples of Mahalanobis distance behaviour within the manuscript.
Funding Statement
This research was funded by internal investment by Manaaki Whenua—Landcare Research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Additional Information and Declarations
Competing Interests
Thomas R. Etherington is employed by Manaaki Whenua—Landcare Research, and declares that there are no competing interests.
Author Contributions
Thomas R. Etherington conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft.
Data Availability
The following information was supplied regarding data availability:
R scripts to reproduce the examples, virtual experiments, and figures are available as a Supplemental File.
References
- Aggarwal (2017).Aggarwal CC. Outlier analysis. 2nd edition Springer; Cham: 2017. [Google Scholar]
- Calenge (2006).Calenge C. The package “adehabitat” for the R software: a tool for the analysis of space and habitat use by animals. Ecological Modelling. 2006;197(3–4):516–519. doi: 10.1016/j.ecolmodel.2006.03.017. [DOI] [Google Scholar]
- Clark, Dunn & Smith (1993).Clark JD, Dunn JE, Smith KG. A multivariate model of female black bear habitat use for a geographic information system. Journal of Wildlife Management. 1993;57(3):519–526. doi: 10.2307/3809276. [DOI] [Google Scholar]
- Dettmers, Buehler & Bartlett (2002).Dettmers R, Buehler DA, Bartlett JB. A test and comparison of wildlife-habitat modeling techniques for predicting bird occurrence at a regional scale. In: Scott JM, Heglund PJ, Morrison ML, Haufler JB, Raphael MG, Wall WA, Samson FB, editors. Predicting species occurrences: issues of accuracy and scale. Island Press; Washington: 2002. pp. 607–616. [Google Scholar]
- Etherington et al. (2009).Etherington TR, Ward AI, Smith GC, Pietravalle S, Wilson GJ. Using the Mahalanobis distance statistic with unplanned presence-only survey data for biogeographical models of species distribution and abundance: a case study of badger setts. Journal of Biogeography. 2009;36(5):845–853. doi: 10.1111/j.1365-2699.2008.02041.x. [DOI] [Google Scholar]
- Farber & Kadmon (2003).Farber O, Kadmon R. Assessment of alternative approaches for bioclimatic modeling with special emphasis on the Mahalanobis distance. Ecological Modelling. 2003;160:115–130. doi: 10.1016/S0304-3800(02)00327-7. [DOI] [Google Scholar]
- Hellgren et al. (2007).Hellgren EC, Bales SL, Gregory MS, Leslie DM, Clark JD. Testing a Mahalanobis distance model of black bear habitat use in the Ouachita Mountains of Oklahoma. Journal of Wildlife Management. 2007;71(3):924–928. doi: 10.2193/2006-031. [DOI] [Google Scholar]
- Johnson & Gillingham (2005).Johnson CJ, Gillingham MP. An evaluation of mapped species distribution models used for conservation planning. Environmental Conservation. 2005;32(2):117–128. doi: 10.1017/s0376892905002171. [DOI] [Google Scholar]
- Johnson, Kotz & Balakrishnan (1994).Johnson NL, Kotz S, Balakrishnan N. Continuous probability distributions: volume 1. 2nd edition John Wiley & Sons; New York: 1994. [Google Scholar]
- Knick & Dyer (1997).Knick ST, Dyer DL. Distribution of black-tailed jackrabbit habitat determined by GIS in southwestern Idaho. Journal of Wildlife Management. 1997;61(1):75–85. doi: 10.2307/3802416. [DOI] [Google Scholar]
- Knick & Rotenberry (1998).Knick ST, Rotenberry JT. Limitations to mapping habitat use areas in changing landscapes using the Mahalanobis distance statistic. Journal of Agricultural, Biological, and Environmental Statistics. 1998;3(3):311–322. doi: 10.2307/1400585. [DOI] [Google Scholar]
- Mahalanobis (1936).Mahalanobis PC. On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India. 1936;2(1):49–55. [Google Scholar]
- Manly (2005).Manly BFJ. Multivariate statistical methods: a primer. 3rd edition Chapman & Hall/CRC Press; Boca Raton: 2005. [Google Scholar]
- Rowling (2003).Rowling JK. Harry Potter and the Order of the Phoenix. Bloomsbury; London: 2003. [Google Scholar]
- Thatcher, Van Manen & Clark (2006).Thatcher CA, Van Manen FT, Clark JD. Identifying suitable sites for Florida panther reintroduction. Journal of Wildlife Management. 2006;70(3):752–763. doi: 10.2193/0022-541X(2006)70[752:ISSFFP]2.0.CO;2. [DOI] [Google Scholar]
- Tsoar et al. (2007).Tsoar A, Allouche O, Steinitz O, Rotem D, Kadmon R. A comparative evaluation of presence-only methods for modelling species distribution. Diversity and Distributions. 2007;13(4):397–405. doi: 10.1111/j.1472-4642.2007.00346.x. [DOI] [Google Scholar]
- Zurell et al. (2010).Zurell D, Berger U, Cabral JS, Jeltsch F, Meynard CN, Münkemüller T, Nehrbass N, Pagel J, Reineking B, Schröder B, Grimm V. The virtual ecologist approach: simulating data and observers. Oikos. 2010;119(4):622–635. doi: 10.1111/j.1600-0706.2009.18284.x. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Each R script will generate one of the figures used as examples of Mahalanobis distance behaviour within the manuscript.
Data Availability Statement
The following information was supplied regarding data availability:
R scripts to reproduce the examples, virtual experiments, and figures are available as a Supplemental File.



