Abstract
Pearson and Spearman correlations are important tools for all scientific fields and are widely used in medical sciences, especially during the current COVID-19 pandemic emergency. This technical report has shown that conventional criteria for evaluating the adoption of these coefficients can conceal substantial scientific information regarding correlations that occur above or below a certain threshold. In particular, the Pearson coefficient can reveal hidden correlations even when data are not normally distributed. Finally, an optimized operational guide to reveal any hidden correlation is reported.
Keywords: correlation, statistics, medical physics, medical statistics, pearson, spearman
Introduction
The search for statistical correlations between two data distributions constitutes one of the fundamental elements of scientific research [1-4]. Particularly in the fields of public health, social sciences, infoveillance, and epidemiology, these can provide important information on risk perception and the spread of viruses and bacteria [5-8]. The two most frequently used correlation indices are those of Pearson and Spearman: the first one measures the linear relationship between two continuous random variables and is adopted when the data follows a normal distribution while the second one measures any monotonic relationship between two continuous random variables and is adopted when the data do not follow a normal distribution; both range from -1 to 1 [1-4]. A correlation (ρ) is often defined in medicine as very strong (|ρ| > 0.7), moderate (0.7 ≤ |ρ| < 0.5), fair (0.5 ≤ |ρ| ≤ 0.3), or poor (|ρ| < 0.3) [3]. Nonetheless, it is customary to evaluate its significance based not only on the ρ value itself but also on the relative p-value [3]. The first problem in this approach is precisely the meaning of the p-value: some authors believe that exceeding the significance threshold α implies the immediate acceptance of the null hypothesis [9]; others assert that the p-value should be used as an index of the evidences found against the null hypothesis [10-12], and others conclude that the p-value in itself does not provide any information on the validity of the model used [13-14]. However, all cited authors agree that the mere violation of the significance threshold is not a criterion for the rejection of a statistical relationship. Therefore, in this paper, no significance threshold has been fixed and p-values (p) and Pearson (R) and Spearman (r) coefficients were used to evaluate the statistical significance and the strength of the correlations analyzed.
Technical report
Is it okay to avoid the Pearson coefficient when the data are not normally distributed?
There are various methods to evaluate whether a data series is normally distributed: some finer, such as the Shapiro-Wilk test, others coarser, such as the standard errors-test for kurtosis and skewness [15-16]. In this paper, the latter has been used together with the graphical representation of every distribution. In Table 1, despite data not being normally distributed in most cases, we can see how the Pearson coefficient is able to highlight monotonous trends. Clearly, in the proposed situation, the Spearman coefficient is more appropriate since it perfectly detects this relationship. On the other hand, this shows that it is wrong to state that Pearson's coefficient is only useful when data are normally distributed although it remains true that it would not be able to identify certain non-linear correlations.
Table 1. Comparison between Pearson and Spearman correlations on data distributions printed through monotone polynomial functions.
K-test = Kurtosis test, S-test = Skewness test, R = Pearson’s correlation value, r = Spearman’s correlation value, D% = Percentage difference between R and r
| x | x^2 | x^3 | x^4 | x^5 | x^6 | x^7 | x^8 | x^9 | x^10 | |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
| 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | |
| 3 | 9 | 27 | 81 | 243 | 729 | 2187 | 6561 | 19683 | 59049 | |
| 4 | 16 | 64 | 256 | 1024 | 4096 | 16384 | 65536 | 2.62E+05 | 1.05E+06 | |
| 5 | 25 | 125 | 625 | 3125 | 15625 | 78125 | 3.91E+05 | 1.95E+06 | 9.77E+06 | |
| 6 | 36 | 216 | 1296 | 7776 | 46656 | 2.80E+05 | 1.68E+06 | 1.01E+07 | 6.05E+07 | |
| 7 | 49 | 343 | 2401 | 16807 | 1.18E+05 | 8.24E+05 | 5.76E+06 | 4.04E+07 | 2.82E+08 | |
| 8 | 64 | 512 | 4096 | 32768 | 2.62E+05 | 2.10E+06 | 1.68E+07 | 1.34E+08 | 1.07E+09 | |
| 9 | 81 | 729 | 6561 | 59049 | 5.31E+05 | 4.78E+06 | 4.30E+07 | 3.87E+08 | 3.49E+09 | |
| 10 | 100 | 1000 | 10000 | 1.00E+05 | 1.00E+06 | 1.00E+07 | 1.00E+08 | 1.00E+09 | 1.00E+10 | |
| K-test | -0.77 | -0.48 | 0.20 | 0.96 | 1.71 | 2.40 | 3.02 | 3.57 | 4.05 | 4.46 |
| S-test | 0 | 0.87 | 1.47 | 1.92 | 2.28 | 2.58 | 2.82 | 3.03 | 3.21 | 3.35 |
| R | 1 | 0.97 | 0.93 | 0.88 | 0.84 | 0.80 | 0.77 | 0.74 | 0.72 | 0.70 |
| r | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| Δ% | 0 | 2.61 | 7.71 | 13.42 | 19.13 | 24.64 | 29.88 | 34.80 | 39.42 | 43.73 |
However, there is a more relevant aspect to discuss. in Figure 1, we can observe a peculiar statistical phenomenon, that is, a sequence of monotonic correlations that occur only when a certain threshold is exceeded. In this specific case, although the data is not normally distributed, Pearson's coefficient is even more effective than Spearman's, for it gives more weight to higher values (Figure 1).
Figure 1. Comparison between the Pearson and Spearman coefficients in data distributions that show correlations beyond a specific threshold.
R = Pearson’s correlation value; r = Spearman’s correlation value, K-test = Kurtosis test, S-test = Skewness test
In these examples, the hidden correlations are visible to the naked eye. But, when dealing with hundreds of distributions, it is not always possible to graph each data series. Thus, this method can be effective and efficient in revealing such hidden phenomena. When this happens (i.e. when Pearson's R is larger and more significant than Spearman's r), it is important to interpret it as a signal of plausible correlations.
This method works with all monotonic correlations, provided that all correlated values are greater than those preceding the threshold. Similarly, the same rules also apply to correlations that occur below certain thresholds.
How to discover correlations hidden in large data variability?
It remains questionable how to behave when the correlated values are lower than the unrelated values. A quick but rough method is to calculate, for each hypothetically dependent value k ≠ 0, the quantities 1/k and redo the operation. Although the nature of the correlation is distorted, in doing so, it is possible to signal the presence of a local monotonic relationship between the two variables (Figure 2).
Figure 2. Hidden correlations revealed by the Pearson and Spearman coefficients through the reciprocal 1/k of the values k.
K-test = Kurtosis test, S-test = Skewness test, R = Pearson’s correlation value, r = Spearman’s correlation value
Nevertheless, this method is ineffective when correlated values have the same magnitude as unrelated values (Figure 3).
Figure 3. Hidden correlations not revealed by the Pearson and Spearman coefficients.
K-test = Kurtosis test, S-test = Skewness test, R = Pearson’s correlation value, r = Spearman’s correlation value
Thus, when even the latter procedure cannot identify any hidden correlations, finer methods could be used. For example, these could be revealed via cyclic iterations: supposing we have two data distributions that vary from 1 to n, we first search for the correlation between all the pairs; then, we redo the calculation between 1 and n-1, between 1 and n-2, and so on. After that, the whole operation must be repeated starting from pair 2 (ergo, we search for the correlation between 2 and n, then between 2 and n-1, and so on).
A real example
In the early stages of the COVID-19 epidemic, the web interest of Italian netizens in the novel coronavirus was correlated with the number of cases per region above a certain threshold as shown in Table 2.
Table 2. Italian netizens' web interest in COVID-19 during the early stages of the pandemic in Italy (from February 20 to February 25, 2020): correlations between COVID-19 cases and web interest and between the number of medical swabs and web interest.
RSV = Relative search volume
| Region | Coronavirus RSV | COVID-19 Total Cases | Medical Swabs | |
| 1 | Abruzzo | 61 | 0 | 5 |
| 2 | Basilicata | 65 | 0 | 0 |
| 3 | Calabria | 63 | 0 | 2 |
| 4 | Campania | 67 | 0 | 10 |
| 5 | Emilia-Romagna | 84 | 26 | 391 |
| 6 | Friuli-Venezia Giulia | 66 | 0 | 89 |
| 7 | Lazio | 60 | 3 | 124 |
| 8 | Liguria | 68 | 1 | 39 |
| 9 | Lombardy | 100 | 240 | 3700 |
| 10 | Marche | 71 | 0 | 21 |
| 11 | Molise | 57 | 0 | 0 |
| 12 | PA Trentino-Alto Adige/South Tyrol | 60 | 1 | 4 |
| 13 | Piedmont | 82 | 3 | 141 |
| 14 | Puglia | 60 | 0 | 0 |
| 15 | Sardinia | 45 | 0 | 1 |
| 16 | Sicily | 54 | 3 | 5 |
| 17 | Tuscany | 69 | 2 | 296 |
| 18 | Umbria | 73 | 0 | 8 |
| 19 | Valle d'Aosta | 77 | 0 | 7 |
| 20 | Veneto | 79 | 43 | 3780 |
| Kurtosis-test | 1.25 | 16.58 | 6.27 | |
| Skewness-test | 1.35 | 7.67 | 5.19 | |
| Pearson R (p-value) | .68 (.001) | .63 (.003) | ||
| Spearman r (p-value) | .41 (.07) | .75 (.0002) |
In this case, despite the data not being normally distributed, the use of Spearman's correlation alone would not have highlighted the first correlation, which is most likely of a causal nature. The data were collected using the Google Trends tool and the website of the Italian Civil Protection Department (URL: https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/b0c68bce2cce478eaac82fe38d4138b1).
Discussion
The aim of this technical report is to provide a guide for the appropriate use of the Pearson and Spearman correlation coefficients, showing that the data (non-)normality should not be the sole criterion for their adoption or rejection. Indeed, phenomena capable of manifesting and correlating above a certain threshold are known in the literature [7,17-19]. This paper shows that Pearson’s coefficient can reveal such hidden phenomena even when statistical tests suggest that data groups are not normal. Furthermore, the simultaneous use of both correlations allows to compensate for some potential failures of normality tests. In fact, the kurtosis and skewness standard errors-test is efficient but can be inaccurate while the Shapiro-Wilk test is more reliable but also operationally complex [15]. Therefore, if we have a pronounced Pearson correlation and a weak Spearman correlation, there may be a “correlation threshold”, i.e. we need to investigate further. If we have a weak Pearson correlation and a pronounced Spearman correlation, the relationship is likely to exist, but we must make sure that the data is not normally distributed; otherwise, further investigation is required. If both Pearson and Spearman correlations are pronounced, the correlation holds. Finally, if both correlations are weak, we need to recalculate them using the reciprocal of the hypothetically dependent variable to unmask any other possible hidden correlation. Alongside that, the only truly comprehensive and complete method for detecting hidden correlations is the cyclical search for correlations between data subsets: in fact, all the methods listed above can fail in their purpose when dealing with data distributions that contain hidden correlations between values similar to those non-correlated. However, these procedures can drastically skim the data on which it is necessary to act via the cyclic-iterative method.
Some plausible scenarios in which it is legitimate to expect hidden causal correlations are: i) the effects of air pollution, where exceeding specific thresholds can cause an increase in population mortality and disease due to an impairment of the immune and respiratory systems [5,18], ii) the levels of interest, stress, and anxiety among the population, which can reach high values when negative news exceeds a particular number [7], iii) in the specific case of the novel coronavirus, exceeding a certain value of the population density could have an important role in increasing the spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [6,18,20], and iv) particulate matter could act as a viruses-carrier, especially beyond specific thresholds [17-18].
Conclusions
When data distributions are numerous, it is always recommended to calculate both the Pearson and Spearman correlations. To highlight hidden correlations on continuous data (X, Y), it is also important to recalculate both correlations on data (X, 1/Y). When even this procedure is not able to detect any hidden correlation and there are valid reasons to support its existence, it is necessary to resort to cyclic-iterative methods.
The content published in Cureus is the result of clinical experience and/or research by independent individuals or organizations. Cureus is not responsible for the scientific accuracy or reliability of data or conclusions published herein. All content published within Cureus is intended only for educational, research and reference purposes. Additionally, articles published within Cureus should not be deemed a suitable substitute for the advice of a qualified health care professional. Do not disregard or avoid professional medical advice due to content published within Cureus.
The authors have declared that no competing interests exist.
Human Ethics
Consent was obtained by all participants in this study
Animal Ethics
Animal subjects: All authors have confirmed that this study did not involve animal subjects or tissue.
References
- 1.A guide to appropriate use of correlation coefficient in medical research. Mukaka MM. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3576830/ Malawi Med J. 2012;24:69–71. [PMC free article] [PubMed] [Google Scholar]
- 2.Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: a tutorial using simulations and empirical data. De Winter JCF, Gosling SD, Potter J. Psychol Methods. 2016;21:273–290. doi: 10.1037/met0000079. [DOI] [PubMed] [Google Scholar]
- 3.User's guide to correlation coefficients. Akoglu H. Turk J Emerg Med. 2018;18:91–93. doi: 10.1016/j.tjem.2018.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Correlation coefficients: appropriate use and interpretation. Schober P, Boer C, Schwarte LA. Anesth Analg. 2018;126:1763–1768. doi: 10.1213/ANE.0000000000002864. [DOI] [PubMed] [Google Scholar]
- 5.Assessing the relationship between surface levels of PM2.5 and PM10 particulate matter impact on COVID-19 in Milan, Italy. Zoran AM, Savastru SR, Savastru DM, Tautan MN. Sci Total Environ. 2020;738:139825. doi: 10.1016/j.scitotenv.2020.139825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Correlation between COVID-19 morbidity and mortality rates in Japan and local population density, temperature, and absolute humidity. Kodera S, Rashed AE, Hirata A. Int J Environ Res Public Health. 2020;17:5477. doi: 10.3390/ijerph17155477. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.The Impact of COVID-19 on Italian web users: a quantitative analysis of regional hygiene interest and emotional response. Rovetta A, Castaldo L. Cureus. 2020;12:0. doi: 10.7759/cureus.10719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Study of bacterial community composition and correlation of environmental variables in Rambla Aalada, a hypersaline environment in South-Eastern Spain. Oueriaghli N, Castro JD, Llamas I, Béjar V, Martínez-Checa F. Front Microbiol. 2018;21:1377. doi: 10.3389/fmicb.2018.01377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hypothesis testing, type I and type II errors. Banerjee A, Chitnis UB, Jadhav LS, Bhawalkar JS, Chaudhury S. Ind Psychiatry J. 2009;18:127–131. doi: 10.4103/0972-6748.62274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Greenland S, Senn JS, Rothman JK, Carlin JB, Poole C, Goodman SN, Altman DG. Eur J Epidemiol. 2016;31:337–350. doi: 10.1007/s10654-016-0149-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research. Amrhein V, Korner-Nievergelt F, Roth T. Peer J. 2017; 7:0. doi: 10.7717/peerj.3544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Statistical significance: p value, 0.05 threshold, and applications to radiomics-reasons for a conservative approach. Di Leo G, Sardanelli F. Eur Radiol Exp. 2020;4:18. doi: 10.1186/s41747-020-0145-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.P-value: what is and what is not. Tanha K, Mohammadi N, Janani L. Med J Islam Repub Iran. 2017;31:65. doi: 10.14196/mjiri.31.65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.The debate about p-values. Lu Y, Belitskaya-Levy I. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4858512/#:~:text=Recently%2C%20there%20are%20controversies%20over,research%20that%20cannot%20be%20replicated.&text=We%20conclude%20that%20using%20p,alternative%20hypotheses%20in%20clinical%20trials. Shanghai Arch Psychiatry. 2015;27:381–385. doi: 10.11919/j.issn.1002-0829.216027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Problematic standard errors and confidence intervals for skewness and kurtosis. Wright DB, Herrington JA. Behav Res Methods. 2011;43:8–17. doi: 10.3758/s13428-010-0044-x. [DOI] [PubMed] [Google Scholar]
- 16.Analysis of skewness and kurtosis, real statistics using Excel. [Oct;2020 ];http://www.real-statistics.com/tests-normality-and-symmetry/analysis-skewness-kurtosis/ 2016
- 17.Potential role of particulate matter in the spreading of COVID-19 in Northern Italy: first observational study based on initial epidemic diffusion. Setti L, Passarini F, De Gennaro G, et al. BMJ Open. 2020;10:0. doi: 10.1136/bmjopen-2020-039338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Relationships between demographic, geographic, and environmental statistics and the spread of novel coronavirus disease (COVID-19) in Italy [PREPRINT] Rovetta A, Castaldo L. Cureus. 2020;12:0. doi: 10.7759/cureus.11397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Photoelectric effect. [Nov;2020 ];https://www.khanacademy.org/science/ap-chemistry/electronic-structure-of-atoms-ap/bohr-model-hydrogen-ap/a/photoelectric-effect 2020
- 20.Population density, a factor in the spread of COVID-19 in Algeria: statistic study. Kadi N, Khelfaoui M. Bull Natl Res Cent. 2020;44:138. doi: 10.1186/s42269-020-00393-x. [DOI] [PMC free article] [PubMed] [Google Scholar]



