Skip to main content
Source Code for Biology and Medicine logoLink to Source Code for Biology and Medicine
. 2019 Dec 20;14:6. doi: 10.1186/s13029-019-0076-2

Computing and graphing probability values of pearson distributions: a SAS/IML macro

Qing Yang 1, Xinming An 2, Wei Pan 1,
PMCID: PMC6923921  PMID: 31889995

Abstract

Background

Any empirical data can be approximated to one of Pearson distributions using the first four moments of the data (Elderton WP, Johnson NL. Systems of Frequency Curves. 1969; Pearson K. Philos Trans R Soc Lond Ser A. 186:343–414 1895; Solomon H, Stephens MA. J Am Stat Assoc. 73(361):153–60 1978). Thus, Pearson distributions made statistical analysis possible for data with unknown distributions. There are both extant, old-fashioned in-print tables (Pearson ES, Hartley HO. Biometrika Tables for Statisticians, vol. II. 1972) and contemporary computer programs (Amos DE, Daniel SL. Tables of percentage points of standardized pearson distributions. 1971; Bouver H, Bargmann RE. Tables of the standardized percentage points of the pearson system of curves in terms of β1 and β2. 1974; Bowman KO, Shenton LR. Biometrika. 66(1):147–51 1979; Davis CS, Stephens MA. Appl Stat. 32(3):322–7 1983; Pan W. J Stat Softw. 31(Code Snippet 2):1–6 2009) available for obtaining percentage points of Pearson distributions corresponding to certain pre-specified percentages (or probability values; e.g., 1.0%, 2.5%, 5.0%, etc.), but they are little useful in statistical analysis because we have to rely on unwieldy second difference interpolation to calculate a probability value of a Pearson distribution corresponding to a given percentage point, such as an observed test statistic in hypothesis testing.

Results

The present study develops a SAS/IML macro program to identify the appropriate type of Pearson distribution based on either input of dataset or the values of four moments and then compute and graph probability values of Pearson distributions for any given percentage points.

Conclusions

The SAS macro program returns accurate approximations to Pearson distributions and can efficiently facilitate researchers to conduct statistical analysis on data with unknown distributions.

Keywords: Pearson distributions, Curve fitting, Distribution-free statistics, Hypothesis testing

Background

Most of statistical analysis relies on normal distributions, but this assumption is often difficult to meet in reality. Pearson distributions can be approximated for any data using the first four moments of the data [13]. Thus, Pearson distributions made statistical analysis possible for any data with unknown distributions. For instance, in hypothesis testing, a sampling distribution of an observed test statistic is usually unknown but the sampling distribution can be fitted into one of Pearson distributions. Then, we can compute and use a p-value (or probability value) of the approximated Pearson distribution to make a statistical decision for such distribution-free hypothesis testing.

There are both extant, old-fashioned in-print tables [4] and contemporary computer programs [59] that provided a means of obtaining percentage points of Pearson distributions corresponding to certain pre-specified percentages (or probability values; e.g., 1.0%, 2.5%, 5.0%, etc.). Unfortunately, they are little useful in statistical analysis because we have to employ unwieldy second difference interpolation for both skewness β1 and kurtosis β2 to calculate a probability value of a Pearson distribution corresponding to a given percentage point, such as an observed test statistic in hypothesis testing. Thus, a new program is needed for efficiently computing probability values of Pearson distributions for any given data point; and therefore, researchers can utilize the program to conduct more applicable statistical analysis, such as distribution-free hypothesis testing, on data with unknown distributions.

Pearson distributions are a family of distributions which consist of seven different types of distributions plus normal distribution (Table 1). To determine the type of the Pearson distribution and the required parameters of the density function for the chosen type, the only thing we need to know is the first four moments of the data. Let X represent given data, and its first four central moments can be calculated by

μ1=E(X);μi=E[XE(X)]i=E[Xμ1]i,i=2,3,4. 1

Table 1.

Types of Pearson distributions

Type κ-Criterion Density function Domain
Main Type
I κ<0 f(x)=y0(1+xa1)m1(1xa2)m2 a1xa2
IV 0<κ<1 f(x)=y0(1+x2a2)meνarctan(x/a) <x<
VI κ>1 f(x)=y0(xa)q2xq1 ax<
Transition Type
Normal κ=0(β2=3) f(x)=y0ex2/(2μ2) <x<
II κ=0(β2<3) f(x)=y0(1x2a2)m axa
III κ f(x)=y0(1+xa)γaeγx ax<
V κ=1 f(x)=y0xpeγ/x 0<x<
VII κ=0(β2>3) f(x)=y0(1+x2a2)m <x<

The four central moments can also be uniquely determined by mean, variance, skewness, and kurtosis, which are more commonly used parameters for a distribution and easily obtained from statistical software. The relationships between skewness β1 and the third central moment, and between kurtosis β2 and the fourth central moment are illustrated as follows:

β1=μ3μ23/2(alsoβ1=(β1)2=μ32μ23);β2=μ4μ22. 2

Once the four central moments or the mean, variance, skewness, and kurtosis are calculated, the types of Pearson distributions to which X will be approximated can be determined by a κ-criterion that is defined as follows [1]:

κ=β1(β2+3)24(4β23β1)(2β23β16). 3

The determination of types of Pearson distributions by the κ-criterion (Eq. 3) is illustrated in Table 1. From Table 1, we can also see that for each type of Pearson distributions, its density function has a closed form with a clearly defined domain of X. The closed form of density functions made numerical integration possible for obtaining probability values of approximated Pearson distributions. For each type of Pearson distributions, the required parameters of the density function are calculated by using different formulas. Without loss of generality, we illustrate the type IV formula below. The formula for the rest of the types can be retrieved from [1].

The density function for type IV Pearson distribution is

y=y01+(xλ)2a2meνtan1(xλ)/a, 4

where m=12(r+2), ν=r(r2)β116(r1)β1(r2)2, r=6(β2β11)2β23β16, the scale parameter a=(μ2/16)(16(r1)β1(r2)2), the location parameter λ=μ1+νa/r, and normalization coefficient y0=NaF(r,ν).

The required parameters for each type of Pearson distribution density functions will be automatically computed in a SAS/IML [10] macro program described in the next section. Then, probability values of Pearson distributions can be obtained through numerical integration with the SAS subroutine QUAD.

Implementation

To add the flexibility to the macro, we allow two different ways to input required information. The first one is to input the dataset and variable. The macro will automatically calculate the mean, variance, skewness, and kurtosis of the input variable. The second one is to input the mean, variance, skewness, and kurtosis of the variable directly. The main SAS/IML macro program (see Additional file 1) to compute and graph probability values of Pearson distributions is as follows: %PearsonProb(data=, var=, mean=, variance=, skew=, kurt=, x0=, plot=)

wheredata = the name of the dataset to calculate four moments (this input can be omitted if mean, variance, skewness, and kurtosis input used); var = the name of variable in the dataset to calculate moments (this input can be omitted if mean, variance, skewness, and kurtosis input used); mean = the mean of the variable (this input can be omitted if data and var input used); variance = the variance of the variable (this input can be omitted if data and var input used); skew = the skewness of the variable (this input can be omitted if data and var input used); kurt = the kurtosis of the variable (this input can be omitted if data and var input used); x0 = the percentage point x0; plot = 1 for graph, 0 for no graph.

This SAS/IML macro program has four steps. The first step is to either calculate mean, variance, skewness, and kurtosis based on the input dataset or take the four values directly from inputted parameters. The second step is to calculate κ by using Eq. (3) and identify a specific type of Pearson distribution based on the κ-criterion displayed in Table 1. Once the type of Pearson distribution is determined, in the third step, the macro will calculate the parameters of density function for the specific type of Pearson distribution. For example, for type IV Pearson distribution, y0, m, ν, a, and λ will be calculated according to the specifications underneath Eq. (4). In the fourth and last step, the probability value of the specific type of Pearson distribution corresponding to the inputted percentage point x0 will be calculated by the SAS subroutine QUAD for numerical integration. If the inputted x0 is beyond the defined domain, a warning message will be printed as “WARNING: x0 is out of the domain of type VI Pearson distribution,” for example. If successful, the computed probability value along with the parameters are printed (see Fig. 1).

Fig. 1.

Fig. 1

SAS output for Type IV Pearson distribution parameters and probability

To graph the probability value on the approximated density function of the Pearson distribution, a small SAS/IML macro %plotprob was written for use within the main SAS/IML macro %PearsonProb(data=, var=, mean=, variance=, skew=, kurt=, x0=, plot=). If 1 is inputted for plot, the SAS subroutines GDRAW, GPLOY, etc. are called in the small graphing macro for plotting the density function and indicating probability value. Otherwise (i.e., plot = 0), no graph is produced.

To illustrate the process, we provide an example of input and output below (two example datasets are available online: Additional files 2 & 3). One could either input a dataset and variable name (Item 1) or input the values of “mean”, “variance”, “skewness”, and “kurtosis” (Item 2) to the %PearsonProb macro. Both the dataset “dataIV” and the values of the four moments for this example are taken from [1].

  1. %PearsonProb(data = pearson.dataIV, var = x, x0 = 66, plot = 1);

  2. %PearsonProb(mean = 44.578, variance = 115, skew = 0.07325, kurt = 3.1729, x0 = 66, plot = 1).

The outputs from both the statements are the same. The standard output (see Fig. 1) includes the values of mean, variance, skewness, and kurtosis; and indicates the type of the Pearson distribution identified. It also outputs the formula for the density function and the values of the parameters of the density function. Lastly, it prints the calculated probability. Since we used the plot = 1 option, a figure to illustrate the distribution and probability is also produced (see Fig. 2).

Fig. 2.

Fig. 2

A type IV Pearson distribution with a probability value indicated

Results

To evaluate the accuracy of the SAS/IML macro program for computing and graphing probability values of Pearson distributions, the calculated parameters of the approximated Pearson distributions from this SAS/IML macro were first compared with the corresponding ones in [1]. As can be seen in Table 2, the absolute differences between the calculated parameters from the SAS/IML macro and those from [1]’s tables are all very small with almost all of them less than.001 and a few less than.019. The same story applies to the relative differences with an unsurprising exception (4.46%) of κ for type IV whose original magnitude is very small.

Table 2.

Computed parameters and their accuracy

Value from Value from Elderton Absolute Differenceb Relative Differencec
Typea Parameter SAS/IML Macro and Johnson (1969)
I β1 .507296 .507296 <.0001 <.01%
β2 2.935111 2.935110 <.0001 <.01%
κ -.264690 -.264500 .0002 .07%
r 5.186821 5.186811 <.0001 <.01%
α1 1.977543 1.996380 .0188 .94%
α2 13.508428 13.527280 .0189 .14%
m1 .406954 .409833 .0029 .70%
m1 2.779867 2.776878 .0030 .12%
IV β1 .005366 .005366 <.0001 <.01%
β2 3.172912 3.172912 <.0001 <.01%
κ .012230 .012800 .0006 4.46%
r 39.442562 39.442540 <.0001 <.01%
v 4.388796 4.388794 <.0001 <.01%
α 13.111988 13.111980 <.0001 <.01%
m 20.721280 20.721270 <.0001 <.01%
VI β1 .995360 .995361 <.0001 <.01%
β2 4.739349 4.739349 <.0001 <.01%
κ 1.894437 1.895000 .0006 .03%
r -33.421430 -33.421290 .0001 <.01%
q1 42.030520 42.030800 .0003 <.01%
q2 6.609095 6.609500 .0004 <.01%
α 10.379832 10.379470 .0004 <.01%

aElderton and Johnson (1969) does not have the other types of Pearson distributions

bAbsolute Difference = |Value from Elderton and Johnson (1969) − Value from SAS/IML Macro |

cRelative Difference = |(Value from Elderton and Johnson (1969) − Value from SAS/IML Macro)/Value from Elderton and Johnson (1969) |×100%

Then, the computed probability values from the SAS/IML macro were evaluated using the percentage points in [4]’s Table 32 (p. 276) corresponding to probability values of 2.5% and 97.5% for illustration purposes only. From Table 3, we can see that the probability values computed from the SAS/IML macro are very close to.025 (or 2.5%) and.975 (or 97.5%), respectively, with a high degree of precision (less than.0001).

Table 3.

Computed probability values and their accuracy

Percentage Point from Pearson and Hartley (1972) Probability Value from SAS/IML Macro Absolute Differenceb
Typea β1 β2 For 2.5% For 97.5% 2.5% 97.5% For 2.5% For 97.5%
Normal .0 3.0 -1.9600 1.9600 .0249970 .9750020 <.00001 <.00001
I .6 3.2 -1.5998 2.2320 .0249965 .9749989 <.00001 <.00001
II .0 2.6 -1.9196 1.9196 .0250030 .9749970 <.00001 <.00001
IV 1.4 8.6 -1.5068 2.3801 .0249838 .9749471 .00002 .00005
VI 2.0 11.2 -1.1915 2.5545 .0250054 .9750021 .00001 <.00001
VII .0 8.4 -1.9925 1.9925 .0249999 .9750001 <.00001 <.00001

aPearson and Hartley (1972) does not have examples of types III and V

bAbsolute Difference = |.025 − Probability value from SAS/IML macro |; and = |.975 − Probability value from SAS/IML macro |, respectively

Discussion

Pearson distributions are a family of non-parametric distributions. It is often used when the normal distribution assumption is not applicable to the data. In this paper, the first approach of inputting dataset as parameters for the macro is more often used. The second approach of entering first four moments as parameters are more helpful when the researcher already performed some descriptive statistics based on the data in the first approach.

Conclusions

The new SAS/IML macro program provides an efficient and accurate means to determine the type of Pearson distribution based on either a dataset or values of the first four moments and then compute probability values of the specific Pearson distributions. Thus, researchers can utilize this SAS/IML macro program in conducting distribution-free statistical analysis for any data with unknown distributions. The SAS/IML macro program also provides a nice feature of graphing the probability values of Pearson distributions to visualize the probability values on the Pearson distribution curves.

Availability and requirements

Project name: PearsonProb

Project home page: To be available

Operating system(s): Platform independent

Programming language: SAS/IML

Other requirements: SAS 9.4 or higher

License: Not applicable

Any restrictions to use by non-academics: None

Additional material

Additional file 1 (18KB, sas)

SAS/IML macro program. The SAS/IML macro program for computing and graphing probability values of Pearson distributions is available as an additional file, PearsonDistributionProbfinal.sas

Additional file 2 (128KB, sas7bdat)

Sample dataset 1. The dataset dataI.sas7bdat was taken from [1].

Additional file 3 (256KB, sas7bdat)

Sample dataset 2. The dataset dataIV.sas7bdat was taken from [1].

Acknowledgments

Not applicable.

Authors’ contributions

QY extensively revised manuscript and the SAS program. XA revised the manuscript. WP initially wrote the manuscript and the SAS program. All authors read and approved the final manuscript.

Funding

Not applicable.

Availability of data and materials

Not applicable.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information accompanies this paper at 10.1186/s13029-019-0076-2.

References

  • 1.Elderton WP, Johnson NL. Systems of Frequency Curves. London: Cambridge University Press; 1969. [Google Scholar]
  • 2.Pearson K. Contributions to the mathematical theory of evolution. ii. skew variations in homogeneous material. Philos Trans R Soc Lond Ser A. 1895;186:343–414. doi: 10.1098/rsta.1895.0010. [DOI] [Google Scholar]
  • 3.Solomon H, Stephens MA. Approximations to density functions using pearson curves. J Am Stat Assoc. 1978;73(361):153–60. doi: 10.1080/01621459.1978.10480019. [DOI] [Google Scholar]
  • 4.Pearson ES, Hartley HO. Biometrika Tables for Statisticians, vol. II. New York: Cambridge University Press; 1972. [Google Scholar]
  • 5.Amos DE, Daniel SL. Tables of percentage points of standardized pearson distributions, Research Report SC-RR-71 0348. Albuquerque: Sanida Laboratories; 1971. [Google Scholar]
  • 6.Bouver H, Bargmann RE. Tables of the standardized percentage points of the pearson system of curves in terms of β1 and β2, Technical Report No. 107. Georgia: Department of Statistics and Computer Science, University of Georgia; 1974. [Google Scholar]
  • 7.Bowman KO, Shenton LR. Approximate percentage points for pearson distributions. Biometrika. 1979;66(1):147–51. doi: 10.1093/biomet/66.1.147. [DOI] [Google Scholar]
  • 8.Davis CS, Stephens MA. Approximate percentage points using pearson curves. Appl Stat. 1983;32(3):322–7. doi: 10.2307/2347964. [DOI] [Google Scholar]
  • 9.Pan W. A SAS/IML macro for computing percentage points of pearson distributions. J Stat Softw. 2009;31(Code Snippet 2):1–6. doi: 10.18637/jss.v031.c02. [DOI] [Google Scholar]
  • 10.SAS Institute Inc.SAS/IML 9.3 User’s Guide. 2011. http://www.sas.com/. Accessed 23 Jun 2012.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Additional file 1 (18KB, sas)

SAS/IML macro program. The SAS/IML macro program for computing and graphing probability values of Pearson distributions is available as an additional file, PearsonDistributionProbfinal.sas

Additional file 2 (128KB, sas7bdat)

Sample dataset 1. The dataset dataI.sas7bdat was taken from [1].

Additional file 3 (256KB, sas7bdat)

Sample dataset 2. The dataset dataIV.sas7bdat was taken from [1].

Data Availability Statement

Not applicable.


Articles from Source Code for Biology and Medicine are provided here courtesy of BMC

RESOURCES