Bias in Zipf’s law estimators

Charlie Pilgrim; Thomas T Hills

doi:10.1038/s41598-021-96214-w

. 2021 Aug 27;11:17309. doi: 10.1038/s41598-021-96214-w

Bias in Zipf’s law estimators

Charlie Pilgrim ^1,^✉, Thomas T Hills ^2,³

PMCID: PMC8397718 PMID: 34453066

Abstract

The prevailing maximum likelihood estimators for inferring power law models from rank-frequency data are biased. The source of this bias is an inappropriate likelihood function. The correct likelihood function is derived and shown to be computationally intractable. A more computationally efficient method of approximate Bayesian computation (ABC) is explored. This method is shown to have less bias for data generated from idealised rank-frequency Zipfian distributions. However, the existing estimators and the ABC estimator described here assume that words are drawn from a simple probability distribution, while language is a much more complex process. We show that this false assumption leads to continued biases when applying any of these methods to natural language to estimate Zipf exponents. We recommend that researchers be aware of the bias when investigating power laws in rank-frequency data.

Subject terms: Human behaviour, Applied mathematics, Statistics, Computational science

If we take a book and rank each word based on how many times it appears, we will find that the number of occurrences of each word is approximately inversely proportional to its rank¹. The second most frequent word will appear approximately $\frac{1}{2}$ as often as the most frequent word, the third around $\frac{1}{3}$ as frequently. This describes a power law relationship between the frequency of a word, n, and the word’s rank in terms of its frequency, $r_{e}$ , with exponent $γ \approx 1$ ².

\begin{matrix} n (r_{e}) \propto r_{e}^{- γ} \end{matrix}

This is known as Zipf’s law and is consistent, in a general sense, across human communication^3,4. We do not have a satisfactory reason why this is² and the exponent, $γ$ , is not always 1 but varies between different speakers³ and texts^3,5. Sound analytical tools are needed to investigate these research areas.

Equation (1) describes an observed empirical relationship. It is tempting to assume that this is equivalent to a probability distribution for words (an early example is of Shannon using Zipf’s law to estimate the entropy of English⁶). Indeed, Zipf’s law is often expressed as a relationship between a word’s probability of occurrence^7,8 and the word’s rank in the probability distribution, $r_{p}$ .

\begin{matrix} p (r_{p}) \propto {r_{p}}^{- λ} \end{matrix}

The conflation of Eqs. (1) and (2) causes the prevailing maximum likelihood estimators to miscalculate $λ$ in Eq. (2) with a positive bias^2,9,10 (Fig. 1). This bias applies specifically to rank-frequency distributions, where the ranks of events are not known a priori and instead are extracted from the frequency distribution, as is the case with word frequencies. The root of the bias is that the existing estimators make the assumption that the observed empirical frequency rankings of data [ $r_{e}$ in Eq. (1)] are equivalent to rankings in an underlying probability distribution [ $r_{p}$ in Eq. (2)]². The nth most frequent word is assumed to be the nth most likely word, which is not necessarily the case². This is often overlooked in the literature².

Bias in maximum likelihood estimation for rank-frequency data. 100 values of $λ$ between 1 and 2 were investigated. For each $λ$ , samples with $N = 100, 000$ were generated from an unbounded power law distribution and Clauset et al.’s estimator was applied to the empirical rank-frequency distribution. This was repeated 100 times and results averaged. There is a clear and strong positive bias for $λ ⪅ 1.5$ .

In the 2000s there were a series of papers^8,11,12 describing a method of maximum likelihood estimation that gave more accurate (lower bias) estimates for power law exponents than graphical methods⁸. The most influential of these is Clauset et al.’s paper⁸. The estimators had been derived and presented before¹¹ (as early as 1952 in the discrete case¹³) but Clauset et al.’s paper popularised the idea and provided a clear methodology including techniques to perform goodness of fit tests⁸. In all of these papers, the derivation of the likelihood function assumes that there is some a priori ordering on an independent variable. This works very well for power laws with some natural way to order events, such as the size vs frequency of earthquakes⁸. However, it does not work so well with rank-frequency distributions, where the rank is extracted empirically from the frequency distribution, so that the empirical rank and frequency are correlated variables², both dependent on the same underlying mechanism. This difference was not addressed by Clauset et al., who include examples of applying their estimator to Zipf’s law in language⁸. The same data can look very different depending on whether we know it’s true rank or not, as shown in Fig. 2.

Difference between distributions with probability and empirical ranks. Data was generated from an underlying power law probability distribution with exponent $λ = 1$ , number of possible events $W = 60$ and $N = 200$ samples. The dotted blue line shows the probability distribution. The blue circles show the sampled event frequencies with a priori known probability ranks. The red crosses show the empirical rank-frequency distribution from the same data. There is a significant difference between the two distributions. The current estimators are designed to fit data with a priori known ranks, not empirical ranks.

Recently Clauset et al.’s estimator has been shown, empirically, to be biased for some rank-frequency distributions^2,9,10. In particular, Clauset et al.’s method over-estimates exponents with rank-frequency data generated from known power law probability distributions with exponents below about 1.5¹⁰ (Fig. 1). The problem is related to low sampling in the tail^9,10, so that the observed empirical ranks tend to “bunch up” above the line of the true probability distribution before decaying sharply at the end of the observed tail (Fig. 2). To our knowledge this bias has not been adequately explained or solved.

In 2014 Piantadosi et al.² explained the problem and suggested splitting a corpora and calculating ranks of words from one part of the split and frequencies from the other, breaking the correlation of errors. However the method does not take into account uncorrelated errors in the ranks. In particular, the empirical ranks of events in the tail will almost certainly be lower than the actual ranks in the probability distribution as many events in the tail will not be observed at all.
Hanel et al.¹⁰ identified the problem and suggested using a finite set of events instead of Clauset et al.’s unbounded event set⁸. This gives more accurate results in the limited case that the number of possible events, W, is finite and known¹⁰. Often W is not known and the choice of W can substantially change the results. With Zipf’s law in language, W represents the writer’s vocabulary and is usually modelled as unbounded^2,8,12. This seems appropriate given that Heaps’ Law suggests that the number of unique words in a document continues to rise indefinitely as the document length increases¹⁴.
In 2019 Corral et al.⁹ examined the problem and explored a technique of transforming the data to a distribution of frequencies representation, f(n), which is also a power law type distribution that they call the Zipf's law for sizes⁹. This distribution has an a priori known independent variable of frequency sizes, so the bias does not apply to this representation. However there is still difficulty in estimating the rank-frequency exponent, as a power law in the rank-frequency distribution, $n (r_{e})$ , will only approximately map to a power law in the distribution of frequencies, f(n), for real-world sample sizes⁹.

Overall these ad-hoc methods can remove the bias to some extent but not completely. The methods also introduce a host of somewhat arbitrary choices for the researcher to resolve.

We derive a new maximum likelihood estimator that does not make the false assumption that the empirical ranks, $r_{e}$ , are equivalent to the probability ranks, $r_{p}$ . The new estimator considers all the possible ways that the events could be ranked in the underlying probability distribution to generate the observed empirical data. Unfortunately this new likelihood function is computationally intractable for all but the smallest data sets. In order to estimate parameters for larger data sets, we turn to approximate Bayesian computation (ABC), a method that is designed for situations where likelihood functions cannot be computed¹⁵. We show that this method has much lower bias than Clauset et al.’s estimator for rank-frequency data generated from simple power laws. We further explore two different implementations of ABC and find that they give different results when applied to word distributions in books because ABC and Clauset et al.’s method both assume an underlying power law probability model, while natural language arises from a more complex model. We suggest that this false assumption means that maximum likelihood estimation with simple models will always have some arbitrary bias when studying rank-frequency data in natural language, including ABC and Clauset et al.’s method.

Model

Likelihood function: general case with no a priori ordering

A vector of data, $d = [d_{1}, d_{2}, \dots d_{N}]$ , represents N observations of a random variable X. Each of these observations are one of a discrete set of W events, with no a priori ordinality. An example is words in a book.

We can transform the vector $d$ to counts of each event, ordered from most to least frequent, $n = [n (x_{(1)}), n (x_{(2)}), \dots, n (x_{(W)})]$ . $n (x_{(r_{e})})$ represents the count of the $r_{e}$ th most common event, where $r_{e}$ is the event’s ranking in the empirical frequency distribution. For ease of notation we will refer to $n (x_{(r_{e})})$ as $n (r_{e})$ .

We assume a simple model where each of these events has some unknown fixed probability of being observed, $p (x_{r_{p}}) = P r (X = x_{r_{p}})$ , where $r_{p}$ is the event’s rank in the underlying probability distribution.

The key insight is that given an event’s empirical rank, we do not know that event’s rank in the underlying probability distribution. We can describe the mapping of events from the data generating probability ranking to the empirical ranking with a vector $s$ , so that $s (r_{p}) = r_{e}$ . For example $s = [2, 1, 3]$ would mean that the second most probable event was observed empirically the most number of times, the most probable event was seen the second most number of times, and the third most likely seen third most. For any valid mapping, $s$ must be a permutation of the integers from 1 to W. Figure 3 shows an example mapping.

An example mapping from probability to empirical ranks. The observed data $n = [8, 6, 3, 2, 1, 1]$ can arise from any valid permutation of events from the probability distribution. Here the permutation is $s = [2, 1, 5, 3, 4, 6]$ . The 1st most likely event is observed the second most times ( $s [1] = 2$ ), etc. The likelihood of the data given this permutation is $p (n | s, θ) = p_{1}^{6} p_{2}^{8} p_{3}^{1} p_{4}^{3} p_{5}^{2} p_{6}^{1}$ .

We assume that the probability distribution is parameterised by $θ$ . Considering Bayes’ rule

\begin{matrix} p (θ | n) = \frac{p (n | θ) p (θ)}{p (n)} \end{matrix}

The likelihood can be written as (ignoring constants of proportionality)

\begin{matrix} p (n | θ) = \prod_{r_{e} = 1}^{W} p {(x_{(r_{e})})}^{n (r_{e})} \end{matrix}

This likelihood equation is in terms of the events’ empirical rank, $r_{e}$ , whereas the underlying probability model is in terms of probability rank, $r_{p}$ . To convert the likelihood to be in terms of $r_{p}$ we condition on the mapping vector, $s$ .

\begin{matrix} p (n | θ, s) = \prod_{r_{p} = 1}^{W} p {(x_{r_{p}})}^{n (s (r_{p}))} \end{matrix}

Using the law of total probability we sum over all possible mappings of probability rankings onto empirical rankings. S(W) is the set of all possible permutations of the numbers 1 to W, known as the symmetric group.

\begin{matrix} p (n | θ) = \sum_{s \in S (W)} \prod_{r_{p} = 1}^{W} p {(x_{r_{p}})}^{n (s (r_{p}))} \end{matrix}

Equation (6) is the likelihood for any data that represents observations of discrete events, where the events have no a priori ordering in relation to the underlying model. The equation generalises to $W \to \infty$ , suitable to describe models with unbounded event sets, as is the case in many Zipf type models.

Likelihood function: power laws with no a priori ordering

A common model applied to rank-frequency distributions is the power law, used by Zipf in his study of words¹. A power law probability distribution is of the form

\begin{matrix} p (x_{r_{p}}) = \frac{r_{p}^{- λ}}{Z_{λ}} \end{matrix}

where $λ$ is the power law exponent, $Z_{λ}$ is a normalising factor. We use the simplest form of Zipf’s law for ease of analysis. The method described here can be used with other models such as the Zipf–Mandelbrot law¹⁶. The normalising factor is:

\begin{matrix} Z_{λ} = \sum_{r_{p} = 1}^{W} r_{p}^{- λ} \end{matrix}

W is the number of possible events. In the limit $W \to \infty$ , $Z_{λ}$ becomes the Riemann zeta function, $ζ (λ)$ ⁸.

Considering Eq. (6), the likelihood can be written as

\begin{matrix} L (λ | n) = \sum_{s \in S (W)} \prod_{r_{p}}^{W} {(\frac{r_{p}^{- λ}}{Z_{λ}})}^{n (s (r_{p}))} \end{matrix}

And the differential of the likelihood with respect to $λ$ is

\begin{matrix} \frac{\partial}{\partial λ} L (λ | n) = \sum_{s \in S (W)} ((\frac{N Z_{λ}^{'}}{Z_{λ}} + \sum_{r_{p}}^{W} n (s (r_{p})) l n (r_{p})) \times \prod_{r_{p}}^{W} {(\frac{{r_{p}}^{- λ}}{Z_{λ}})}^{n (s (r_{p}))}) \end{matrix}

$Z_{λ}^{'}$ is the differential of the normalising factor with respect to $λ$ . To find the maximum likelihood estimator, we can use numerical methods to either (a) maximise Eq. (9) or (b) find the root of Eq. (10) (Fig. 4).

Likelihood functions of the full likelihood (blue) and only the leading term (red). Both likelihoods are calculated for the data $n = [10, 3, 3, 2, 1, 1]$ . The leading term of the full likelihood is equivalent to the likelihood function as defined by Hanel et al.¹⁰, which is adapted for finite event sets from Clauset et al.’s estimator⁸. The top figure shows the full likelihood compared to Hanel et al.’s likelihood, with the maximum likelihood estimators shown as dashed lines. The bottom figure shows the differential of the likelihood functions. The form of the differential of the full likelihood is markedly different to only the first term. There is a substantial difference in the maximum likelihood estimator, with the Hanel et al. estimator giving $\hat{λ} = 1.27$ and the full estimator giving $\hat{λ} = 1.16$ .

The prevailing estimators from the literature (often implicitly) assume that the empirical ranks match the probability ranks^2,8,12, so that they only consider the leading term in the main sum in both Eqs. (9) and (10) (associated with the identity permutation $s_{I} = [1, 2, \dots, W]$ ). This is the source of the bias in the existing estimators.

The number of terms in the likelihood function (Eq. 6) scales as O(W!), so that naive computation of the likelihood is impractical even at $W \approx 10$ . The computation can be shown to be equivalent to the computation of the permanent of a matrix with entries $a_{ij} = p {(x_{j})}^{n (i)}$ . The best known algorithm for exactly computing the permanent of a matrix is Ryser’s algorithm^17,18 with complexity $O (W 2^{W})$ . This is computationally intractable for real world data sets such as text corpora with vocabularies of $W > 1000$ . A more in-depth discussion on the computational complexity can be found in the Supplementary Information.

Approximate Bayesian computation

Approximate Bayesian computation is a technique for approximating posterior distributions without calculating a likelihood function^19–21. Instead, we assume a model, $M$ , simulate data, $n_{i}$ , from possible parameters, $λ_{i}$ , and observe how close that simulated data is to the empirical data using a distance measure $ρ (n_{i}, n_{obs})$ ^19,21. The ABC rejection algorithm is based upon the principle that we can approximate the actual posterior by estimating the probability of $λ$ given that the data is within some small tolerance, $ϵ$ , of the observed empirical data^19,22. This assumes that the model, $M$ , is a good representation of the actual data generating process.

\begin{matrix} p (λ | n = & n_{obs}, M) \approx p (λ | ρ (n, n_{obs}) < ϵ, M) \end{matrix}

\begin{matrix} p (λ | ρ (n, n_{obs}) < & ϵ, M) = \frac{p (ρ (n, n_{obs}) < ϵ | λ, M) p (λ | M)}{p (ρ (n, n_{obs}) < ϵ | M)} \end{matrix}

The ABC rejection algorithm begins by sampling parameter values from the prior. For each of these parameter values, data is then generated from the model and tested on the condition $ρ (n_{i}, n_{obs}) < ϵ$ ¹⁹. With enough samples, the density of successful parameters will approximate the right hand side of Eq. (12), and an approximation for the posterior distribution¹⁹. If we use a uniform prior then this will be a proportional estimate to the likelihood.

An ideal distance measure, $ρ (n_{i}, n_{obs})$ , would involve comparing Bayesian sufficient summary statistics from the data²¹. Usually in practice Bayesian sufficiency cannot be achieved^19,21, and some information will be lost so that the approximation of the posterior includes some error¹⁹. A common technique is to summarise the data sets with summary statistics, $S (n)$ , and define the distance as the difference between those, $ρ (n_{i}, n_{obs}) = S (n_{i}) - S (n_{obs})$ ^15,19,21. Recently the Wasserstein distance, a metric between distributions, has been shown to work well as a distance measure²³. This is a principled approach that avoids the difficult selection of summary statistics²³, and this is the measure that we use here.

The ABC rejection algorithm requires a small tolerance in order to find a good estimate for the posterior²². This in turn requires a high density of samples in order to have enough successful parameters to build the posterior approximation. To sample at a high density across a reasonable parameter space with a uniform prior would be prohibitively computationally expensive. Instead, we use population Monte Carlo to sample from a proposal distribution that focuses on areas of high posterior probability while avoiding areas of negligible probability²⁴. At each time step, the results are weighted using principles from importance sampling to account for the fact that we are sampling from the proposal distribution instead of the prior²⁴. This algorithm, adapted from²⁵, is shown in Algorithm 1 and Fig. 5 (the 2 parameter algorithm is equivalent, with the variance replaced by a covariance matrix). The parameters in the algorithm were set following trial and error to balance computation time and accuracy.

Approximate Bayesian computation with population Monte Carlo (ABC-PMC). (a) Given the observed data. (b) Particles are generated from a proposal distribution and data is simulated for each particle. For each particle, the Wasserstein distance is measured between the simulated data and the observed data. (c) This is repeated until *nParticles* samples are generated with Wasserstein distance within a tolerance $ϵ$ . (d) A new proposal distribution is generated by a weighted kernel density estimate on the accepted particles, with a weighting based on importance sampling principles. A new tolerance is set based upon a proportion of *survivalFraction* particles with the smallest distances found in this time step. This is repeated for a given number of generations. The final successful particles are used to generate an approximation of the posterior distribution using a weighted kernel density estimate. Figure adapted in part from¹⁹ and²¹.

We also investigated an alternative approach to approximate Bayesian computation known as ABC regression. Instead of the Wasserstein distance, we used the mean of the log transformed event counts as a summary statistic with this method. Full details are in the Supplementary Information.

ABC results

Approximate Bayesian computation with Zipf distributions

Rank-frequency data was generated ( $N =$ 10,000) from an unbounded power law with exponents ranging from 1 to 2. For each generated data set, the exponent was estimated using a) Clauset et al.’s estimator and b) ABC-PMC with the Wasserstein distance. This was repeated 100 times to find the mean bias and variance. The ABC method has much lower bias and similar variance to Clauset et al.’s method, (Fig. 6).

Bias in ABC (solid blue) vs Clauset et al.’s estimator (dashed red) for unbounded power laws. For each of 100 values of $λ$ between 1.01 and 2, rank-frequency data ( $N =$ 10,000) was generated by sampling an unbounded power law. This was run 100 times. The left figure shows the known $λ$ and the mean estimated $λ$ . The centre figure shows the mean bias, with a 68 $%$ confidence interval shaded. The right figure shows the variance of the estimators. The ABC estimator has much lower bias and similar variance to Clauset et al.’s estimator.

We also investigated how the bias changes with varying sample size. Rank-frequency data was generated with $λ = 1.1$ and varying sample size up to $N =$ 1,000,000. Clauset et al.’s estimator shows positive bias at all values of N, although it decreases with large N. ABC shows much lower bias for all values of N. The variance of ABC is higher for $N ⪅ 1000$ . Overall the variance is still very low, and is insignificant compared to the positive bias showed by Clauset et al.’s estimator (Fig. 7).

Bias in ABC (solid blue) vs Clauset et al.’s estimator (dashed red) for unbounded power laws. Rank-frequency data was generated for $λ = 1.1$ with varying sizes, N. This was run 100 times. The left figure shows the known $λ$ against the mean estimated $λ$ . The centre figure shows the mean bias, with a 68% confidence interval shaded. The right figure shows the variance of the estimators. The bias is much lower with ABC. The ABC estimator has higher variance than Clauset et al. at low N, although the variance is still very low.

In addition to the results shown here, we explored a variation of the algorithm using ABC rejection with the mean of the logged event counts as a summary statistic. This method has similarly low bias and variance as the results shown here. See the Supplementary Information for full details.

Approximate Bayesian computation with Zipf–Mandelbrot model

The Zipf–Mandelbrot law is a modification of Zipf’s law derived by Mandelbrot that accounts for a departure from a strict power law in the head of the rank-frequency distribution¹⁶.

\begin{matrix} p (r_{p}) \propto {(r_{p} + q)}^{- λ} q \in [0, 1, 2 \dots] \end{matrix}

We tested the ABC PMC algorithm with this 2 parameter model. The algorithm is of the same form as Algorithm 1, with the variance replaced with a covariance matrix. The algorithm is demonstrated with one generated data set with $q =$ 4, $λ$ =1.2 and $N =$ 100,000. ABC PMC performs well, with close estimates to the true parameters (see Fig. 8). The approximated likelihood function gives negligible probability for $q =$ 0, suggesting that the algorithm can discriminate between data generated from Zipf’s law and the Zipf–Mandelbrot law.

Results of ABC-PMC for the Zipf–Mandelbrot law with data generated with known exponent $λ = 1.2$ and $q = 4$ (red cross) with $N =$ 100,000 words. The likelihood function (darker blue regions have higher likelihood) was approximated using a kernel density estimate. The mode of the KDE gives the maximum likelihood estimate (green circle). The estimator correctly identifies q and is close to the correct exponent $λ$ .

Analysis of books

Both Clauset et al.’s method and the approximate Bayesian computation method described here assume a Zipfian data generating model. We have demonstrated that ABC-PMC with the Wasserstein distance works well for data generated from a known power law, with much lower bias than Clasuet et al.’s method. In the Supplementary Information, we also describe an ABC regression method using the mean log of the word counts that has similar low bias when applied to data from a power law distribution.

It is reasonable to suggest that natural language is a more complex process than drawing words from a power law probability distribution. Indeed, deep learning language models like GPT-3 use billions of parameters²⁶. As such, models that assume Zipfian data generating models are not necessarily suitable for analysing language. To demonstrate the problem, we analysed books using (a) Clauset et al.’s method, (b) ABC-PMC with the Wasserstein distance (c) ABC regression with the mean of the log transformed word counts as a summary statistic (Table 1). All of the books were downloaded from Project Gutenberg²⁷. Each text sample was first “cleaned” by removing all punctuation, replacing numbers with a $#$ symbol, and converting all text to lowercase. The word frequencies were then counted.

Table 1.

Comparision of estimators of Zipf’s law in books

Book	Clauset et al.	ABC PMC with Wasserstein	ABC regression with mean log
Moby dick	1.19	1.25	1.16
A tale of two cities	1.21	1.27	1.17
Alice in Wonderland	1.22	1.25	1.18
Chronicles of London	1.19	1.20	1.15
Ulysses	1.18	1.22	1.14

Open in a new tab

The two forms of ABC give different results, which bracket the results of the Clauset et al. estimator. This does not imply that the Clauset et al. is the best approximator as we show above that it is biased upwards. What these present results indicate is that there is no correct “ground truth” because the assumed underlying models are wrong.

Discussion

We have demonstrated that the prevailing Zipf’s law maximum likelihood estimators for rank-frequency data are biased due to an inappropriate likelihood function. This bias is particularly strong in the range of natural language, with exponents close to 1. The correct likelihood function is intractable. We have presented one approach to overcoming this bias using a likelihood-free method of approximate Bayesian computation. The ABC method is shown to work well with data generated from actual power law distributions, with lower bias than Clasuet et al.’s estimator.

ABC works well in an idealised situation where the true model is known. However when applied to analysing books, the two ABC approaches that we explored give very different estimates for the Zipf exponents. The Zipfian approaches we investigate all assume a simple bag of words probability model, whereas our results on books indicate that natural language generation is a more complex process–otherwise the two ABC methods would converge. The ABC algorithms are searching a parameter space for the closest model based on the distance measure. This works well when the parameter space includes the true data generating process. But with natural language the assumed simple Zipf model is wrong so there is no “correct” location in the parameter space (or the “correct” location is outside the parameter space). Different distance measures will prejudice different aspects of the observed data and so arrive at different estimates. This bias is arbitrary in nature and there seems to be no reasonable way to decide which distance measure is “correct”. The error lies in the assumption of an incorrect data generating model. This problem applies to ABC and Clauset et al.’s estimator, and seems to be inherent in applying maximum likelihood estimation using simple models to describe power laws in natural language.

Zipf’s law for word types is an empirical relationship between frequencies of words and ranks in that frequency distribution. The difficulty arises when a probabilistic model is used to describe the mechanism that is generating this relationship, when the actual mechanism is more complex. The main aim of this publication is to clearly show that Clauset et al.’s estimator is strongly biased for rank-frequency data. The correct likelihood function provides an unbiased framework that works well when the underlying data generating process is known. This does not appear to be the case for natural language. Graphical methods may therefore be more suitable to study Zipf’s law when investigating the empirical relationship between ranks and frequencies (Eq. 1) and not the probability distribution (Eq. 2). All Zipf estimators have some bias and the best choice will depend on the specific application.

The scripts and data used here are available at the repository https://github.com/chasmani/PUBLIC_bias_in_zipfs_law_estimators. That repository includes the approximate Bayesian computation algorithm as well as implementations of other estimators from the literature.

Supplementary Information

Supplementary Information.^{(398.8KB, pdf)}

Acknowledgements

The study was funded by the EPSRC grant for the Mathematics for Real-World Systems CDT at Warwick (Grant No. EP/L015374/1). T.T.H. was supported on this work by the Royal Society Wolfson Research Merit Award (WM160074) and a Fellowship from the Alan Turing Institute.

Author contributions

C.P. conceived of the presented idea and carried out the analyses. T.T.H. supervised C.P. and offered guidance, suggestions and support throughout. All authors reviewed the manuscript.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-021-96214-w.

References

1.Zipf, G. K. Human Behavior and the Principle of Least Effort. (Addison-wesley press, 1949).
2.Piantadosi ST, Piantadosi ST. Zipf’s word frequency law in natural language: A critical review and future directions. Psychon. Bull. Rev. 2014;21:1112–1130. doi: 10.3758/s13423-014-0585-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Ferrer R, Cancho R. The variation of Zipf’s law in human language. Eur. Phys. J. B. 2005;44:249–257. doi: 10.1140/epjb/e2005-00121-8. [DOI] [Google Scholar]
4.Moreno-Sánchez I, Font-Clos F, Corral Á. Large-scale analysis of Zipf’s law in english texts. PLoS ONE. 2016;11:e0147073. doi: 10.1371/journal.pone.0147073. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Montemurro MA, Zanette DH. New perspectives on zipf’s law in linguistics: From single texts to large corpora. Glottometrics. 2002;4:87–99. [Google Scholar]
6.Shannon CE. Prediction and entropy of printed english. Bell Syst. Tech. J. 1951;30:50–64. doi: 10.1002/j.1538-7305.1951.tb01366.x. [DOI] [Google Scholar]
7.Newman ME. Power laws, pareto distributions and zipf’s law. Contemp. Phys. 2005;46:323–351. doi: 10.1080/00107510500052444. [DOI] [Google Scholar]
8.Clauset A, Shalizi CR, Newman ME. Power-law distributions in empirical data. SIAM Rev. 2009;51:661–703. doi: 10.1137/070710111. [DOI] [Google Scholar]
9.Corral, A., Serra, I. & Ferrer-i Cancho, R. The distinct flavors of zipf’s law in the rank-size and in the size-distribution representations, and its maximum-likelihood fitting. arXiv preprint arXiv:1908.01398 (2019). [DOI] [PubMed]
10.Hanel R, Corominas-Murtra B, Liu B, Thurner S. Fitting power-laws in empirical data with estimators that work for all exponents. PLoS ONE. 2017;12:e0170920. doi: 10.1371/journal.pone.0170920. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Goldstein ML, Morris SA, Yen GG. Problems with fitting to the Power-law distribution. Eur. Phys. J. B. 2004 doi: 10.1140/epjb/e2004-00316-5. [DOI] [Google Scholar]
12.Bauke H. Parameter estimation for power-law distributions by maximum likelihood methods. Eur. Phys. J. B. 2007;58:167–173. doi: 10.1140/epjb/e2007-00219-y. [DOI] [Google Scholar]
13.Seal H. The maximum likelihood fitting of the discrete pareto law. J. Inst. Actuar. 1952;1886–1994(78):115–121. doi: 10.1017/S0020268100052501. [DOI] [Google Scholar]
14.Heaps HS. Information Retrieval, Computational and Theoretical Aspects. Academic Press; 1978. [Google Scholar]
15.Beaumont MA. Approximate Bayesian computation in evolution and ecology. Annu. Rev. Ecol. Evol. Syst. 2010;41:379–406. doi: 10.1146/annurev-ecolsys-102209-144621. [DOI] [Google Scholar]
16.Mandelbrot B. An informational theory of the statistical structure of language. Commun. Theory. 1953;84:486–502. [Google Scholar]
17.Ryser, H. J. Combinatorial Mathematics, vol. 14 (American Mathematical Soc., 1963).
18.Glynn DG. The permanent of a square matrix. Eur. J. Comb. 2010;31:1887–1891. doi: 10.1016/j.ejc.2010.01.010. [DOI] [Google Scholar]
19.Sunnåker M, et al. Approximate Bayesian computation. PLoS Comput. Biol. 2013;9:e1002803. doi: 10.1371/journal.pcbi.1002803. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Beaumont MA, Zhang W, Balding DJ. Approximate Bayesian computation in population genetics. Genetics. 2002;162:2025–2035. doi: 10.1093/genetics/162.4.2025. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Csilléry K, Blum MGB, Gaggiotti OE, François O. Approximate Bayesian computation (ABC) in practice. Trends Ecol. Evol. 2010;25:410–418. doi: 10.1016/j.tree.2010.04.001. [DOI] [PubMed] [Google Scholar]
22.Sisson SA, Fan Y, Tanaka MM. Sequential Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. U.S.A. 2007;104:1760–1765. doi: 10.1073/pnas.0607208104. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Bernton, E., Jacob, P. E., Gerber, M. & Robert, C. P. Approximate bayesian computation with the wasserstein distance. arXiv preprint arXiv:1905.03747 (2019).
24.Cappé O, Guillin A, Marin JM, Robert CP. Population Monte Carlo. J. Comput. Graph. Stat. 2004;13:907–929. doi: 10.1198/106186004X12803. [DOI] [Google Scholar]
25.Beaumont MA, Cornuet J-M, Marin J-M, Robert CP. Adaptive approximate Bayesian computation. Biometrika. 2009;96:983–990. doi: 10.1093/biomet/asp052. [DOI] [Google Scholar]
26.Brown, T. B. et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
27.Project Gutenberg (2020). [Online; accessed 16. Jul. 2020].

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information.^{(398.8KB, pdf)}

[CR1] 1.Zipf, G. K. Human Behavior and the Principle of Least Effort. (Addison-wesley press, 1949).

[CR2] 2.Piantadosi ST, Piantadosi ST. Zipf’s word frequency law in natural language: A critical review and future directions. Psychon. Bull. Rev. 2014;21:1112–1130. doi: 10.3758/s13423-014-0585-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Ferrer R, Cancho R. The variation of Zipf’s law in human language. Eur. Phys. J. B. 2005;44:249–257. doi: 10.1140/epjb/e2005-00121-8. [DOI] [Google Scholar]

[CR4] 4.Moreno-Sánchez I, Font-Clos F, Corral Á. Large-scale analysis of Zipf’s law in english texts. PLoS ONE. 2016;11:e0147073. doi: 10.1371/journal.pone.0147073. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Montemurro MA, Zanette DH. New perspectives on zipf’s law in linguistics: From single texts to large corpora. Glottometrics. 2002;4:87–99. [Google Scholar]

[CR6] 6.Shannon CE. Prediction and entropy of printed english. Bell Syst. Tech. J. 1951;30:50–64. doi: 10.1002/j.1538-7305.1951.tb01366.x. [DOI] [Google Scholar]

[CR7] 7.Newman ME. Power laws, pareto distributions and zipf’s law. Contemp. Phys. 2005;46:323–351. doi: 10.1080/00107510500052444. [DOI] [Google Scholar]

[CR8] 8.Clauset A, Shalizi CR, Newman ME. Power-law distributions in empirical data. SIAM Rev. 2009;51:661–703. doi: 10.1137/070710111. [DOI] [Google Scholar]

[CR9] 9.Corral, A., Serra, I. & Ferrer-i Cancho, R. The distinct flavors of zipf’s law in the rank-size and in the size-distribution representations, and its maximum-likelihood fitting. arXiv preprint arXiv:1908.01398 (2019). [DOI] [PubMed]

[CR10] 10.Hanel R, Corominas-Murtra B, Liu B, Thurner S. Fitting power-laws in empirical data with estimators that work for all exponents. PLoS ONE. 2017;12:e0170920. doi: 10.1371/journal.pone.0170920. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Goldstein ML, Morris SA, Yen GG. Problems with fitting to the Power-law distribution. Eur. Phys. J. B. 2004 doi: 10.1140/epjb/e2004-00316-5. [DOI] [Google Scholar]

[CR12] 12.Bauke H. Parameter estimation for power-law distributions by maximum likelihood methods. Eur. Phys. J. B. 2007;58:167–173. doi: 10.1140/epjb/e2007-00219-y. [DOI] [Google Scholar]

[CR13] 13.Seal H. The maximum likelihood fitting of the discrete pareto law. J. Inst. Actuar. 1952;1886–1994(78):115–121. doi: 10.1017/S0020268100052501. [DOI] [Google Scholar]

[CR14] 14.Heaps HS. Information Retrieval, Computational and Theoretical Aspects. Academic Press; 1978. [Google Scholar]

[CR15] 15.Beaumont MA. Approximate Bayesian computation in evolution and ecology. Annu. Rev. Ecol. Evol. Syst. 2010;41:379–406. doi: 10.1146/annurev-ecolsys-102209-144621. [DOI] [Google Scholar]

[CR16] 16.Mandelbrot B. An informational theory of the statistical structure of language. Commun. Theory. 1953;84:486–502. [Google Scholar]

[CR17] 17.Ryser, H. J. Combinatorial Mathematics, vol. 14 (American Mathematical Soc., 1963).

[CR18] 18.Glynn DG. The permanent of a square matrix. Eur. J. Comb. 2010;31:1887–1891. doi: 10.1016/j.ejc.2010.01.010. [DOI] [Google Scholar]

[CR19] 19.Sunnåker M, et al. Approximate Bayesian computation. PLoS Comput. Biol. 2013;9:e1002803. doi: 10.1371/journal.pcbi.1002803. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Beaumont MA, Zhang W, Balding DJ. Approximate Bayesian computation in population genetics. Genetics. 2002;162:2025–2035. doi: 10.1093/genetics/162.4.2025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Csilléry K, Blum MGB, Gaggiotti OE, François O. Approximate Bayesian computation (ABC) in practice. Trends Ecol. Evol. 2010;25:410–418. doi: 10.1016/j.tree.2010.04.001. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Sisson SA, Fan Y, Tanaka MM. Sequential Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. U.S.A. 2007;104:1760–1765. doi: 10.1073/pnas.0607208104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Bernton, E., Jacob, P. E., Gerber, M. & Robert, C. P. Approximate bayesian computation with the wasserstein distance. arXiv preprint arXiv:1905.03747 (2019).

[CR24] 24.Cappé O, Guillin A, Marin JM, Robert CP. Population Monte Carlo. J. Comput. Graph. Stat. 2004;13:907–929. doi: 10.1198/106186004X12803. [DOI] [Google Scholar]

[CR25] 25.Beaumont MA, Cornuet J-M, Marin J-M, Robert CP. Adaptive approximate Bayesian computation. Biometrika. 2009;96:983–990. doi: 10.1093/biomet/asp052. [DOI] [Google Scholar]

[CR26] 26.Brown, T. B. et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).

[CR27] 27.Project Gutenberg (2020). [Online; accessed 16. Jul. 2020].

PERMALINK

Bias in Zipf’s law estimators

Charlie Pilgrim

Thomas T Hills

Abstract

Figure 1.

Figure 2.

Model

Likelihood function: general case with no a priori ordering

Figure 3.