Abstract
In a language corpus, the probability that a word occurs n times is often proportional to 1/ n 2. Assigning rank, s, to words according to their abundance, log s vs log n typically has a slope of minus one. That simple Zipf's law pattern also arises in the population sizes of cities, the sizes of corporations, and other patterns of abundance. By contrast, for the abundances of different biological species, the probability of a population of size n is typically proportional to 1/ n, declining exponentially for larger n, the log series pattern.
This article shows that the differing patterns of Zipf's law and the log series arise as the opposing endpoints of a more general theory. The general theory follows from the generic form of all probability patterns as a consequence of conserved average values and the associated invariances of scale.
To understand the common patterns of abundance, the generic form of probability distributions plus the conserved average abundance is sufficient. The general theory includes cases that are between the Zipf and log series endpoints, providing a broad framework for analyzing widely observed abundance patterns.
Keywords: scaling patterns, ecology, demography, linguistics, probability theory
Introduction
A few simple patterns recur in nature. Adding up random processes often leads to the bell-shaped normal distribution. Death and other failures typically follow the extreme value distributions.
Those simple patterns recur under widely varying conditions. Something fundamental must set the relations between pattern and underlying process. To understand the common patterns of nature, we must know what fundamentally constrains the forms that we see.
Without that general understanding, we will often reach for unnecessarily detailed and complex models of process to explain what is in fact some structural property that influences the invariant form of observed pattern.
We already understand that the central limit theorem explains the widely observed normal distribution 1. Similar limit theorems explain why failure often follows the extreme value pattern 2, 3.
The puzzles set by other commonly observed patterns remain unsolved. Each of those puzzles poses a challenge. The solutions will likely broaden our general understanding of what causes pattern. Such insight will help greatly in the big data analyses that play an increasingly important role in modern science.
Zipf’s law is one of the great unsolved puzzles of invariant pattern. The frequency of word usage 4, the sizes of cities 5, 6, and the sizes of corporations 7 have the same shape. On a log-log plot of rank versus abundance, the slope is minus one. For cities, the largest city would have a rank of one, the second largest city a rank of two, and so on. Abundance is population size.
The abundance of species is another great unsolved puzzle of invariant pattern. In an ecological community, the probability that a species has a population size of n individuals is proportional to p n/ n, the log series pattern 8. Communities differ only in their average population size, described by the parameter, p. Actual data vary, but most often fit closely to the log series 9.
In this article, I show that Zipf’s law and the log series arise as the opposing endpoints of a more general theory. That theory provides insight into the particular puzzles of Zipf’s law and species abundances. The analysis also suggests deeper insights that will help to unify understanding of commonly observed patterns.
Theory
The argument begins with the invariances that define alternative probability patterns 10, 11. To analyze the invariances of a probability distribution, note that we can write almost any probability distribution, q z, as
in which T( z) ≡ T z is a function of the variable z. The probability pattern, q z, is invariant to a constant shift, T z ↦ a + T z, because we can write the transformed probability pattern in Equation 1 as
with k = k ae – λa. We express k in this way because k adjusts to satisfy the constraint that the total probability be one. In other words, conserved total probability implies that the probability pattern is shift invariant with respect to T z.
Now consider the consequences if the average of some value over the distribution q z is conserved. That constraint causes the probability pattern to be invariant to a multiplicative stretching (or shrinking), T z ↦ bT z, because
with λ = λ bb. We specify λ in this way because λ adjusts to satisfy the constraint of conserved average value. Thus, invariant average value implies that the probability pattern is stretch invariant with respect to T z.
Conserved total probability and conserved average value cause the probability pattern to be invariant to an affine transformation of the T z scale, T z ↦ a + bT z, in which “affine” means both shift and stretch.
The affine invariance of probability patterns with respect to T z induces significant structure on the form of T z and the associated form of probability patterns. Understanding that structure provides insight into probability patterns and the processes that generate them 10, 12, 13.
In particular, Frank & Smith 12 showed that the invariance of probability patterns to affine transformation, T z ↦ a + bT z, implies that T z satisfies the differential equation
in which w( z) is a function of the variable z. The solution of this differential equation expresses the scaling of probability patterns in the generic form
in which, because of the affine invariance of T z, I have added and multiplied by constants to obtain a convenient form, with T z → w as β → 0. With this expression for T z, we may write probability patterns generically as
Turning now to the log series and Zipf’s law, the relation n = e r between observed pattern, n, and process, r, plays a central role. Here, r represents the total of all proportional processes acting on abundance. A proportional process simply means that the number of individuals or entities affected by the process increases in proportion to the number currently present, n.
The sum of all of the proportional processes acting on abundance over some period of time is
Here, m( t) is a proportional process acting at time t to change abundance. The value of r = log n is the total of the m values over the total time, τ. For simplicity, I assume n 0 = 1.
Proportional processes are often discussed in terms of population growth 5, 14. However, many different processes act individually on the members of a population. If the number of individuals affected increases in proportion to population size, then the process is a proportional process.
Growth and other proportional processes often lead to an approximate power law, q n ≈ kn – ρ. However, the exponent of a growth process does not necessarily match the values observed in the log series and Zipf’s law. We need both the power law aspect of proportional process and something further to get the specific forms of those widely observed abundance distributions. That something further arises from conserved quantities and their associated invariances.
The log series and Zipf’s law follow as special cases of the generic probability pattern in Equation 3. To analyze abundance, focus on the process scale by letting the variable of interest be z ≡ r, with the key scaling simply the process variable itself, w( r) = r. Then Equation 3 becomes
in which q rd r is the probability of a process value, r, in the interval r + d r. From the relation between abundance and process, n = e r, we can change from the process scale to the abundance scale by the substitutions r ↦ log n and d r ↦ n –1d n, yielding the identical probability pattern expressed on the abundance scale
The value of k always adjusts to satisfy the constraint of invariant total probability, and the value of λ always adjusts to satisfy the constraint of invariant average value.
For β = 1, we obtain the log series distribution
replacing n – 1 by n in the exponential term which, because of affine invariance, describe the same probability pattern. The log series is often written with e – λ = p, and thus q n = kp n/ n. One typically observes discrete values n = 1, 2, . . . . The Supplemental material for this article 15 shows the relation between discrete and continuous distributions 16 and the domain of the variables. The continuous analysis here is sufficient to understand pattern.
For β → 0, we have ( n β – 1)/ β → log n, which yields
for n ≥ 1. If we constrain average abundance, 〈 n〉, with respect to this distribution, then
For any average abundance that is finite and not small, λ → 1, which is Zipf’s law.
Equation 5 provides a general expression for abundance distributions. The log series and Zipf’s law set the endpoints of β = 1 and β → 0. We can understand the differences between abundance distributions in terms of the parameter β by writing the distribution in the generic form of Equation 1, with the defining affine invariant scale
This scale expresses the invariances that define the pattern. At the Zipf’s law endpoint, β → 0, the scale becomes 2 log n = 2 r, when satisfying the constraint that the average abundance, 〈 n〉, is sufficiently large.
In this case, with affine invariant scale T n = 2 r, neither addition nor multiplication of process value, r ↦ a + br, alters the pattern. We could have started with this affine invariance, and derived the probability pattern from the invariance properties 10, 11.
For the log series endpoint, β = 1, the affine invariant scale is
The dominant aspect of the scale changes with n. For small abundances, the logarithmic scale r = log n dominates, and for large abundances, the linear scale n = e r dominates. Many common probability patterns change their scaling with magnitude 13, 17.
For log series patterns, the dominance of scale at small magnitude by r corresponds to affine invariance with respect to r. At larger abundances, the dominance by the effectively linear scale, n, corresponds to invariance to a shift in process r ↦ a + r, but not to a multiplication of process, r ↦ br, because e br = n b is a power transformation of abundance. Linear scales are not invariant to power transformations. Once again, we could have derived the pattern from the invariances.
In Equation 8, intermediate values of β combine aspects of Zipf’s law and the log series. The closer β is to one of the endpoints, the more the invariance characteristics of that endpoint dominate pattern.
Conclusion
This analysis shows how two great and seemingly unconnected puzzles solve very simply in terms of a single continuum between alternative invariances. This approach reveals the simple invariant structure of many common probability patterns.
Data availability
Underlying data
All data underlying the results are available as part of the article and no additional source data are required.
Extended data
Zenodo: Supplemental Material for “The common patterns of abundance: the log series and Zipf’s law”. https://doi.org/10.5281/zenodo.2597895 15.
Acknowledgements
I completed this work while on sabbatical in the Theoretical Biology group of the Institute for Integrative Biology at Eidgenössische Technische Hochschule (ETH) Zürich.
A previous version of this article is available on arXiv: https://arxiv.org/abs/1812.09662
Funding Statement
The Donald Bren Foundation supports my research.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 1; peer review: 4 approved]
References
- 1. Fischer H: A History of the Central Limit Theorem: From Classical to Modern Probability Theory. Springer, New York,2011. 10.1007/978-0-387-87857-7 [DOI] [Google Scholar]
- 2. Kotz S, Nadarajah S: Extreme Value Distributions: Theory and Applications. World Scientific, Singapore,2000. 10.1142/9781860944024 [DOI] [Google Scholar]
- 3. Coles S: An Introduction to Statistical Modeling of Extreme Values. Springer, New York,2001. 10.1007/978-1-4471-3675-0 [DOI] [Google Scholar]
- 4. Zipf GK: The Psycho-biology of Language. Houghton Mifflin, Boston,1935. Reference Source [Google Scholar]
- 5. Gabaix X: Zipf’s law for cities: an explanation. Q J Econ. 1999;114(3):739–767. Reference Source [Google Scholar]
- 6. Arshad S, Hu S, Ashraf BN: Zipf’s law and city size distribution: a survey of the literature and future research agenda. Physica A: Stat Mech Appl. 2018;492:75–92. 10.1016/j.physa.2017.10.005 [DOI] [Google Scholar]
- 7. Axtell RL: Zipf distribution of U.S. firm sizes. Science. 2001;293(5536):1818–1820. 10.1126/science.1062081 [DOI] [PubMed] [Google Scholar]
- 8. Fisher RA, Corbet AS, Williams CB: The relation between the number of species and the number of individuals in a random sample of an animal population. J Anim Ecol. 1943;12(1):42–58. 10.2307/1411 [DOI] [Google Scholar]
- 9. Baldridge E, Harris DJ, Xiao X, et al. : An extensive comparison of species-abundance distribution models. PeerJ. 2016;4:e2823. 10.7717/peerj.2823 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Frank SA: Common probability patterns arise from simple invariances. Entropy. 2016;18(5):192 10.3390/e18050192 [DOI] [Google Scholar]
- 11. Frank SA: Measurement invariance explains the universal law of generalization for psychological perception. Proc Natl Acad Sci U S A. 2018;115(39):9803–9806. 10.1073/pnas.1809787115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Frank SA, Smith E: A simple derivation and classification of common probability distributions based on information symmetry and measurement scale. J Evol Biol. 2011;24(3):469–484. 10.1111/j.1420-9101.2010.02204.x [DOI] [PubMed] [Google Scholar]
- 13. Frank SA: How to read probability distributions as statements about process. Entropy. 2014;16:6059–6098. 10.3390/e16116059 [DOI] [Google Scholar]
- 14. Gibrat R: Les Inégalités Économiques. Librairie du Recueil Sirey, Paris.1931. Reference Source [Google Scholar]
- 15. Frank SA: Supplemental Material for “The common patterns of abundance: the log series and Zipf’s law”.2019. 10.5281/zenodo.2597895 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Au C, Tam J: Transforming variables using the Dirac generalized function. Am Stat. 1999;53(3):270–272. 10.2307/2686109 [DOI] [Google Scholar]
- 17. Frank SA: The invariances of power law size distributions [version 2; peer review: 2 approved]. F1000Res. 2016;5:2074. 10.12688/f1000research.9452.3 [DOI] [PMC free article] [PubMed] [Google Scholar]