Skip to main content
PLOS One logoLink to PLOS One
. 2024 Sep 27;19(9):e0301240. doi: 10.1371/journal.pone.0301240

Machine learning of the prime distribution

Alexander Kolpakov 1,*,#, A Alistair Rocke 2,#
Editor: Viacheslav Kovtun3
PMCID: PMC11432912  PMID: 39331654

Abstract

In the present work we use maximum entropy methods to derive several theorems in probabilistic number theory, including a version of the Hardy–Ramanujan Theorem. We also provide a theoretical argument explaining the experimental observations of Y.–H. He about the learnability of primes, and posit that the Erdős–Kac law would very unlikely be discovered by current machine learning techniques. Numerical experiments that we perform corroborate our theoretical findings.

Introduction

Below we briefly recall some known results from Kolmogorov’s complexity theory and algorithmic randomness. The reader may find a detailed exposition of this theory in the monographs [1, 2]. Here, however, we assume that most fundamental notions in computability theory are known to the reader. We also provide interpretations for some of these results in the wider epistemological context.

Kolmogorov’s Invariance Theorem

Let U be a Turing–complete language that is used to simulate a universal Turing machine. Let p be an input of U that produce a given binary string x ∈ {0, 1}*. Then the Kolmogorov Complexity (or Minimal Description Length) of x is defined as

KU(x)=minp{|p|:Up=x} (1)

where Up denotes the output of U on input p.

Kolmogorov’s Invariance Theorem states that the above definition is (on the large scale) invariant of the choice of U. Namely, any other Turing–complete language (or, equivalently, another universal Turing machine) U′ satisfies

x{0,1}*:|KU(x)-KU(x)|O(1) (2)

In other terms,

x{0,1}*:-c(U,U)KU(x)-KU(x)c(U,U) (3)

for some positive constant c(U, U′) that depends only on U and U′.

The minimal description p such that Up = x serves as a natural representation of the string x relative to the Turing–complete language U. However, it turns out that p, and thus KU(x), is not computable.

Levin’s Universal Distribution

The Algorithmic Probability of a binary string x can be defined as the probability of x being generated by U on random input p, where we consider p to be a binary string generated by fair coin flips:

P(x)=p:Up=x2-|p| (4)

However, this quantity is not well–defined: we can choose one such input p and use it as a prefix for some p′ that is about log2 k bits longer than p and such that U produces the same binary string: Up′ = x. Then 2−|p′| ∼ 2−|p|/k, and we have that

P(x)2-|p|k1k (5)

for k from any subset of integers. Thus, neither P(x) ≤ 1, nor even has it to be finite.

Levin’s idea effectively formalizes Occam’s razor: we need to consider prefix–free Turing–complete languages only. Such languages are easy to imagine: if we agree that all documents end with the instruction \end{document} that cannot appear anywhere else, then we have a prefix–free language.

Given that any prefix–free code satisfies the Kraft-McMillan inequality, we obtain the Universal Distribution:

2-KU(x)P(x)=p:Up=x2-|p|1 (6)

where from now on we consider U to be prefix–free, and KU is now corresponds to prefix–free Kolmogorov Complexity.

Interpretation

The above facts may also be given a physical interpretation: in a computable Universe, given a phenomenon with encoding x ∈ {0, 1}* generated by a physical process, the probability of that phenomenon is well–defined and equal to the sum over the probabilities of distinct and independent causes. The prefix–free condition is precisely what guarantees causal independence.

Furthermore, Levin’s Universal Distribution formalizes Occam’s razor as the most likely cause for that process is provided by the minimal description of x and more lengthy explanations are less probable causes.

Levin’s Coding Theorem

In the setting of prefix–free Kolmogorov complexity, Levin’s Coding theorem states that

KU(x)=-log2P(x)+O(1) (7)

or, in other terms,

P(x)=Θ(2-KU(x)) (8)

The algorithmic probability of x was defined by Solomonoff simply as

R(x)=2-KU(x), (9)

that is proportional to the leading term in the Universal Distribution probability.

Interpretation

Relative to a prefix–free Turing–complete language U (or, equivalently, a universal prefix–free Turing machine), the number of fair coin flips required to generate the shortest program that outputs x is on the order of ∼KU(x).

Maximum entropy via Occam’s razor

Given a discrete random variable X with computable probability semi–measure P, it holds that

E[KU(X)]=xXP(x)·KU(x)=H(X)+O(1), (10)

where H(X) is the Shannon Entropy of X in base 2.

Moreover, we have that

E[R(X)]=2-E[KU(X)]=2-H(X)+O(1), (11)

which means that the expected Solomonoff probability is inverse proportional to the effective alphabet size.

Interpretation

The Shannon Entropy of a random variable in base 2 equals its Expected Kolmogorov Complexity up to a constant that becomes negligible in asymptotic analysis. This provides us with a precise answer to von Neumann’s original question.

As follows from (10), the average Kolmogorov Complexity and Entropy have the same order of magnitude. Hence machine learning systems that minimise the KL–Divergence are implicitly applying Occam’s razor.

However, what exactly do we mean by random variable? In a computable Universe the sample space of a random variable X represents the state–space of a Turing Machine with unknown dynamics whose output sequence is computable. As the generated sequence is computable, it is finite–state incompressible in the worst-case i.e. a normal number. Hence, a random variable corresponds to a stochastic source that is finite–state random.

This definition comes from the well–known correspondence between finite–state machines and normal numbers that establishes that a sequence is normal if and only if there is no finite–state machine that accepts it.

Algorithmic randomness

Let x be a binary sequence, and x|N be the initial segment of N bits. Then x is algorithmically random if

KU(x|N)N-c, (12)

for some constant c ≥ 0. Such a string x is also called incompressible.

Thus, an integer nN is algorithmically random (or incompressible) if

KU(n)=log2n+O(1), (13)

and a discrete random variable X taking N values with computable distribution is algorithmically random if

E[KU(X)]=log2N+O(1), (14)

which corresponds to the maximal entropy case of the uniform distribution.

Interpretation

A random binary string cannot be compressed because it does not contain any information about itself. Thus, a random number n needs log2 n bits to be written in its binary form, and a random variable is “truly random” in a computable Universe if its expected Kolmogorov complexity is maximal.

Maximum entropy methods for probabilistic number theory

Below we provide an illustration of information–theoretical approach to classical theorems in number theory. Some proofs using the notions of entropy and Kolmogorov’s complexity have already been known [3], and other proofs below are definitely new. We prefer to keep both kinds of them to retain the whole picture.

The Erdős–Euclid theorem

This information–theoretic adaptation of Erdős’ proof of Euclid’s theorem is originally due to Ioannis Kontoyiannis [3]. In essence, this proof demonstrates that the information content of finitely many primes is insufficient to generate all the integers.

Let π(N) be the number of primes that are less or equal to a given natural number N. Let us suppose that the set of primes P is finite so we have P={pi}i=1π(N) where π(N) is constant for N big enough.

Then we can define a uniform integer–valued random variable Z chosen uniformly from [1, N], such that

Z=(i=1π(N)piXi)·Y2 (15)

for some integer–valued random variables 1 ≤ YN and Xi ∈ {0, 1}, such that Z/Y2 is square–free. In particular, we have that YN, thus the upper bound for Shannon’s Entropy from Jensen’s inequality implies:

H(Y)log2N=12log2N. (16)

Also, since Xi is a binary variable, we have H(Xi) ≤ 1.

Then, we compute

H(Z)=log2N. (17)

Moreover, we readily obtain the following inequality:

H(Z)=H(Y,X1,,Xπ(N))H(Y)+i=1π(N)H(Xi)12log2N+π(N) (18)

which implies

π(N)12log2N (19)

This clearly contradicts the assumption that π(N) is a constant for any natural N, and provides us with a simple but far from reasonable lower bound.

Chebyshev’s theorem via algorithmic probability

An information–theoretic derivation of Chebyshev’s Theorem [4], an important precursor of the Prime Number Theorem, from the Maximum Entropy Principle. Another proof was given by Ioannis Kontoyiannis in [3].

Chebyshev’s Theorem is a classical result that states that

pN1p·lnplnN, (20)

where we sum over the primes pN.

Here, two functions f(n)∼g(n) are asymptotically equivalent once limnf(n)g(n)=1.

In information–theoretical terms, the expected information gained from observing a prime number in the interval [1, N] is on the order of ∼log2 N.

For an integer Z sampled uniformly from the interval [1, N] we may define its random prime factorization in terms of the random variables Xp:

Z=pNpXp. (21)

As we have no prior information about Z, it has the maximum entropy distribution among all possible distributions on [1, N].

Since the set of finite strings is countable so there is a 1–to–1 map from {0, 1}* to N and we may define the Kolmogorov Complexity as a map from integers to integers, KU:NN. As almost all finite strings are incompressible [2, Theorem 2.2.1], it follows that almost all integers are algorithmically random. Thus, for large N and for n ∈ [1, N], we have KU(n) = log2 n + O(1) almost surely.

Thus,

E[KU(Z)]E[log2Z]=1Nk=1Nlog2k=log2(N!)Nlog2N, (22)

by using Stirling’s approximation

log2(N!)=Nlog2N-Nlog2e+O(log2N). (23)

On the other hand,

E[KU(Z)]E[log2Z]=pNE[Xp]·log2p. (24)

All of the above implies

E[KU(Z)]pNE[Xp]·log2plog2N. (25)

Here, we may apply the Maximum Entropy Principle to Xp, provided that E[Xp] is fixed: then Xp belongs to the geometric distribution [5, Theorem 12.1.1]. This can also be verified directly as shown in [3].

Thus, we compute:

E[Xp]=k1P(Xpk)=k11NNpk1p. (26)

Thus, we rediscover Chebyshev’s theorem:

H(Xp1,,Xpπ(N))=H(Z)E[KU(Z)]pN1p·log2plog2N. (27)

It should be noted that the asymptotic equivalence above does not depended on the logarithm’s base (as long as the same base is used on both sides).

The Hardy–Ramanujan theorem from information theory

Based on the ideas explained in the previous paragraphs, we can deduce a version of the classical Hardy–Ramanujan theorem.

The probability of a prime factor

Given an integer Z chosen uniformly from [1, N] with prime factorisation

Z=pNpXp, (28)

we observe that

P(X1k1,,Xnkn)=1NNp1k1pnknin1NNpiki=inP(Xiki), (29)

which means that Xi’s are asymptotically independent (i.e. independent in the large N limit).

In particular, the probability of having pP as a prime factor satisfies

P(Xp1)1p. (30)

Likewise, the probability that two distinct primes p,qP are happen as factors simultaneously satisfies

P(Xp1Xq1)1p·1q. (31)

Thus, any two sufficiently large prime numbers p,qP are statistically independent.

The expected number of unique prime factors

For any integer Z sampled uniformly from [1, N], we may define its number of unique prime factors w(Z) = ∑pN I(Xp ≥ 1). Thus, we calculate the expected value

E[w(Z)]=pNP(Xp1)pN1p=lnlnN+O(1), (32)

where we use Mertens’ 2nd Theorem for the last equality.

The variance of w(Z)

Given our previous analysis, the random variables Ip = I(Xp ≥ 1) are asymptotically independent, so that the variance of w(Z) satisfies

Var[w(Z)]pNE[Ip2]-E[Ip]2pN(1p-1p2)=lnlnN+O(1), (33)

since pN1p2π26.

The Hardy–Ramanujan theorem

For any positive k > 0 and a random variable Y with E[Y]=μ and Var[Y] = σ2, Chebyshev’s inequality informs us that

P(|Y-μ|>kσ)1k2. (34)

With Y = w(Z) and k = (ln ln N)ε we obtain a probabilistic version of the Hardy–Ramanujan theorem. Namely, for a random integer ZU([1,N]), and any function ψ(N)∼(ln ln N)1/2+ε, in the large N limit we have

|w(Z)-lnlnN|ψ(N) (35)

with probability

1-O(1(lnlnN)2ε). (36)

Discussion and conclusions

The Erdős–Kac theorem states that

w(Z)-lnlnNlnlnN (37)

converges to the standard normal distribution N(0,1) as N → ∞.

This theorem is of great interest to the broader mathematical community as it is impossible to guess from empirical observations. In fact, it is far from certain that Erdős and Kac would have proved the Erdős–Kac theorem if its precursor, the Hardy–Ramanujan theorem, was not first discovered.

More generally, in the times of Big Data this theorem forces the issue of determining how some scientists were able to formulate correct theories based on virtually zero empirical evidence.

In our computational experiments Z was chosen to run through all possible N–bit integers with N = 24, and no normal distribution emerged according to the D’Agostino–Pearson test. The p–value associated to this test equals the probability of sampling normally distributed data that produces at least as extreme value of the D’Agostino–Pearson statistics as the actual observations. Thus, if the p–value is small it may be taken as evidence against the normality hypothesis. In our case we obtained a p–value of 0.0.

For comparison, the Central Limit Theorem clearly manifests itself for 216, not even 224, binomial distribution samples. In this case, the p–value associated to the D’Agostino–Pearson test equals ∼0.4657.

The code used in our numerical experiments is available on GitHub [6], and all computations can be reproduced on a laptop computer such as MacBook Pro M1 with 8 GB RAM, or comparable.

In order to observe the normal order in the Erdős–Kac law, one would need Z to reach about 2240, not 224, instead [7].

Thus, non–trivial scientific discoveries of this kind that are provably beyond the scope of computational induction (and hence machine learning) do not yet have an adequate explanation.

Learning the prime distribution

Below we provide a theoretical study of the Kolmogorov complexity of the prime distribution.

Entropy of primes

In what follows pP will always be a fixed prime number, and X ∈ [1, N] will be chosen at random among the first N natural numbers. Considering prime numbers as random variables, or as elements of random walks, goes back to Billingsley’s heuristics in [8].

From the information–theoretic perspective, Chebyshev’s theorem states that the average code length of X expressed as the sequence of prime exponents Xp1,,Xpπ(N) satisfies

H(X)=H(Xp1,,Xpπ(N))pN1p·lnplnN, (38)

where we use the natural logarithm entropy instead of binary entropy. As previously noted, this is only a matter of computational convenience.

It turns out that the entropy of X, which is almost surely composite in the large N limit, essentially depends on the available primes pN.

A given non–negative integer n has binary code length

(n)=log2n+1log2n+O(1). (39)

Given an integer n ∈ [1, N], we need at most (n) = bits to encode it, and thus it can be produced by a program of length ≤(n). Note that (N) is so far irrelevant here as we need a prefix–free encoding and do not consider adding zero bits for padding all binary strings to same length.

By Levin’s coding theorem,

P(X=n)=Θ(2-KU(n)), (40)

and thus we have

2-(n)P(X=n)α·2-(n), (41)

with α > 1, for most numbers n ∈ [1, N], as most binary strings are incompressible in the large N limit [2, Theorem 2.2.1].

Thus, we may as well resort to the following Ansatz:

P(X=n)=1n, (42)

as it provides a computable measure on [1, N] that is roughly equivalent to the initial Levin’s distribution. The same discrete measure arises in the heuristic derivation of Benford’s law [9].

Let us consider a random variable Y that generates primes within [1, N] with probability

P(Y=p)=1p,

that appears to represent the algorithmic probability of a prime rather than its frequentist probability.

Then we can write

H(Y)=-pNP(Y=p)·lnP(Y=p)=pNlnpplnN.

There are π(N) primes p satisfying pN, and thus we need exactly π(N) random variables Yi to generate them all. Each Yi has the same distribution as Y, and we assume that Yi’s are independent.

Let Y˜N=(Y1,,Yπ(N)) be the ordered sequence generating all primes in [1, N]. Then Shannon’s source coding theorem informs us that Y˜N needs no less than π(N) ⋅ H(Y) bits to be encoded without information loss. Any smaller encoding will almost surely lead to information loss. This means that

H(Y˜N)=π(N)·H(Y)+O(1). (43)

Thus,

E[KU(Y˜N)]=H(Y˜N)+O(1)=π(N)·H(Y)+O(1)π(N)·ln(N)N, (44)

as the Prime Number Theorem provides the asymptotic equivalence

π(N)NlnN. (45)

On the other hand, the most obvious encoding of primes is the N–bit string where we put 1 in position k if k is prime, and 0 if k is composite. The above equality for expected Kolmogorov’s complexity of Y˜N implies that for large values of N this string is almost surely incompressible.

Discussion and conclusions

The above discussion provides a theoretical corroboration of the experimental fact observed by Yang–Hui He in [10]. Namely, the complexity of “machine learning” the prime distribution on the interval [1, N] is equivalent to learning an algorithmically random sequence. Thus, the true positive rate of any model predicting primes should be very low.

There are, however, several theoretical issues with the above argument. One is that the proposed Benford’s probability is not a finite measure, as we have

kNP(X=k)=kN1k=lnN+O(1). (46)

The 1st Mertens’ theorem gives that summing over the primes only gives

pNP(Y=p)=pN1p=lnlnN+O(1), (47)

which is a much smaller quantity asymptotically, yet unbounded.

We can apply some regularization, such as setting

P(X=n)=1ns, (48)

for some s > 1, and thus obtaining a finite measure and trying to study its s → 1 limit. This measure is not normalized to be a probability measure, however this is not very crucial (e.g. Levin’s universal probability is not normalized either).

Indeed, we shall have

k=1P(X=k)=ζ(s)<, (49)

for any s > 1.

Moreover,

Hs(Y)=s·pNlnppss1H(Y). (50)

In contrast, P will not converge to a finite measure as ζ(s) has a pole at s = 1.

Given the fact that the Prime Encoding produces a bit–string that is algorithmically random (at least, asymptotically), the expected True Positive Rate (TPR) for inferring the first N numbers as primes or composites is on the order of

E[TPR]1Nk=1NP(X=k)=1Nk=1N1n=lnNN, (51)

and thus tends to 0 as N becomes large. Our numerical experiments corroborate the fact that the TPR of a machine learning model is indeed small and, moreover, that it diminishes as N grows.

Another point worthy discussion is the independence of primes Y1, …, Yπ(N). This assumption allows Shannon’s source coding theorem into play, however it does not seem to fully hold (even theoretically). Indeed, a Turing machine that enumerates any finite initial segment of primes exists. Arguably, since K(n)∼log2 n for most numbers, we may write the following upper bound for the complexity of primes up to 2N

K(π(2N))+O(1)log2(2NNln2)+O(1)N-log2N+O(1). (52)

Thus, the primes are somewhat compressible, as their complexity is not outright NO(1). However, our result about the expected Kolmogorov complexity of primes holds in the sense of asymptotic equivalence, and N − log2 NN in the large N limit.

Numerical experiments

We posit that a machine learning model may not be reliably used to predict the locations of prime numbers. Indeed, X being algorithmically random means that no prediction can be reasonably inferred for it by using inductive learning.

Previous experiments on prime number inference using deep learning were done in [10], and showed a very low true positive rate (∼ 10−3). The neural network had a three–layer architecture, and no specific training was performed. Modern tree–based classifiers are approximating the Kolmogorov complexity very efficiently by using Huffman encoding or a more advanced variant of thereof.

For example, XGBoost often outperforms other models in Kaggle competitions, especially on tabular data and in classification tasks. Thus, we may take XGBoost as a more practical experimental model.

We performed XGBoost experiments on prime learning for N–bit integers with N = 18 and N = 24. Each integer was represented as a binary string of length N with leading zeroes where appropriate. The code used for our experiments is available on GitHub [6], and the computations may be reproduced on a GPU–equipped laptop, e.g. MacBook Pro M1 with 8 GB memory.

For N = 18, there are 23′000 primes and 239′144 composites out of 262′144 numbers in total. We have the following probability confusion matrix:

CompositePrimeCompositePrime0.6388860.2732780.0582090.029628

For N = 24, there are 1′077′871 primes and 15′699′345 composites out of 16′777′216 numbers in total. The probability confusion matrix turns out

CompositePrimeCompositePrime0.6357070.3000540.0419120.022326

In both cases we used Bayesian hyper–parameter optimization with HyperOpt, and 80% / 20% train/test split. This achieves a better (by an order of magnitude) true positive rate than in [10], which is still insignificant. In fact, as follows from the above confusion matrices, the true positive rate declines as N grows. All this is expected given our theoretical analysis, and our experiments corroborate the theoretical conclusion that “machine learning” primes turns out no better than guessing a random number.

As shown in [11], the recent experiments with Neural Network classifiers for semi–primes largely fail to infer the semi–prime distribution. More precisely, for N = 426 detecting semi–primes with N–bit factors has the number of false negatives on par with the number of true positives (both on the order of ∼0.25). It is worth mentioning that the number of false positives (∼0.05) is relatively small. However, the problem considered in [11] is substantially different from ours.

Acknowledgments

We would like to thank Anders Södergren, Ioannis Kontoyiannis, Hector Zenil, Steve Brunton, Marcus Hutter, Cristian Calude, Igor Rivin, and the anonymous reviewers for their constructive feedback on the earlier versions of this manuscript.

Data Availability

Data are available at: https://github.com/sashakolpakov/xgb-primes.

Funding Statement

This study was supported by the Swiss National Science Foundation project PP00P2--202667 awarded to A.K. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Grünwald P., Vitányi P. Algorithmic Information Theory. arXiv:08092754; 2008.
  • 2. Li M., Vitányi P. An Introduction to Kolmogorov Complexity and Its Applications. Springer; 2019. [Google Scholar]
  • 3. Kontoyiannis I. Some Information-Theoretic Computations Related to the Distribution of Prime Numbers. In: Grunwald P, Myllymaki P, Tabus I, Weinberger M, Yu B, editors. Festschrift in Honor of Jorma Rissanen. Tampere University Press; 2008. p. 135–143. [Google Scholar]
  • 4. Chebychev P. L. Sur la totalité des nombres premiers inférieurs à une limite donnée. J de Math Pures Appl. 1852; 17:341–365. [Google Scholar]
  • 5. Cover T. M., Thomas J.A. Elements of Information Theory. 2nd ed. Wiley; 2006. [Google Scholar]
  • 6.Kolpakov A., Rocke A. XGBoost Prime Numbers. Available from: https://githubcom/sashakolpakov/xgb-primes/
  • 7. Rényi A., Turán P. On a Theorem of Erdős–Kac. Acta Arithmetica. 1958; 4(1):71–84. doi: 10.4064/aa-4-1-71-84 [DOI] [Google Scholar]
  • 8. Billingsley P. Prime numbers and Brownian motion. American Mathematical Monthly. 1973; 80:1099. doi: 10.1080/00029890.1973.11993463 [DOI] [Google Scholar]
  • 9.Wolfram Research. Benford’s Law. Available from: https://mathworld.wolfram.com/BenfordsLaw.html.
  • 10.He Y.–H. Deep–Learning the Landscape. arXiv:170602714; 2017.
  • 11.Blake S. Integer Factorisation, Fermat & Machine Learning on a Classical Computer. arXiv:230812290; 2023.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data are available at: https://github.com/sashakolpakov/xgb-primes.


Articles from PLOS ONE are provided here courtesy of PLOS

RESOURCES