Skip to main content
Entropy logoLink to Entropy
. 2022 Oct 14;24(10):1469. doi: 10.3390/e24101469

A Formal Framework for Knowledge Acquisition: Going beyond Machine Learning

Ola Hössjer 1, Daniel Andrés Díaz-Pachón 2,*, J Sunil Rao 2
Editor: Deniz Gençağa
PMCID: PMC9601974  PMID: 37420489

Abstract

Philosophers frequently define knowledge as justified, true belief. We built a mathematical framework that makes it possible to define learning (increasing number of true beliefs) and knowledge of an agent in precise ways, by phrasing belief in terms of epistemic probabilities, defined from Bayes’ rule. The degree of true belief is quantified by means of active information I+: a comparison between the degree of belief of the agent and a completely ignorant person. Learning has occurred when either the agent’s strength of belief in a true proposition has increased in comparison with the ignorant person (I+>0), or the strength of belief in a false proposition has decreased (I+<0). Knowledge additionally requires that learning occurs for the right reason, and in this context we introduce a framework of parallel worlds that correspond to parameters of a statistical model. This makes it possible to interpret learning as a hypothesis test for such a model, whereas knowledge acquisition additionally requires estimation of a true world parameter. Our framework of learning and knowledge acquisition is a hybrid between frequentism and Bayesianism. It can be generalized to a sequential setting, where information and data are updated over time. The theory is illustrated using examples of coin tossing, historical and future events, replication of studies, and causal inference. It can also be used to pinpoint shortcomings of machine learning, where typically learning rather than knowledge acquisition is in focus.

Keywords: active information; Bayes’ rule; counterfactuals; epistemic probability; learning, justified true belief; knowledge acquisition; replication studies

1. Introduction

1.1. The Present Article

The process by which cognitive agents acquire knowledge is complicated, and has been studied from different perspectives within educational science, psychology, neuroscience, cognitive science, and social science [1]. Philosophers usually distinguish between three types of knowledge [2]: acquaintance knowledge (to get to know other persons), knowledge how (to learn certain skills), and knowledge that (to learn about propositions or facts). Mathematically, acquaintance knowledge has been studied via trees and networks, for instance, in small-world-type models and rumor-spreading models [3,4,5]. Knowledge how has been widely developed in education and psychology, since the middle of the twentieth century, by means of testing and psychometry, using classical statistics [6,7,8].

The purpose of this paper is to formulate knowledge that in mathematical terms. Our starting point is to define knowledge that as justified true belief (JTB), which generally is agreed to constitute at least a sufficient condition for such knowledge [9,10]. The primary tools will be the concepts of truth, probabilities, and information theory. Probabilities, in addition to logic, are used to formulate mechanisms of reasoning in order to define beliefs [11,12]. More specifically, a Bayesian approach with subjective probabilities will be used to quantify rational agents’ degrees of beliefs in a proposition. These subjective probabilities may vary between agents, but since each agent is assumed to be rational, its probabilities satisfy basic axioms of probability [13]. This is also referred to as the personalistic view of probabilities in [14].

The degree of belief in a proposition is associated with some type of randomness or uncertainty regarding the truth of the proposition. It is helpful in this context to distinguish between ontological randomness (genuine randomness regarding the truth of the proposition) and epistemic randomness (incomplete knowledge about propositions that are either true or false). Here the focus will be on epistemic randomness, and following [15], subjective probabilities are referred to as epistemic probabilities. The epistemic randomness assumption that each proposition has a fixed truth value can be viewed as a frequentist component of our framework.

To use epistemic probabilities in a wider context of knowledge that (subsequently simply referred to as knowledge), we incorporate degrees of beliefs within a framework of parallel worlds in order to define more clearly what JTB means. These parallel worlds correspond to parameters of a statistical model and a second frequentist notion of one parameter being true, whereas the others are counterfactuals [16]. An agent’s maximal possible discernment between worlds is described in terms of the σ-algebra G. The agent’s degrees of belief are obtained through Bayes’ rule from prior belief and data [17], in such a way that it is not possible to discern between worlds beyond the limits set by G.

Learning is associated with increased degrees of true belief, although these beliefs need not necessarily be justified. More specifically, the agent’s degree of belief in a proposition is compared to that of an ignorant person. This corresponds to an hypothesis test within a frequentist framework. More specifically, the null hypothesis of a proposition being true is tested against an alternative hypothesis that the proposition is false. As a test statistic, we use active information I+ [18,19,20], which quantifies how much the agent has learned about the truth value of the proposition compared to an ignorant person. In particular, learning has occurred when the agent’s degree of belief in a true proposition is larger than that of an ignorant person (I+>0), or if the agent’s degree of belief in a false proposition is less than that of an ignorant person (I+<0). In either case, G sets a limit in terms of the maximal amount of possible learning. Learning is, however, not sufficient for knowledge acquisition, since the latter concept also requires that the true belief is justified, or has been formed for the right reason. Knowledge acquisition is defined as a learning process where the agent’s degree of belief in the true world is increased, corresponding to a more accurate estimate of the true world parameter. Thus, knowledge acquisition goes beyond learning in that it also deals with the "justified" part of the JTB condition. It is related to consistency of a posterior distribution, a notion that is meaningful only within our hybrid frequentist/Bayesian approach.

To the best of our knowledge, the hybrid frequentist/Bayesian approach has only been used in the context of Bayesian asymptotic theory (Section 7.2), but not as a general tool for modeling the distinction between learning and knowledge acquisition. Although the concept of a true world (or the true state of affairs) is used in the context of Bayesian decision theory and its extensions, such as robust Bayesian inference and belief functions based on the Dempster–Shafer theory [21,22,23,24], the goal is then to maximize an expected utility (or to minimize an expected cost) of the agent that makes the decision. In our context, the Bayesian approach is only used to formulate beliefs as posterior distributions, whereas the criteria for learning (probabilities of rejecting a false or true proposition) and knowledge acquisition (consistency) are frequentist. Given that a model with one unique, true world is correct, the frequentist error probability and consistency criteria are objective, since they depend on the true world. No such criteria exist within a purely Bayesian framework.

Illustration 1.

In order to illustrate our approach for modeling learning and knowledge acquisition, we present an example that will be revisited several times later on. A teacher (the agent) wants to evaluate whether a child has learned addition. The teacher gives the student a home assignment test with two-choice answers, one right and one wrong, to measure the proposition S: “The child is expected to score well on the test." In this case, we have a set X={x1,x2,x3} of three possible worlds. An ignorant person who does not ask for help is expected to have half her questions right and half her questions wrong (x1). A child who knows addition is expected to get a large fraction of the answers right (x2). However, there is also a third alternative, where an ignorant student asks for help and is expected to have a high score for that reason (x3). Notice in particular that S is true only for the two worlds of the set A={x2,x3}. If the child answers substantially more questions right than wrong, the active information will be positive and the teacher learns S. However, this learning that S is true does not represent knowledge of whether the student knows how to add, since the teacher is not able to distinguish x2 from x3. Now, let us say that the test has only two questions. In this setting, it is expected that an ignorant person has one question right and one wrong. However, it is also highly probable that even if the child does not know his sums well, he can answer the two questions in the right way. In this case, the teacher has not learned substantially about S (nor attained knowledge of whether the student knows how to add). The reason is that, since the test has only two questions, the teacher cannot exclude any of x1, x2, and x3. The more questions the test has, and if the student scores well, the more certain the teacher is that either x2 or x3 is true, that is, the more he learns about S. If the student is also monitored during the exam, alternative x3 is excluded and the teacher knows that x2 is true; that is, the teacher not only learns about S, but also acquires knowledge that the student knows how to add.

Each of the following sections contains remarks and illustrations like the previous one. At the end of the paper, a whole section with multiple examples will explore deeper how the model works in practice.

1.2. Related Work

Other contributions have been made to developing a mathematical framework for learning and knowledge acquisition. Hopkins [25] studied the theoretical properties of two different models of learning in games, namely, reinforcement learning and stochastic fictitious play. He developed an equivalence relation between the two under a variety of different scenarios with increasing degrees of structure. Stoica and Strack [26] introduced a stochastic model for acquired knowledge and showed that empirical data fit the estimated outcomes of the model well, using data from student performance in university-level classes. Taylor [27] proposed a model using the notion of concept lattices and the mathematical theory of closure spaces to describe knowledge acquisition and organization. However, none of these works has been developed through basic concepts in probability and information theory the way we do here. Our approach permits important generalizations which cover a wide range of real-life scenarios.

2. Possible Worlds, Propositions, and Discernment

Consider a collection X of possible worlds, of which x0X is the true world, and all other worlds xX \ {x0} are counterfactuals. We will regard x as a statistical parameter, and the fact that this parameter has a true but unknown value x0 corresponds to a frequentist assumption. The set X is the parameter space of interest, and it is assumed to be either finite or a bounded and open subset of Euclidean space Rq of dimension q. Let S be a proposition (or statement), and impose a second frequentist assumption that S is either true or false, although the truth value of S may depend on the world xX. Define a binary-valued truth function f:X{0,1} by f(x)=1 or 0, depending on whether S is true or not in world x. The set A={xX;f(x)=1} consists of all worlds for which S is a true proposition. Although there is one-to-one correspondence between f and A, in the sequel it will be convenient to use both notions. The simplest truth scenario of S is one for which the truth value of S is unique for the true world, i.e.,

A0={x0},if f(x0)=1,X \ {x0},if f(x0)=0. (1)

x0 being unique and f being binary-valued together correspond to a framework of epistemic randomness, where the actual truth value f(x0) of S is either 0 or 1. S is referred to as falsifiable [28] if it is logically possible (in principle) to find a data set D implying that the truth value of S is 0, or equivalently, that none of the worlds in A is true. It is possible though to falsify S without knowing x0.

3. Probabilities

3.1. Degrees of Beliefs and Sigma Algebras

Let (X,F) be a measurable space. When X is finite, F consists of all subsets of X (i.e., F=2X); otherwise, F is the class of Borel sets. The Bayesian part of our approach is to quantify an agent’s belief in which a world is true by means of an epistemic probability measure P on the measurable space (X,F), whereas the beliefs of an ignorant person follow another probability measure P0. It is often assumed that

P0(B)=|B||X|,BF, (2)

is the uniform probability measure that maximizes entropy among all probability measures on (X,F), where | · | refers to the cardinality for finite X and to the Lebesgue measure for continuous X. Then, (2) corresponds to a maximal amount of ignorance about which possible world is true [29]. Sometimes (as in Example 5 below) some general background knowledge is assumed also for the ignorant person, so that P0 differs from (2).

The agent’s and the ignorant person’s strength of belief in S are quantified by P(A) and P0(A), respectively. Following [15], it is helpful to interpret P(A) and P0(A) as the agent’s and the ignorant person’s predictions of the physical probability f(x0){0,1} of S. Whereas P and P0 involve epistemic uncertainty, the physical probability is an indicator for the real (physical) event that S is true or not.

When an agent’s belief P is formed, it is assumed that any information accessible to him, beyond that of the ignorant person, belongs to a sub-σ-algebra GF. This means that the agent has no more knowledge of how to discern events in G than the ignorant person, if this discernment requires that he considers events in F that do not belong to G. Mathematically, this corresponds to a requirement

EP[gG]=EP0[gG], (3)

for all F-measurable functions g:XR, and all sigma algebras G such that GGF. It is assumed, on the left-hand side of (3), that g is a random variable defined on the probability space (X,F,P), whereas g is defined on the probability space (X,F,P0) on the right-hand side of (3). It follows from (3) that G sets the limit in terms of the agent’s possibility to form propositions about which world is true. Therefore, G is referred to as the agent’s maximal possible discernment about which world is true. It follows from (3) that

P(A)=EP[f]=EP{EP[fG]}=EP{EP0[fG]}. (4)

The minimal amount of discernment corresponds to the trivial σ-algebra G0={,X}. Whenever (3) holds with G=G0, necessarily P=P0. This corresponds to removing the outer expectation on the right-hand side of (4), so that

P0(A)=EP0[f]=EP0[fG0]. (5)

Remark 1.

Suppose there exists an oracle or omniscient agent O that is able to discern between all possible worlds and also knows x0. Mathematically, the discernment requirement means that O has knowledge about all sets in a σ-algebra F that corresponds to a maximal amount of discernment between possible worlds. We will assume that f is measurable with respect to F, so that A is measurable (i.e., AF). Knowledge of F is, however, not sufficient for knowing A, since A may involve x0, as in (1). By this we mean that if the agent knows F, and if A involves x0, then there are several candidates of A for the agent, and he does not know a priori which one of these candidates is the actual A. However, since O knows F and x0, he also knows A. It follows that O knows that S is true for all worlds in (the actual) A, and that S is false for all worlds outside of (the actual) A. That is, the oracle knows for which possible worlds the proposition S is true.

As mentioned in Remark 1, the truth function f is measurable with respect to the maximal σ-algebra F. However, depending on how G is constructed, and whether A involves x0 or not, the set A may or may not be known to the agent. Therefore, when A involves x0, the agent may not be able to compute P0(A) and P(A) himself. Although he is able to compute P0(B) and P(B) for all BF, since he does not know x0, it follows that he does not know which of these sets B equals A. Therefore, he does not know P(A) and P0(A), unless P(B)=P(A) and P0(B)=P0(B), respectively, for all B that are among the agent’s candidates for the set A. For instance, suppose X={1,2,3}, P(1)=1/5, P(2)=P(3)=2/5, and A={3}. If the agent’s candidates for A are {1}, {2}, and {3}, then the agent does not know P(A). On the other hand, if the agent’s candidates for A are {2} and {3}, then he knows P(A), although he does not know A.

As will be seen from the examples of Section 8, it is often helpful (but not necessary) to construct G as the σ-algebra that is generated by a random variable Y whose domain is X (i.e., G=σ(Y)). This means that Y determines the collection G of subsets of X for which the agent is free to form beliefs beyond that of the ignorant person. Typically, Y highlights the way in which information is lost by going from F to G. For instance, suppose X=[0,) and Y:[0,){0,1,2,} is defined by Y(x)=[x/δ] for some δ>0; then, G=σ({[0,δ),[δ,2δ),}) is the sigma-algebra obtained by from a quantization procedure with accuracy δ.

3.2. Bayes’ Rule and Posterior Probabilities

A Bayesian approach will be used to define the agent’s degree of belief P. To this end, we regard xX as a parameter of a statistical model and that the agent has access to data dD. The agent assumes that (x,d) is an observation of a random variable (X,D):ΩX×D defined on some sample space Ω. The joint distribution of the parameter X and data D, according to the agent’s beliefs, is dQ(x,D)=dP0(x)L(D|x)dD. This is a probability measure on subsets of X×D, with prior distribution P0 of the parameter X, and with a conditional distribution L(D|x)=dQ(D|x)/dD that corresponds to the likelihood of data D. A posterior probability

P(A)=Q(AD)=AdQ(xD) (6)

of A is formed by updating the prior distribution P0 based on data D. It is assumed that the likelihood xL(Dx) is measurable with respect to G, so that data conform with the agent’s maximal possible discernment between possible worlds. The likelihood function xL(Dx) includes the agent’s interpretation of D. Although this interpretation may involve a subjective part, it is still assumed that the agent is not willing to speculate about possible worlds beyond the limits set by G. That is, whenever the agent discerns events in G beyond the limits set by G, this discernment is the same as for an ignorant person.

Remark 2.

To account for the possibility that the agent still speculates beyond the limits set by external data, G=σ(Gext,Gint) could be defined as the smallest σ-algebra containing the σ-algebras Gext and Gint that originate from external data Dext and internal data Dint (the agent’s internal experiences, such as dreams and revelations, respectively). Note, however, that xL(Dx) is subjective, even when internal data are absent, since agents might interpret external data in different ways, due to the way in which they perceive such data and incorporate previous life experience.

From Bayes’ rule we find that the posterior distribution satisfies

P(A)=Q(AD)=AdQ(xD)=L(DA)P0(A)L(D)=AL(Dx)dP0(x)XL(Dx)dP0(x). (7)

A couple of additional reasons reinforce the subjectivity of P: the prior P0 might be subjective, and acquisition of data D might vary between agents [30]. Additionally, acquisition of data D will not necessarily make P more concentrated around the true world x0, since it is possible that the data themselves are biased or that the agent interprets the data in a sub-optimal way.

Since the likelihood function is measurable with respect to G, it follows from (4) that the agent’s belief P, after having observed D, does not lead to a different discernment between possible worlds beyond G than for an ignorant person. Given G, together with an unlimited amount of unbiased data that the agent interprets correctly, the G-optimal choice of P is

P(B)=1(x0B),BG. (8)

Equations (4) and (8) uniquely define the G-optimal choice of P. Whenever GF is a proper subset of the maximal σ-algebra F, the measure P in (8) is not the same thing as a point mass δx0 at x0. On the other hand, for an oracle with a maximal amount of knowledge about which world is true, G=F, (8) reduces to a point mass at the true world—i.e.,

P=δx0P(B)=1(x0B),BF. (9)

Remark 3.

An extreme example of biased beliefs is a true-world-excluding probability measure , with support that does not include x0:

supp(P)X \ {x0}. (10)

Another example is a correct-proposition-excluding probability measure , with support that excludes all worlds x with a correct value f(x)=f(x0) of S:

supp(P)Ac=X \ A,x0A,A,x0A. (11)

Illustration 2

(Continuation of Illustration 1). Suppose data DD={0,1,,10} are available to the teacher (the agent) in terms of the number of correct answers of a home assignment test with 10 questions. The prior P0(xi)=1/3 is uniform on X={x1,x2,x3}, whereas data D|xiBin(10,πi) have a binomial distribution with probabilities π1=0.5 and π2=π3=0.8 of answering each question correctly, for a student that either guesses or has math skills/asks for help. Let d be the observed value of D. Since data have the same likelihood (L(d|x2)=L(d|x3)) for a student who scores well, regardless of whether he knows how to add or gets help, it is clear that the posterior distribution

P(xi)=L(d|xi)j=13L(d|xj)=10dπid(1πi)10dj=1310dπjd(1πj)10d

satisfies P(x2)=P(x3). Since the teacher cannot distinguish x2 from x3, his sigma-algebra

G={,{x1},{x2,x3},X} (12)

has only four elements, whereas the full sigma-algebra F=2X consists of all eight subsets of X. Note that Equation (3) stipulates that the teacher cannot discern between the elements of X, beyond the limits set by G, better than the ignorant person. In order to verify (3), since there is no sigma-algebra between G and F, we only need to check this equation for G=G. To this end, let g:XR be a real-valued function. Then, since P(x2)=P(x3), it follows that

EP(g|G)(xi)=EP0(g|G)(xi)=g(x1),i=1,(g(x2)+g(x3))/2,i=2,3.

in agreement with (3).

Illustration 3.

During the Russo-Japanese war, the czar Nicholas II was convinced that Russia would easily defeat Japan [31]. His own biases (he considered the Japanese weak and the Russians superior) and the partial information he received from his advisors blinded him to reality. In the end, Russian forces were heavily beaten by the Japanese. In this scenario, the proposition S is “Russia will beat Japan", X consists of all possible future scenarios, and f(x)=1 for those scenarios xX in which Russia would win the war. As history reveals, f(x0)=0. The information he received from his advisors was D, and we know it was heavily biased. Nicholas II adopted (very subjectively!) a correct-proposition-excluding probability measure, as in (11), because he did not even consider the possibility of Russia being defeated. The main reason was a dramatically poor assessment of the likelihood L(Dx), for xX, on top of a prior P0 that had a low probability for scenarios xAc. Nicholas II’s verdict was P(A)1.

3.3. Expected Posterior Beliefs

Since D is random, so is P. For this reason, the expected posterior distribution

P¯(B)=Ex0[Q(BD)]=Ex0[P(B)],BF, (13)

will be used occasionally, with an expectation corresponding to averaging over all possible data sets D according to its distribution L(·|x0) in the true world. Consequently, P¯(A) represents the agent’s expected belief in S in the true world x0. Note in particular that in contrast to the posterior P, the expected posterior P¯ is not a purely Bayesian notion, since it depends on x0.

4. Learning

4.1. Active Information for Quantifying the Amount of Learning

The active information (AIN) of an event B is

I+(B)=logP(B)P0(B). (14)

In particular, I+(A) quantifies how much an agent has learned about whether S is true or not compared to an ignorant person. By inserting (7) into (14), we find that the AIN

I+(A)=logLDAL(D)=logLDxdP0xALDxdP0(x) (15)

is the logarithm of the ratio between how likely it is to observe data when S holds, and how likely data are when no assumption regarding S is made (see also [32]). The corresponding AIN for expected degrees of beliefs is

I¯+(A)=logP¯(A)P0(A). (16)

Definition 1

(Learning). Learning about S has occurred (conditionally on observed D) if the probability measure P either satisfies I+(A)>0 when x0A or I+(A)<0 when x0A. In particular, full learning corresponds to I+(A)=logP0(A) when x0A and I+(A)= when x0A. Learning is expected to occur if the probability measure P¯ is such that I¯+(A)>0 when x0A or I¯+(A)<0 when x0A. In particular, full learning is expected to occur if I¯+(A)=logP0(A) when x0A or I¯+(A)= when x0A.

Remark 4.

Two extreme scenarios for the active information, when x0A, are

I+(A)=x0AlogP0(A),if(8)holdsandAG,,if(11)holds. (17)

According to Definition 1, the upper part of (17) represents full learning—that is, P(A)=1; whereas the lower part corresponds to a maximal amount of false belief about S when x0A—that is, P(A)=0.

Remark 5.

Suppose S is a proposition that a certain entity or machine functions; then, logP0(A) is the functional information associated with the event A of observing such functioning entity [33,34,35]. In our context, functional information corresponds to the maximal amount of learning about S when the machine works (f(x0)=1).

4.2. Learning as Hypothesis Testing

It is possible to view the AIN in (15) as a test statistic for choosing between the two statistical hypotheses

H0:S is truex0A,H1:S is falsex0A, (18)

with the null distribution H0 being rejected (conditionally on observed D) when

I+(A)I (19)

for some threshold I [36,37,38]. Typically, this threshold represents a lower bound of what is considered to be a significant amount of learning when S is true. Note in particular that the framework of the hypothesis test, (18) and (19), is frequentist, although we use Bayesian tools (the prior and posterior distributions) to define the test statistic.

In order to introduce performance measures of how much the agent has learnt, let Prx0 refer to a probabilities when data DL(·|x0) are generated according to what one expects in the true world. The type I and II errors of the test (18) and (19) are then defined as

α(x0)=Prx0[I+(A)I],x0A,β(x0)=Prx0[I+(A)>I],x0A, (20)

respectively. Both these error probabilities are functions of x0, and they quantify how much the agent has learnt about the truth (cf. Figure 1 for an illustration).

Figure 1.

Figure 1

Illustration of the density function yfP(A)(y) of P(A) when the data set DL(·|x0) varies according to the likelihood of the true world parameter for two scenarios where S is either false (a) or true (b). The threshold of the hypothesis test (19) is I+=log[p/P0(A)], so that H0 is rejected when P(A)p=0.5. Note that P¯(A) is the expected value of each density, whereas the error probabilities of type I and II correspond to the areas under the curves in (b) and (a) to the left and right of p, respectively.

4.3. The Bayesian Approach to Learning

Within a Bayesian framework, we think of H0 and H1 as two different models, A and Ac, that represent a subdivision of the parameter space into two disjoints subsets. The posterior odds

PostOdds=1P(A)P(A)=1P0(A)P0(A)·L(D|Ac)L(D|A)=1P0(A)P0(A)·BF (21)

factor into a product of the prior odds and the Bayes factor. Hypothesis H1 is chosen whenever

PostoOddsr, (22)

for some threshold r. If the cost of drawing a parameter XP from A (Ac) is C0 (C1) when H1 (H0) is chosen, the optimal Bayesian decision rule corresponds to r=C0/C1. A little algebra reveals that the AIN is a monotone decreasing function

I+=log[P0(A)(1+PostoOdds)]

of the posterior odds. From this, it follows that the frequentist test (19), with AIN as test statistic, is equivalent to the Bayesian test (22), whenever I=log[P0(A)(1+r)]. However, the interpretation of the two tests differ. Whereas the aim of the Bayesian decision rule is to minimize an expected cost (or maximize an expected reward/utility), the aim of the frequentist test is to keep the error probabilities of type I and II low.

4.4. Test Statistic When x0 Is Unknown

Recall that the agent may or may not know the set A. In the latter case, the agent cannot determine the value of the test statistic I+(A), and hence he cannot test between H0 and H1 himself. This happens, for instance, for the truth function (1), with A={x0}, since the AIN I+(A)=log[p(x0)/p0(x0)] then involves the unknown x0, with p(x)dx=dP(x) and p0(x)dx=dP0(x). Although I+(A) is not known for this particular choice of A, the agent may still use the posterior distribution (7) in order to compute the expected value (conditionally on observed D)

EQI+({X})|D=EQlogp(X)p0(X)|D=EPlogp(X)p0(X)=Xlogp(x)p0(x)p(x)dx=DKL(PP0)=H(P,P0)H(P) (23)

of the test statistic according to his posterior beliefs. Note that (23) equals the Kullback–Leibler divergence DKL(PP0) between P and P0, or the difference between the cross entropy H(P,P0) between P and P0, and the entropy H(P) of P. If we also take randomness of the data set D into account, and make use of (7), it follows that the expected AIN, for the same choice of A, equals the mutual information

EQI+({X})=EQEQlogp(X)p0(X)|D=logL(d|x)L(d)dQ(x,d)=logq(x,d)p0(x)L(d)dQ(x,d), (24)

between X and D, when (X,D)Q vary jointly according to the agent’s Bayesian expectations, and with q(x,d)=dQ(x,d)/d(x,d).

5. Knowledge Acquisition

5.1. Knowledge Acquisition Goes beyond Learning

As mentioned in the introduction, knowledge acquisition goes beyond learning, since it also requires that a true belief in S is justified (see Figure 2 for an illustration).

Figure 2.

Figure 2

Illustration of the difference between learning and knowledge acquisition for a scenario with a set of worlds X=[0,1] and a statement S whose truth function xf(x) is depicted to the left (a) and right (b). It is assumed that S is true (x0A), and that the degrees of beliefs P0 of an ignorant person correspond to a uniform distribution on X. The filled histograms correspond to the density functions p(x)dx=dP(dx) of two agent’s beliefs. The agent to the left (a) has learnt about S but not acquired knowledge, since x0 does not belong to the support of P. The agent to the right has not only learnt about S, but also acquired knowledge, since his belief is justified, corresponding to a distribution P that is more concentrated around the true world x0, compared to the ignorant person. Hence, the JTB condition is satisfied for the agent to the right, but not for the agent to the left.

It is possible, in principle, for an agent whose probability measure P corresponds to a smaller belief in x0 compared to that of the ignorant person, to have a value of I+ anywhere in the range [,logP0(A)] when S is true (i.e., when x0A). One can think of a case in which the agent will believe in S with certainty (P(A)=1) if supp(P)A; but this belief in S is for the wrong reason if, for instance, the agent does not believe in the true world, i.e., if (10) holds, corresponding to the left part of Figure 2. Another less extreme situation occurs when the agent has a higher belief in A compared to the ignorant person but has lost some (but not all) confidence in the true world with respect to that of the ignorant person; in this case, the agent has not acquired new knowledge about the true world compared to the ignorant person, although he still has learned about S and has some knowledge about the true world.

5.2. A Formal Definition of Knowledge Acquisition

Knowledge acquisition is formulated using tools from statistical estimation theory. Loosely speaking, the agent acquires knowledge, based on data D, if the posterior distribution P gets more concentrated around x0, compared to an ignorant person. By this we mean that each closed ball centered at x0 has a probability that is at least as large under P as under P0. Closed balls require, in turn, the concept of a metric or distance; that is, a function d:X×X[0,). Some examples of metric are:

  1. If XRq, we use the Euclidean distance d(x1,x2)=i=1q(x2ix1i)2 between x1,x2X as metric.

  2. If X={0,1}q consists of all binary sequences of length q, then d(x1,x2)=i=1q|x2ix1i| is the Hamming distance between x1 and x2.

  3. If X is a finite categorical space, we put
    d(x1,x2)=0,x1=x2,1,x1x2.

Equipped with a metric on X, knowledge acquisition is now defined:

Definition 2

(Knowledge acquisition and full knowledge). Let Bϵ(x0)={xX:d(x,x0)ϵ} be the closed ball of radius ϵ that is centered at x0 with respect to some metric d. We say that an agent has acquired knowledge about S (conditionally on observed D) if learning has occurred according to Definition 1, and in order for this learning to be justified, the following two properties are satisfied for all ϵ>0:

PBϵ(x0)>0, (25)

and

PBϵ(x0)P0Bϵ(x0) (26)

with strict inequality for at least one ϵ>0. Full knowledge about S requires that (9) holds; i.e., that the agent with certainty believes that the true world x0 is true. The agent is expected to acquire knowledge about S if learning is expected to occur, according to Definition 1, and if (25) and (26) hold with is true. The agent is expected to acquire knowledge about S if learning is expected to occur, according to Definition 1, and if (25) and (26) hold with P¯ instead of P. The agent is expected to acquire full knowledge about S if (9) holds with P¯ instead of P.

Several remarks are in order.

Remark 6.

Property (25) ensures that x0 is in support of P ([39], p. 20) Kallenberg2021a. When P0 is the uniform distribution (2), (25) follows from (26). Property (26) is equivalent to I+Bϵ(x0)0, when P0Bϵ(x0)>0. In this case, the requirement that (26) is satisfied with strict inequality for some ϵ=ϵ*>0 is equivalent to learning the proposition Sϵ*: "The distance of a world to the true world x0 is less than or equal to ϵ*," corresponding to a truth function

fϵ*(x)=1(xBϵ*(x0)). (27)

Since the agent does not know x0, neither fϵ* nor Aϵ*={xX;fϵ*(x)=1} is known to him, even if he is able to discern between all possible worlds. If fϵ* differs from the original truth function f, learning of Sϵ* can be viewed as meta-learning. Note also that A0={x0} corresponds to the set in (1).

Remark 7.

Suppose the truth function used to define learning and knowledge acquisition satisfies (27), i.e., f=fϵ for some ϵ0. Then (25) and (26) are sufficient for knowledge acquisition, since they imply that learning of S=Sϵ* has occurred, according to Definition 1. Although knowledge acquisition in general requires more than learning, the two concepts are equivalent for a truth function f=f0, with A=A0={x0}, as defined in (1). Indeed, in this case it is not possible to learn whether S=S0 is true or not for the wrong reason.

Remark 8.

Recall from Definition 1 that an agent has fully learnt S when

P(A)=1(x0A)=1,x0A,0,x0A. (28)

For a rational agent, the lower part of (28) should hold when data D falsifies S. In general, (28) is a necessary but not sufficient condition for full knowledge. Indeed, it follows from (9) that, for a person to have full knowledge, P(B)=1(x0B) must hold for all BF, not only for the set A of worlds for which S is true.

Remark 9.

Suppose a distance measure d(P,Q) between probability distributions on (X,F) is defined. This gives rise to a different definition of knowledge acquisition, whereby the agent acquires knowledge if has learnt about S and additionally d(P,δx0)<d(P0,δx0), that is, if his beliefs are closer than the ignorant person’s beliefs to the Oracle’s beliefs. Possible choices of distances are the Kullback–Leibler divergence d(P,Q)=DKL(Q||P) and the Wasserstein metric d(P,Q)=minX1,X2E|X1X2|, where the minimum is taken over all random vectors (X1,X2) whose marginals have distributions P and Q, respectively. Note in particular that the KL choice of distance yields d(Q,δx0)=logQ(x0). The corresponding notion of knowledge acquisition is weaker than in Definition 2, requiring (25) and (26) to hold only for ϵ=0.

Illustration 4

(Continuation of Illustration 1). To check whether learning or knowledge acquisition has occurred, according to Definitions 1 and 2, for the student who takes the math home assignment, x0 must be known. The reader may think of an instructor with full information—an F-optimal measure according to (9)—who checks whether a pupil has learned and acquired knowledge or not. However, in Illustration 1 it is the teacher who is the pupil and learns and acquires knowledge about the skills of a math student. In this context, the instructor is a supervisor of the teacher who knows whether the math student is able to add (x0=x2) or not, and in the latter case whether the student gets help (x0=x3) or not (x0=x1). Whereas the instructor’s sigma-algebra is F, the teacher’s sigma-algebra G in (12) does not make it possible to discern between x2 and x3. Suppose x0=x2. No matter how many questions the home exam has, as long as the teacher does not get information from the instructor on whether the student solved the home exam without help or not, although the teacher learns that S is true, since the student scores well, he will never acquire full knowledge that the student knows how to add, since P(x0)=P(x2)=P(x3)0.5<1.

6. Learning and Knowledge Acquisition Processes

The previous two sections dealt with learning and knowledge acquisition of a static belief P, corresponding to an agent who is able to discern between worlds according to one sub-σ-algebra G of F, and who has access to one data set D. The setting is now extended to consider an agent who is exposed to an increasing amount of information about (or discernment between) the possible worlds in X, and increasingly larger data sets.

6.1. The Process of Discernment and Data Collection

Mathematically, an increased ability to discern between possible worlds is expressed as a sequence of σ-algebras

G1GnF. (29)

Typically, Gk is generated by a random variable Yk whose domain is in X for k=1,,n. The σ-algebras in (29) are associated with increasingly larger data sets D1,,Dn, with DkDk. Let dQk(x,Dk)=dP0(x)L(Dk|x)dDk refer to the joint distribution of the parameter and data in step k, such that the likelihood xLDkx of Dk is Gk-measurable. This implies that an agent who interprets data Dk according to this likelihood function has beliefs (represented by the posterior probability measure Pk(·)=Qk·Dk) that correspond to not being able to discern events outside of Gk better than an ignorant person. Mathematically, this is phrased as a requirement

EPkgGk=EP0gGk, (30)

for all F-measurable functions g:XR and sigma algebras Gk such that GkGkF, for k=1,,n. The collection of pairs (D1,G1),,(Dn,Gn) is referred to as a discernment and data collection process. The active information, after k steps of the discernment and data collection process, is

Ik+(A)=logPk(A)P0(A). (31)

Let P¯k(·)=Ex0[Pk(·Dk)] refer to expected degrees of belief after k steps of the information and data collection process, if data DkL(·|x0) vary according that what one expects in the true world. The corresponding active information is

I¯k+(A)=logP¯k(A)P0(A). (32)

In the following sections we will use the sequences I1+,,In+ and P1,,Pn of AINs and posterior beliefs in order to define different notions of learning and knowledge acquisition.

6.2. Strong Learning and Knowledge Acquisition

Definition 3

(Strong learning). The probability measures P1,,Pn, obtained from the discernment and data collection process represent a learning process in the strong sense (conditionally on observed D1,,Dn) if

0I1+(A)In+(A),if x0A,0I1+(A)In+(A),if x0A, (33)

with at least one strict inequality. Learning is expected to occur, in the strong sense, if (33) holds with I¯1+(A),,I¯n+(A), instead of I1+(A),,In+(A).

Definition 4

(Strong knowledge acquisition). With Bϵ(x0) as in Definition 2, the learning process is knowledge acquiring in the strong sense (conditionally on observed D1,,Dn) if, in addition to (33), we have that this learning process is justified, so that for all ϵ>0, P1(Bϵ(x0))>0 and

P0(Bϵ(x0))P1(Bϵ(x0))Pn(Bϵ(x0)), (34)

with strict inequality for at least one step of (34) and for at least one ϵ>0. Knowledge acquisition is expected to occur, in the strong sense, if learning is expected to occur in the strong sense, according to Definition 3, and additionally (34) holds with P¯1,,P¯n, instead of P1,,Pn.

Illustration 5

(Continuation of Illustration 1). Assume the teacher of the math student has a discernment and data collection process (G1,D1),(G2,D2), where in the first step, G1=G and D1|xiBin(10,πi) are obtained from a home assignment with 10 questions (as described in Section 3.2). Suppose the student knows how to add (x0=x2). It can be seen that

P1(A)=P1(x2)+P1(x3)>P0(A)=2/3,P1(x0)=P1(x2)>P0(x2)=1/3 (35)

whenever 7d110. Assume that in a second step the teacher receives information Z2{0,1} from the instructor on whether the student used external help (Z2=1) or not (Z2=0) during the exam. Let d2=(d1,z2) refer to observed data after step 2. The likelihood, after the second step, then takes the form

L(d2|xi)=10d1πid1(1πi)10di·L(z2|xi),

where L(1|xi)=1(xi=x3) and L(0|xi)=1(xi{x1,x2}). If the instructor correctly reports that the student did not use external help (z2=0), it follows that

P2(A)=P2(x2)=P1(x2)/(P1(x1)+P1(x2))<2P1(x2)/(P1(x1)+2P1(x2))=P1(A),P2(x0)=P2(x2)>P1(x2)/(P1(x1)+2P1(x2))=P1(x2)=P1(x0). (36)

We deduce from (35) and (36) that

P0(x0)<P1(x0)<P2(x0), (37)

which suggests that knowledge acquisition has occurred if the categorical space metric d(xi,xj)=1(xixj) is used on X. However, since P2(A)<P1(A), neither learning nor knowledge acquisition in the strong sense has occurred. The reason is that the information from the instructor (that the student has not cheated) makes the teacher less certain as to whether the student is able to score well on the test. On the hand, if we change the proposition to S: "The student knows how to add," with A={x2}, then strong learning and knowledge acquisition has occurred because of (37), since Pk(A)=Pk(x0) for k=0,1,2.

6.3. Weak Learning and Knowledge Acquisition

Learning and knowledge acquisition are often fluctuating processes, and the requirements of Definition 3 are sometimes too strict. Accordingly, weaker versions of learning and knowledge acquisition are thus introduced.

Definition 5

(Weak learning). Learning in the weak sense has occurred at time n (conditionally on observed Dn), if

0<In+(A),if x0A,0>In+(A),if x0A. (38)

Learning is expected to occur in the weak sense if (38) holds with I¯n+ instead of In+.

Definition 6

(Weak knowledge acquisition). Knowledge acquisition in the weak sense occurs (conditionally on observed Dn) if, in addition to the weak learning condition (38), in order for this learning to be justified, it holds for all ϵ>0 that Pn(Bϵ(x0))>0 and

P0(Bϵ(x0))Pn(Bϵ(x0)), (39)

with strict inequality for at least one ϵ>0. Knowledge acquisition is expected to occur in the weak sense if weak learning occurs according to Definition 5 and (39) holds with P¯n instead of Pn.

7. Asymptotics

Strong and weak learning (or strong and weak knowledge acquisition) are equivalent for n=1. The larger n is, the more restrictive strong learning becomes in comparison to weak learning. However, for large n, neither strong nor weak learning (knowledge acquisition) are entirely satisfactory entities. For this reason, in this section we will introduce asymptotic versions of learning and knowledge acquisition, for an agent whose discernment between worlds and collected data sets increase over a long period of time.

7.1. Asymptotic Learning and Knowledge Acquisition

In order to define asymptotic learning and knowledge acquisition, as the number of steps n of the discernment and data collection process tends to infinity, we first need to introduce AIN versions of limits. Define

Ilim inf+(B)=loglim infPk(B)P0(B), (40)
Ilim sup+(B)=loglim supPk(B)P0(B), (41)

and when the two limits of (40) agree, we refer to the common value as Ilim+(B). Define also

I¯lim inf+(B)=loglim infP¯k(B)P0(B), (42)
I¯lim sup+(B)=loglim supP¯k(B)P0(B), (43)

with I¯lim+(B) the common value whenever the two limits of (42) agree. Since Ilim+(B) only exists when Ilim inf+(B)=Ilim sup+(B), and Ilim inf+(B)Ilim sup+(B), the following definitions of asymptotic learning and knowledge acquisition are natural:

Definition 7

(Asymptotic learning). Learning occurs asymptotically (conditionally on the observed data sequence {Dk}k=1) if

Iliminf+(A)>0,forx0A,Ilimsup+(A)<0,forx0A. (44)

Full learning occurs asymptotically (conditionally on {Dk}k=1}) if

Ilim+(A)=logP0(A),forx0A,Ilim+(A)=,forx0A. (45)

Learning is expected to occur asymptotically if (44) holds with I¯lim sup+ and I¯lim inf+, instead of Ilim sup+ and Ilim inf+, respectively. Full learning is expected to occur asymptotically, if (45) holds with I¯lim+ instead of Ilim+.

Definition 8

(Asymptotic knowledge acquisition). Knowledge acquisition occurs asymptotically (conditionally on {Dk}k=1) if, in addition to the asymptotic learning condition (44), in order for this asymptotic learning to be justified, for every ϵ>0, it holds that

lim infkPk(Bϵ(x0))>0

and

Ilim inf+(Bϵ(x0))0, (46)

with strict inequality for a least one ϵ>0. Full knowledge acquisition occurs asymptotically (conditionally on {Dk}k=1}) if (45) holds and

Ilim+(Bϵ(x0))=logP0(Bϵ(x0)) (47)

is satisfied for all ϵ>0. If learning is expected to occur asymptotically according to Definition 7, and if (46) holds with I¯lim inf+ instead of Ilim inf+, then knowledge acquisition is expected to occur asymptotically. Full knowledge acquisition is expected to occur asymptotically if full learning is expected to occur asymptotically according to Definition 7, and if (47) holds with I¯lim+ instead of Ilim+.

7.2. Bayesian Asymptotic Theory

In this subsection we will use Bayesian asymptotic theory in order to quantify and give conditions for when asymptotic learning and knowledge acquisition occur. Let Ω be a large space that incorporates prior beliefs and data for all k=1,2,. Define Xk:ΩX as a random variable whose distribution corresponds to the agent’s posterior beliefs, based on data set Dk, which itself varies according to another random variable Dk:ΩDk with distribution DkL(·|x0). Let Prx0 be a probability measure on subsets of Ω that induces distributions Xk|DkPk and XkP¯k, respectively. The following proposition is a consequence of Definitions 7 and 8:

Proposition 1.

Suppose full learning is expected to occur asymptotically, in the sense of (45), with I¯lim+ instead of Ilim+. Then,

Prx0(XkA)=P¯k(A)1,x0A,0,x0A (48)

as k. In particular, the type I and II errors of the hypothesis test (18) and (19), with threshold I=log[p/P0(A)] for some 0<p<1, satisfy

αk(x0)=Prx0Ik+(A)I=Prx0[Pr(XkADk)p]=Prx0Pk(A)p0,x0A,βk(x0)=Prx0Ik+(A)>I=Prx0[Pr(XkADk)>p]=Prx0Pk(A)>p0,x0A, (49)

respectively, as k. If full knowledge acquisition occurs asymptotically, in the sense of (47), then

XkDkpx0conditionallyon{Dk}k=1 (50)

as k, with p referring to convergence in probability. If full knowledge acquisition is expected to occur asymptotically, in the sense of Definition 8, then

Xkpx0 (51)

as k.

Remark 10.

Full asymptotic knowledge acquisition (50) is closely related to the notion of posterior consistency [40]. For our model, the latter concept is usually defined as

Prx0XkDkpx0ask=1, (52)

where the probability refers to variations in the data sequence {Dk}k=1 when x0 holds. Thus, posterior consistency (52) means that full asymptotic knowledge acquisition (50) holds with probability 1. Let L(X) refer to the distribution of the random variable X. Then, (52) is equivalent to

Pk=L(XkDk)a.s.δx0 (53)

as k, with a.s. referring to almost sure weak convergence with respect to variations in the data sequence {Dk}k=1 when x=x0. On the other hand, it follows from Definition 8 that if full knowledge acquisition is expected to occur asymptotically, this is equivalent to

Pk=L(XkDk)pδx0 (54)

as k, which is a weaker concept than posterior consistency, since almost sure weak convergence implies weak convergence in probability. However, sometimes (54), rather than (52) and (53), is used as a definition of posterior consistency.

Remark 11.

It is sometimes possible to sharpen (54) and obtain the rate at which the posterior distribution converges to δx0. The posterior distribution is said to contract at rate ϵk0 to δx0 as k (see for instance [41]), if for every sequence Mk it holds that

QXkDkB(x0,Mkϵk)=PkB(x0,Mkϵk)cp0, (55)

when {Dk}k=1 varies according to what one expects in the true world x0. Since convergence in probability is equivalent to convergence in mean for bounded random variables, it can be seen that (54) is equivalent to P¯kB(x0,Mkϵk)c0, or

(Mkϵk)1(Xkx0)p0 (56)

as k for each sequence Mk. Comparing (51) with (55) and (56), we found that a contraction of the posterior towards δx0 at rate ϵk is equivalent to expecting full knowledge acquisition asymptotically at rate ϵk.

It follows from Proposition 1 and Remarks 10 and 11 that Bayesian asymptotic theory can be used, within our frequentist/Bayesian framework, to give sufficient conditions for asymptotic learning and knowledge acquisition to occur. Suppose, for instance, that Dk=(Z1,,Zk) is a sample of k independent and identically distributed random variables Zl with distribution ZlF(·x0) that belongs to the statistical model {F(·x);xX}. The likelihood function is then a product

L(Dkx)=l=1kPr(Zlx) (57)

of the likelihoods of all observations Zl. For such a model, a number of authors [40,42,43,44,45] have provided sufficient conditions for posterior consistency (52) and (53) to occur. It follows from Remark 10 that these conditions also imply the weaker concept (54) of full, expected knowledge acquisition to occur asymptotically.

Suppose (57) holds with a parameter space XRq that is a subset of Euclidean space of dimension q. It is possible then to obtain the rate (56) at which knowledge acquisition is expected to occur. The first step is to use the Bernstein–von Mises theorem, which under appropriate conditions (see for instance [46]) approximates the posterior distribution Pk=L(XkDk) by a normal distribution centered around the maximum likelihood (ML) estimator

x^0k=x^0k(Dk)=argmaxxXL(Dkx) (58)

of x0. More specifically, this theorem provides weak convergence

k(Xkx^0k)|DkLN0,J(x0)1 (59)

as k, of a re-scaled version of the distribution of Xk|Dk when {Dk}k=1 varies according what one expects when x=x0. The limiting distribution is a q-dimensional normal distribution with mean 0 and a covariance matrix that equals the inverse of the Fisher information matrix J(x0), evaluated at the true world parameter x0. On the other hand, the standard asymptotic theory of maximum likelihood estimation (see for instance [47]) implies

k(x^0kx0)LN0,J(x0)1 (60)

as k, with weak convergence referring to variations in the data sequence {Dk}k=1 when x=x0. Combining equations (59) and (60), we arrive at the following result:

Theorem 1.

Assume data {Dk=(Z1,,Zk)}k=1 consists of independent and identically distributed random variables {Zl}l=1, and that the Bernstein–von Mises theorem (59) and asymptotic normality (60) of the ML-estimator hold. Then, Xk converges weakly towards x0 at rate 1/k, in the sense that

k(Xkx0)LN0,2J(x0)1 (61)

as k. In particular, full knowledge acquisition is expected to occur asymptotically at rate 1/k.

Proof. 

Let sFk(s)=Fk(x^k0x0)(s) refer to the distribution function of k(x^k0(Dk)x0), defined for all q-dimensional vectors s=(s1,,sq)Rq. Let also t=(t1,,tq)Rq and denote the distribution function of N0,2J(x0)1 by G. Combining (59) and (60), and making use of the fact that the convolution of two independent N0,J(x0)1-variables is distributed as N0,2J(x0)1, we can find that

Prx0k(Xkx0)t=Prx0k(Xkx^0k)ts|k(x^0kx0)=sdFk(s)stdG(s) (62)

as k, with st referring to sjtj for j=1,,q. Since (62) holds for any tRq, Equation (61) follows. Moreover, in view of (56), Equation (61) implies that full knowledge acquisition is expected to occur asymptotically at rate 1/k. □

In general, the conditions of Theorem 1 typically require that data, and the agent’s interpretation of data, are unbiased. When these conditions fail (cf. Remark 2), there is no guarantee that knowledge acquisition is expected to occur asymptotically as k.

8. Examples

Example 1

(Coin tossing). Let x0X=[0,1] be the probability of heads when a certain coin is tossed. An agent wants to find out whether the proposition

S:the coin is symmetric with margin ε>0

is true or not. This corresponds to a truth function f(x)=1(xA), with A=[0.5ε,0.5+ε], that is known to the agent. Suppose the coin is tossed a large number of times (n=), and let Dk=(Z1,,Zk)Dk={0,1}k be a binary sequence of length k that represents the first k tosses, with tails and heads corresponding to 0 (Zk=0) and 1 (Zk=1), respectively. The number of heads Mk=l=1kZlBin(k,x0) after k tosses is then a sufficient statistic for estimating x0 based on data Dk. Even though {Dk} is an increasing sequence of data sets, we put Yk(x)=x and Gk=F=B([0,1]), the Borel σ-algebra on [0,1], for k=1,2,. Let P0 be the uniform prior distribution on [0,1]. Since the uniform distribution is a beta distribution, and beta distributions are conjugate priors to binomial distributions, it is well known [17] that the posterior distribution

PkBeta(1+Mk,1+kMk)

belongs to the beta family as well. Consequently, if Xk is a random variable that reflects the agent’s degree of belief in the probability of heads after k tosses, it follows that his belief in a symmetric coin, if Mk=m, is

Pk(A)=PrXkADk=PrXkAMk=m=0.5ε0.5+εpkxmdx,

where

pkxm=(1x)kmxmB(1+m,1+km)=(k+1)km(1x)kmxm=(k+1)Lmx (63)

is the posterior density function of the parameter x, whereas B(a,b) is the beta function and xL(mx) the likelihood function. From this, it follows that the AIN after k coin tosses with m heads and km tails equals

Ik+(A)=log[(2ε)1Pk(A)]=log(2ε)1(k+1)km0.5ε0.5+ε(1x)kmxmdx.

Since data are random, Pk(A) (and hence also Ik+(A)) will fluctuate randomly up and down with probability one (see Figure 3); for this reason, {Pk}k=1 does not represent a learning process in the strong sense of Definition 3. On the other hand, it follows by the strong law of large numbers that Mk/ka.s.x0 as k, and from properties of the beta distribution, this implies that full learning and knowledge acquisition occur asymptotically according to Definitions 7 and 8, with probability 1. In view of Remark 10, we also have posterior consistency (52) and (53).

By analyzing P¯k instead of Pk, we may also assess whether learning and knowledge acquisition are expected to occur. The expected degree of belief in a symmetric coin, after k tosses, is

P¯k(A)=0.5ε0.5+εp¯k(x)dx,

where

p¯k(x)=Ex0pkxMk=m=0kLmx0pkxm=(k+1)m=0kLmx0Lmx

is the expected posterior density function of x, after k tosses of the coin. Note in particular that

01p¯k(x)dx=1.

It can be shown that (63) and the weak law of large numbers (Mk/kpx0 as k, where p refers to convergence in probability) lead to uniform convergence

supx:|xx0|ϵp¯k(x)0

as k for any ϵ>0. The last four displayed equations imply P¯k(A)1(x0A) and P¯kpx0 as k. This and Definitions 7 and 8 imply that full learning and knowledge acquisition are expected to occur asymptotically. This result is also a consequence of posterior consistency, or of Theorem 1. Notice, however, that a purely Bayesian analysis does not allow us to conclude that knowledge acquisition occurs, or is expected to occur, asymptotically.

Figure 3.

Figure 3

Degree of belief is represented as a function of coin tosses. There is no strong learning because the belief oscillates. However, there is weak learning after a few coin tosses. In particular, when the number of coin tosses is 1000, there is weak learning since P1000(A)>P0(A) and I1000+(A)>0.

Example 2

(Historical events). Let X=(0,1] represent a time interval of the past. A person wants to find out whether his ancestor died or not during a famine that occurred in the province where the ancestor lived. Formally, this corresponds to a proposition

S:the ancestor died during the time of the famine.

Let f(x)=1 if the famine occurred at time x, and f(x)=0 if not. Assume that the ancestor died at an unknown time point x0 and that the time period during which the famine lasted is A=[a,b], where 0a<b1 are known. If X corresponds to a fairly short time interval of the past, it is reasonable to assume that P0 has a uniform distribution on (0,1].

In the first step of the learning process, suppose radiometric dating D1=Z1 of a burial find from the ancestor appears. If δ=1/N represents the precision of this dating method, the corresponding σ-algebra is

G1=σ((0,1/δ],(1/δ,2/δ],,[(N1)/δ,1])=σ(Y1), (64)

where Y1:X{1,,N} is defined through Y1(x)=[x/δ]+1, and where [x/δ] is the integer part of x/δ. Due to (3), it follows that P1 has a density function

p1(x)=Ni=1Np1i1(x((i1)δ,iδ]), (65)

for some non-negative probability weights p1i0 that sum to 1. Since p1i=p1i(D1), this measure is constructed from the radiometric dating data D1 of the burial find from the ancestor. The G1-optimal probability measure is obtained from (8) as

p1i=1,i=i0=[x0/δ]+1,0,ii0=[x0/δ]+1.

It corresponds to dating the time of death of the ancestor correctly, given the accuracy of this dating method. On the other hand, if the radiometric dating equipment has a systematic error of δ, a truth-excluding probability measure (10) is obtained with

p1i=1,i=i01=[x0/δ],0,ii01=[x0/δ]. (66)

In the second step of the learning process, suppose data D2=(Z1,Z2) is extended to include a piece of text Z2 from a book where the time of death of the ancestor can be found. This extra source of information increases the σ-algebra to G2=F=B([0,1]), and if the contents of the book are reliable, P2=δx0 is F-optimal. It follows from Definition 3 that strong learning has occurred if Na=ia and Nb=ib are integers and

0<I1+=log[i=ia+1ibp1i/(ba)]<I2+=log[1/(ba)],ifx0(a,b),0>I1+=log[i=ia+1ibp1i/(ba)]>I2+=,ifx0(a,b). (67)

Figure 4 illustrates another scenario where not only strong learning but also strong knowledge acquisition occurs. Suppose now that (66) holds, with ia+1i01ib. If P2=δx0, the strong learning condition (67) is satisfied, and the weak knowledge acquisition requirement of Definition 6 holds as well. Strong knowledge acquisition has not occurred though, since p1i0=0 means that Equation (34) of Definition 4 (with n=2) is violated for sufficiently small ϵ>0. Note in particular that these conclusions about knowledge acquisition cannot be drawn from a purely Bayesian analysis.

Assume now that the contents of the book are not reliable. A probability measure P2 on [0,1] may be chosen so that it incorporates data Z1 from the radiometric dating and data Z2 from the book. This probability measure will also include information about the way the text of the book is believed to be unreliable. If the agent trusts Z2 too much, it may happen that strong learning does not occur.

Figure 4.

Figure 4

Posterior densities p1(x) and p2(x) after one and two steps of the discerment and data collection process of Example 2 when S is true (x0[a,b]). Since p1 is measurable with respect to G1, it is piecewise-constant with step length δ. Note that strong learning and knowledge acquisition occurs.

Example 3

(Future events). A youth camp with certain outdoor activities is planned for a weekend. Let X=(0,1]2 denote the set of possible temperatures x=(x1,x2) of the two days for which the camp is planned, each normalized within a range 0xi1. The outdoor activities are only possible within a certain sub-range 0<ax1,x2b<1 of temperatures. The proposition

S:itispossibletohavetheoutdooractivities

corresponds to a truth function f(x)=1x[a,b]2 and A=[a,b]2. The leaders have to travel to the camp five days before it starts and then make a decision on whether to bring equipment for the outdoor activities or for some other indoor activities. In the first step they consult weather forecast data D1=Z1, with a σ-algebra G1 given by

σ((i1)δ1,iδ1]×((j1)δ2,jδ2];1iN1,1jN2,

which is σ(Y1), the σ-algebra generated by Y1, where δ1 and δ2>δ1 represent the maximal possible accuracy of weather forecasts five and six days ahead, respectively, Ni=1/δi and Y1(x)=([x1/δ1]+1,[x2/δ2]+1). Let P0 be the uniform distribution on [0,1]2. Due to (3), P1 has a density function

p1(x)=N1N2i=1N1j=1N2p1ij1(x1,x2)Rij, (68)

for some non-negative probability weights p1ij=p1i(D1)0 that sum to 1, with

Rij=((i1)δ1,iδ1]×((j1)δ2,jδ2],

a rectangular region that corresponds to the ith temperature interval the first day of the camp and the jth temperature interval the second day. Consequently, the accuracy G1 of weather forecast data forces p1 to be constant over each Rij. A G1-optimal measure assigns full weight 1 to the rectangle Rij with i=[x01/δ1]+1 and j=[x02/δ2]+1, where x0=(x01,x02) represents the actual temperature the two days. Observe then that the G1-optimal measure is restricted to measurements that are accurate up to δ1 and δ2; therefore, it cannot do better than assigning the temperature to the intervals with sizes δ1 and δ2 to which the actual temperatures belong; however, it cannot say what the exact temperature is. The exact prediction requires an F-optimal measure.

In a second step, in order to get some additional information, the leaders of the camp consult a prophet. Let P2 refer to the probability measure based on the weather forecast Z1 and the message Z2 of the prophet, so that D2=(Z1,Z2) and G2=F. If the prophet always speaks the truth, and if the leaders of the camp rely on his message, they will make use of the F-optimal measure P2=δx0, corresponding weak (and full) learning, and a full amount of knowledge. In general, the camp leaders’ prediction in step k is correct with

probability=Pk[a,b]2,if x0[a,b]2,1Pk[a,b]2,if x0[a,b]2.

If this probability is less than 1 for k=2, the reason is either that the prophet does not always speak the truth or that the leaders do not rely solely on the message of the prophet. In particular, it follows from Definition 3 that strong learning has occurred if

0<I1+=log[P1[a,b]2/(ba)2]<I2+=log[P2[a,b]2/(ba)2],ifx0[a,b]2,0>I1+=log[P1[a,b]2/(ba)2]>I2+=log[P2[a,b]2/(ba)2],ifx0[a,b]2. (69)

Suppose the weather forecast and the message of the prophet are biased, but they still correctly predict whether outdoor activities are possible or not. Then, neither weak nor strong knowledge acquisition occurs, in spite of the fact that the strong learning condition (69) holds. Note in particular that such a conclusion is not possible with a purely Bayesian analysis. Another scenario wherein neither (weak or strong) learning nor knowledge acquisition takes place is depicted in Figure 5.

Figure 5.

Figure 5

Posterior densities p1(x1,x2) and p2(x1,x2) after one and two steps of data collection for Example 3. Since x01<a, it is not possible to have outdoor activities the first day of the camp. The weather forecast density p1 is supported and piecewise-constant on the four rectangles with width δ1 and height δ2, corresponding to σ-algebra G1. The true temperatures (x01,x02) are within the support of p1. On the other hand, the prophet incorrectly predicts that outdoor activities are possible both days; p2 is supported on the ellipse. In this case, neither (weak or strong) learning nor knowledge acquisition takes place.

Example 4

(Replication of studies). Some researchers want to find the prevalence of the physical symptoms of a certain disease. Let X=[0,1]2 refer to the possible set of values x=(x1,x2) for the prevalence of the symptoms, obtained from two different laboratories. The first value corresponds to the prevalence obtained in Laboratory 1, whereas the second value x2 is obtained when Laboratory 2 tries to replicate the study of Laboratory 1. The board members of the company to which the two laboratories belong want to find out whether the two estimates are consistent, within some tolerance level 0<ε<1. In that case, the second study is regarded as replication of the first one. The proposition

S:the second study replicates the first one

corresponds to a truth function f(x)=1|x2x1|ε and

A=(x1,x2);|x2x1|ε. (70)

The true value x0=(x01,x02) represents the actual prevalences of the symptoms, obtained from the two laboratories under ideal conditions. Importantly, it may still be the case that x01x02, if either the prevalence of the symptoms changes between the two studies and/or the two laboratories estimate the prevalences within two different subpopulations.

Let D2 be a data set by which Laboratory 2 receives all needed data from Laboratory 1 in order to set up its study properly (so that, for instance, affection status is defined in the same way in the two laboratories ). We will assume Y2(x1,x2)=(x1,x2), so that the corresponding σ-algebra

G2=F=B(X)=B0×B0,

corresponds to full discernment, with B(X) being the Borel σ-algebra on X, whereas B0=B((0,1]) is the Borel σ-algebra on the unit interval (see Remark 1). If P2 is the probability measure obtained from D2, the probability of concluding that the second study replicated the first is

P2(A)=AdP2(x1,x2). (71)

In particular, when each laboratory makes use of data from all individuals in its subpopulation (which is either the same or not for the two laboratories), the F-optimal probability measure (9) corresponds to

P2=δx0P2(A)=1(|x02x01|ε). (72)

Now consider another scenario where Laboratory 2 only gets partial information from Laboratory 1. This corresponds to a data set D1 with the same sampled individuals as in D2, but Laboratory 2 has incomplete information from Laboratory 1 regarding the details of how the first study was set up. For this reason, they make use of a coarser σ-algebra, by which it is only possible to quantify prevalence with precision δ. If this σ-algebra is referred to as BδB0, it follows that Y1(x1,x2)=(x1,[x2/δ]+1) and

G1=B0×Bδ.

The corresponding loss of information is measured through a probability P1 that has the same marginal distribution as P2 for all events B that are discernible from G1, i.e.,

P1(B)=P2(B),BG1. (73)

Hence, it follows from (30) and (73) that

dP1(x1,x2)=Nj=1Np1j(x1)1x2RjdP2(x1),

where N=1/δ, p1j(x1)=P2X2RjX1=x1, and Rj=((j1)δ,jδ] is the j-th possible region for the prevalence estimate of Laboratory 2. In particular, the probability that the second study replicates the first one is

P1(A)=N01j=1Np1j(x1)|Rj[x1ε,x1+ε]|dP2(x1). (74)

If both laboratories perform a screening and collect data from all individuals in their regions, so that (72) holds, then P1 is a G1-optimal measure according to (8), with

P1(A)=N|Rj0[x01ε,x01+ε]|, (75)

and j0=[x02/δ]+1. Making use of Definition 3, we notice that a sufficient condition for strong learning to occur is that P0 has a uniform distribution on X (so that P0([a,b]2)=2εε2), such that (72) and (75) hold, and

0<I1+=log[N|Rj0[x01ε,x01+ε]|/(2εε2)]<I2+=1,if|x02x01|<ε,0>I1+=log[N|Rj0[x01ε,x01+ε]|/(2εε2)]>I2+=,if|x02x01|>ε.

With full information transfer between the two laboratories, the replication probabilities (71) and (72) based on data D2 only depend on ε, whereas the corresponding replication probabilities (74) and (75) under incomplete information transfer between the laboratories and data D1, also depending on δ. In particular, P1(A) will always be less than 1 when 2ε<δ, even when (75) holds and x01=x02. Moreover, δ sets the limit in terms of how much knowledge can be obtained from the two studies under incomplete information transfer, since

P1(Bϵ(x0))<1,for all 0<ϵ<δ.

Note that this last conclusion cannot be obtained from a Bayesian analysis, since a true pair x0 of prevalences does not belong to a purely Bayesian framework.

Example 5

(Unmeasured confounding and causal inference). This example illustrates unmeasured confounding and causal inference. Let q=n and X={0,1}n. An individual is assigned a binary vector x=(x1,,xn) of length n, where xn{0,1} codes for whether that person will have symptoms within five years (xn=1) or not (xn=0) that are associated with a certain mental disorder. The first component x1{0,1} refers to the individual’s binary exposure, whereas the other variables xk{0,1}, k=2,,n1 are binary confounders. The truth function f(x)=xn corresponds to symptom status, whereas

A={xX;xn=1}

represents the vectors x of all individuals in the population with symptoms. Consider the proposition

S:Adam will have the symptoms within five years,

and let x0=(x01,,x0n) be the vector associated with Adam. We will introduce a sequence of probability measures P0,P1,,Pn, where P0 represents the distribution of X=(X1,,Xn) in the whole population, whereas Pk corresponds to the conditional distribution of XP0, given that its first k covariates Dk=(Z1,,Zk)=(x01,,x0k)Dk={0,1}k have been observed, with values equal to those of Adam’s first k covariates. Since the conditional distribution Dk|x0 is non-random, it follows that

P¯k=Pk=l=1kδx0l×P0·{Xl=x0l}l=1k (76)

for k=0,1,,n1, whereas P¯n=Pn=δx0 for k=n. According to Definition 5, this implies that weak learning occurs with probability 1, and in particular that weak learning is expected to occur. If Yk(x1,,xn)=(x1,,xk), we have that

Gk=2{0,1}k×{0,1}nk (77)

for k=0,,n. Note, in particular, that Pk is Gk-optimal, corresponding to error-free measurement of Adam’s first k covariates.

In order to specify the null distribution P0, we assume that a logistic regression model [48]

P0(Xn=1x1,,xn1)=expβ0+k=1n1βkxk1+expβ0+k=1n1βkxk=g(x1,,xn1) (78)

holds for the probability of having the symptoms within five years, conditionally on the n1 covariates (one exposure and n2 confounders). It is also assumed that the regression parameters β0,,βn1 are known, so that g is known as well. It follows from Equations (76) and (78) that

Pk(A)=P0Xn=1{Xl=x0l}l=1k=EP0g(X1,,Xn1){Xl=x0l}l=1k=:gk(x01,,x0k) (79)

can be interpreted as increasingly better predictions of Adam’s symptom status five years ahead, for k=0,1,,n1, whereas Pn(A)=f(x0)=x0n represents full knowledge of S. In particular, P0(A) is the prevalence of the symptoms in the whole population, whereas P1(A)=g1(x01) is Adam’s predicted probability of having the symptoms when his exposure x01 is known, whereas none of his confounders are measured.

Suppose x2,,xn1 are sufficient for confounding control, and that the exposure and the confounders (in principle) can be assigned. Let x0=(x01,,x0n) represent a hypothetical individual for which all covariates are assigned. Under a so called conditional exchangeability condition [16], it is possible to use a slightly different definition

P˜k=l=1kδx0l×EP0P0·{Xl}l=1k

of the probability measures in order to compute the counterfactual probability

hk(x01,,x0k)=P˜k(A)=EP0gx01,,x0k,Xk+1,,Xn1

of the potential outcome Xn=1, under the scenario that the first k covariates were set to x01,,x0k. In particular, it is of interest to know how much the unknown causal risk ratio effect h1(1)/h1(0) of the exposure maximally differs from the known risk ratio g1(1)/g0(0) [49,50,51,52]. Note in particular that the corresponding logged quantities

log[g1(1)/g1(0)]=I1+(A;1)I1+(A;0),log[h1(1)/h1(0)]=I˜1+(A;1)I˜1+(A;0),

can be expressed in terms of the active information

I1+(A;x01)=log[P1(A)/P0(A)]=log[P0(Xn=1x01)/P0(Xn=1)],I˜1+(A;x01)=log[P˜1(A)/P0(A)]=log[EP0(P0(Xn=1x01,X2,,Xn1))/P0(Xn=1)].

9. Discussion

In this paper, we studied an agent’s learning and knowledge acquisition within a mathematical framework of possible worlds. Learning is interpreted as an increased degree of true belief, whereas knowledge acquisition additionally requires that the belief is justified, corresponding to an increased belief in the correct world. The theory is put into a framework that involves elements of frequentism and Bayesianism, with possible worlds corresponding to the parameters of a statistical model, where only one parameter value is true, whereas the agent’s beliefs are obtained from a posterior distribution. We formulated learning as a hypothesis test within this framework, whereas knowledge acquisition corresponds to consistency of posterior distributions. Importantly, we argue that a hybrid frequentist/Bayesian approach is needed in order to model mathematically the way in which philosophers distinguish learning from knowledge acquisition.

Some applications of our theory were provided in the examples of Section 8. Apart from those, we argue that our framework has quite general implications for machine learning, in particular, supervised learning. A typical task of machine learning is to obtain a predictor of a binary outcome variable Y=f(x0), when only incomplete information X of x0 is obtained from training data. The performance of a machine learning algorithm is typically assessed in terms of prediction accuracy, that is, how well f(X) approximates Y, with less focus on the closeness between X and x0. In our terminology, the purpose of machine learning is learning rather than knowledge acquisition. This can often be a disadvantage, since knowledge acquisition often provides deeper insights than learning. For instance, full knowledge acquisition may fail asymptotically when k, even when data are unbiased and interpreted correctly by the agent, if there is lacking discernment between the set of possible worlds X, even in the limit k.

On the other hand, it makes no sense to go beyond learning for game theory, where the purpose is to find the optimal strategy (an instance of knowledge-how). In more detail, let xX={0,,M1} refer to the strategy x of a player among a finite set of M possible strategies. The optimal strategy x0 is the one that maximizes an expected reward function R(x) for the actions taken with strategy x, D refers to data from previous games that a player makes use of to estimate R(·), and G represents the player’s maximal possible discernment between strategies. Since the objective is to find the optimal strategy, it is natural to use a truth function

f(x)=1(x=x0), (80)

with the associated set A=A0={x0} of true worlds corresponding to the upper row of (1). It follows from Remark 7 that learning and knowledge acquisition are equivalent for game theory whenever (80) is used. Various algorithms, such as reinforcement learning [53] and sequential sampling models [54,55], could be used by a player in order to generate his beliefs P about which strategy is the best.

Many extensions of our work are possible. A first extension would be to generalize the framework of Theorem 1 and Example 1, where data {Dk=(Z1,,Zk)}k=1n are collected sequentially according to a Markov process with increasing state space, without requiring that {Zl}l=1n are independent and identically distributed. We will mention two related models for which this framework applies. For both of these models a student’s mastery of q skills (which represent knowledge how rather than knowledge that) is of interest. More specifically, x=(x1,,xq) is a binary sequence of length q, with xi=1 or 0 depending on whether the student has acquired skill i or not, whereas Dk corresponds to exercises that are given to a student up to time k, and the student’s answers to these exercises. It is also known which skills are required to solve each type of exercise. The first model is Bayesian knowledge tracing (BKT) [56], which has recently been analyzed using recurrent neural networks [1]. In BKT, a tutor trains the student to learn the q skills, so that the student’s learning profile changes over time. At each time point, the tutor is free to choose the last exercises at time k based on previous exercises and what the student learnt up to time k1. The goal of the tutoring is to reach a state x0=(1,,1) where the student has learned all skills. The most restrictive truth function (80) monitors whether the student has learned all skills or not, so that Pk(A) is the probability that the student has learnt all skills at time k. In view of Remark 7, there is no distinction between learning and knowledge acquisition for such a truth function. A less restrictive truth function f(x)=xi focuses on whether the student has learnt skill i or not, so that Pk(A) is the probability that the student learnt skill i at time k. The second model—the Bayesian version of Diagnostic Classification Models (DCMs) [57]—can be viewed as an extension of Illustration 1. The purpose of DCMs is not to train the student (as for knowledge tracing), but rather to diagnose the student’s (or respondent’s) current vector x0=(x01,,x0q), where x0i=1 or 0 if this particular student masters skill (or attribute) i or not. The exercises of DCM are usually referred to as items. Assume a truth function (80); Pk(A) is the probability that the diagnostic test by time k has learnt which attributes the student masters. Note in particular that the student’s attribute mastery profile x0 is fixed, and it is rather the instructor that learns about x0 when the student is being tested on new items.

A second extension would be to consider opinion making and consensus formation [58] for a whole group of N agents that are connected according to some social network. In this context, Gk represents the maximal amount of discernibility between possible worlds that is possible to achieve after k time steps based on external data (available to all agents) and information from other agents (which varies between agents and depends on properties of the social network). It is of interest in this context to study the dynamics of {Pki(A)}i=1N over time, where Pki(A) represents the belief of agent (or individual) i in proposition S after k time steps. This can be accomplished using a dynamical Bayesian network [59] with N nodes i=1,,N that represent individuals, associating each node i with a distribution Pki over the set of possible worlds X, corresponding to the beliefs of agent i at time k. A particularly interesting example in this vein would be to explore the degree to which social media and social networks can influence learning and knowledge acquisition.

The third possible extension is related to consensus formation, but with a more explicit focus on how N decentralized agents make a collective decision. In order to illustrate this, we first describe a related model of cognition in bacterial populations. Marshall [60] has concluded that “the direction of causation in biology is cognition → code → chemicals”. Cognition is observed when there is a discernment and data collection process that either optimizes code or improves the probability of a given chemical outcome. Accordingly, the strong learning process of Definition 3 can be used to model how biological cognition is attained (or at least is expected to be attained). For instance, in quorum sensing, once bacteria reach a critical density, they emit a chemical signal to ascertain the number of neighboring bacteria [61]; when a critical density is reached, the population performs certain functions as a unit (Table 1 of [62] presents several examples of bacterial functions partially controlled by quorum sensing). The proposition under consideration here is S: “the function is performed by at least a fraction ε of bacteria”, where ε represents a critical density above which the bacteria act as a unit. The parameter x=(x1,,xN) is a binary sequence reflecting the way in which a population of N=q bacteria acts, so that xi=1 if bacterium i performs the function, whereas f(x)=1(xA)=1(ixiεN). For collective decisions, xi rather represents a local decision of agent i, whereas f(x) corresponds to the global decision of all agents. Learning about S at time k=1,2, is described by Pk(A), the probability that the population acts as a unit at time k. There is a phase transition at time k if the probabilities P1(A),,Pk1(A) of the population acting as a unit are essentially null, whereas Pk(A) becomes positive (and hence Ik+(A) gets large) when discernment ability and data are extended from (Gk1,Dk1) to (Gk,Dk). This is closely related to the fine-tuning of biological systems [33,35] with f being a specificity function and A set of highly specified states, and fine-tuning after k steps of an algorithm that models the evolution of the system corresponding to Ik+(A) being large. As for the direction of causation from cognition to code, Kolmogorov’s complexity, which measures the complexity of an outcome as the shortest code that produces it, can be used in place of or jointly with active information to measure learning [63].

A fourth theoretical extension is to consider the case |X|=. In this case, instead of the (discrete or continuous) uniform distribution given by (2), it will be necessary to consider more general maximum entropy distributions P0, subject to some restrictions, in order to measure learning and knowledge acquisition [20,64,65,66].

A fifth extension is to consider models where the data sets Dk are not nested. This is of interest, for instance, in Example 5, when non-nested subsets of confounders are used to predict Adam’s disease status. For such scenarios, it might be preferable to use information-based model selection criteria (such as maximizing AIN) in order to quantify learning [67], rather than sequentially testing various pairs of nested hypotheses by means of

ΔIk+=Ik+Ik1+=logPk(A)Pk1(A),

in order to assess whether learning has occurred in each step k (corresponding to strong learning of Definition 3).

A sixth extension would be to compare the proposed Bayesian/frequentist notions of learning and knowledge acquisition, with purely frequentist counterparts. Since learning corresponds to choosing between the two hypotheses in (18), we may consider a test that rejects the null hypothesis when the log likelihood ratio is small enough, or equivalently, when

Λ=2logmaxxAL(D|x)maxxXL(D|x)t (81)

for some appropriately chosen threshold t. The frequentist notion of learning is then formulated in terms of error probabilities of type I and II, analogously to (20), but for the LR-test (81) rather than the Bayesian/frequentist test (19) with test statistic AIN, or the purely Bayesian approach that relies on posterior odds (21). A frequentist version of knowledge acquisition corresponds to using data D in order to produce a one-dimensional class of confidence regions CR for x0, with a nominal coverage probability of CR that varies. In order to quantify how much knowledge that is acquired, it is possible to use the steepness of a curve that plots the actual coverage probability P(x0CR) as a function of the volume |CR|. However, a disadvantage of the frequentist versions of learning and knowledge acquisition is that they do not involve degrees of beliefs, the philosophical starting point of this article. This is related to the critique of frequentist hypothesis testing offered in [68]. Since no prior probabilities are allowed, within a frequentist setting, important notions such as the false report probability (FRP) and true report probability (TRP) are not computable, leading to many non-replicated findings.

A seventh extension is to consider multiple propositions S1,,Sm, as in [69,70]. For each possible world xX, we let f:X{0,1}m be a truth function such that f(x)=(f1(x),,fm(x)), with fi(x)=1 (0) if Si is true (false) in world x. It is then of interest to develop a theory of learning and knowledge acquisition of these m propositions. To this end, for each y=(y1,,ym){0,1}m, let Ay={xX;f(x)=y} refer to the set of worlds for which the truth value of Si is yi for i=1,,m. Learning is then a matter of determining which Ay is true (the one for which x0Ay), whereas justified true beliefs in S1,,Sm amount to finding x0 as well. Learning of statements such as SiSj and SiSi can be addressed using the m=1 theory of this paper, since they correspond to binary-valued truth functions f˜(x)=fi(x)+fj(x)fi(x)fj(x) and f˜(x)=fi(x)fj(x), respectively.

Acknowledgments

The authors wish to thank Glauco Amigo at Baylor University for his help with producing Figure 3. We also appreciate the comments of three anonymous reviewers that made it possible to considerably improve the quality of the paper.

Author Contributions

Conceptualization and methodology: O.H., D.A.D.-P., J.S.R.; writing—original draft preparation: O.H.; writing—review and editing: D.A.D.-P., J.S.R. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Funding Statement

This research received no external funding.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Piech C., Bassen J., Huang J., Ganguli S., Sahami M., Guibas L.J., Sohl-Dickstein J. Deep Knowledge Tracing; Proceedings of the Neural Information Processing Systems (NIPS) 2015; Montreal, QC, Canada. 7–12 December 2015; pp. 505–513. [Google Scholar]
  • 2.Pavese C. Knowledge How. In: Zalta E.N., editor. The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University; Stanford, CA, USA: 2021. [Google Scholar]
  • 3.Agliari E., Pachón A., Rodríguez P.M., Tavani F. Phase transition for the Maki-Thompson rumour model on a small-world network. J. Stat. Phys. 2017;169:846–875. doi: 10.1007/s10955-017-1892-x. [DOI] [Google Scholar]
  • 4.Lyons R., Peres Y. Probability on Trees and Networks. Cambridge University Press; Cambridge, UK: 2016. [Google Scholar]
  • 5.Watts D.J., Strogatz S.H. Collective dynamics of ‘small-world’ networks. Nature. 1998;393:440–442. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]
  • 6.Embreston S.E., Reise S.P. Item Response Theory for Psychologists. Psychology Press; New York, NY, USA: 2000. [Google Scholar]
  • 7.Stevens S.S. On the Theory of Scales of Measurement. Science. 1946;103:677–680. doi: 10.1126/science.103.2684.677. [DOI] [PubMed] [Google Scholar]
  • 8.Thompson B. Exploratory and Confirmatory Factor Analysis: Understanding Concepts and Applications. American Psychological Association; Washington, DC, USA: 2004. [Google Scholar]
  • 9.Gettier E.L. Is Justified True Belief Knowledge? Analysis. 1963;23:121–123. doi: 10.1093/analys/23.6.121. [DOI] [Google Scholar]
  • 10.Ichikawa J.J., Steup M. The Analysis of Knowledge. In: Zalta E.N., editor. The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University; Stanford, CA, USA: 2018. [Google Scholar]
  • 11.Hájek A. Probability, Logic, and Probability Logic. In: Goble L., editor. The Blackwell Guide to Philosophical Logic. Blackwell; Hoboken, NJ, USA: 2001. pp. 362–384. Chapter 16. [Google Scholar]
  • 12.Demey L., Kooi B., Sack J. Logic and Probability. In: Zalta E.N., editor. The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University; Stanford, CA, USA: 2019. [Google Scholar]
  • 13.Hájek A. Interpretations of Probability. In: Zalta E.N., editor. The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University; Stanford, CA, USA: 2019. [Google Scholar]
  • 14.Savage L. The Foundations of Statistics. Wiley; Hoboken, NJ, USA: 1954. [Google Scholar]
  • 15.Swinburne R. Epistemic Justification. Oxford University Press; Oxford, UK: 2001. [Google Scholar]
  • 16.Pearl J. Causality: Models, Reasoning and Inference. 2nd ed. Cambridge University Press; Cambridge, UK: 2009. [Google Scholar]
  • 17.Berger J. Statistical Decision Theory and Bayesian Analysis. 2nd ed. Springer; New York, NY, USA: 2010. [Google Scholar]
  • 18.Dembski W.A., Marks R.J., II Bernoulli’s Principle of Insufficient Reason and Conservation of Information in Computer Search; Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics; San Antonio, TX, USA. 11–14 October 2009; pp. 2647–2652. [DOI] [Google Scholar]
  • 19.Dembski W.A., Marks R.J., II Conservation of Information in Search: Measuring the Cost of Success. IEEE Trans. Syst. Man Cybern.-Part A Syst. Hum. 2009;5:1051–1061. doi: 10.1109/TSMCA.2009.2025027. [DOI] [Google Scholar]
  • 20.Díaz-Pachón D.A., Marks R.J., II Generalized active information: Extensions to unbounded domains. BIO-Complexity. 2020;2020:1–6. doi: 10.5048/BIO-C.2020.3. [DOI] [Google Scholar]
  • 21.Shafer G. Belief functions and parametric models. J. R. Stat. Soc. Ser. B. 1982;44:322–352. doi: 10.1111/j.2517-6161.1982.tb01211.x. [DOI] [Google Scholar]
  • 22.Wasserman L. Prior envelopes based on belief functions. Ann. Stat. 1990;18:454–464. doi: 10.1214/aos/1176347511. [DOI] [Google Scholar]
  • 23.Dubois D., Prade H. Belief functions and parametric models. Int. J. Approx. Reason. 1992;6:295–319. doi: 10.1016/0888-613X(92)90027-W. [DOI] [Google Scholar]
  • 24.Denoeux T. Decision-making with belief functions: A review. Int. J. Approx. Reason. 2019;109:87–110. [Google Scholar]
  • 25.Hopkins E. Two competing models of how people learn in games. Econometrica. 2002;70:2141–2166. doi: 10.1111/1468-0262.00372. [DOI] [Google Scholar]
  • 26.Stoica G., Strack B. Acquired knowledge as a stochastic process. Surv. Math. Appl. 2017;12:65–70. [Google Scholar]
  • 27.Taylor C.M. Ph.D. Thesis. University of Virginia; Charlottesville, VA, USA: 2002. A Mathematical Model for Knowledge Acquisition. [Google Scholar]
  • 28.Popper K. The Logic of Scientific Discovery. Hutchinson; London, UK: 1968. [Google Scholar]
  • 29.Jaynes E.T. Prior Probabilities. IEEE Trans. Syst. Sci. Cybern. 1968;4:227–241. doi: 10.1109/TSSC.1968.300117. [DOI] [Google Scholar]
  • 30.Hössjer O. Modeling decision in a temporal context: Analysis of a famous example suggested by Blaise Pascal. In: Hasle P., Jakobsen D., Øhrstrøm P., editors. The Metaphysics of Time, Themes from Prior. Logic and Philosophy of Time. Volume 4. Aalborg University Press; Aalborg, Denmark: 2020. pp. 427–453. [Google Scholar]
  • 31.Kowner R. Nicholas II and the Japanese body: Images and decision-making on the eve of the Russo-Japanese War. Psychohist. Rev. 1998;26:211–252. [Google Scholar]
  • 32.Hössjer O., Díaz-Pachón D.A., Chen Z., Rao J.S. Active information, missing data, and prevalence estimation. arXiv. 2022 doi: 10.48550/arXiv.2206.05120.2206.05120 [DOI] [Google Scholar]
  • 33.Díaz-Pachón D.A., Hössjer O. Assessing, testing and estimating the amount of fine-tuning by means of active information. Entropy. 2022;24:1323. doi: 10.3390/e24101323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Szostak J.W. Functional information: Molecular messages. Nature. 2003;423:689. doi: 10.1038/423689a. [DOI] [PubMed] [Google Scholar]
  • 35.Thorvaldsen S., Hössjer O. Using statistical methods to model the fine-tuning of molecular machines and systems. J. Theor. Biol. 2020;501:110352. doi: 10.1016/j.jtbi.2020.110352. [DOI] [PubMed] [Google Scholar]
  • 36.Díaz-Pachón D.A., Sáenz J.P., Rao J.S. Hypothesis testing with active information. Stat. Probab. Lett. 2020;161:108742. doi: 10.1016/j.spl.2020.108742. [DOI] [Google Scholar]
  • 37.Montañez G.D. A Unified Model of Complex Specified Information. BIO-Complexity. 2018;2018:1–26. doi: 10.5048/BIO-C.2018.4. [DOI] [Google Scholar]
  • 38.Yik W., Serafini L., Lindsey T., Montañez G.D. Identifying Bias in Data Using Two-Distribution Hypothesis Tests; Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society; Oxford, UK. 19–21 May 2021; New York, NY, USA: ACM; 2022. pp. 831–844. [DOI] [Google Scholar]
  • 39.Kallenberg O. Foundations of Modern Probability. 3rd ed. Volume 1 Springer; New York, NY, USA: 2021. [Google Scholar]
  • 40.Ghosal S., van der Vaart A. Fundamentals of Nonparametric Bayesian Inference. Cambridge University Press; Cambridge, UK: 2017. [Google Scholar]
  • 41.Shen W., Tokdar S.T., Ghosal S. Adaptive Bayesian multivariate density estimation with Dirichlet mixtures. Biometrika. 2013;100:623–640. doi: 10.1093/biomet/ast015. [DOI] [Google Scholar]
  • 42.Barron A.R. Uniformly Powerful Goodness of Fit Tests. Ann. Stat. 1989;17:107–124. doi: 10.1214/aos/1176347005. [DOI] [Google Scholar]
  • 43.Freedman D.A. On the Asymptotic Behavior of Bayes’ Estimates in the Discrete Case. Ann. Math. Stat. 1963;34:1386–1403. doi: 10.1214/aoms/1177703871. [DOI] [Google Scholar]
  • 44.Cam L.L. Convergence of Estimates Under Dimensionality Restrictions. Ann. Stat. 1973;1:38–53. doi: 10.1214/aos/1193342380. [DOI] [Google Scholar]
  • 45.Schwartz L. On Bayes procedures. Z. Wahrscheinlichkeitstheorie Verw Geb. 1965;4:10–26. doi: 10.1007/BF00535479. [DOI] [Google Scholar]
  • 46.Cam L.L. Asymptotic Methods in Statistical Decision Theory. Springer; New York, NY, USA: 1986. [Google Scholar]
  • 47.Lehmann E.L., Casella G. Theory of Point Estimation. 2nd ed. Springer; New York, NY, USA: 1998. [Google Scholar]
  • 48.Agresti A. Categorical Data Analysis. 3rd ed. Wiley; Hoboken, NJ, USA: 2013. [Google Scholar]
  • 49.Robins J.M. The analysis of Randomized and Nonrandomized AIDS Treatment Trials Using A New Approach to Causal Inference in Longitudinal Studies. In: Sechrest L., Freeman H., Mulley A., editors. Health Service Research Methodology: A Focus on AIDS. U.S. Public Health Service, National Center for Health Services Research; Washington, DC, USA: 1989. pp. 113–159. [Google Scholar]
  • 50.Manski C.F. Nonparametric Bounds on Treatment Effects. Am. Econ. Rev. 1990;80:319–323. [Google Scholar]
  • 51.Ding P., VanderWeele T.J. Sensitivity Analysis Without Assumptions. Epidemilogy. 2016;27:368–377. doi: 10.1097/EDE.0000000000000457. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Sjölander A., Hössjer O. Novel bounds for causal effects based on sensitivity parameters on the risk difference scale. J. Causal Inference. 2021;9:190–210. doi: 10.1515/jci-2021-0024. [DOI] [Google Scholar]
  • 53.Sutton R.S., Barto A.G. Reinforcement Learning: An Introduction. MIT Press; Cambridge, MA, USA: 1998. [Google Scholar]
  • 54.Ratcliff R., Smith P.L. A Comparison of Sequential Sampling Models for Two-Choice Reaction Time. Psychol. Rev. 2004;111:333–367. doi: 10.1037/0033-295X.111.2.333. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Chen W.J., Krajbich I. Computational modeling of epiphany learning. Proc. Natl. Acad. Sci. USA. 2017;114:4637–4642. doi: 10.1073/pnas.1618161114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Corbett A.T., Anderson J.R. Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge. User Model. User-Adapt. Interact. 1995;4:253–278. doi: 10.1007/BF01099821. [DOI] [Google Scholar]
  • 57.Oka M., Okada K. Assessing the Performance of Diagnostic Classification Models in Small Sample Contexts with Different Estimation Methods. arXiv. 20222104.10975 [Google Scholar]
  • 58.Hirscher T. Ph.D. Thesis. Division of Mathematics, Department of Mathematical Sciences, Chalmers University of Technology; Gothenburg, Sweden: 2014. Consensus Formation in the Deffuant Model. [Google Scholar]
  • 59.Murphy K.P. Ph.D. Thesis. University of California; Berkeley, CA, USA: 2002. Dynamic Bayesian Networks: Representation, Inference and Learning. [Google Scholar]
  • 60.Marshall P. Biology transcends the limits of computation. Prog. Biophys. Mol. Biol. 2021;165:88–101. doi: 10.1016/j.pbiomolbio.2021.04.006. [DOI] [PubMed] [Google Scholar]
  • 61.Atkinson S., Williams P. Quorum sensing and social networking in the microbial world. J. R. Soc. Interface. 2009;6:959–978. doi: 10.1098/rsif.2009.0203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Shapiro J.A. All living cells are cognitive. Biochem. Biophys. Res. Commun. 2020;564:134–149. doi: 10.1016/j.bbrc.2020.08.120. [DOI] [PubMed] [Google Scholar]
  • 63.Ewert W., Dembski W., Marks R.J., II Algorithmic Specified Complexity in the Game of Life. IEEE Trans. Syst. Man Cybern. Syst. 2015;45:584–594. doi: 10.1109/TSMC.2014.2331917. [DOI] [Google Scholar]
  • 64.Díaz-Pachón D.A., Hössjer O., Marks R.J., II Is Cosmological Tuning Fine or Coarse? J. Cosmol. Astropart. Phys. 2021;2021:020. doi: 10.1088/1475-7516/2021/07/020. [DOI] [Google Scholar]
  • 65.Díaz-Pachón D.A., Hössjer O., Marks R.J., II Sometimes size does not matter. arXiv. 20222204.11780 [Google Scholar]
  • 66.Zhao X., Plata G., Dixit P.D. SiGMoiD: A super-statistical generative model for binary dataP. PLoS Comput. Biol. 2021;17:e1009275. doi: 10.1371/journal.pcbi.1009275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Stephens P.A., Buskirk S.W., Hayward G.D., del Río C.M. Information theory and hypothesis testing: A call for pluralism. J. Appl. Ecol. 2005;42:4–12. doi: 10.1111/j.1365-2664.2005.01002.x. [DOI] [Google Scholar]
  • 68.Szucs D., Ioannidis J.P.A. When Null Hypothesis Significance Testing Is Unsuitable for Research: A Reassessment. Front. Hum. Neurosci. 2017;11:390. doi: 10.3389/fnhum.2017.00390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Cox R.T. The Algebra of Probable Inference. Johns Hopkins University Press; Baltimore, MD, USA: 1961. [Google Scholar]
  • 70.Jaynes E.T. Probability Theory: The Logic of Science. Cambridge University Press; Cambridge, UK: 2003. [DOI] [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES