Abstract
Philosophers frequently define knowledge as justified, true belief. We built a mathematical framework that makes it possible to define learning (increasing number of true beliefs) and knowledge of an agent in precise ways, by phrasing belief in terms of epistemic probabilities, defined from Bayes’ rule. The degree of true belief is quantified by means of active information : a comparison between the degree of belief of the agent and a completely ignorant person. Learning has occurred when either the agent’s strength of belief in a true proposition has increased in comparison with the ignorant person (), or the strength of belief in a false proposition has decreased (). Knowledge additionally requires that learning occurs for the right reason, and in this context we introduce a framework of parallel worlds that correspond to parameters of a statistical model. This makes it possible to interpret learning as a hypothesis test for such a model, whereas knowledge acquisition additionally requires estimation of a true world parameter. Our framework of learning and knowledge acquisition is a hybrid between frequentism and Bayesianism. It can be generalized to a sequential setting, where information and data are updated over time. The theory is illustrated using examples of coin tossing, historical and future events, replication of studies, and causal inference. It can also be used to pinpoint shortcomings of machine learning, where typically learning rather than knowledge acquisition is in focus.
Keywords: active information; Bayes’ rule; counterfactuals; epistemic probability; learning, justified true belief; knowledge acquisition; replication studies
1. Introduction
1.1. The Present Article
The process by which cognitive agents acquire knowledge is complicated, and has been studied from different perspectives within educational science, psychology, neuroscience, cognitive science, and social science [1]. Philosophers usually distinguish between three types of knowledge [2]: acquaintance knowledge (to get to know other persons), knowledge how (to learn certain skills), and knowledge that (to learn about propositions or facts). Mathematically, acquaintance knowledge has been studied via trees and networks, for instance, in small-world-type models and rumor-spreading models [3,4,5]. Knowledge how has been widely developed in education and psychology, since the middle of the twentieth century, by means of testing and psychometry, using classical statistics [6,7,8].
The purpose of this paper is to formulate knowledge that in mathematical terms. Our starting point is to define knowledge that as justified true belief (JTB), which generally is agreed to constitute at least a sufficient condition for such knowledge [9,10]. The primary tools will be the concepts of truth, probabilities, and information theory. Probabilities, in addition to logic, are used to formulate mechanisms of reasoning in order to define beliefs [11,12]. More specifically, a Bayesian approach with subjective probabilities will be used to quantify rational agents’ degrees of beliefs in a proposition. These subjective probabilities may vary between agents, but since each agent is assumed to be rational, its probabilities satisfy basic axioms of probability [13]. This is also referred to as the personalistic view of probabilities in [14].
The degree of belief in a proposition is associated with some type of randomness or uncertainty regarding the truth of the proposition. It is helpful in this context to distinguish between ontological randomness (genuine randomness regarding the truth of the proposition) and epistemic randomness (incomplete knowledge about propositions that are either true or false). Here the focus will be on epistemic randomness, and following [15], subjective probabilities are referred to as epistemic probabilities. The epistemic randomness assumption that each proposition has a fixed truth value can be viewed as a frequentist component of our framework.
To use epistemic probabilities in a wider context of knowledge that (subsequently simply referred to as knowledge), we incorporate degrees of beliefs within a framework of parallel worlds in order to define more clearly what JTB means. These parallel worlds correspond to parameters of a statistical model and a second frequentist notion of one parameter being true, whereas the others are counterfactuals [16]. An agent’s maximal possible discernment between worlds is described in terms of the -algebra . The agent’s degrees of belief are obtained through Bayes’ rule from prior belief and data [17], in such a way that it is not possible to discern between worlds beyond the limits set by .
Learning is associated with increased degrees of true belief, although these beliefs need not necessarily be justified. More specifically, the agent’s degree of belief in a proposition is compared to that of an ignorant person. This corresponds to an hypothesis test within a frequentist framework. More specifically, the null hypothesis of a proposition being true is tested against an alternative hypothesis that the proposition is false. As a test statistic, we use active information [18,19,20], which quantifies how much the agent has learned about the truth value of the proposition compared to an ignorant person. In particular, learning has occurred when the agent’s degree of belief in a true proposition is larger than that of an ignorant person (), or if the agent’s degree of belief in a false proposition is less than that of an ignorant person (). In either case, sets a limit in terms of the maximal amount of possible learning. Learning is, however, not sufficient for knowledge acquisition, since the latter concept also requires that the true belief is justified, or has been formed for the right reason. Knowledge acquisition is defined as a learning process where the agent’s degree of belief in the true world is increased, corresponding to a more accurate estimate of the true world parameter. Thus, knowledge acquisition goes beyond learning in that it also deals with the "justified" part of the JTB condition. It is related to consistency of a posterior distribution, a notion that is meaningful only within our hybrid frequentist/Bayesian approach.
To the best of our knowledge, the hybrid frequentist/Bayesian approach has only been used in the context of Bayesian asymptotic theory (Section 7.2), but not as a general tool for modeling the distinction between learning and knowledge acquisition. Although the concept of a true world (or the true state of affairs) is used in the context of Bayesian decision theory and its extensions, such as robust Bayesian inference and belief functions based on the Dempster–Shafer theory [21,22,23,24], the goal is then to maximize an expected utility (or to minimize an expected cost) of the agent that makes the decision. In our context, the Bayesian approach is only used to formulate beliefs as posterior distributions, whereas the criteria for learning (probabilities of rejecting a false or true proposition) and knowledge acquisition (consistency) are frequentist. Given that a model with one unique, true world is correct, the frequentist error probability and consistency criteria are objective, since they depend on the true world. No such criteria exist within a purely Bayesian framework.
Illustration 1.
In order to illustrate our approach for modeling learning and knowledge acquisition, we present an example that will be revisited several times later on. A teacher (the agent) wants to evaluate whether a child has learned addition. The teacher gives the student a home assignment test with two-choice answers, one right and one wrong, to measure the proposition S: “The child is expected to score well on the test." In this case, we have a set of three possible worlds. An ignorant person who does not ask for help is expected to have half her questions right and half her questions wrong (). A child who knows addition is expected to get a large fraction of the answers right (). However, there is also a third alternative, where an ignorant student asks for help and is expected to have a high score for that reason (). Notice in particular that S is true only for the two worlds of the set . If the child answers substantially more questions right than wrong, the active information will be positive and the teacher learns S. However, this learning that S is true does not represent knowledge of whether the student knows how to add, since the teacher is not able to distinguish from . Now, let us say that the test has only two questions. In this setting, it is expected that an ignorant person has one question right and one wrong. However, it is also highly probable that even if the child does not know his sums well, he can answer the two questions in the right way. In this case, the teacher has not learned substantially about S (nor attained knowledge of whether the student knows how to add). The reason is that, since the test has only two questions, the teacher cannot exclude any of , , and . The more questions the test has, and if the student scores well, the more certain the teacher is that either or is true, that is, the more he learns about S. If the student is also monitored during the exam, alternative is excluded and the teacher knows that is true; that is, the teacher not only learns about S, but also acquires knowledge that the student knows how to add.
Each of the following sections contains remarks and illustrations like the previous one. At the end of the paper, a whole section with multiple examples will explore deeper how the model works in practice.
1.2. Related Work
Other contributions have been made to developing a mathematical framework for learning and knowledge acquisition. Hopkins [25] studied the theoretical properties of two different models of learning in games, namely, reinforcement learning and stochastic fictitious play. He developed an equivalence relation between the two under a variety of different scenarios with increasing degrees of structure. Stoica and Strack [26] introduced a stochastic model for acquired knowledge and showed that empirical data fit the estimated outcomes of the model well, using data from student performance in university-level classes. Taylor [27] proposed a model using the notion of concept lattices and the mathematical theory of closure spaces to describe knowledge acquisition and organization. However, none of these works has been developed through basic concepts in probability and information theory the way we do here. Our approach permits important generalizations which cover a wide range of real-life scenarios.
2. Possible Worlds, Propositions, and Discernment
Consider a collection of possible worlds, of which is the true world, and all other worlds are counterfactuals. We will regard x as a statistical parameter, and the fact that this parameter has a true but unknown value corresponds to a frequentist assumption. The set is the parameter space of interest, and it is assumed to be either finite or a bounded and open subset of Euclidean space of dimension q. Let S be a proposition (or statement), and impose a second frequentist assumption that S is either true or false, although the truth value of S may depend on the world . Define a binary-valued truth function by or 0, depending on whether S is true or not in world x. The set consists of all worlds for which S is a true proposition. Although there is one-to-one correspondence between f and A, in the sequel it will be convenient to use both notions. The simplest truth scenario of S is one for which the truth value of S is unique for the true world, i.e.,
| (1) |
being unique and f being binary-valued together correspond to a framework of epistemic randomness, where the actual truth value of S is either 0 or 1. S is referred to as falsifiable [28] if it is logically possible (in principle) to find a data set D implying that the truth value of S is 0, or equivalently, that none of the worlds in A is true. It is possible though to falsify S without knowing .
3. Probabilities
3.1. Degrees of Beliefs and Sigma Algebras
Let be a measurable space. When is finite, consists of all subsets of (i.e., ); otherwise, is the class of Borel sets. The Bayesian part of our approach is to quantify an agent’s belief in which a world is true by means of an epistemic probability measure P on the measurable space , whereas the beliefs of an ignorant person follow another probability measure . It is often assumed that
| (2) |
is the uniform probability measure that maximizes entropy among all probability measures on , where refers to the cardinality for finite and to the Lebesgue measure for continuous . Then, (2) corresponds to a maximal amount of ignorance about which possible world is true [29]. Sometimes (as in Example 5 below) some general background knowledge is assumed also for the ignorant person, so that differs from (2).
The agent’s and the ignorant person’s strength of belief in S are quantified by and , respectively. Following [15], it is helpful to interpret and as the agent’s and the ignorant person’s predictions of the physical probability of S. Whereas P and involve epistemic uncertainty, the physical probability is an indicator for the real (physical) event that S is true or not.
When an agent’s belief P is formed, it is assumed that any information accessible to him, beyond that of the ignorant person, belongs to a sub--algebra . This means that the agent has no more knowledge of how to discern events in than the ignorant person, if this discernment requires that he considers events in that do not belong to . Mathematically, this corresponds to a requirement
| (3) |
for all -measurable functions , and all sigma algebras such that . It is assumed, on the left-hand side of (3), that g is a random variable defined on the probability space , whereas g is defined on the probability space on the right-hand side of (3). It follows from (3) that sets the limit in terms of the agent’s possibility to form propositions about which world is true. Therefore, is referred to as the agent’s maximal possible discernment about which world is true. It follows from (3) that
| (4) |
The minimal amount of discernment corresponds to the trivial -algebra . Whenever (3) holds with , necessarily . This corresponds to removing the outer expectation on the right-hand side of (4), so that
| (5) |
Remark 1.
Suppose there exists an oracle or omniscient agent that is able to discern between all possible worlds and also knows . Mathematically, the discernment requirement means that has knowledge about all sets in a σ-algebra that corresponds to a maximal amount of discernment between possible worlds. We will assume that f is measurable with respect to , so that A is measurable (i.e., ). Knowledge of is, however, not sufficient for knowing A, since A may involve , as in (1). By this we mean that if the agent knows , and if A involves , then there are several candidates of A for the agent, and he does not know a priori which one of these candidates is the actual A. However, since knows and , he also knows A. It follows that knows that S is true for all worlds in (the actual) A, and that S is false for all worlds outside of (the actual) A. That is, the oracle knows for which possible worlds the proposition S is true.
As mentioned in Remark 1, the truth function f is measurable with respect to the maximal -algebra . However, depending on how is constructed, and whether A involves or not, the set A may or may not be known to the agent. Therefore, when A involves , the agent may not be able to compute and himself. Although he is able to compute and for all , since he does not know , it follows that he does not know which of these sets B equals A. Therefore, he does not know and , unless and , respectively, for all B that are among the agent’s candidates for the set A. For instance, suppose , , , and . If the agent’s candidates for A are , , and , then the agent does not know . On the other hand, if the agent’s candidates for A are and , then he knows , although he does not know A.
As will be seen from the examples of Section 8, it is often helpful (but not necessary) to construct as the -algebra that is generated by a random variable Y whose domain is (i.e., ). This means that Y determines the collection of subsets of for which the agent is free to form beliefs beyond that of the ignorant person. Typically, Y highlights the way in which information is lost by going from to . For instance, suppose and is defined by for some ; then, is the sigma-algebra obtained by from a quantization procedure with accuracy .
3.2. Bayes’ Rule and Posterior Probabilities
A Bayesian approach will be used to define the agent’s degree of belief P. To this end, we regard as a parameter of a statistical model and that the agent has access to data . The agent assumes that is an observation of a random variable defined on some sample space . The joint distribution of the parameter X and data D, according to the agent’s beliefs, is . This is a probability measure on subsets of , with prior distribution of the parameter X, and with a conditional distribution that corresponds to the likelihood of data D. A posterior probability
| (6) |
of A is formed by updating the prior distribution based on data D. It is assumed that the likelihood is measurable with respect to , so that data conform with the agent’s maximal possible discernment between possible worlds. The likelihood function includes the agent’s interpretation of D. Although this interpretation may involve a subjective part, it is still assumed that the agent is not willing to speculate about possible worlds beyond the limits set by . That is, whenever the agent discerns events in beyond the limits set by , this discernment is the same as for an ignorant person.
Remark 2.
To account for the possibility that the agent still speculates beyond the limits set by external data, could be defined as the smallest σ-algebra containing the σ-algebras and that originate from external data and internal data (the agent’s internal experiences, such as dreams and revelations, respectively). Note, however, that is subjective, even when internal data are absent, since agents might interpret external data in different ways, due to the way in which they perceive such data and incorporate previous life experience.
From Bayes’ rule we find that the posterior distribution satisfies
| (7) |
A couple of additional reasons reinforce the subjectivity of P: the prior might be subjective, and acquisition of data D might vary between agents [30]. Additionally, acquisition of data D will not necessarily make P more concentrated around the true world , since it is possible that the data themselves are biased or that the agent interprets the data in a sub-optimal way.
Since the likelihood function is measurable with respect to , it follows from (4) that the agent’s belief P, after having observed D, does not lead to a different discernment between possible worlds beyond than for an ignorant person. Given , together with an unlimited amount of unbiased data that the agent interprets correctly, the -optimal choice of P is
| (8) |
Equations (4) and (8) uniquely define the -optimal choice of P. Whenever is a proper subset of the maximal -algebra , the measure P in (8) is not the same thing as a point mass at . On the other hand, for an oracle with a maximal amount of knowledge about which world is true, , (8) reduces to a point mass at the true world—i.e.,
| (9) |
Remark 3.
An extreme example of biased beliefs is a true-world-excluding probability measure , with support that does not include :
(10) Another example is a correct-proposition-excluding probability measure , with support that excludes all worlds x with a correct value of S:
(11)
Illustration 2
(Continuation of Illustration 1). Suppose data are available to the teacher (the agent) in terms of the number of correct answers of a home assignment test with 10 questions. The prior is uniform on , whereas data have a binomial distribution with probabilities and of answering each question correctly, for a student that either guesses or has math skills/asks for help. Let d be the observed value of D. Since data have the same likelihood () for a student who scores well, regardless of whether he knows how to add or gets help, it is clear that the posterior distribution
satisfies . Since the teacher cannot distinguish from , his sigma-algebra
(12) has only four elements, whereas the full sigma-algebra consists of all eight subsets of . Note that Equation (3) stipulates that the teacher cannot discern between the elements of , beyond the limits set by , better than the ignorant person. In order to verify (3), since there is no sigma-algebra between and , we only need to check this equation for . To this end, let be a real-valued function. Then, since , it follows that
in agreement with (3).
Illustration 3.
During the Russo-Japanese war, the czar Nicholas II was convinced that Russia would easily defeat Japan [31]. His own biases (he considered the Japanese weak and the Russians superior) and the partial information he received from his advisors blinded him to reality. In the end, Russian forces were heavily beaten by the Japanese. In this scenario, the proposition S is “Russia will beat Japan", consists of all possible future scenarios, and for those scenarios in which Russia would win the war. As history reveals, . The information he received from his advisors was D, and we know it was heavily biased. Nicholas II adopted (very subjectively!) a correct-proposition-excluding probability measure, as in (11), because he did not even consider the possibility of Russia being defeated. The main reason was a dramatically poor assessment of the likelihood , for , on top of a prior that had a low probability for scenarios . Nicholas II’s verdict was .
3.3. Expected Posterior Beliefs
Since D is random, so is P. For this reason, the expected posterior distribution
| (13) |
will be used occasionally, with an expectation corresponding to averaging over all possible data sets D according to its distribution in the true world. Consequently, represents the agent’s expected belief in S in the true world . Note in particular that in contrast to the posterior P, the expected posterior is not a purely Bayesian notion, since it depends on .
4. Learning
4.1. Active Information for Quantifying the Amount of Learning
The active information (AIN) of an event B is
| (14) |
In particular, quantifies how much an agent has learned about whether S is true or not compared to an ignorant person. By inserting (7) into (14), we find that the AIN
| (15) |
is the logarithm of the ratio between how likely it is to observe data when S holds, and how likely data are when no assumption regarding S is made (see also [32]). The corresponding AIN for expected degrees of beliefs is
| (16) |
Definition 1
(Learning). Learning about S has occurred (conditionally on observed D) if the probability measure P either satisfies when or when . In particular, full learning corresponds to when and when . Learning is expected to occur if the probability measure is such that when or when . In particular, full learning is expected to occur if when or when .
Remark 4.
Two extreme scenarios for the active information, when , are
(17) According to Definition 1, the upper part of (17) represents full learning—that is, ; whereas the lower part corresponds to a maximal amount of false belief about S when —that is, .
Remark 5.
Suppose S is a proposition that a certain entity or machine functions; then, is the functional information associated with the event A of observing such functioning entity [33,34,35]. In our context, functional information corresponds to the maximal amount of learning about S when the machine works ().
4.2. Learning as Hypothesis Testing
It is possible to view the AIN in (15) as a test statistic for choosing between the two statistical hypotheses
| (18) |
with the null distribution being rejected (conditionally on observed D) when
| (19) |
for some threshold I [36,37,38]. Typically, this threshold represents a lower bound of what is considered to be a significant amount of learning when S is true. Note in particular that the framework of the hypothesis test, (18) and (19), is frequentist, although we use Bayesian tools (the prior and posterior distributions) to define the test statistic.
In order to introduce performance measures of how much the agent has learnt, let refer to a probabilities when data are generated according to what one expects in the true world. The type I and II errors of the test (18) and (19) are then defined as
| (20) |
respectively. Both these error probabilities are functions of , and they quantify how much the agent has learnt about the truth (cf. Figure 1 for an illustration).
Figure 1.
Illustration of the density function of when the data set varies according to the likelihood of the true world parameter for two scenarios where S is either false (a) or true (b). The threshold of the hypothesis test (19) is , so that is rejected when . Note that is the expected value of each density, whereas the error probabilities of type I and II correspond to the areas under the curves in (b) and (a) to the left and right of p, respectively.
4.3. The Bayesian Approach to Learning
Within a Bayesian framework, we think of and as two different models, A and , that represent a subdivision of the parameter space into two disjoints subsets. The posterior odds
| (21) |
factor into a product of the prior odds and the Bayes factor. Hypothesis is chosen whenever
| (22) |
for some threshold r. If the cost of drawing a parameter from A () is () when () is chosen, the optimal Bayesian decision rule corresponds to . A little algebra reveals that the AIN is a monotone decreasing function
of the posterior odds. From this, it follows that the frequentist test (19), with AIN as test statistic, is equivalent to the Bayesian test (22), whenever . However, the interpretation of the two tests differ. Whereas the aim of the Bayesian decision rule is to minimize an expected cost (or maximize an expected reward/utility), the aim of the frequentist test is to keep the error probabilities of type I and II low.
4.4. Test Statistic When Is Unknown
Recall that the agent may or may not know the set A. In the latter case, the agent cannot determine the value of the test statistic , and hence he cannot test between and himself. This happens, for instance, for the truth function (1), with , since the AIN then involves the unknown , with and . Although is not known for this particular choice of A, the agent may still use the posterior distribution (7) in order to compute the expected value (conditionally on observed D)
| (23) |
of the test statistic according to his posterior beliefs. Note that (23) equals the Kullback–Leibler divergence between P and , or the difference between the cross entropy between P and , and the entropy of P. If we also take randomness of the data set D into account, and make use of (7), it follows that the expected AIN, for the same choice of A, equals the mutual information
| (24) |
between X and D, when vary jointly according to the agent’s Bayesian expectations, and with .
5. Knowledge Acquisition
5.1. Knowledge Acquisition Goes beyond Learning
As mentioned in the introduction, knowledge acquisition goes beyond learning, since it also requires that a true belief in S is justified (see Figure 2 for an illustration).
Figure 2.
Illustration of the difference between learning and knowledge acquisition for a scenario with a set of worlds and a statement S whose truth function is depicted to the left (a) and right (b). It is assumed that S is true (), and that the degrees of beliefs of an ignorant person correspond to a uniform distribution on . The filled histograms correspond to the density functions of two agent’s beliefs. The agent to the left (a) has learnt about S but not acquired knowledge, since does not belong to the support of P. The agent to the right has not only learnt about S, but also acquired knowledge, since his belief is justified, corresponding to a distribution P that is more concentrated around the true world , compared to the ignorant person. Hence, the JTB condition is satisfied for the agent to the right, but not for the agent to the left.
It is possible, in principle, for an agent whose probability measure P corresponds to a smaller belief in compared to that of the ignorant person, to have a value of anywhere in the range when S is true (i.e., when ). One can think of a case in which the agent will believe in S with certainty () if ; but this belief in S is for the wrong reason if, for instance, the agent does not believe in the true world, i.e., if (10) holds, corresponding to the left part of Figure 2. Another less extreme situation occurs when the agent has a higher belief in A compared to the ignorant person but has lost some (but not all) confidence in the true world with respect to that of the ignorant person; in this case, the agent has not acquired new knowledge about the true world compared to the ignorant person, although he still has learned about S and has some knowledge about the true world.
5.2. A Formal Definition of Knowledge Acquisition
Knowledge acquisition is formulated using tools from statistical estimation theory. Loosely speaking, the agent acquires knowledge, based on data D, if the posterior distribution P gets more concentrated around , compared to an ignorant person. By this we mean that each closed ball centered at has a probability that is at least as large under P as under . Closed balls require, in turn, the concept of a metric or distance; that is, a function . Some examples of metric are:
If , we use the Euclidean distance between as metric.
If consists of all binary sequences of length q, then is the Hamming distance between and .
- If is a finite categorical space, we put
Equipped with a metric on , knowledge acquisition is now defined:
Definition 2
(Knowledge acquisition and full knowledge). Let be the closed ball of radius ϵ that is centered at with respect to some metric d. We say that an agent has acquired knowledge about S (conditionally on observed D) if learning has occurred according to Definition 1, and in order for this learning to be justified, the following two properties are satisfied for all :
(25) and
(26) with strict inequality for at least one . Full knowledge about S requires that (9) holds; i.e., that the agent with certainty believes that the true world is true. The agent is expected to acquire knowledge about S if learning is expected to occur, according to Definition 1, and if (25) and (26) hold with is true. The agent is expected to acquire knowledge about S if learning is expected to occur, according to Definition 1, and if (25) and (26) hold with instead of P. The agent is expected to acquire full knowledge about S if (9) holds with instead of P.
Several remarks are in order.
Remark 6.
Property (25) ensures that is in support of P ([39], p. 20) Kallenberg2021a. When is the uniform distribution (2), (25) follows from (26). Property (26) is equivalent to , when . In this case, the requirement that (26) is satisfied with strict inequality for some is equivalent to learning the proposition : "The distance of a world to the true world is less than or equal to ," corresponding to a truth function
(27) Since the agent does not know , neither nor is known to him, even if he is able to discern between all possible worlds. If differs from the original truth function f, learning of can be viewed as meta-learning. Note also that corresponds to the set in (1).
Remark 7.
Suppose the truth function used to define learning and knowledge acquisition satisfies (27), i.e., for some . Then (25) and (26) are sufficient for knowledge acquisition, since they imply that learning of has occurred, according to Definition 1. Although knowledge acquisition in general requires more than learning, the two concepts are equivalent for a truth function , with , as defined in (1). Indeed, in this case it is not possible to learn whether is true or not for the wrong reason.
Remark 8.
Recall from Definition 1 that an agent has fully learnt S when
(28) For a rational agent, the lower part of (28) should hold when data D falsifies S. In general, (28) is a necessary but not sufficient condition for full knowledge. Indeed, it follows from (9) that, for a person to have full knowledge, must hold for all , not only for the set A of worlds for which S is true.
Remark 9.
Suppose a distance measure between probability distributions on is defined. This gives rise to a different definition of knowledge acquisition, whereby the agent acquires knowledge if has learnt about S and additionally , that is, if his beliefs are closer than the ignorant person’s beliefs to the Oracle’s beliefs. Possible choices of distances are the Kullback–Leibler divergence and the Wasserstein metric , where the minimum is taken over all random vectors whose marginals have distributions P and Q, respectively. Note in particular that the KL choice of distance yields . The corresponding notion of knowledge acquisition is weaker than in Definition 2, requiring (25) and (26) to hold only for .
Illustration 4
(Continuation of Illustration 1). To check whether learning or knowledge acquisition has occurred, according to Definitions 1 and 2, for the student who takes the math home assignment, must be known. The reader may think of an instructor with full information—an -optimal measure according to (9)—who checks whether a pupil has learned and acquired knowledge or not. However, in Illustration 1 it is the teacher who is the pupil and learns and acquires knowledge about the skills of a math student. In this context, the instructor is a supervisor of the teacher who knows whether the math student is able to add () or not, and in the latter case whether the student gets help () or not (). Whereas the instructor’s sigma-algebra is , the teacher’s sigma-algebra in (12) does not make it possible to discern between and . Suppose . No matter how many questions the home exam has, as long as the teacher does not get information from the instructor on whether the student solved the home exam without help or not, although the teacher learns that S is true, since the student scores well, he will never acquire full knowledge that the student knows how to add, since .
6. Learning and Knowledge Acquisition Processes
The previous two sections dealt with learning and knowledge acquisition of a static belief P, corresponding to an agent who is able to discern between worlds according to one sub--algebra of , and who has access to one data set D. The setting is now extended to consider an agent who is exposed to an increasing amount of information about (or discernment between) the possible worlds in , and increasingly larger data sets.
6.1. The Process of Discernment and Data Collection
Mathematically, an increased ability to discern between possible worlds is expressed as a sequence of -algebras
| (29) |
Typically, is generated by a random variable whose domain is in for . The -algebras in (29) are associated with increasingly larger data sets , with . Let refer to the joint distribution of the parameter and data in step k, such that the likelihood of is -measurable. This implies that an agent who interprets data according to this likelihood function has beliefs (represented by the posterior probability measure ) that correspond to not being able to discern events outside of better than an ignorant person. Mathematically, this is phrased as a requirement
| (30) |
for all -measurable functions and sigma algebras such that , for . The collection of pairs is referred to as a discernment and data collection process. The active information, after k steps of the discernment and data collection process, is
| (31) |
Let refer to expected degrees of belief after k steps of the information and data collection process, if data vary according that what one expects in the true world. The corresponding active information is
| (32) |
In the following sections we will use the sequences and of AINs and posterior beliefs in order to define different notions of learning and knowledge acquisition.
6.2. Strong Learning and Knowledge Acquisition
Definition 3
(Strong learning). The probability measures , obtained from the discernment and data collection process represent a learning process in the strong sense (conditionally on observed ) if
(33) with at least one strict inequality. Learning is expected to occur, in the strong sense, if (33) holds with , instead of .
Definition 4
(Strong knowledge acquisition). With as in Definition 2, the learning process is knowledge acquiring in the strong sense (conditionally on observed ) if, in addition to (33), we have that this learning process is justified, so that for all , and
(34) with strict inequality for at least one step of (34) and for at least one . Knowledge acquisition is expected to occur, in the strong sense, if learning is expected to occur in the strong sense, according to Definition 3, and additionally (34) holds with , instead of .
Illustration 5
(Continuation of Illustration 1). Assume the teacher of the math student has a discernment and data collection process , where in the first step, and are obtained from a home assignment with 10 questions (as described in Section 3.2). Suppose the student knows how to add (). It can be seen that
(35) whenever . Assume that in a second step the teacher receives information from the instructor on whether the student used external help () or not () during the exam. Let refer to observed data after step 2. The likelihood, after the second step, then takes the form
where and . If the instructor correctly reports that the student did not use external help (), it follows that
(36) We deduce from (35) and (36) that
(37) which suggests that knowledge acquisition has occurred if the categorical space metric is used on . However, since , neither learning nor knowledge acquisition in the strong sense has occurred. The reason is that the information from the instructor (that the student has not cheated) makes the teacher less certain as to whether the student is able to score well on the test. On the hand, if we change the proposition to S: "The student knows how to add," with , then strong learning and knowledge acquisition has occurred because of (37), since for .
6.3. Weak Learning and Knowledge Acquisition
Learning and knowledge acquisition are often fluctuating processes, and the requirements of Definition 3 are sometimes too strict. Accordingly, weaker versions of learning and knowledge acquisition are thus introduced.
Definition 5
(Weak learning). Learning in the weak sense has occurred at time n (conditionally on observed ), if
(38) Learning is expected to occur in the weak sense if (38) holds with instead of .
Definition 6
(Weak knowledge acquisition). Knowledge acquisition in the weak sense occurs (conditionally on observed ) if, in addition to the weak learning condition (38), in order for this learning to be justified, it holds for all that and
(39) with strict inequality for at least one . Knowledge acquisition is expected to occur in the weak sense if weak learning occurs according to Definition 5 and (39) holds with instead of .
7. Asymptotics
Strong and weak learning (or strong and weak knowledge acquisition) are equivalent for . The larger n is, the more restrictive strong learning becomes in comparison to weak learning. However, for large n, neither strong nor weak learning (knowledge acquisition) are entirely satisfactory entities. For this reason, in this section we will introduce asymptotic versions of learning and knowledge acquisition, for an agent whose discernment between worlds and collected data sets increase over a long period of time.
7.1. Asymptotic Learning and Knowledge Acquisition
In order to define asymptotic learning and knowledge acquisition, as the number of steps n of the discernment and data collection process tends to infinity, we first need to introduce AIN versions of limits. Define
| (40) |
| (41) |
and when the two limits of (40) agree, we refer to the common value as . Define also
| (42) |
| (43) |
with the common value whenever the two limits of (42) agree. Since only exists when , and , the following definitions of asymptotic learning and knowledge acquisition are natural:
Definition 7
(Asymptotic learning). Learning occurs asymptotically (conditionally on the observed data sequence ) if
(44) Full learning occurs asymptotically (conditionally on ) if
(45) Learning is expected to occur asymptotically if (44) holds with and , instead of and , respectively. Full learning is expected to occur asymptotically, if (45) holds with instead of .
Definition 8
(Asymptotic knowledge acquisition). Knowledge acquisition occurs asymptotically (conditionally on ) if, in addition to the asymptotic learning condition (44), in order for this asymptotic learning to be justified, for every , it holds that
and
(46) with strict inequality for a least one . Full knowledge acquisition occurs asymptotically (conditionally on ) if (45) holds and
(47) is satisfied for all . If learning is expected to occur asymptotically according to Definition 7, and if (46) holds with instead of , then knowledge acquisition is expected to occur asymptotically. Full knowledge acquisition is expected to occur asymptotically if full learning is expected to occur asymptotically according to Definition 7, and if (47) holds with instead of .
7.2. Bayesian Asymptotic Theory
In this subsection we will use Bayesian asymptotic theory in order to quantify and give conditions for when asymptotic learning and knowledge acquisition occur. Let be a large space that incorporates prior beliefs and data for all . Define as a random variable whose distribution corresponds to the agent’s posterior beliefs, based on data set , which itself varies according to another random variable with distribution . Let be a probability measure on subsets of that induces distributions and , respectively. The following proposition is a consequence of Definitions 7 and 8:
Proposition 1.
Suppose full learning is expected to occur asymptotically, in the sense of (45), with instead of . Then,
(48) as . In particular, the type I and II errors of the hypothesis test (18) and (19), with threshold for some , satisfy
(49) respectively, as . If full knowledge acquisition occurs asymptotically, in the sense of (47), then
(50) as , with referring to convergence in probability. If full knowledge acquisition is expected to occur asymptotically, in the sense of Definition 8, then
(51) as .
Remark 10.
Full asymptotic knowledge acquisition (50) is closely related to the notion of posterior consistency [40]. For our model, the latter concept is usually defined as
(52) where the probability refers to variations in the data sequence when holds. Thus, posterior consistency (52) means that full asymptotic knowledge acquisition (50) holds with probability 1. Let refer to the distribution of the random variable X. Then, (52) is equivalent to
(53) as , with referring to almost sure weak convergence with respect to variations in the data sequence when . On the other hand, it follows from Definition 8 that if full knowledge acquisition is expected to occur asymptotically, this is equivalent to
(54) as , which is a weaker concept than posterior consistency, since almost sure weak convergence implies weak convergence in probability. However, sometimes (54), rather than (52) and (53), is used as a definition of posterior consistency.
Remark 11.
It is sometimes possible to sharpen (54) and obtain the rate at which the posterior distribution converges to . The posterior distribution is said to contract at rate to as (see for instance [41]), if for every sequence it holds that
(55) when varies according to what one expects in the true world . Since convergence in probability is equivalent to convergence in mean for bounded random variables, it can be seen that (54) is equivalent to , or
(56) as for each sequence . Comparing (51) with (55) and (56), we found that a contraction of the posterior towards at rate is equivalent to expecting full knowledge acquisition asymptotically at rate .
It follows from Proposition 1 and Remarks 10 and 11 that Bayesian asymptotic theory can be used, within our frequentist/Bayesian framework, to give sufficient conditions for asymptotic learning and knowledge acquisition to occur. Suppose, for instance, that is a sample of k independent and identically distributed random variables with distribution that belongs to the statistical model . The likelihood function is then a product
| (57) |
of the likelihoods of all observations . For such a model, a number of authors [40,42,43,44,45] have provided sufficient conditions for posterior consistency (52) and (53) to occur. It follows from Remark 10 that these conditions also imply the weaker concept (54) of full, expected knowledge acquisition to occur asymptotically.
Suppose (57) holds with a parameter space that is a subset of Euclidean space of dimension q. It is possible then to obtain the rate (56) at which knowledge acquisition is expected to occur. The first step is to use the Bernstein–von Mises theorem, which under appropriate conditions (see for instance [46]) approximates the posterior distribution by a normal distribution centered around the maximum likelihood (ML) estimator
| (58) |
of . More specifically, this theorem provides weak convergence
| (59) |
as , of a re-scaled version of the distribution of when varies according what one expects when . The limiting distribution is a q-dimensional normal distribution with mean 0 and a covariance matrix that equals the inverse of the Fisher information matrix , evaluated at the true world parameter . On the other hand, the standard asymptotic theory of maximum likelihood estimation (see for instance [47]) implies
| (60) |
as , with weak convergence referring to variations in the data sequence when . Combining equations (59) and (60), we arrive at the following result:
Theorem 1.
Assume data consists of independent and identically distributed random variables , and that the Bernstein–von Mises theorem (59) and asymptotic normality (60) of the ML-estimator hold. Then, converges weakly towards at rate , in the sense that
(61) as . In particular, full knowledge acquisition is expected to occur asymptotically at rate .
Proof.
Let refer to the distribution function of , defined for all q-dimensional vectors . Let also and denote the distribution function of by G. Combining (59) and (60), and making use of the fact that the convolution of two independent -variables is distributed as , we can find that
(62) as , with referring to for . Since (62) holds for any , Equation (61) follows. Moreover, in view of (56), Equation (61) implies that full knowledge acquisition is expected to occur asymptotically at rate . □
In general, the conditions of Theorem 1 typically require that data, and the agent’s interpretation of data, are unbiased. When these conditions fail (cf. Remark 2), there is no guarantee that knowledge acquisition is expected to occur asymptotically as .
8. Examples
Example 1
(Coin tossing). Let be the probability of heads when a certain coin is tossed. An agent wants to find out whether the proposition
is true or not. This corresponds to a truth function , with , that is known to the agent. Suppose the coin is tossed a large number of times (), and let be a binary sequence of length k that represents the first k tosses, with tails and heads corresponding to 0 () and 1 (), respectively. The number of heads after k tosses is then a sufficient statistic for estimating based on data . Even though is an increasing sequence of data sets, we put and , the Borel σ-algebra on , for . Let be the uniform prior distribution on . Since the uniform distribution is a beta distribution, and beta distributions are conjugate priors to binomial distributions, it is well known [17] that the posterior distribution
belongs to the beta family as well. Consequently, if is a random variable that reflects the agent’s degree of belief in the probability of heads after k tosses, it follows that his belief in a symmetric coin, if , is
where
(63) is the posterior density function of the parameter x, whereas is the beta function and the likelihood function. From this, it follows that the AIN after k coin tosses with m heads and tails equals
Since data are random, (and hence also ) will fluctuate randomly up and down with probability one (see Figure 3); for this reason, does not represent a learning process in the strong sense of Definition 3. On the other hand, it follows by the strong law of large numbers that as , and from properties of the beta distribution, this implies that full learning and knowledge acquisition occur asymptotically according to Definitions 7 and 8, with probability 1. In view of Remark 10, we also have posterior consistency (52) and (53).
By analyzing instead of , we may also assess whether learning and knowledge acquisition are expected to occur. The expected degree of belief in a symmetric coin, after k tosses, is
where
is the expected posterior density function of x, after k tosses of the coin. Note in particular that
It can be shown that (63) and the weak law of large numbers ( as , where refers to convergence in probability) lead to uniform convergence
as for any . The last four displayed equations imply and as . This and Definitions 7 and 8 imply that full learning and knowledge acquisition are expected to occur asymptotically. This result is also a consequence of posterior consistency, or of Theorem 1. Notice, however, that a purely Bayesian analysis does not allow us to conclude that knowledge acquisition occurs, or is expected to occur, asymptotically.
Figure 3.
Degree of belief is represented as a function of coin tosses. There is no strong learning because the belief oscillates. However, there is weak learning after a few coin tosses. In particular, when the number of coin tosses is 1000, there is weak learning since and .
Example 2
(Historical events). Let represent a time interval of the past. A person wants to find out whether his ancestor died or not during a famine that occurred in the province where the ancestor lived. Formally, this corresponds to a proposition
Let if the famine occurred at time x, and if not. Assume that the ancestor died at an unknown time point and that the time period during which the famine lasted is , where are known. If corresponds to a fairly short time interval of the past, it is reasonable to assume that has a uniform distribution on .
In the first step of the learning process, suppose radiometric dating of a burial find from the ancestor appears. If represents the precision of this dating method, the corresponding σ-algebra is
(64) where is defined through , and where is the integer part of . Due to (3), it follows that has a density function
(65) for some non-negative probability weights that sum to 1. Since , this measure is constructed from the radiometric dating data of the burial find from the ancestor. The -optimal probability measure is obtained from (8) as
It corresponds to dating the time of death of the ancestor correctly, given the accuracy of this dating method. On the other hand, if the radiometric dating equipment has a systematic error of , a truth-excluding probability measure (10) is obtained with
(66) In the second step of the learning process, suppose data is extended to include a piece of text from a book where the time of death of the ancestor can be found. This extra source of information increases the σ-algebra to , and if the contents of the book are reliable, is -optimal. It follows from Definition 3 that strong learning has occurred if and are integers and
(67) Figure 4 illustrates another scenario where not only strong learning but also strong knowledge acquisition occurs. Suppose now that (66) holds, with . If , the strong learning condition (67) is satisfied, and the weak knowledge acquisition requirement of Definition 6 holds as well. Strong knowledge acquisition has not occurred though, since means that Equation (34) of Definition 4 (with ) is violated for sufficiently small . Note in particular that these conclusions about knowledge acquisition cannot be drawn from a purely Bayesian analysis.
Assume now that the contents of the book are not reliable. A probability measure on may be chosen so that it incorporates data from the radiometric dating and data from the book. This probability measure will also include information about the way the text of the book is believed to be unreliable. If the agent trusts too much, it may happen that strong learning does not occur.
Figure 4.
Posterior densities and after one and two steps of the discerment and data collection process of Example 2 when S is true (). Since is measurable with respect to , it is piecewise-constant with step length . Note that strong learning and knowledge acquisition occurs.
Example 3
(Future events). A youth camp with certain outdoor activities is planned for a weekend. Let denote the set of possible temperatures of the two days for which the camp is planned, each normalized within a range . The outdoor activities are only possible within a certain sub-range of temperatures. The proposition
corresponds to a truth function and . The leaders have to travel to the camp five days before it starts and then make a decision on whether to bring equipment for the outdoor activities or for some other indoor activities. In the first step they consult weather forecast data , with a σ-algebra given by
which is , the σ-algebra generated by , where and represent the maximal possible accuracy of weather forecasts five and six days ahead, respectively, and . Let be the uniform distribution on . Due to (3), has a density function
(68) for some non-negative probability weights that sum to 1, with
a rectangular region that corresponds to the ith temperature interval the first day of the camp and the jth temperature interval the second day. Consequently, the accuracy of weather forecast data forces to be constant over each . A -optimal measure assigns full weight 1 to the rectangle with and , where represents the actual temperature the two days. Observe then that the -optimal measure is restricted to measurements that are accurate up to and ; therefore, it cannot do better than assigning the temperature to the intervals with sizes and to which the actual temperatures belong; however, it cannot say what the exact temperature is. The exact prediction requires an -optimal measure.
In a second step, in order to get some additional information, the leaders of the camp consult a prophet. Let refer to the probability measure based on the weather forecast and the message of the prophet, so that and . If the prophet always speaks the truth, and if the leaders of the camp rely on his message, they will make use of the -optimal measure , corresponding weak (and full) learning, and a full amount of knowledge. In general, the camp leaders’ prediction in step k is correct with
If this probability is less than 1 for , the reason is either that the prophet does not always speak the truth or that the leaders do not rely solely on the message of the prophet. In particular, it follows from Definition 3 that strong learning has occurred if
(69) Suppose the weather forecast and the message of the prophet are biased, but they still correctly predict whether outdoor activities are possible or not. Then, neither weak nor strong knowledge acquisition occurs, in spite of the fact that the strong learning condition (69) holds. Note in particular that such a conclusion is not possible with a purely Bayesian analysis. Another scenario wherein neither (weak or strong) learning nor knowledge acquisition takes place is depicted in Figure 5.
Figure 5.
Posterior densities and after one and two steps of data collection for Example 3. Since , it is not possible to have outdoor activities the first day of the camp. The weather forecast density is supported and piecewise-constant on the four rectangles with width and height , corresponding to -algebra . The true temperatures are within the support of . On the other hand, the prophet incorrectly predicts that outdoor activities are possible both days; is supported on the ellipse. In this case, neither (weak or strong) learning nor knowledge acquisition takes place.
Example 4
(Replication of studies). Some researchers want to find the prevalence of the physical symptoms of a certain disease. Let refer to the possible set of values for the prevalence of the symptoms, obtained from two different laboratories. The first value corresponds to the prevalence obtained in Laboratory 1, whereas the second value is obtained when Laboratory 2 tries to replicate the study of Laboratory 1. The board members of the company to which the two laboratories belong want to find out whether the two estimates are consistent, within some tolerance level . In that case, the second study is regarded as replication of the first one. The proposition
corresponds to a truth function and
(70) The true value represents the actual prevalences of the symptoms, obtained from the two laboratories under ideal conditions. Importantly, it may still be the case that , if either the prevalence of the symptoms changes between the two studies and/or the two laboratories estimate the prevalences within two different subpopulations.
Let be a data set by which Laboratory 2 receives all needed data from Laboratory 1 in order to set up its study properly (so that, for instance, affection status is defined in the same way in the two laboratories ). We will assume , so that the corresponding σ-algebra
corresponds to full discernment, with being the Borel σ-algebra on , whereas is the Borel σ-algebra on the unit interval (see Remark 1). If is the probability measure obtained from , the probability of concluding that the second study replicated the first is
(71) In particular, when each laboratory makes use of data from all individuals in its subpopulation (which is either the same or not for the two laboratories), the -optimal probability measure (9) corresponds to
(72) Now consider another scenario where Laboratory 2 only gets partial information from Laboratory 1. This corresponds to a data set with the same sampled individuals as in , but Laboratory 2 has incomplete information from Laboratory 1 regarding the details of how the first study was set up. For this reason, they make use of a coarser σ-algebra, by which it is only possible to quantify prevalence with precision δ. If this σ-algebra is referred to as , it follows that and
The corresponding loss of information is measured through a probability that has the same marginal distribution as for all events B that are discernible from , i.e.,
(73) Hence, it follows from (30) and (73) that
where , , and is the j-th possible region for the prevalence estimate of Laboratory 2. In particular, the probability that the second study replicates the first one is
(74) If both laboratories perform a screening and collect data from all individuals in their regions, so that (72) holds, then is a -optimal measure according to (8), with
(75) and . Making use of Definition 3, we notice that a sufficient condition for strong learning to occur is that has a uniform distribution on (so that ), such that (72) and (75) hold, and
With full information transfer between the two laboratories, the replication probabilities (71) and (72) based on data only depend on ε, whereas the corresponding replication probabilities (74) and (75) under incomplete information transfer between the laboratories and data , also depending on δ. In particular, will always be less than 1 when , even when (75) holds and . Moreover, δ sets the limit in terms of how much knowledge can be obtained from the two studies under incomplete information transfer, since
Note that this last conclusion cannot be obtained from a Bayesian analysis, since a true pair of prevalences does not belong to a purely Bayesian framework.
Example 5
(Unmeasured confounding and causal inference). This example illustrates unmeasured confounding and causal inference. Let and . An individual is assigned a binary vector of length n, where codes for whether that person will have symptoms within five years () or not () that are associated with a certain mental disorder. The first component refers to the individual’s binary exposure, whereas the other variables , are binary confounders. The truth function corresponds to symptom status, whereas
represents the vectors x of all individuals in the population with symptoms. Consider the proposition
and let be the vector associated with Adam. We will introduce a sequence of probability measures , where represents the distribution of in the whole population, whereas corresponds to the conditional distribution of , given that its first k covariates have been observed, with values equal to those of Adam’s first k covariates. Since the conditional distribution is non-random, it follows that
(76) for , whereas for . According to Definition 5, this implies that weak learning occurs with probability 1, and in particular that weak learning is expected to occur. If , we have that
(77) for . Note, in particular, that is -optimal, corresponding to error-free measurement of Adam’s first k covariates.
In order to specify the null distribution , we assume that a logistic regression model [48]
(78) holds for the probability of having the symptoms within five years, conditionally on the covariates (one exposure and confounders). It is also assumed that the regression parameters are known, so that g is known as well. It follows from Equations (76) and (78) that
(79) can be interpreted as increasingly better predictions of Adam’s symptom status five years ahead, for , whereas represents full knowledge of S. In particular, is the prevalence of the symptoms in the whole population, whereas is Adam’s predicted probability of having the symptoms when his exposure is known, whereas none of his confounders are measured.
Suppose are sufficient for confounding control, and that the exposure and the confounders (in principle) can be assigned. Let represent a hypothetical individual for which all covariates are assigned. Under a so called conditional exchangeability condition [16], it is possible to use a slightly different definition
of the probability measures in order to compute the counterfactual probability
of the potential outcome , under the scenario that the first k covariates were set to . In particular, it is of interest to know how much the unknown causal risk ratio effect of the exposure maximally differs from the known risk ratio [49,50,51,52]. Note in particular that the corresponding logged quantities
can be expressed in terms of the active information
9. Discussion
In this paper, we studied an agent’s learning and knowledge acquisition within a mathematical framework of possible worlds. Learning is interpreted as an increased degree of true belief, whereas knowledge acquisition additionally requires that the belief is justified, corresponding to an increased belief in the correct world. The theory is put into a framework that involves elements of frequentism and Bayesianism, with possible worlds corresponding to the parameters of a statistical model, where only one parameter value is true, whereas the agent’s beliefs are obtained from a posterior distribution. We formulated learning as a hypothesis test within this framework, whereas knowledge acquisition corresponds to consistency of posterior distributions. Importantly, we argue that a hybrid frequentist/Bayesian approach is needed in order to model mathematically the way in which philosophers distinguish learning from knowledge acquisition.
Some applications of our theory were provided in the examples of Section 8. Apart from those, we argue that our framework has quite general implications for machine learning, in particular, supervised learning. A typical task of machine learning is to obtain a predictor of a binary outcome variable , when only incomplete information X of is obtained from training data. The performance of a machine learning algorithm is typically assessed in terms of prediction accuracy, that is, how well approximates Y, with less focus on the closeness between X and . In our terminology, the purpose of machine learning is learning rather than knowledge acquisition. This can often be a disadvantage, since knowledge acquisition often provides deeper insights than learning. For instance, full knowledge acquisition may fail asymptotically when , even when data are unbiased and interpreted correctly by the agent, if there is lacking discernment between the set of possible worlds , even in the limit .
On the other hand, it makes no sense to go beyond learning for game theory, where the purpose is to find the optimal strategy (an instance of knowledge-how). In more detail, let refer to the strategy x of a player among a finite set of M possible strategies. The optimal strategy is the one that maximizes an expected reward function for the actions taken with strategy x, D refers to data from previous games that a player makes use of to estimate , and represents the player’s maximal possible discernment between strategies. Since the objective is to find the optimal strategy, it is natural to use a truth function
| (80) |
with the associated set of true worlds corresponding to the upper row of (1). It follows from Remark 7 that learning and knowledge acquisition are equivalent for game theory whenever (80) is used. Various algorithms, such as reinforcement learning [53] and sequential sampling models [54,55], could be used by a player in order to generate his beliefs P about which strategy is the best.
Many extensions of our work are possible. A first extension would be to generalize the framework of Theorem 1 and Example 1, where data are collected sequentially according to a Markov process with increasing state space, without requiring that are independent and identically distributed. We will mention two related models for which this framework applies. For both of these models a student’s mastery of q skills (which represent knowledge how rather than knowledge that) is of interest. More specifically, is a binary sequence of length q, with or 0 depending on whether the student has acquired skill i or not, whereas corresponds to exercises that are given to a student up to time k, and the student’s answers to these exercises. It is also known which skills are required to solve each type of exercise. The first model is Bayesian knowledge tracing (BKT) [56], which has recently been analyzed using recurrent neural networks [1]. In BKT, a tutor trains the student to learn the q skills, so that the student’s learning profile changes over time. At each time point, the tutor is free to choose the last exercises at time k based on previous exercises and what the student learnt up to time . The goal of the tutoring is to reach a state where the student has learned all skills. The most restrictive truth function (80) monitors whether the student has learned all skills or not, so that is the probability that the student has learnt all skills at time k. In view of Remark 7, there is no distinction between learning and knowledge acquisition for such a truth function. A less restrictive truth function focuses on whether the student has learnt skill i or not, so that is the probability that the student learnt skill i at time k. The second model—the Bayesian version of Diagnostic Classification Models (DCMs) [57]—can be viewed as an extension of Illustration 1. The purpose of DCMs is not to train the student (as for knowledge tracing), but rather to diagnose the student’s (or respondent’s) current vector , where or 0 if this particular student masters skill (or attribute) i or not. The exercises of DCM are usually referred to as items. Assume a truth function (80); is the probability that the diagnostic test by time k has learnt which attributes the student masters. Note in particular that the student’s attribute mastery profile is fixed, and it is rather the instructor that learns about when the student is being tested on new items.
A second extension would be to consider opinion making and consensus formation [58] for a whole group of N agents that are connected according to some social network. In this context, represents the maximal amount of discernibility between possible worlds that is possible to achieve after k time steps based on external data (available to all agents) and information from other agents (which varies between agents and depends on properties of the social network). It is of interest in this context to study the dynamics of over time, where represents the belief of agent (or individual) i in proposition S after k time steps. This can be accomplished using a dynamical Bayesian network [59] with N nodes that represent individuals, associating each node i with a distribution over the set of possible worlds , corresponding to the beliefs of agent i at time k. A particularly interesting example in this vein would be to explore the degree to which social media and social networks can influence learning and knowledge acquisition.
The third possible extension is related to consensus formation, but with a more explicit focus on how N decentralized agents make a collective decision. In order to illustrate this, we first describe a related model of cognition in bacterial populations. Marshall [60] has concluded that “the direction of causation in biology is cognition → code → chemicals”. Cognition is observed when there is a discernment and data collection process that either optimizes code or improves the probability of a given chemical outcome. Accordingly, the strong learning process of Definition 3 can be used to model how biological cognition is attained (or at least is expected to be attained). For instance, in quorum sensing, once bacteria reach a critical density, they emit a chemical signal to ascertain the number of neighboring bacteria [61]; when a critical density is reached, the population performs certain functions as a unit (Table 1 of [62] presents several examples of bacterial functions partially controlled by quorum sensing). The proposition under consideration here is S: “the function is performed by at least a fraction of bacteria”, where represents a critical density above which the bacteria act as a unit. The parameter is a binary sequence reflecting the way in which a population of bacteria acts, so that if bacterium i performs the function, whereas . For collective decisions, rather represents a local decision of agent i, whereas corresponds to the global decision of all agents. Learning about S at time is described by , the probability that the population acts as a unit at time k. There is a phase transition at time k if the probabilities of the population acting as a unit are essentially null, whereas becomes positive (and hence gets large) when discernment ability and data are extended from to . This is closely related to the fine-tuning of biological systems [33,35] with f being a specificity function and A set of highly specified states, and fine-tuning after k steps of an algorithm that models the evolution of the system corresponding to being large. As for the direction of causation from cognition to code, Kolmogorov’s complexity, which measures the complexity of an outcome as the shortest code that produces it, can be used in place of or jointly with active information to measure learning [63].
A fourth theoretical extension is to consider the case . In this case, instead of the (discrete or continuous) uniform distribution given by (2), it will be necessary to consider more general maximum entropy distributions , subject to some restrictions, in order to measure learning and knowledge acquisition [20,64,65,66].
A fifth extension is to consider models where the data sets are not nested. This is of interest, for instance, in Example 5, when non-nested subsets of confounders are used to predict Adam’s disease status. For such scenarios, it might be preferable to use information-based model selection criteria (such as maximizing AIN) in order to quantify learning [67], rather than sequentially testing various pairs of nested hypotheses by means of
in order to assess whether learning has occurred in each step k (corresponding to strong learning of Definition 3).
A sixth extension would be to compare the proposed Bayesian/frequentist notions of learning and knowledge acquisition, with purely frequentist counterparts. Since learning corresponds to choosing between the two hypotheses in (18), we may consider a test that rejects the null hypothesis when the log likelihood ratio is small enough, or equivalently, when
| (81) |
for some appropriately chosen threshold t. The frequentist notion of learning is then formulated in terms of error probabilities of type I and II, analogously to (20), but for the LR-test (81) rather than the Bayesian/frequentist test (19) with test statistic AIN, or the purely Bayesian approach that relies on posterior odds (21). A frequentist version of knowledge acquisition corresponds to using data D in order to produce a one-dimensional class of confidence regions for , with a nominal coverage probability of CR that varies. In order to quantify how much knowledge that is acquired, it is possible to use the steepness of a curve that plots the actual coverage probability as a function of the volume . However, a disadvantage of the frequentist versions of learning and knowledge acquisition is that they do not involve degrees of beliefs, the philosophical starting point of this article. This is related to the critique of frequentist hypothesis testing offered in [68]. Since no prior probabilities are allowed, within a frequentist setting, important notions such as the false report probability (FRP) and true report probability (TRP) are not computable, leading to many non-replicated findings.
A seventh extension is to consider multiple propositions , as in [69,70]. For each possible world , we let be a truth function such that , with (0) if is true (false) in world x. It is then of interest to develop a theory of learning and knowledge acquisition of these m propositions. To this end, for each , let refer to the set of worlds for which the truth value of is for . Learning is then a matter of determining which is true (the one for which ), whereas justified true beliefs in amount to finding as well. Learning of statements such as and can be addressed using the theory of this paper, since they correspond to binary-valued truth functions and , respectively.
Acknowledgments
The authors wish to thank Glauco Amigo at Baylor University for his help with producing Figure 3. We also appreciate the comments of three anonymous reviewers that made it possible to considerably improve the quality of the paper.
Author Contributions
Conceptualization and methodology: O.H., D.A.D.-P., J.S.R.; writing—original draft preparation: O.H.; writing—review and editing: D.A.D.-P., J.S.R. All authors have read and agreed to the published version of the manuscript.
Conflicts of Interest
The authors declare no conflict of interest.
Funding Statement
This research received no external funding.
Footnotes
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Piech C., Bassen J., Huang J., Ganguli S., Sahami M., Guibas L.J., Sohl-Dickstein J. Deep Knowledge Tracing; Proceedings of the Neural Information Processing Systems (NIPS) 2015; Montreal, QC, Canada. 7–12 December 2015; pp. 505–513. [Google Scholar]
- 2.Pavese C. Knowledge How. In: Zalta E.N., editor. The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University; Stanford, CA, USA: 2021. [Google Scholar]
- 3.Agliari E., Pachón A., Rodríguez P.M., Tavani F. Phase transition for the Maki-Thompson rumour model on a small-world network. J. Stat. Phys. 2017;169:846–875. doi: 10.1007/s10955-017-1892-x. [DOI] [Google Scholar]
- 4.Lyons R., Peres Y. Probability on Trees and Networks. Cambridge University Press; Cambridge, UK: 2016. [Google Scholar]
- 5.Watts D.J., Strogatz S.H. Collective dynamics of ‘small-world’ networks. Nature. 1998;393:440–442. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]
- 6.Embreston S.E., Reise S.P. Item Response Theory for Psychologists. Psychology Press; New York, NY, USA: 2000. [Google Scholar]
- 7.Stevens S.S. On the Theory of Scales of Measurement. Science. 1946;103:677–680. doi: 10.1126/science.103.2684.677. [DOI] [PubMed] [Google Scholar]
- 8.Thompson B. Exploratory and Confirmatory Factor Analysis: Understanding Concepts and Applications. American Psychological Association; Washington, DC, USA: 2004. [Google Scholar]
- 9.Gettier E.L. Is Justified True Belief Knowledge? Analysis. 1963;23:121–123. doi: 10.1093/analys/23.6.121. [DOI] [Google Scholar]
- 10.Ichikawa J.J., Steup M. The Analysis of Knowledge. In: Zalta E.N., editor. The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University; Stanford, CA, USA: 2018. [Google Scholar]
- 11.Hájek A. Probability, Logic, and Probability Logic. In: Goble L., editor. The Blackwell Guide to Philosophical Logic. Blackwell; Hoboken, NJ, USA: 2001. pp. 362–384. Chapter 16. [Google Scholar]
- 12.Demey L., Kooi B., Sack J. Logic and Probability. In: Zalta E.N., editor. The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University; Stanford, CA, USA: 2019. [Google Scholar]
- 13.Hájek A. Interpretations of Probability. In: Zalta E.N., editor. The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University; Stanford, CA, USA: 2019. [Google Scholar]
- 14.Savage L. The Foundations of Statistics. Wiley; Hoboken, NJ, USA: 1954. [Google Scholar]
- 15.Swinburne R. Epistemic Justification. Oxford University Press; Oxford, UK: 2001. [Google Scholar]
- 16.Pearl J. Causality: Models, Reasoning and Inference. 2nd ed. Cambridge University Press; Cambridge, UK: 2009. [Google Scholar]
- 17.Berger J. Statistical Decision Theory and Bayesian Analysis. 2nd ed. Springer; New York, NY, USA: 2010. [Google Scholar]
- 18.Dembski W.A., Marks R.J., II Bernoulli’s Principle of Insufficient Reason and Conservation of Information in Computer Search; Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics; San Antonio, TX, USA. 11–14 October 2009; pp. 2647–2652. [DOI] [Google Scholar]
- 19.Dembski W.A., Marks R.J., II Conservation of Information in Search: Measuring the Cost of Success. IEEE Trans. Syst. Man Cybern.-Part A Syst. Hum. 2009;5:1051–1061. doi: 10.1109/TSMCA.2009.2025027. [DOI] [Google Scholar]
- 20.Díaz-Pachón D.A., Marks R.J., II Generalized active information: Extensions to unbounded domains. BIO-Complexity. 2020;2020:1–6. doi: 10.5048/BIO-C.2020.3. [DOI] [Google Scholar]
- 21.Shafer G. Belief functions and parametric models. J. R. Stat. Soc. Ser. B. 1982;44:322–352. doi: 10.1111/j.2517-6161.1982.tb01211.x. [DOI] [Google Scholar]
- 22.Wasserman L. Prior envelopes based on belief functions. Ann. Stat. 1990;18:454–464. doi: 10.1214/aos/1176347511. [DOI] [Google Scholar]
- 23.Dubois D., Prade H. Belief functions and parametric models. Int. J. Approx. Reason. 1992;6:295–319. doi: 10.1016/0888-613X(92)90027-W. [DOI] [Google Scholar]
- 24.Denoeux T. Decision-making with belief functions: A review. Int. J. Approx. Reason. 2019;109:87–110. [Google Scholar]
- 25.Hopkins E. Two competing models of how people learn in games. Econometrica. 2002;70:2141–2166. doi: 10.1111/1468-0262.00372. [DOI] [Google Scholar]
- 26.Stoica G., Strack B. Acquired knowledge as a stochastic process. Surv. Math. Appl. 2017;12:65–70. [Google Scholar]
- 27.Taylor C.M. Ph.D. Thesis. University of Virginia; Charlottesville, VA, USA: 2002. A Mathematical Model for Knowledge Acquisition. [Google Scholar]
- 28.Popper K. The Logic of Scientific Discovery. Hutchinson; London, UK: 1968. [Google Scholar]
- 29.Jaynes E.T. Prior Probabilities. IEEE Trans. Syst. Sci. Cybern. 1968;4:227–241. doi: 10.1109/TSSC.1968.300117. [DOI] [Google Scholar]
- 30.Hössjer O. Modeling decision in a temporal context: Analysis of a famous example suggested by Blaise Pascal. In: Hasle P., Jakobsen D., Øhrstrøm P., editors. The Metaphysics of Time, Themes from Prior. Logic and Philosophy of Time. Volume 4. Aalborg University Press; Aalborg, Denmark: 2020. pp. 427–453. [Google Scholar]
- 31.Kowner R. Nicholas II and the Japanese body: Images and decision-making on the eve of the Russo-Japanese War. Psychohist. Rev. 1998;26:211–252. [Google Scholar]
- 32.Hössjer O., Díaz-Pachón D.A., Chen Z., Rao J.S. Active information, missing data, and prevalence estimation. arXiv. 2022 doi: 10.48550/arXiv.2206.05120.2206.05120 [DOI] [Google Scholar]
- 33.Díaz-Pachón D.A., Hössjer O. Assessing, testing and estimating the amount of fine-tuning by means of active information. Entropy. 2022;24:1323. doi: 10.3390/e24101323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Szostak J.W. Functional information: Molecular messages. Nature. 2003;423:689. doi: 10.1038/423689a. [DOI] [PubMed] [Google Scholar]
- 35.Thorvaldsen S., Hössjer O. Using statistical methods to model the fine-tuning of molecular machines and systems. J. Theor. Biol. 2020;501:110352. doi: 10.1016/j.jtbi.2020.110352. [DOI] [PubMed] [Google Scholar]
- 36.Díaz-Pachón D.A., Sáenz J.P., Rao J.S. Hypothesis testing with active information. Stat. Probab. Lett. 2020;161:108742. doi: 10.1016/j.spl.2020.108742. [DOI] [Google Scholar]
- 37.Montañez G.D. A Unified Model of Complex Specified Information. BIO-Complexity. 2018;2018:1–26. doi: 10.5048/BIO-C.2018.4. [DOI] [Google Scholar]
- 38.Yik W., Serafini L., Lindsey T., Montañez G.D. Identifying Bias in Data Using Two-Distribution Hypothesis Tests; Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society; Oxford, UK. 19–21 May 2021; New York, NY, USA: ACM; 2022. pp. 831–844. [DOI] [Google Scholar]
- 39.Kallenberg O. Foundations of Modern Probability. 3rd ed. Volume 1 Springer; New York, NY, USA: 2021. [Google Scholar]
- 40.Ghosal S., van der Vaart A. Fundamentals of Nonparametric Bayesian Inference. Cambridge University Press; Cambridge, UK: 2017. [Google Scholar]
- 41.Shen W., Tokdar S.T., Ghosal S. Adaptive Bayesian multivariate density estimation with Dirichlet mixtures. Biometrika. 2013;100:623–640. doi: 10.1093/biomet/ast015. [DOI] [Google Scholar]
- 42.Barron A.R. Uniformly Powerful Goodness of Fit Tests. Ann. Stat. 1989;17:107–124. doi: 10.1214/aos/1176347005. [DOI] [Google Scholar]
- 43.Freedman D.A. On the Asymptotic Behavior of Bayes’ Estimates in the Discrete Case. Ann. Math. Stat. 1963;34:1386–1403. doi: 10.1214/aoms/1177703871. [DOI] [Google Scholar]
- 44.Cam L.L. Convergence of Estimates Under Dimensionality Restrictions. Ann. Stat. 1973;1:38–53. doi: 10.1214/aos/1193342380. [DOI] [Google Scholar]
- 45.Schwartz L. On Bayes procedures. Z. Wahrscheinlichkeitstheorie Verw Geb. 1965;4:10–26. doi: 10.1007/BF00535479. [DOI] [Google Scholar]
- 46.Cam L.L. Asymptotic Methods in Statistical Decision Theory. Springer; New York, NY, USA: 1986. [Google Scholar]
- 47.Lehmann E.L., Casella G. Theory of Point Estimation. 2nd ed. Springer; New York, NY, USA: 1998. [Google Scholar]
- 48.Agresti A. Categorical Data Analysis. 3rd ed. Wiley; Hoboken, NJ, USA: 2013. [Google Scholar]
- 49.Robins J.M. The analysis of Randomized and Nonrandomized AIDS Treatment Trials Using A New Approach to Causal Inference in Longitudinal Studies. In: Sechrest L., Freeman H., Mulley A., editors. Health Service Research Methodology: A Focus on AIDS. U.S. Public Health Service, National Center for Health Services Research; Washington, DC, USA: 1989. pp. 113–159. [Google Scholar]
- 50.Manski C.F. Nonparametric Bounds on Treatment Effects. Am. Econ. Rev. 1990;80:319–323. [Google Scholar]
- 51.Ding P., VanderWeele T.J. Sensitivity Analysis Without Assumptions. Epidemilogy. 2016;27:368–377. doi: 10.1097/EDE.0000000000000457. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Sjölander A., Hössjer O. Novel bounds for causal effects based on sensitivity parameters on the risk difference scale. J. Causal Inference. 2021;9:190–210. doi: 10.1515/jci-2021-0024. [DOI] [Google Scholar]
- 53.Sutton R.S., Barto A.G. Reinforcement Learning: An Introduction. MIT Press; Cambridge, MA, USA: 1998. [Google Scholar]
- 54.Ratcliff R., Smith P.L. A Comparison of Sequential Sampling Models for Two-Choice Reaction Time. Psychol. Rev. 2004;111:333–367. doi: 10.1037/0033-295X.111.2.333. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Chen W.J., Krajbich I. Computational modeling of epiphany learning. Proc. Natl. Acad. Sci. USA. 2017;114:4637–4642. doi: 10.1073/pnas.1618161114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Corbett A.T., Anderson J.R. Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge. User Model. User-Adapt. Interact. 1995;4:253–278. doi: 10.1007/BF01099821. [DOI] [Google Scholar]
- 57.Oka M., Okada K. Assessing the Performance of Diagnostic Classification Models in Small Sample Contexts with Different Estimation Methods. arXiv. 20222104.10975 [Google Scholar]
- 58.Hirscher T. Ph.D. Thesis. Division of Mathematics, Department of Mathematical Sciences, Chalmers University of Technology; Gothenburg, Sweden: 2014. Consensus Formation in the Deffuant Model. [Google Scholar]
- 59.Murphy K.P. Ph.D. Thesis. University of California; Berkeley, CA, USA: 2002. Dynamic Bayesian Networks: Representation, Inference and Learning. [Google Scholar]
- 60.Marshall P. Biology transcends the limits of computation. Prog. Biophys. Mol. Biol. 2021;165:88–101. doi: 10.1016/j.pbiomolbio.2021.04.006. [DOI] [PubMed] [Google Scholar]
- 61.Atkinson S., Williams P. Quorum sensing and social networking in the microbial world. J. R. Soc. Interface. 2009;6:959–978. doi: 10.1098/rsif.2009.0203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Shapiro J.A. All living cells are cognitive. Biochem. Biophys. Res. Commun. 2020;564:134–149. doi: 10.1016/j.bbrc.2020.08.120. [DOI] [PubMed] [Google Scholar]
- 63.Ewert W., Dembski W., Marks R.J., II Algorithmic Specified Complexity in the Game of Life. IEEE Trans. Syst. Man Cybern. Syst. 2015;45:584–594. doi: 10.1109/TSMC.2014.2331917. [DOI] [Google Scholar]
- 64.Díaz-Pachón D.A., Hössjer O., Marks R.J., II Is Cosmological Tuning Fine or Coarse? J. Cosmol. Astropart. Phys. 2021;2021:020. doi: 10.1088/1475-7516/2021/07/020. [DOI] [Google Scholar]
- 65.Díaz-Pachón D.A., Hössjer O., Marks R.J., II Sometimes size does not matter. arXiv. 20222204.11780 [Google Scholar]
- 66.Zhao X., Plata G., Dixit P.D. SiGMoiD: A super-statistical generative model for binary dataP. PLoS Comput. Biol. 2021;17:e1009275. doi: 10.1371/journal.pcbi.1009275. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Stephens P.A., Buskirk S.W., Hayward G.D., del Río C.M. Information theory and hypothesis testing: A call for pluralism. J. Appl. Ecol. 2005;42:4–12. doi: 10.1111/j.1365-2664.2005.01002.x. [DOI] [Google Scholar]
- 68.Szucs D., Ioannidis J.P.A. When Null Hypothesis Significance Testing Is Unsuitable for Research: A Reassessment. Front. Hum. Neurosci. 2017;11:390. doi: 10.3389/fnhum.2017.00390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Cox R.T. The Algebra of Probable Inference. Johns Hopkins University Press; Baltimore, MD, USA: 1961. [Google Scholar]
- 70.Jaynes E.T. Probability Theory: The Logic of Science. Cambridge University Press; Cambridge, UK: 2003. [DOI] [Google Scholar]





