Skip to main content
Entropy logoLink to Entropy
. 2018 Oct 23;20(11):815. doi: 10.3390/e20110815

The Reconciliation of Multiple Conflicting Estimates: Entropy-Based and Axiomatic Approaches

João F D Rodrigues 1,*, Michael L Lahr 2
PMCID: PMC7512377  PMID: 33266539

Abstract

When working with economic accounts it may occur that multiple estimates of a single datum exist, with different degrees of uncertainty or data quality. This paper addresses the problem of defining a method that can reconcile conflicting estimates, given best guess and uncertainty values. We proceeded from first principles, using two different routes. First, under an entropy-based approach, the data reconciliation problem is addressed as a particular case of a wider data balancing problem, and an alternative setting is found in which the multiple estimates are replaced by a single one. Afterwards, under an axiomatic approach, a set of properties is defined, which characterizes the ideal data reconciliation method. Under both approaches, the conclusion is that the formula for the reconciliation of best guesses is a weighted arithmetic average, with the inverse of uncertainties as weights, and that the formula for the reconciliation of uncertainties is a harmonic average.

Keywords: uncertainty modelling, economic accounts, conflicting estimates, entropy-based approach, axiomatix approach

1. Introduction

With improvements in information technology, the world has become more unified and interconnected. Information is now typically shared quickly and easily from all over the globe, such that barriers formed by linguistic and geographic boundaries essentially have been torn down. This has enabled people from disparate cultures and backgrounds to share ideas and information. One outcome of this regime change has been a boosting of the perceived benefits of statistical information. While some benefits of such statistical information have been known since at least Quetelet’s (1835) tome on so-called “social physics” was published, today’s massive socio-economic statistical repositories in Europe, North America, and East Asia are enabling a data revolution of sorts. Indeed, the fields of data mining and data analytics are fast becoming important fields of academic study. Mirroring the rise of data availability and the nature of some of the data itself, the term “big data” has been coined [1] to refer to the extremely voluminous and complex data sets that require specialized processing application software to deal with them.

The most prominent stewards of socio-economic data are government statistical agencies, which focus on producing and disseminating data products secured via surveys (for example the American Community Survey), censuses (such as Japan’s 2015 Population Census), and administrative procedures (like information needed to get an academic promotion in Spain). As a result, data storage is now ubiquitously electronic, replicated offsite to guard against storage failure, and measured in petabytes. Electronic storage enables low-cost dissemination of data. It also facilitates the integration of records across disparate databases—for example, into a system of national accounts, which is what countries use to generate their estimates of gross domestic product, as suggested by the United Nations. Both lead to concerns about confidentiality of data and how it can be protected [2].

Our point in broaching the above is that data producers, disseminators, and users alike can run into the problem of having access to multiple estimates for a single quantity of interest. In the particular experience of the authors, which motivated the present study, these multiple estimates are a consequence of non-disclosure by a statistical office in order to ensure confidentiality. Hence we act as what Duncan et al. [2] called “data snooper”. For concreteness, in such instances we are interested in obtaining number of employees at county level from the U.S. Bureau of Statistics’s Quarterly Census of Employment and Wages (QCEW) data and U.S. Bureau of the Census’s County Business Patterns. In both datasets these figures are suppressed for selected sectors in some counties and even states. But information is provided for larger spatial and sectoral units, so it is possible to use this higher-level information to obtain multiple estimates of the quantities of interest. It is common for official statistical data to have a hierarchical structure so this problem is quite general. Garfinkel et al. [3] note that the increasing ability of data snoopers is making ever more data stewards reluctant to provide certain data products because they are finding it increasingly difficult to ensure confidentiality to the agents from whom they obtain the data. This is despite some use of noise as a disclosure limitation [4].

In this paper, we focus on the problem of combining such multiple estimates into a single value. In the case of economic accounts, Miller and Blair [5] (pp. 384–386) have called it “the reconciliation issue”. The reconciliation issue considered here should not be confused with the more general problem of data balancing, in which a set of multiple data points need to satisfy a set of constraints: That problem is addressed in other studies, such as Kruithof [6], Stone et al. [7], Byron [8], Van Der Ploeg [9], Lahr and Mesnard [10], Chen [11]. General solutions to confidentiality disclosure or data censoring issues are provided by [12,13]. Herein we set out to assist current and future data snoopers and miners, by identifying what a data reconciliation method should be from first principles when a fairly general formulation of the reconciliation constraints is possible.

In particular, we consider that the multiple estimates for a particular datum can be characterized by a best guess and uncertainty. If we interpret each estimate as a random variable with an underlying probability distribution, the best guess is the expected value and the uncertainty is standard deviation. In the case of multiple data sources, the conflict enabling the multiple estimates is self-evident. When numbers are published with some data censored and for which estimates can be obtained using partial information [14], the conflict can arise from a higher (or lower) hierarchical spatial or sectoral level (e.g., average employee number if the number of establishments is available). To the best of our knowledge no first-principle approach to this problem has yet been published, although more heuristic approaches can be found in Bourque et al. [15], Miernyk et al. [16], Jensen and McGaurr [17], Gerking [18], Gerking [19], Weale [20], Boomsma and Oosterhaven [21], Rassier et al. [22]. We tackle the same problem from two different angles.

Using concepts and techniques from Bayesian inference [23] and in particular the minimum cross-entropy method [24], we first address the problem of data reconciliation as a particular case of more general data balancing [25]. That is, we consider there are two or more initial estimates for a particular datum, but this datum is itself embedded in a set of constraints connecting it to other data that are potentially unbalanced. We look for simplifications of the general setting under which this original problem can be transformed into another balancing problem where the multiple estimates are replaced by a single one. We prove that, if the initial uncertainty estimates are close to one another, the data reconciliation method of best guesses is a weighted arithmetic average and the data reconciliation method of uncertainties is a harmonic average.

Afterwards we address the same problem from an axiomatic perspective, laying out the desirable properties of a data reconciliation method. Such an approach has roots in different fields, from table deflation [26] and supply-use transformations [27] to environmental responsibility [28]. It turns out that the canonical data reconciliation method, i.e., the one that satisfies all required properties, is none other than a suitable generalization of the entropy-based method as derived earlier. That generalization centers on the introduction of the number of previously combined priors and a ranking of estimates by their relative quality.

2. Entropy-Based Approach

2.1. Basic Concepts

Bayesian inference was first developed by Laplace [29] and later expanded by others, such as Jeffreys [30], Jaynes [31] and Jaynes [23]. According to the Bayesian paradigm, a probability is a degree of belief about the likelihood of an event, and should reflect all relevant available information about that event. According to Weise and Woger [32], if an empirical quantity is subject to measurement errors, it must be described by a random variable, whose expectation is the best guess and whose standard-deviation is the uncertainty estimate.

More formally, a prior datum θi is characterized by a probability distribution π(qi), which expresses the degree of belief that the datum takes realization qi. The best guess is μi=E[θi] and the uncertainty is σi=Var[θi]. When multiple data are considered, e.g., θi and θj, it is necessary to introduce the correlation between them, ρij=Cov[θi,θj]/σiσj. Rodrigues [33] further provides a series of rules to determine the properties of a strictly positive prior datum, using the maximum-entropy principle [34].

The type of data we are interested in are connected to one another through accounting identities of the form:

θ0=i=1nθi,

where θ0 is an aggregate datum and the θi’s are disaggregate data. If the set of data is arranged in a vector θ of length nT, the set of nK accounting identities can be defined through a concordance matrix G, where, for a given accounting identity i, Gij=1 if θj is a disaggregate datum, Gij=1 if θj is an aggregate datum and Gij=0 otherwise.

If the prior configuration is unbalanced, then Gθ0, where 0 is a vector of zeros. Rodrigues [25] derives an analytical solution and a series of approximations that, given a concordance matrix and prior configuration, provide a posterior configuration, t, such that Gt=0. The notational convention used here is that Greek letters refer to priors while Latin cognates will refer to posteriors, i.e., mi, si and rij are, respectively, the best guess and uncertainty of ti and correlation between ti and tj.

2.2. Problem Formulation

We are now in position to formulate the data reconciliation problem. Given initial priors θ and θ and a system with nT+1 numerical data {θ1,,θnT1,θ,θ}, and nK+1 accounting identities, where accounting identity nK+1 takes the form θ=θ, our goal is to determine the final prior θ, in a new system with nT numerical data, {θ1,,θnT1,θ} and the nK first accounting identities of the original system, in which the posteriors {t1,,tnT1} are identical in both data balancing problems, and t=t=t. Conceptually, we are approaching data reconciliation as a form of preliminary data balancing, as illustrated in Figure 1. The conflicting estimates are initial priors of the same datum, and the reconciled value is a final prior. Note the following notational convention: while other variables (and their properties) are denoted with subscripts, initial priors/posteriors (and their properties) are denoted with one () or two () primes, and the final prior/posterior is denoted with neither subscripts nor primes.

Figure 1.

Figure 1

On the left-hand side balancing in a single step, with multiple initial estimates (priors) of the same datum, θ and θ, balanced to the same quantity (posterior), t=t. On the right-hand side balancing in two steps: First the reconciliation procedures combines the multiple initial estimates (initial priors), θ and θ, into a final prior, θ; afterwards the full system is balanced, leading to posterior t. We impose that the result from both procedures is the same, t=t=t.

Three situations emerge: Either the datum to be reconciled is only a disaggregate datum; it is only an aggregate datum; or it is both a disaggregate and an aggregate datum, in different accounting identities. We will deal with the three cases separately.

We now present simple systems to illustrate the three possible cases. As a benchmark consider a tabular system (i.e., with data organized in rows and columns) with no multiple estimates consisting of a 2×3 table A with row sums b and columns sums c. Furthermore, consider that the sum of both b and c is known as d. If i is a vector of ones of appropriate length, all vectors are in column format by default, and prime () adjoined to a matrix or vector denotes transpose, then the previous set of constraints means that:

Ai=b;Ai=c;bi=d;ci=d.

The vectorized form of this system and the concordance table is presented in Table 1. In the baseline system there is a total of twelve variables (columns of the concordance matrix G) and seven constraints (rows thereof). The first six variables are disaggregate values (corresponding to the initial A matrix), the following five are mixed (row and column sums b and c), and the last one is an aggregate datum (d). The first two constraints (rows of G) are the row sums of A, the following three are its columns sums, and the last two are the sums of b and c. To understand how G is constructed let us consider the first constraint, which is the row sum of A. Formally, this is:

A11+A12+A13b1=0,

hence in the first row of G the entries corresponding to the columns of A11, A12 and A13 have 1s, the entry corresponding to the column of b1 has 1 and all entries are zero.

Table 1.

Prior vector and concordance matrix, with no multiple estimates.

θ A11 A12 A13 A21 A22 A23 b1 b2 c1 c2 c3 d
G 1 1 1 0 0 0 −1 0 0 0 0 0
0 0 0 1 1 1 0 −1 0 0 0 0
1 0 0 1 0 0 0 0 −1 0 0 0
0 1 0 0 1 0 0 0 0 −1 0 0
0 0 1 0 0 1 0 0 0 0 −1 0
0 0 0 0 0 0 1 1 0 0 0 −1
0 0 0 0 0 0 0 0 1 1 1 −1

We are now in position to formalize the three situations of multiple estimates of a single datum as variants of Table 1 in which an additional row and column has been added to G.

The case of disaggregate datum occurs if the datum for which multiple estimates exist is an interior point, which for concreteness we consider to be element A23: The set of constraints is shown in Table 2. As an illustration of the case of there being two estimates of an aggregate datum consider it to be d: The set of constraints is shown in Table 3. Finally, consider as example of an element that is both aggregate and disaggregate that of b1: The set of constraints is shown in Table 4.

Table 2.

Prior vector and concordance matrix, with multiple estimates of A23.

θ A11 A12 A13 A21 A22 A23 b1 b2 c1 c2 c3 d A23
G 1 1 1 0 0 0 −1 0 0 0 0 0 0
0 0 0 1 1 1 0 −1 0 0 0 0 0
1 0 0 1 0 0 0 0 −1 0 0 0 0
0 1 0 0 1 0 0 0 0 −1 0 0 0
0 0 1 0 0 0 0 0 0 0 −1 0 1
0 0 0 0 0 0 1 1 0 0 0 −1 0
0 0 0 0 0 0 0 0 1 1 1 −1 0
0 0 0 0 0 1 0 0 0 0 0 0 −1

Table 3.

Prior vector and concordance matrix, with multiple estimates of d.

θ A11 A12 A13 A21 A22 A23 b1 b2 c1 c2 c3 d d
G 1 1 1 0 0 0 −1 0 0 0 0 0 0
0 0 0 1 1 1 0 −1 0 0 0 0 0
1 0 0 1 0 0 0 0 −1 0 0 0 0
0 1 0 0 1 0 0 0 0 −1 0 0 0
0 0 1 0 0 1 0 0 0 0 −1 0 0
0 0 0 0 0 0 1 1 0 0 0 −1 0
0 0 0 0 0 0 0 0 1 1 1 0 −1
0 0 0 0 0 0 0 0 0 0 0 1 −1

Table 4.

Prior vector and concordance matrix, with multiple estimates of b1.

θ A11 A12 A13 A21 A22 A23 b1 b2 c1 c2 c3 d b1
G 1 1 1 0 0 0 −1 0 0 0 0 0 0
0 0 0 1 1 1 0 −1 0 0 0 0 0
1 0 0 1 0 0 0 0 −1 0 0 0 0
0 1 0 0 1 0 0 0 0 −1 0 0 0
0 0 1 0 0 1 0 0 0 0 −1 0 0
0 0 0 0 0 0 0 1 0 0 0 −1 1
0 0 0 0 0 0 0 0 1 1 1 −1 0
0 0 0 0 0 0 1 0 0 0 0 0 −1

It is perhaps instructive to describe how the reconciliation problems differ from the features of the baseline system. The three variants of the baseline are constructed by adding a single variable, the conflicting estimate, which by convenience is always appended to the original system. It is also necessary to add an extra constraint, connecting the two conflicting estimates. Finally, the baseline system is also changed so that in one of the original occurrences of the datum to be reconciled is the first conflicting estimate and the second occurrence is the other conflicting estimate.

Note that in this simple example there are only two constraints affecting each datum, but that naturally is not generally the case. The number of constraints per datum is arbitrary and can be either one or larger than two. An example of what this system might represent is employment count by region and sector, with an extra dimension being type of ownership (private or local, state, or federal government), as reported in the QCEW database.

2.3. From Balancing to Reconciliation

Rodrigues [25] shows that if the posterior configuration is balanced, then its first- and second-moment constraints are:

0=Gm; (1)
0=diagGS|G|, (2)

where m and S are the posterior best-guess vector and covariance matrix, and the latter is defined as S=s^Rs^, where s is the vector of posterior uncertainties and R is the vector of posterior correlations, and ^ denotes diagonal matrix. Likewise μ and Σ are the prior best guess vector and covariance matrix, and the latter is defined as Σ=σ^Pσ^.

The analytical solution of the data-balancing problem is:

S˜1=Σ˜1+Gβ^|G|; (3)
S˜1m˜=Σ˜1μ˜+Gα. (4)

Notice that Equations (3) and (4) contain symbols adjoined with ˜ (which we refer to as Gaussian parameters) while Equations (1) and (2) do not. The connection between the Gaussian parameters and the corresponding observable quantities is described in Rodrigues [25]: When relative uncertainty, σj/μj or sj/mj, is low, then the Gaussian parameter and the observable are identical. When relative uncertainty is high, the best guess Gaussian parameter tends to and the uncertainty Gaussian parameter tends to , in such a way that if relative uncertainty is unitary, μ˜j/σ˜j2=1/μj=1/σj and m˜j/s˜j2=1/mj=1/sj. There is no closed-form expression between observables and Gaussian parameters in the multivariate case.

If both the prior uncertainty of aggregate data and initial prior correlations are high, we obtain a simplified weighted least-squares (WLS) method in which the weights are prior uncertainties:

m=μ+σ^Gα, (5)

and posterior correlations are set by considering that relative uncertainty is constant, s=mσμ, where ⊙ and ⊘ are Hadamard (or entrywise) product and division, and the update takes place in small steps.

This WLS method is a generalization of the standard biproportional balancing method (RAS) for arbitrary structure and uncertainty data [25]. However, it is in a way too simple for the data reconciliation problem, because it keeps relative uncertainty constant. In the data reconciliation problem this assumption is untenable, whenever the relative uncertainty of the initial priors differs.

Thus, we now look for a simplification of the general solution (Equations (3) and (4)) that is still feasible and that allows both for best guess and uncertainty reconciliation. Let us consider that correlations change little from prior to posterior, so that only uncertainties are adjusted. Equations (3) and (4) become:

s^1R1s1=σ^1P1σ1+Gβ;s^1R1s^1m=σ^1P1σ^1μ+Gα,

where we dropped the ˜, meaning that all variables are observables. If correlations are not adjusted, then R=P, and if variances change little sσ The previous expressions become:

s1=σ1+Pσ^Gβ;s^1m=σ^1μ+Pσ^Gα.

For convenience, consider now that a datum corresponding to entry (i,j) in the tabular matrix is tij, while the sums of row or column i is ti, and the Lagrange parameters of a row sum or column sum are adjoined with superscript R or C. For a particular entry, the previous matrix equation reads:

1sij=1σij+σiRβiR+σjCβjC+ki,jσikβkC+σkjβkR;1si=1σi+σiβiR+βiC,

where σiR=jσij and σiC=jσji. If the adjustment from prior to posterior is small, then σiRσiCσi. If βi*=σiβiR, σiσij and σiσji, then the previous expression matrix expressions simplify to:

s1=σ1+Gβ*; (6)
s^1m=σ^1μ+Gα*, (7)

where the derivation of Equation (7) follows along identical lines to that of Equation (6). We now use these expressions to obtain a tentative solution of the data reconciliation problem, even though they were derived under rather strict assumptions.

2.4. A Tentative Solution

We now examine the implications of applying Equations (6) and (7) to different data reconciliation configurations as described in Section 2.2: multiple estimates of (a) an aggregate datum; (b) a disaggregate datum; and (c) a datum that is both aggregate and disaggregate. We shall see that the same expression applies to all these problems.

For clarity, the analysis is carried out using scalar expressions, and, for brevity, only to the case of two constraints per datum. The strategy of the proof is the same for all configurations: to derive constraints connecting prior and posterior in the original problem and in a modified problem in which there is only a single datum where originally there were the conflicting estimates.

2.4.1. Aggregate Datum

Consider that there are two initial priors of a datum, θ0 and θ0 and that the datum is involved in two accounting identities, the first summing over elements 1 to n and the second summing over n+1 to n:

t0=i=1nti;t0=i=n+1n+nti;t0=t0,

where each ti, for i>0, can be affected by other accounting identities. The Lagrange parameters associated with these three expressions in Equation (6) are denoted, respectively, by β0, β0 and β0. We wish to determine a final prior θ0, such that:

t0=i=1nti;t0=i=n+1n+nti.

Equation (6) reads, for the original problem:

1si=1σi+β0+if1in;1si=1σi+β0+ifn+1in;1s0=1σ0β0+β0;1s0=1σ0β0β0,

where refers to other Lagrange parameters. And in the modified problem:

1si=1σi+β0+if1in;1si=1σi+β0+ifn+1in;1s0=1σ0β0β0.

Notice that for every datum i>0 the original and modified problem are identical. Because the posteriors of the aggregate datum are all identical, s0=s0=s0, we can write:

21s0=21σ0β0β0=1σ0+1σ0β0β0+β0β0.

A similar expression can be obtained from Equation (7) for the final prior best guess, leading to the solution:

1σ0=121σ0+1σ0;μ0σ0=12μ0σ0+μ0σ0.

Thus, both the final prior of the absolute uncertainty, σ, and the relative uncertainty, σ/μ, are obtained as the harmonic average of the initial prior absolute and relative uncertainties.

2.4.2. Disaggregate Datum

Consider now that there are two initial priors of an interior point, θ1 and θ1, which is affected by two accounting identities, such that the posteriors satisfy:

t0=t1+i=2nti;t0=t1+i=n+1n+nti;t1=t1.

The Lagrange parameters associated with these three expressions are, as before, β0, β0 and β0. We wish to determine a final prior θ1, such that:

t0=t1+i=2nti;t0=t1+i=n+1n+nti.

Equation (6) reads, for the original problem:

1si=1σi+β0+if2in;1si=1σi+β0+ifn+1in;1s0=1σ0β0+;1s0=1σ0β0+;1s1=1σ1+β0+β1;1s1=1σ1+β0β1,

and in the modified problem:

1si=1σi+β0+if2in;1si=1σi+β0+ifn+1in;1s0=1σ0β0+;1s0=1σ0β0+;1s1=1σ1+β0+β0.

As before, the data for which there are no conflicting estimates (t0, t0 and ti with i>1) are subject to the same set of constraints in the original and in the modified problem. Because the posteriors of the disaggregate datum are all identical, s1=s1=s1, we can write:

21s1=21σ1+β0+β0=1σ1+1σ1+β0+β0+β1β1.

At this stage it becomes clear that we will encounter exactly the same solution as in the case of an aggregate datum:

1σ1=121σ1+1σ1;μ1σ1=12μ1σ1+μ1σ1.

2.4.3. Mixed Datum

Consider now that there are two initial priors, θ1 and θ1, of a datum that is both aggregate and disaggregate, in different accounting identities, and whose posteriors satisfy:

t0=t1+i=2nti;t1=i=n+1nti;t1=t1.

As before the Lagrange parameters are denoted as β0, β0 and β1. We wish to determine a final prior θ1, such that:

t0=t1+i=2nti;t1=i=n+1nti.

Equation (6) reads, for the original problem:

1si=1σi+β0+if2in;1si=1σi+β0+ifn+1in;1s0=1σ0β0+;1s1=1σ1+β0+β1;1s1=1σ1β0β1,

and in the modified problem:

1si=1σi+β0+if2in;1si=1σi+β0+ifn+1in;1s0=1σ0β0+;1s1=1σ1+β0β0.

As has become routine, for datum 0 and for every datum i>1 the original and modified problem are identical. Because s1=s1=s1, we can write:

21s1=21σ1+β0β0=1σ1+1σ1+β0β0+β1β1.

Thus, it is clear that the solution is again identical.

3. Axiomatic Approach

3.1. Axiomatic Formulation

In Section 2 we obtained a data reconciliation algorithm from first principles, as an operation of data balancing under a particular structure. However, we can also reason about the data reconciliation algorithm in terms of its properties, i.e., we will not determine what it is, but what it ought to be.

If θ and θ are two initial priors, the data reconciliation algorithm is a function f(·) that generates a final prior θ=f(θ,θ), where each prior θ is characterized by a best guess, μ, an absolute uncertainty, σ, and a relative uncertainty, u=σ/μ, which can take values in the range 0u1. Let xmin=min{x,x} and xmax=max{x,x}, where x can be μ, σ or u.

We now propose a series of properties that define the data reconciliation method.

Property 1 (Lower and upper bounds).

The parameters of the final prior lie within the range set by the parameters of the initial priors, μminμμmax, σminσσmax and uminuumax.

Property 2 (Commutativity).

The order in which the initial priors are combined does not matter, f(θ,θ)=f(θ,θ).

Property 3 (Associativity).

Several initial priors can be combined and the resulting final prior is invariant to the order of reconciliation, f(θ,f(θ,θ))=f(f(θ,θ),θ).

Property 4 (Identity).

If the initial prior best guesses are identical, μ=μ then the final prior best guess is identical, μ=μ=μ. If the initial prior uncertainties are identical, σ=σ then the final prior uncertainty is identical, σ=σ=σ.

Property 5 (Monotonicity).

The relative adjustment from initial to final prior increases with the relative magnitude of initial uncertainty:

μμμμ=gσσ; (8)
σσσσ=hσσ. (9)

where dg(x)/dx>0 and dh(x)/dx>0.

Property 6 (Absorption).

If initial prior θ is known with minimal uncertainty, u=0, and θ is not, u>0, then the final prior is identical to the first initial prior, f(θ,θ)=θ. If initial prior θ is known with maximal uncertainty, u=1, and θ is not, u<1, then the final prior is identical to the second initial prior, f(θ,θ)=θ.

We believe that these six properties are uncontroversial and self-explanatory. However, it turns out that the problem as formulated here has no solution, i.e., no formula can satisfy all of the above properties. We later overcome this hurdle by generalizing the problem formulation, to include two additional concepts: A hierarchy of data quality and the number of combined priors.

3.2. The Canonical Data Reconciliation Method

The properties outlined in Section 3.1 constrain the range of data reconciliation algorithms but do not define a unique solution. However, Equations (8) and (9) suggests how it may be possible to obtain a solution. Let us consider that g(x) and h(x) take the simple yet flexible form of g(x)=axb and h(x)=cxd.

The condition of identity (Property 4), in the case of μ=μ and σ=σ leads to the indeterminacy:

μμμμ=00.

But if the limit is approached as μ=μδ and μ=μ+δ, when δ0, then:

μμμμ=δδ=1.

Thus, under the condition of identity, Equations (8) and (9) imply that:

1=a1b;1=c1d,

so a=c=1. Let us further consider the simplest possible case b=d=1, so that g(·) and h(·) are the identity lines. Applying g(x)=x and h(x)=x to Equations (8) and (9) leads to:

μμσ=μμσ;σσσ=σσσ.

Rearranging terms:

μ1σ+1σ=μσ+μσ;σ1σ+1σ=σσ+σσ.

Recalling that u=σ/μ we obtain the canonical data reconciliation method as:

μ=1/σ1/σ+1/σμ+1/σ1/σ+1/σμ; (10)
1σ=121σ+1σ. (11)

Equation (10) can be be expressed in two other ways:

μ=σ2μσ+μσ; (12)
1u=121u+1u. (13)

Thus, if the ratio of relative adjustment of best guesses and uncertainties is identical to the ratio of absolute uncertainties of the initial priors, the best-guess data reconciliation method is a weighted average, where the weights are proportional to the inverse of absolute uncertainty, and the absolute and relative uncertainty data reconciliation methods are harmonic averages.

Does this data reconciliation method satisfy the properties of Section 3.1? It is trivial to check that Properties 1, 2, 4 and 5 are satisfied. But this is not the case for Properties 3 and 6. In the following subsections we present suitable extensions of the canonical data reconciliation method to address these problems.

3.3. The Number of Combined Priors

The canonical data reconciliation method is not associative. The properties of f(θ,f(θ,θ)) are:

1u=121u+121u+1u=12u+14u+14u1σ=121σ+121σ+1σ=12σ+14σ+14σ.

While the properties of f((θ,θ),θ) are:

1u=12121u+1u+1u=14u+14u+12u1σ=12121σ+1σ+1σ=14σ+14σ+12σ.

Thus, f(θ,f(θ,θ))f((θ,θ),θ). But upon some reflection, this result is in fact reasonable. The final prior is the combination of two initial priors with equal weights. If some of these initial priors are themselves a combination of other initial priors, this information has to be considered explicitly.

Let us introduce a new quantity, n, as the number of combined priors, so that now a prior θ is defined by a best guess, μ, an absolute uncertainty, σ, and n. Consider the following data reconciliation rule:

μ=n/σn/σ+n/σμ+n/σn/σ+n/σμ; (14)
1σ=1nnσ+nσ; (15)
n=n+n. (16)

As before, Equation (14) can be be expressed in two other ways:

μ=σnnσμ+nσμ; (17)
1u=1nnu+nu. (18)

This data reconcilation rule satisfies the first five properties of Section 3.1.

3.4. Ranking of Data Quality

The canonical data reconciliation method satisfies the absorption property of minimal uncertainty. If σ=0 and σ>0, then:

1σ+1σ1σ,

and Equations (10) and (11) become:

μ1/σ1/σμ+1/σ1/σμ=(1)μ+(0)μ;σ2σ=0,

so μ=μ and σ=σ. However, it does not satisfy the absorption property of maximal uncertainty. If σ=μ and σ<μ, then u=0 and Equations (10), (11) and (13) become:

μ=μ+σμμ+σ;σ=2σμμ+σ;u=2u1+u,

and thus μμ and σσ.

In order to ensure that the absorption of maximal uncertainty is satisfied, we use the concept of data quality, introduced in Rodrigues [25]. The idea is that, besides an uncertainty estimate, which formalizes quantitatively a degree of confidence in the accuracy of the best guess of a datum, it is also possible to formalize qualitatively a degree of confidence in the accuracy of a datum relative to others.

For the purpose of data balancing, Rodrigues [25] suggests that a datum that is considered to be of higher quality should be kept fixed while lower quality data are adjusted. The natural corollary, in the problem of data reconciliation, is to consider that when one wishes to combine two initial priors of differing levels of data quality, the prior of lower quality should be disregarded.

If a datum has unitary relative uncertainty, then it is maximally uninformative, and it is reasonable to disregard it. After all, a maximally uninformative prior should only be used if no better alternative is available. We therefore suggest that, if σ=μ and σ<μ, then θ=θ directly, without using Equations (10) and (11).

3.5. Summary

We now present the expressions for the combination of n initial priors, θi, with i=1,,n into a single final prior θ. Addressing this problem requires the specification, for each prior, θi, of its best guess, μi, its absolute uncertainty, σi, and the number of previously combined priors, ni.

If all relative uncertainties, ui=σi/μi, are in the range 0<ui<1, then the final prior properties are defined as:

μ=i=1nni/σin/σμi; (19)
1σ=1ni=1nniσi; (20)
n=i=1nni. (21)

Equation (19) can be expressed as:

1u=1ni=1nniui. (22)

If some initial priors have zero relative uncertainty, ui=0, then all other initial priors should be disregarded. If some initial priors have unitary relative uncertainty, ui=1, then it is they which should be disregarded.

In Figure 2 we illustrate the behaviour of Equation (19), when n=2 and n1=n2=μ2=1. The plot shows different curves of the combined posterior best guess μ as a function of the prior best guess μ1, where each curve corresponds to a different ratio of uncertainties, σ1/σ2. When both uncertainties are identical, the posteriod best guess is the arithmetic average of the two prior best guesses. When the uncertainties differ, the best guess prior with the largest uncertainty contributes the least to combined best guess: in the limit case in which σ1σ2 the prior μ1 is ignored and the posterior is similar to μ2; when σ1σ2 the reverse occurs and μμ1.

Figure 2.

Figure 2

Best guess of the combined prior as a function of initial prior best guess, μ1, when μ2=1, for diferent values of σ1/σ2: Solid line (–) when σ1/σ2=1, solid line with circle markers (–○–) when σ1/σ2=0, and solid line with triangle markers (–△–) when σ1/σ2=.

In turn, Figure 3 we illustrate the behaviour of Equation (20). We still consider that n=2 and n1=n2 but now no explicit assumption about best guesses is necessary. Instead, the uncertainty of the second variable is fixed, σ2=1, and the curve shows the value of the combined posterior as a function of prior uncertainty σ1. The figure describes an arc slightly above the diagonal line. When both prior uncertainties are identical (σ1=1), then the posterior equals the priors, as expected. As σ1 becomes smaller than σ2 the combined prior becomes closer to σ1 than to σ2, but always larger, σ>σ1, except in the limit case σ10, in which case σσ1.

Figure 3.

Figure 3

Solid line (–) and relative uncertainty, uncertainty of the final prior, σ, as a function of initial prior uncertainty, σ1, when σ2=1. Dashed line (--) is the identity line.

4. Conclusions and Discussion

Herein we investigated using two distinct pathways the problem of reconciling multiple conflicting estimates in the course of database development. We assume that the developer (data snooper) is tooled with a best guess and uncertainty for each of those conflicting estimates.

First, we apply a maximum-entropy Bayesian inference method, under the limiting condition that the adjustment from prior to posterior uncertainties is small. Second, we obtain a canonical data reconciliation method through an axiomatic approach that is as simple as possible but satisfied important qualitative properties. Each approach verifies the other.

The resulting formula for the best guess, Equation (19), is a weighted average showing that, as the count of conflicting priors underlying a particular prior rises, the value of that prior increases in importance in terms of obtaining a solution. We get a similar result with the inverse of the uncertainty, that is, the narrower the uncertainty of an estimate the more it contributes to the final solution. The resulting formula for the uncertainty, Equation (20), is a harmonic average where the same factors are present: As the count of conflicting priors underlying a particular prior rises, the value of that prior increases in importance; and the narrower the uncertainty of a prior, the more it contributes to the final solution.

Of course, limitations to our approach must be mentioned. And the key limitation is certainly that, in some practical applications, the data snooper will lack information on, either or both, best guess and uncertainty. It may be that instead, one only has upper and lower bounds for the datum of interest to inform its best guess and uncertainty. This is certainly the case in some instances when data are censored, e.g., the anti-suppression problem of Gerking et al. [13] and Isserman and Westervelt [12]. Future work using variable ranges with externally informed priors would be a natural extension of what is presented here. Indeed, some initial forays into this line of investigation are already underway, see, e.g., Makarkina and Lahr [35].

It should be mentioned that although the focus of attention here was on conflicting estimates arising from economic accounts there are other circumstances in which a formally identical problem arises, for example in expert elicitation [36].

Acknowledgments

Any errors the paper may contain are the sole responsibility of the authors.

Author Contributions

J.R. performed the conceptualization and formal analysis, M.L. provided the motivation and both J.R. and M.L. wrote the manuscript.

Funding

Research funded by Leiden University.

Conflicts of Interest

The authors declare no conflict of interest.

References

  • 1.Lohr S. The Origins of ‘Big Data’: An Etymological Detective Story. [(accessed on 22 October 2018)];New York Times. 2013 Available online: https://bits.blogs.nytimes.com/2013/02/01/the-origins-of-big-data-an-etymological-detective-story/
  • 2.Duncan G.T., Elliot M., Salazar-Gonzalez J.J. Statistical Confidentiality: Principles and Practice. Springer; New York, NY, USA: 2011. [Google Scholar]
  • 3.Garfinkel R., Gopal R., Goes P. Privacy Protection of Binary Confidential Data against Deterministic, Stochastic, and Insider Attack. Manag. Sci. 2002;48:749–764. doi: 10.1287/mnsc.48.6.749.193. [DOI] [Google Scholar]
  • 4.Evans T., Zayatz L., Slanta J. Using Noise for Disclosure Limitation of Establishment Tabular Data. Off. Stat. 1998;14:537–551. [Google Scholar]
  • 5.Miller R.E., Blair P.D. Input-Output Analysis: Foundations and Extensions. 2nd ed. Cambridge University Press; Cambridge, UK: 2009. [Google Scholar]
  • 6.Kruithof R. Telefoonverkeersrekening. De Ingenieur. 1937;52:E15–E25. (In Dutch) [Google Scholar]
  • 7.Stone R., Meade J.E., Champernowne D.G. The Precision of National Income Estimates. Rev. Econ. Stud. 1942;9:111–125. doi: 10.2307/2967664. [DOI] [Google Scholar]
  • 8.Byron R. The Estimation of Large Social Account Matrices. Stat. Soc. Ser. A. 1978;141:359–367. doi: 10.2307/2344807. [DOI] [Google Scholar]
  • 9.Van Der Ploeg F. Reliability and the Adjustment of Sequences of Large Economic Accounting Matrices. Stat. Soc. Ser. A. 1982;145:169–194. doi: 10.2307/2981533. [DOI] [Google Scholar]
  • 10.Lahr M.L., Mesnard L.D. Biproportional Techniques in Input-Output Analysis: Table Updating and Structural Analysis. Econ. Syst. Res. 2004;16:115–134. doi: 10.1080/0953531042000219259. [DOI] [Google Scholar]
  • 11.Chen B. A Balanced System of U.S. Industry Accounts and Distribution of the Aggregate Statistical Discrepancy by Industry. Bus. Econ. Stat. 2012;30:202–211. doi: 10.1080/07350015.2012.669667. [DOI] [Google Scholar]
  • 12.Isserman A., Westervelt J. 1.5 Million Missing Numbers: Overcoming Employment Suppression in County Business Patterns Data. Int. Reg. Sci. Rev. 2006;29:311–335. doi: 10.1177/0160017606290359. [DOI] [Google Scholar]
  • 13.Gerking S., Isserman A., Hamilton W., Pickton T., Smirnov O., Sorenson D. Anti-suppressants and the Creation and Use of Non-Survey Regional Input-Output Models. In: Lahr M., Miller R., editors. Regional Science Perspectives in Economic Analysis: A Festschrift in Memory of Benjamin H. Stevens. Elsevier; New York, NY, USA: 2001. pp. 379–406. [Google Scholar]
  • 14.Rodrigues J., Marques A., Wood R., Tukker A. A network approach for assembling and linking input-output models. Econ. Syst. Res. 2016;28:518–538. doi: 10.1080/09535314.2016.1238817. [DOI] [Google Scholar]
  • 15.Bourque P.J., Chambers E.J., Chiu J.S.Y., Denman F.L., Dowdle B., Gordon G., Thomas M., Tiebout C., Weeks E.E. The Washington Economy: An Input-Output Study. University of Washington; Seattle, WA, USA: 1967. [Google Scholar]
  • 16.Miernyk W.H., Shellhammer K.L., Brown D.M., Coccari R.L., Gallagher C.J., Wineman W.H. Simulating Regional Economic Development: An Interindustry Analysis of the West Virginia Economy. D.C. Heath and Co.; Lexington, KY, USA: 1970. [Google Scholar]
  • 17.Jensen R.C., McGaurr D. Reconciliation of purchases and sales estimates in an Input-Output table. Urban Stud. 1976;13:59–65. doi: 10.1080/00420987620080081. [DOI] [Google Scholar]
  • 18.Gerking S. Reconciling ‘rows only’ and ‘columns only’ coefficients in an Input-Output model. Int. Reg. Sci. Rev. 1976;1:623–626. doi: 10.1177/016001767600100203. [DOI] [Google Scholar]
  • 19.Gerking S. Reconciling reconciliation procedures in regional Input-Output Analysis. Int. Reg. Sci. Rev. 1979;4:23–36. doi: 10.1177/016001767900400102. [DOI] [Google Scholar]
  • 20.Weale M. The reconciliation of values, volumes and prices in national accounts. Stat. Soc. Ser. A. 1988;151:211–221. doi: 10.2307/2982193. [DOI] [Google Scholar]
  • 21.Boomsma P., Oosterhaven J. A double-entry method for the construction of bi-regional Input-Output tables. Reg. Sci. 1992;32:269–284. doi: 10.1111/j.1467-9787.1992.tb00186.x. [DOI] [Google Scholar]
  • 22.Rassier D., Howells T., Morgan E., Empey N., Roesch C. Implementing a Reconciliation and Balancing Model in the U.S. Industry Accounts. Bureau of Economic Analysis; Washington, DC, USA: 2007. [Google Scholar]
  • 23.Jaynes E.T. Probability Theory: The Logic of Science. Cambridge University Press; Cambridge, UK: 2003. [Google Scholar]
  • 24.Kullback S., Leibler R.A. On Information and Sufficiency. Ann. Math. Stat. 1951;22:79–86. doi: 10.1214/aoms/1177729694. [DOI] [Google Scholar]
  • 25.Rodrigues J.F.D. A Bayesian Approach to the Balancing of Statistical Economic Data. Entropy. 2014;16:1243–1271. doi: 10.3390/e16031243. [DOI] [Google Scholar]
  • 26.Persons W.M. Fisher’s Formula for Index Numbers. Rev. Econ. Stat. 1921;3:103–113. doi: 10.2307/1928524. [DOI] [Google Scholar]
  • 27.Kop Jansen P., Raa T.T. The Choice of Model in the Construction of Input-Output Coefficients Matrices. Int. Econ. Rev. 1990;31:213–227. doi: 10.2307/2526639. [DOI] [Google Scholar]
  • 28.Schneider J.R.T.D.S.G.F. Designing an indicator of environmental responsibility. Ecol. Econ. 2006;59:256–266. [Google Scholar]
  • 29.Laplace P.S. Essai Philosophique sur les Probabilités. Courcier Imprimeur; Paris, France: 1814. [Google Scholar]
  • 30.Jeffreys H. Theory of Probability. Clarendon Press; Oxford, UK: 1939. [Google Scholar]
  • 31.Jaynes E.T. Papers on Probability, Statistics and Statistical Physics. Kluwer Academic Publishers; Dordrecht, The Netherlands: 1983. [Google Scholar]
  • 32.Weise K., Woger W. A Bayesian theory of measurement uncertainty. Meas. Sci. Technol. 1992;4:1–11. doi: 10.1088/0957-0233/4/1/001. [DOI] [Google Scholar]
  • 33.Rodrigues J.F.D. Maximum-Entropy Prior Uncertainty and Correlation of Statistical Economic Data. Bus. Econ. Stat. 2016;34:357–367. doi: 10.1080/07350015.2015.1038545. [DOI] [Google Scholar]
  • 34.Jaynes E.T. Information Theory and Statistical Mechanics I. Phys. Rev. 1957;106:620–630. doi: 10.1103/PhysRev.106.620. [DOI] [Google Scholar]
  • 35.Makarkina G.V., Lahr M.L. Estimating Nationwide Impacts using an Input-Output Model with Fuzzy Parameters; Proceedings of the 25th International Input-Output Conference; Atlantic, NJ, USA. 19–23 June 2017. [Google Scholar]
  • 36.Keane E.A.L.M.J. Inconsistency reduction in decision making via multi-objective optimisation. Eur. Oper. Res. 2018;267:212–226. [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES