Full “Laplacianised” posterior naive Bayesian algorithm

Hamse Y Mussa; John BO Mitchell; Robert C Glen

doi:10.1186/1758-2946-5-37

. 2013 Aug 23;5:37. doi: 10.1186/1758-2946-5-37

Full “Laplacianised” posterior naive Bayesian algorithm

Hamse Y Mussa ^1,^✉, John BO Mitchell ², Robert C Glen ¹

PMCID: PMC3846418 PMID: 23968281

Abstract

Background

In the last decade the standard Naive Bayes (SNB) algorithm has been widely employed in multi–class classification problems in cheminformatics. This popularity is mainly due to the fact that the algorithm is simple to implement and in many cases yields respectable classification results. Using clever heuristic arguments “anchored” by insightful cheminformatics knowledge, Xia et al. have simplified the SNB algorithm further and termed it the Laplacian Corrected Modified Naive Bayes (LCMNB) approach, which has been widely used in cheminformatics since its publication.

In this note we mathematically illustrate the conditions under which Xia et al.’s simplification holds. It is our hope that this clarification could help Naive Bayes practitioners in deciding when it is appropriate to employ the LCMNB algorithm to classify large chemical datasets.

Results

A general formulation that subsumes the simplified Naive Bayes version is presented. Unlike the widely used NB method, the Standard Naive Bayes description presented in this work is discriminative (not generative) in nature, which may lead to possible further applications of the SNB method.

Conclusions

Starting from a standard Naive Bayes (SNB) algorithm, we have derived mathematically the relationship between Xia et al.’s ingenious, but heuristic algorithm, and the SNB approach. We have also demonstrated the conditions under which Xia et al.’s crucial assumptions hold. We therefore hope that the new insight and recommendations provided can be found useful by the cheminformatics community.

Keywords: Naive Bayes, Laplacian Corrected Modified Naive Bayes, Classifications, Cheminformatics

Background

Broadly speaking there are two conceptually different ways to solve statistical problems: the frequentist and the Bayesian approaches. On the pros and cons of each method there are numerous excellent review articles and text books, such as the recent book by Murphy [1]. Unlike the frequentist approach, in the Bayesian approach any a priori knowledge about the probability distribution function that one assumes might have generated the given data (in the first place) can be taken into account when estimating this distribution function from the data at hand. If the data are noise–free and “complete”, the role of the a priori information in estimating the distribution function diminishes drastically. However, the a priori information can be crucial when the data are noisy and sparse. The latter scenario is typical in realistic large chemical datasets, which, arguably, makes Bayesian based statistics a powerful data analysis tool.

Unfortunately, Bayesian statistics in its fullest form is not computationally feasible in realistic cheminformatics data analyses. However, in recent years, a simplified version of the Bayesian approach, which is commonly known as the “Naive” Bayesian algorithm, has been found to be a useful classification tool in multi–class classification problems in cheminformactics. To this end a Naive Bayesian classifier is built on binary descriptor space. The descriptors/features x_j, representing the compounds to be classified, assume binary values 0 or 1, where (j = 1,2,...,L) and L can typically be more than 1,000. Thus for some cheminformatics practitioners even the Naive Bayesian algorithm in its standard form is computationally prohibitive when the dataset is large. In this regard, Xia et al. [2] proposed a simpler version of the standard Naive Bayesian algorithm, albeit for binary classification problems; slight variants of this algorithm for multi–class classification can also be found in [3,4]. According to Rogers et al.[5], Rogers being a co–author of the work presented in [2], “the standard Naive Bayes was modified by considering only the effect of the presence of a feature and not its absence”. There are also a few more noticeable aspects of this proposed simplification: (a) the authors cleverly estimate directly – albeit heuristically – the a posteriori class probability for the present feature; (b) these authors (rather ingeniously) incorporate a Laplacian–correction into the estimated posterior class probability; and (c) the authors deem absent features not discriminating enough and therefore discard their contributions to the estimation of the posterior class. More than anything else it is this omission of the absent features from the Standard Naive Bayes (SNB) algorithm that makes Xia et al.’s proposed Naive Bayes Algorithm, termed Laplacian Corrected Modified Naive Bayes (LCMNB), (and its variants by different groups) computationally fast.

It is these three points, (a), (b) and (c), that we expound on in a mathematical setting to demonstrate under which conditions they hold – not only in an abstract sense, but also in the practical sense for a NB practitioner to make an informed decision as to when it is appropriate to employ SNB or LCMNB, in the cheminformatics context.

Methods

Naive Bayes

From Bayes’ theorem recall that [6]:

\frac{p (ω_{i} | x)}{p (ω_{i})} = \frac{p (x | ω_{i})}{p (x)}

(1)

where x = (x₁,x₂,...,x_L) and ω_i denote the feature vectors and class labels, respectively; x_j and L being as described before, whereas i is just an index for the class labels. The terms p(ω_i|x), p(x|ω_i), p(ω_i), and p(x) refer to the posterior probability for ω_i given x, the descriptor vector distribution conditioned on class ω_i, the a priori probability of class ω_i occurring, and the descriptor vector density function, respectively – for more details, see ref. [3,4,6].

The left hand side of Eq. 1 can be expressed as follows [1,7]

\frac{p (ω_{i} | x)}{p (ω_{i})} = \frac{p (ω_{i} | x_{1}, x_{2}, ..., x_{L})}{p (ω_{i})}

(2)

By virtue of Bayes’ theorem p(ω_i|x₁,x₂,...,x_L) can be rewritten as

p (ω_{i} | x_{1}, x_{2}, ..., x_{L}) = \frac{p (x_{1}, x_{2}, ..., x_{L} | ω_{i}) p (ω_{i})}{p (x_{1}, x_{2}, ..., x_{L})}

(3)

which in turn allows us to rewrite Eq. 2 as

\frac{p (ω_{i} | x)}{p (ω_{i})} = \frac{p (ω_{i}) p (x_{1}, x_{2}, ..., x_{L} | ω_{i})}{p (ω_{i}) p (x_{1}, x_{2}, ..., x_{L})} = \frac{p (x_{1}, x_{2}, ..., x_{L} | ω_{i})}{p (x_{1}, x_{2}, ..., x_{L})}

(4)

Making use of the chain rule of probability [1,8], we can express p(x₁,x₂,...,x_L|ω_i) as

\begin{array}{l} p (x_{1}, x_{2}, ..., x_{L} | ω_{i}) = p (x_{1} | ω_{i}) p (x_{2} | ω_{i}, x_{1}) ..... \\ p (x_{L} | ω_{i}, x_{1}, x_{2}, ..., x_{L - 1}) \end{array}

(5)

Plugging the right hand side of the equation above into Eq. 4 results in

\frac{p (ω_{i} | x)}{p (ω_{i})} = \frac{p (x_{1} | ω_{i}) p (x_{2} | ω_{i}, x_{1}) .....p (x_{L} | ω_{i}, x_{1}, x_{2}, ..., x_{L - 1})}{p (x_{1}, x_{2}, ..., x_{L})}

(6)

In practice, it is extremely difficult to estimate p(ω_i|x) or p(x|ω_i). This reality inevitably forces one to make concessions over the degree of accuracy the estimated p(ω_i|x) or p(x|ω_i) can deliver. One widely employed scheme to obtain these probability distributions with compromised accuracy is to assume that individual descriptors x_j, j = 1,2,...,L, are independent conditional on ω_i. It is this naive assumption of independence among features to which the term “Naive” in “Naive Bayesian” refers.

\frac{p (ω_{i} | x)}{p (ω_{i})} = \frac{p (x_{1} | ω_{i}) p (x_{2} | ω_{i}) ... p (x_{L} | ω_{i})}{p (x_{1}, x_{2}, ..., x_{L})}

(7)

Multiplying top and bottom of Eq. 7 by $p^{L} (ω_{i}) Π_{j = 1}^{L} p (x_{j})$ yields

\begin{align} \frac{p (ω_{i} | x)}{p (ω_{i})} & = \frac{p^{L} (ω_{i}) p (x_{1} | ω_{i}) p (x_{2} | ω_{i}) ... p (x_{L} | ω_{i}) Π_{j = 1}^{L} p (x_{j})}{p^{L} (ω_{i}) p (x_{1}, x_{2}, ..., x_{L}) Π_{j = 1}^{L} p (x_{j})} \end{align}

(8)

\begin{align} = \frac{p^{L} (ω_{i}) p (x_{1} | ω_{i}) p (x_{2} | ω_{i}) ... p (x_{L} | ω_{i}) Π_{j = 1}^{L} p (x_{j})}{Π_{j = 1}^{L} p (x_{j}) p^{L} (ω_{i}) p (x_{1}, x_{2}, ..., x_{L})} \end{align}

(9)

Then making use of the fact that $p (ω_{i} | x_{1}) = \frac{p (ω_{i}) p (x_{1} | ω_{i})}{p (x_{1})}$ , $p (ω_{i} | x_{2}) = \frac{p (ω_{i}) p (x_{2} | ω_{i})}{p (x_{2})}$ ,..., $p (ω_{i} | x_{L}) = \frac{p (ω_{i}) p (x_{L} | ω_{i})}{p (x_{1})}$ , we can rewrite Eq. 9 as

\begin{align} \frac{p (ω_{i} | x)}{p (ω_{i})} = \frac{p (ω_{i} | x_{1}) p (ω_{i} | x_{2}) ... p (ω_{i} | x_{L}) Π_{j = 1}^{L} p (x_{j})}{p^{L} (ω_{i}) p (x_{1}, x_{2}, ..., x_{L})} \end{align}

(10)

or more compactly as

\frac{p (ω_{i} | x)}{p (ω_{i})} = \frac{Π_{j = 1}^{L} p (ω_{i} | x_{j})}{p^{L} (ω_{i})} \times \frac{Π_{j = 1}^{L} p (x_{j})}{p (x_{1}, x_{2}, ...., x_{L})}

(11)

Clearly $\frac{Π_{j = 1}^{L} p (x_{j})}{p (x_{1}, x_{2}, ...., x_{L})}$ is common to all classes and therefore plays no role in classification. Thus, in practice (in the Naive Bayes context with which this work is concerned) one is required to estimate p(ω_i|x_j) and p(ω_i).

Since generative approaches can be informative and “simpler” than their discriminative counterparts [9], we make use of Bayes’ theorem again, i.e., $p (ω_{i} | x_{j}) = \frac{p (ω_{i}) p (x_{j} | ω_{i})}{p (x_{j})}$ and then estimate p(ω_i|x_j) through $\frac{p (ω_{i}) p (x_{j} | ω_{i})}{p (x_{j})}$ , where $p (x_{j}) = \sum_{i = 1}^{C} p (x_{j} | ω_{i}) p (ω_{i})$ with C referring to the number of classes. p(ω_i) denotes the a priori class probability, which is relatively easy to estimate. Thus, in our Bayesian context, the estimation of p(ω_j|x_j) boils down in practice to estimating p(x_j|ω_i).

Estimation of p(x_j|ω_i), with x_j = 1 and 0

p(x_j|ω_i) can be estimated using the given data and assuming a Beta distribution as an a priori distribution for p(x_j|ω_i) [10]. (There are other possible prior distributions from which one can choose, but we select the Beta distribution for reasons that will transpire later). As described in Appedix A, a Beta a priori distribution Beta(α_i,β_i) for p(x_j|ω_i) results in a p(x_j|ω_i) estimator in the form [11]:

p (x_{j} = 1 | ω_{i}) = \frac{N_{ij} + α_{i}}{N_{ω_{i}} + β_{i} + α_{i}}

(12)

and of course

p (x_{j} = 0 | ω_{i}) = 1 - \frac{N_{ij} + α_{i}}{N_{ω_{i}} + β_{i} + α_{i}}

(13)

where $N_{ω_{i}}$ and N_ij, respectively, denote the number compounds in class ω_i, and number of compounds in this class with descriptor x_j assuming value 1. β_i and α_i are Beta distribution hyper–parameters per class and the valid range of values that these hyper–parameter can assume are as defined in Appendix A. When α_i and β_i equal 1, α_i and β_i + α_i in Eqs. 12–13 can be viewed as a “Laplacian correction”.

Results and discussion

Estimation of p(ω_i|x_j = 1) and p(ω_i|x_j = 0)

Estimation of p(ω_i|x_j = 1): In Our Approach

Remark 1

Assume that we have N chemical compounds (and their activity labels) available for training, where $N_{ω_{i}}$ of these compounds belong to class ω_i.

Remark 2

Assume that the class a priori distribution is taken as $p (ω_{i}) = \frac{N_{ω_{i}}}{N}$ , where $N_{ω_{i}} > > α_{i} + β_{i}$ (which is a valid assumption as found in any realistic large chemical dataset).

By virtue of Remark 1 and Eq. 12, the estimate of p(ω_i|x_j = 1) becomes

p (ω_{i} | x_{j} = 1) = \frac{p (x_{j} = 1 | ω_{i}) p (ω_{i})}{p (x_{j} = 1)} = \frac{\frac{N_{ij} + α_{i}}{N_{ω_{i}} + α_{i} + β_{i}} \times \frac{N_{ω_{i}}}{N}}{\sum_{i = 1}^{C} \frac{N_{ij} + α_{i}}{N_{ω_{i}} + α_{i} + β_{i}} \times \frac{N_{ω_{i}}}{N}}

(14)

(recall that $p (x_{j}) = \sum_{i = 1}^{C} p (x_{j} | ω_{i}) p (ω_{i})$ ).

Because of Remark 2, Eq. 14 can be simplified to

p (ω_{i} | x_{j} = 1) = \frac{N_{ij} + α_{i}}{\sum_{i = 1}^{C} N_{ij} + \sum_{i = 1}^{C} α_{i}} = \frac{N_{ij} + α_{i}}{N_{j}^{+} + \sum_{i = 1}^{C} α_{i}}

(15)

where $N_{j}^{+}$ is the number of times x_j assumes the value 1.

Estimation of p(ω_i|x_j = 1): In Xia et al.’s Formulation

In the approach of Xia et al., p(ω_i|x_j = 1) is estimated as

p (ω_{i} | x_{j} = 1) = \frac{N_{ij} + A_{i}}{N_{j}^{+} + K}

(16)

where K is as defined in Xia et al. and in their paper A_i is given as

A_{i} = p (ω_{i}) \times K, with K = \sum_{i = 1}^{C} A_{i} as \sum_{i = 1}^{C} p (ω_{i}) = 1

()

Eq. 16 constitutes what Xia et al. term “the Laplacian–Corrected Modified Naive Bayes (LCMNB)” estimator for p(ω_i|x_j = 1).

If α_i in Eq. 15 is set to A_i, Eq. 15 is exactly equivalent to Xia et al.’s estimator for p(ω_i|x_j = 1) as can be seen in Eq. 16.

We note in passing that in Xia et al.’s case, C = 2 and $p (ω_{2}) = \frac{1}{K}$ , which in their nomenclature denoted by p(Active) – that is, A₂ = 1 while A₁ = K − 1.

Initially we employed the Beta a priori distribution for the class conditional distribution to ascertain the equivalence of Eqs. 15 and 16. Fortunately, however, we have ended up with the general equations (Eqs. 14 – 15) that not only encapsulate the LCMNB scheme of Xia et al., but also subsume the other various variants of LCMNB, such as those discussed in Nidhi et al. and Nigsch et al.’s papers [3,4].

At any rate, let us proceed to the nub of this work: Identifying the conditions under which the LCMNB algorithm holds with respect to the SNB algorithm. But first we need to describe the estimation of p(ω_i|x_j = 0).

Estimation of p(ω_i|x_j = 0): In Our Approach

In regard to the case of x_j = 0, we make use of Remark 1, Remark 2 and Eq. 13, which yield an estimator for p(ω_i|x_j = 0) as

p (ω_{i} | x_{j} = 0) = \frac{N_{ω_{i}} - (N_{ij} + α_{i})}{N - (N_{j}^{+} + \sum_{i = 1}^{C} α_{i})}

(17)

Naive Bayes: scoring function from

For notational convenience let us denote $\frac{N_{ij} + α_{i}}{N_{j}^{+} + \sum_{i = 1}^{C} α_{i}}$ in Eq. 15 and $\frac{N_{ω_{i}} - (N_{ij} + α_{i})}{N - (N_{j}^{+} + \sum_{i = 1}^{C} α_{i})}$ in Eq. 17 by ξ_ij and ν_ij, respectively.

Thus, p(ω_i|x_j = 1) and p(ω_i|x_j = 0) may be written more succinctly as $p (ω_{i} | x_{j}) = ξ_{ij}^{x_{j}} ν_{ij}^{(1 - x_{j})}$ , which allows us to express Eq. 11 more compactly as

\begin{align} \frac{p (ω_{i} | x)}{p (ω_{i})} & = \frac{Π_{j} p (ω_{i} | x_{j})}{p^{L} (ω_{i})} \times \frac{Π_{j} p (x_{j})}{p (x_{1}, x_{2}, ...., x_{L})} \\ = \frac{Π_{j} ξ_{ij}^{x_{j}} ν_{ij}^{(1 - x_{j})}}{p^{L} (ω_{i})} \times \frac{Π_{j} p (x_{j})}{p (x_{1}, x_{2}, ...., x_{L})} \end{align}

(18)

Now we come to the core of this work, under which conditions does the LCMNB algorithm hold with respect to the SNB algorithm? Before we answer this question, we deem it instructive and more insightful to map Eq. 18 monotonically to a discriminant function, a “scoring function” (so to speak).

To this end, taking the logarithm of Eq. 18 results in

\begin{align} S_{ω_{i}} (x) = ln \frac{p (ω_{i} | x)}{p (ω_{i})} = ln \frac{Π_{j} ξ_{ij}^{x_{j}} ν_{ij}^{(1 - x_{j})}}{p^{L} (ω_{i})} \times \frac{Π_{j} p (x_{j})}{p (x_{1}, x_{2}, ..., x_{L})} \end{align}

(19)

\begin{align} = \sum_{j} x_{j} ln ξ_{ij} + \sum_{j} (1 - x_{j}) ln ν_{ij} - L \times ln p (ω_{i}) \\ + \sum_{j} ln p (x_{j}) - ln p (x_{1}, x_{2}, ..., x_{L}) \end{align}

(20)

Self–evidently, the term $[\sum_{j} ln p (x_{j}) - ln p (x_{1}, x_{2}, ..., x_{L})]$ is common to all classes and therefore does not play any role in classifying a given new compound. In other words, for practical classification purposes we are only interested in class dependent terms, i.e.,

D_{ω_{i}} (x) = \sum_{j} x_{j} ln ξ_{ij} - ln p (ω_{i}) + \sum_{j} (1 - x_{j}) ln ν_{ij} - (L - 1) \times ln p (ω_{i})

(21)

where $S_{ω_{i}} (x) = D_{ω_{i}} (x) + [\sum_{j} ln p (x_{j}) - ln p (x_{1}, x_{2}, ..., x_{L})]$

Conditions

In Xia et al.’s approach, the LCMNB algorithm, is none other than $\sum_{j} x_{j} ln ξ_{ij} - ln p (ω_{i})$ in Eq. 21. This means that in Xia et al.’s scheme the contributions from the terms depending on x_j = 0 for a given class, i.e.,

\sum_{j} (1 - x_{j}) ln ν_{ij} - (L - 1) \times ln p (ω_{i}), \forall i, i = 1, 2, .., C

(22)

are discarded. To the best of our knowledge, neither in Xia et al. nor in any other paper on the LCMNB approach has it been demonstrated that (i) the contribution of Eq. 22 is zero, i.e.,

\sum_{j} (1 - x_{j}) ln ν_{ij} - (L - 1) \times ln p (ω_{i}) = 0, \forall i, i = 1, 2, .., C

(23)

equally, in these papers, it has not been shown that (ii)

\begin{array}{l} | \sum_{j} x_{j} ln ξ_{ij} & - ln p (ω_{i}) | > > | \sum_{j} (1 - x_{j}) ln ν_{ij} - (L - 1) \\ \times ln p (ω_{i}) |, \forall i, i = 1, 2, .., C \end{array}

(24)

nor has it been established that (iii)

\sum_{j} (1 - x_{j}) ln ν_{ij} - (L - 1) \times ln p (ω_{i}) = constant, \forall i, i = 1, 2, .., C

(25)

Thus unless one (or more) of the above – (i), (ii) and (iii) – is (are) met, the assumption on which the Modified Naive Bayesian algorithm is based is questionable and therefore its practitioners should pay attention to this discrepancy; clearly it is not justifiable to discard from the onset the contribution of $\sum_{j} (1 - x_{j}) ln ν_{ij} - (L - 1) \times ln p (ω_{i})$ simply because features x _j are absent, i.e. x_j = 0.

For completeness, we consider also the case of the highly popular class prior distribution, $p (ω_{i}) = \frac{1}{C}$ , i.e. p(ω₁) = p(ω₂) = ... = p(ω_C). We hasten to add that this option was not included in the LCMNB scheme. At any rate, by simply repeating the arguments in the preceding sections, it is straightforward to show that one ends up with Eq. 21. In this scenario, though, L × lnp(ω_i) is common to all classes and therefore does not play a role in classifying a new compound, i.e., $D_{ω_{i}} (x)$ reduces to

D_{ω_{i}} (x) = \sum_{j} x_{j} ln ξ_{ij} + \sum_{j} (1 - x_{j}) ln ν_{ij}

(26)

Conclusions

Starting from a standard Naive Bayes (SNB) algorithm, we have derived mathematically the relationship between Xia et al.’s ingenious, but heuristic algorithm, and the standard Naive Bayes approach. We also describe the conditions on which Xia et al.’s crucial assumption – contributions from absent feature can be discarded – holds. It is our hope that, with this new insight, cheminformaticians may now be able to efficiently use the Modified version of the standard Naive Bayes algorithm, as proposed by Xia et al., and subsequently by Nidhi et al. and Nigsch et al.

Appendix

Appendix A: Estimator of p(x_j|ω_i)

Here we give for completeness the proof that a priori Beta distribution leads to Eqs. 12 and 13 in the text.

For bookkeeping:

ω_i: class label indexed by i, i = 1,2,...,C.

C: Number of classes.

$N_{ω_{i}}$ : Number of samples in class ω_i.

N_ij: Number of samples in class ω_i with feature x_j = 1, j = 1,2,...,L.

L: Number of features.

We state from the onset, in the following derivation we follow closely the descriptions given in ref. [10]. We also note, for clarity’s sake, in the following analyses we abuse notation and use x_jk for both the random variable and its realization.

In this work, x ∈ {0,1}^L, i.e. x_j ∈ {0,1} and suppose that x_j are independent Bernoulli random variables (and this is in fact the assumption made in the Naive Bayesian approach). Thus, in the Naive Bayesian setting p(x|ω_i) can be given as

p (x | ω_{i}) = Π_{j = 1}^{L} Ber (x_{j} | μ_{ij}) = Π_{j = 1}^{L} μ_{ij}^{x_{j}} {(1 - μ_{ij})}^{1 - x_{j}}

(27)

where μ_ij is an estimate for the conditional probability that feature j occurs in class ω_i, and is what we are trying to estimate given a set of compounds assumed to belong to class ω_i. (In our context, μ_ij is an estimator for p(x_j|ω_i), where p(x_j|ω_i)is as defined in the text.)

To estimate μ_ij in a Bayesian framework, we first view μ_ij as a random variable, then choose an “appropriate” prior and likelihood for the random variable μ_ij.

Let us suppose that our a priori knowledge about the random variable μ_ij indicates that μ_ij is described by a Beta distribution, i.e.,

\begin{array}{l} π (μ_{ij}) = & \frac{1}{B (α_{i}, β_{i})} μ_{ij}^{α_{i} - 1} {(1 - μ_{ij})}^{β_{i} - 1}, 0 \leq μ_{ij} \leq 1, α_{i}, \\ β_{i} > 0, i = 1, 2, ... C \end{array}

(28)

where B(α_i,β_i) ensures that the Beta distribution is normalised

Using the Bayes’ theorem, then the posterior probability for μ_ij on the training data can be given by

\begin{array}{l} π (μ_{ij} | x_{j 1}, x_{j 2}, ..., x_{{jN}_{ω_{i}}}) = \frac{f (x_{j 1}, x_{j 2}, ..., x_{{jN}_{ω_{i}}} | μ_{ij}) π (μ_{ij})}{\int_{0}^{1} f (x_{j 1}, x_{j 2}, ..., x_{{jN}_{ω_{i}}} | μ_{ij}) π (μ_{ij}) d μ_{ij}} \end{array}

(29)

where $f (x_{j 1}, x_{j 2}, ..., x_{{jN}_{ω_{i}}} | μ_{ij})$ refers to the likelihood, and x_j1,x_j2,... and $x_{{jN}_{ω_{i}}}$ denote the j^th feature of the $N_{ω_{i}}$ samples/compounds from class ω_i. As the samples are assumed independent, then $f (x_{j 1}, x_{j 2}, ..., x_{{jN}_{ω_{i}}} | μ_{ij})$ becomes $Π_{k = 1}^{N_{ω_{i}}} f (x_{jk} | μ_{ij}) = Π_{k = 1}^{N_{ω_{i}}} μ_{ij}^{x_{jk}} {(1 - μ_{ij})}^{1 - x_{jk}}$ , i.e.

Π_{k = 1}^{N_{ω_{i}}} f (x_{jk} | μ_{ij}) = μ_{ij}^{\sum_{k = 1}^{m} x_{jk}} {(1 - μ_{ij})}^{N_{ω_{i}} - \sum_{k = 1}^{N_{ω_{i}}} x_{jk}}

(30)

Thus, the posterior π(μ_ij|x_j1,x_j2,...,x_jM) in Eq. 29 modifies to

\begin{array}{l} π (μ_{ij} | x_{j 1}, x_{j 2}, ..., x_{{jN}_{ω_{i}}}) \\ = \frac{μ_{ij}^{\sum_{k = 1}^{N_{ω_{i}}} x_{jk}} {(1 - μ_{ij})}^{N_{ω_{i}} - \sum_{k = 1}^{N_{ω_{i}}} x_{jk}} π (μ_{ij})}{\int_{0}^{1} μ_{ij}^{\sum_{k = 1}^{N_{ω_{i}}} x_{jk}} {(1 - μ_{ij})}^{N_{ω_{i}} - \sum_{k = 1}^{N_{ω_{i}}} x_{jk}} π (μ_{ij}) d μ_{ij}} \end{array}

(31)

i.e.,

\begin{align} π (μ_{ij} | x_{j 1}, x_{j 2}, ..., x_{{jN}_{ω_{i}}}) \propto μ_{ij}^{\sum_{k = 1}^{N_{ω_{i}}} x_{jk}} {(1 - μ_{ij})}^{N_{ω_{i}} - \sum_{k = 1}^{N_{ω_{i}}} x_{jk}} π (μ_{ij}) \end{align}

(32)

\begin{align} = μ_{ij}^{\sum_{k = 1}^{N_{ω_{i}}} x_{jk}} {(1 - μ_{ij})}^{N_{ω_{i}} - \sum_{k = 1}^{N_{ω_{i}}} x_{jk}} \\ \times μ_{ij}^{α_{i} - 1} {(1 - μ_{ij})}^{β_{i} - 1} \end{align}

(33)

\begin{align} = μ_{ij}^{N_{ij} + α_{i} - 1} {(1 - μ_{ij})}^{N_{ω_{i}} - N_{ij} + β_{i} - 1} \end{align}

(34)

Clearly, in Eq. 34, the posterior density for μ_ij given the samples $x_{j 1}, x_{j 2}, ..., x_{{jN}_{ω_{i}}}$ has the same form as the prior for μ_ij[11], i.e.,

\begin{array}{l} π (μ_{ij} | x_{j 1}, x_{j 2}, ..., x_{{jN}_{ω_{i}}}) \\ = \frac{1}{B (N_{ij} + α_{i}, N_{ω_{i}} - N_{ij} + β_{i})} μ_{ij}^{N_{ij} + α_{i} - 1} {(1 - μ_{ij})}^{N_{ω_{i}} - N_{ij} + β_{i} - 1} \end{array}

(35)

which is none other than another Beta distribution. This means that the Bayes estimator of μ_ij, which is the estimate we are interested in, is the mean of the posterior distribution obtained [11]:

E [μ_{ij} | x_{j 1}, x_{j 2}, ..., x_{{jN}_{ω_{i}}}] = \frac{N_{ij} + α_{i}}{N_{ω_{i}} + α_{i} + β_{i}}

(36)

In other words,

p (x_{j} | ω_{i}) = \frac{N_{ij} + α_{i}}{N_{ω_{i}} + α_{i} + β_{i}}

(37)

QED.

An accessible description of the derivation of Eq. 37 can be found in ref. [10].

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

HYM conceived the idea that constitutes the nub of the presented work. This author also carried out the bulk of the mathematical derivations. RCG contributed to the Bayesian aspect of the work. JBOM conceptually contributed to the derivation given in Appendix A. The three authors participated in drafting the manuscript. All authors read and approved the final manuscript.

Contributor Information

Hamse Y Mussa, Email: hym21@cam.ac.uk.

John BO Mitchell, Email: jbom@st-andrews.ac.uk.

Robert C Glen, Email: rcg28@cam.ac.uk.

Acknowledgements

We are in debt to Dr Dave Rogers for his many useful comments on the original LCMNB approach, in particular for helping us understand more about the two–class LCMNB version.

Mussa and Glen would like to thank the Unilever Centre for Molecular Sciences Informatics for its support, whereas Mitchell would like to thank the Scottish Universities Life Sciences Alliance (SULSA).

References

Murphy KP. Machine Learning: A Probabilistic Perspective. Cambridge, MA: MIT Press; 2012. [Google Scholar]
Xia X, Maliski EG, Gallant P, Rogers D. Classification of kinase inhibitors using a Bayesian model. J Med Chem. 2004;47:4463–4470. doi: 10.1021/jm0303195. [DOI] [PubMed] [Google Scholar]
Glick M, Davies JW, Jenkins JL. Nidhi. Prediction of biological targets for compounds using multiple-category Bayesian models trained on chemogenomics databases. J Chem Inf Model. 2006;46:1124–1133. doi: 10.1021/ci060003g. [DOI] [PubMed] [Google Scholar]
Nigsch F, Bender A, Jenkins JL, Mitchell JBO. Ligand-target prediction using winnow and naive Bayesian algorithms and the implications of overall performance statistics. J Chem Inf Model. 2008;48:2313–2325. doi: 10.1021/ci800079x. [DOI] [PubMed] [Google Scholar]
Rogers D, Brown RD, Hahn M. Using extended–connectivity fingerprints with Laplacian-modified Bayesian analysis in high–throughput screening follow–up. J Biomol Screen. 2005;10:682–686. doi: 10.1177/1087057105281365. [DOI] [PubMed] [Google Scholar]
Townsend JA, Glen RC, Mussa HY. Note on naive Bayes based on binary descriptors in Cheminformatics. J Chem Inf Model. 2012;52:2494–2500. doi: 10.1021/ci200303m. [DOI] [PubMed] [Google Scholar]
Duda RO, Hart PE. Pattern Classification and Scene Analysis. New York, NY: John Wiley & Sons, Ltd; 1973. [Google Scholar]
Koch RK. Introduction to Bayesian Statistics. Berlin: Springer; 2007. [Google Scholar]
Bishop CM. Pattern Recognition and Machine Learning. New York: Springer; 2006. [Google Scholar]
Ross SM. Introduction to Probability and Statistics for Engineers and Scientist. New York: John Wiley & Sons; 1987. [Google Scholar]
Davidson AC. Statistical Models (Cambridge Series in Statistical and Probabilistic Mathematics) Cambridge: Cambridge University Press; 2008. [Google Scholar]

[B1] Murphy KP. Machine Learning: A Probabilistic Perspective. Cambridge, MA: MIT Press; 2012. [Google Scholar]

[B2] Xia X, Maliski EG, Gallant P, Rogers D. Classification of kinase inhibitors using a Bayesian model. J Med Chem. 2004;47:4463–4470. doi: 10.1021/jm0303195. [DOI] [PubMed] [Google Scholar]

[B3] Glick M, Davies JW, Jenkins JL. Nidhi. Prediction of biological targets for compounds using multiple-category Bayesian models trained on chemogenomics databases. J Chem Inf Model. 2006;46:1124–1133. doi: 10.1021/ci060003g. [DOI] [PubMed] [Google Scholar]

[B4] Nigsch F, Bender A, Jenkins JL, Mitchell JBO. Ligand-target prediction using winnow and naive Bayesian algorithms and the implications of overall performance statistics. J Chem Inf Model. 2008;48:2313–2325. doi: 10.1021/ci800079x. [DOI] [PubMed] [Google Scholar]

[B5] Rogers D, Brown RD, Hahn M. Using extended–connectivity fingerprints with Laplacian-modified Bayesian analysis in high–throughput screening follow–up. J Biomol Screen. 2005;10:682–686. doi: 10.1177/1087057105281365. [DOI] [PubMed] [Google Scholar]

[B6] Townsend JA, Glen RC, Mussa HY. Note on naive Bayes based on binary descriptors in Cheminformatics. J Chem Inf Model. 2012;52:2494–2500. doi: 10.1021/ci200303m. [DOI] [PubMed] [Google Scholar]

[B7] Duda RO, Hart PE. Pattern Classification and Scene Analysis. New York, NY: John Wiley & Sons, Ltd; 1973. [Google Scholar]

[B8] Koch RK. Introduction to Bayesian Statistics. Berlin: Springer; 2007. [Google Scholar]

[B9] Bishop CM. Pattern Recognition and Machine Learning. New York: Springer; 2006. [Google Scholar]

[B10] Ross SM. Introduction to Probability and Statistics for Engineers and Scientist. New York: John Wiley & Sons; 1987. [Google Scholar]

[B11] Davidson AC. Statistical Models (Cambridge Series in Statistical and Probabilistic Mathematics) Cambridge: Cambridge University Press; 2008. [Google Scholar]

PERMALINK

Full “Laplacianised” posterior naive Bayesian algorithm

Hamse Y Mussa

John BO Mitchell

Robert C Glen

Abstract

Background

Results

Conclusions

Background

Methods

Naive Bayes

Estimation of p(x_j|ω_i), with x_j = 1 and 0

Results and discussion

Estimation of p(ω_i|x_j = 1) and p(ω_i|x_j = 0)

Estimation of p(ω_i|x_j = 1): In Our Approach

Remark 1

Remark 2

Estimation of p(ω_i|x_j = 1): In Xia et al.’s Formulation

Estimation of p(ω_i|x_j = 0): In Our Approach

Naive Bayes: scoring function from

Conditions

Conclusions

Appendix

Appendix A: Estimator of p(x_j|ω_i)

Competing interests

Authors’ contributions

Contributor Information

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Full “Laplacianised” posterior naive Bayesian algorithm

Hamse Y Mussa

John BO Mitchell

Robert C Glen

Abstract

Background

Results

Conclusions

Background

Methods

Naive Bayes

Estimation of p(xj|ωi), with xj = 1 and 0

Results and discussion

Estimation of p(ωi|xj = 1) and p(ωi|xj = 0)

Estimation of p(ωi|xj = 1): In Our Approach

Remark 1

Remark 2

Estimation of p(ωi|xj = 1): In Xia et al.’s Formulation

Estimation of p(ωi|xj = 0): In Our Approach

Naive Bayes: scoring function from

Conditions

Conclusions

Appendix

Appendix A: Estimator of p(xj|ωi)

Competing interests

Authors’ contributions

Contributor Information

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Estimation of p(x_j|ω_i), with x_j = 1 and 0

Estimation of p(ω_i|x_j = 1) and p(ω_i|x_j = 0)

Estimation of p(ω_i|x_j = 1): In Our Approach

Estimation of p(ω_i|x_j = 1): In Xia et al.’s Formulation

Estimation of p(ω_i|x_j = 0): In Our Approach

Appendix A: Estimator of p(x_j|ω_i)