Explanation of data mining methods

A Bate, R Orre, M Lindquist, I R Edwards
 

Uppsala Monitoring Centre, WHO Collaborating Centre for International Drug Monitoring, S-75320 Uppsala, Sweden

A Bate
programme leader, signal research methodology

M Lindquist
head of research and development

I R Edwards
director
 

Department of Mathematics, Stockholm University, Stockholm, Sweden

R Orre
research associate

The Uppsala Monitoring Centre’s main purpose is to find new information on drug safety. From experience it has become clear that if important signals on drug safety are not to be missed, the first analysis of information should be free from prejudice and a priori thinking. Data mining using a neural network is an ideal approach for finding associations, and patterns of associations, in a large amount of data.[1] Bayesian logic is intuitively correct for a process in which additional information is accruing continuously and where probability has to be reconsidered often. Human intelligence and experience is able to operate better with such a transparent method in the generation of hypotheses. Bayesian statistics implemented in a neural network are used to data mine the WHO database of drug adverse reactions. Quantitative filtering of the data focuses clinical review on the potentially most important combinations of drug and adverse reaction.[1][2][3][4][5]

How a neural network is used

The network we use is called the bayesian confidence propagation neural network.[3] This is a feed forward neural network where learning and inference are done by the principles of Bayes’s law. For regular routine output we use it as a one layer model,[6] although it has been extended to a multilayer network.[7] Such a multilayer network can be used in further investigations of combinations of several variables in the WHO database and has already been successfully applied to areas like diagnosis,[8] expert systems,[9] and data analysis in pulp and paper manufacturing.[10]

Why bayesian statistics are used

The information component measures dependencies between variables in the database. Estimates of precision (standard deviation) are provided for each point estimate of the information component, thus both the point estimate of unexpectedness as well as the level of certainty associated with it can be examined. An advantage of bayesian methods in data mining is that distributions can be constructed in such a way that they adapt quickly to the addition of new data from the first sample; additionally the interpretation of the probability distributions is intuitive. Despite the presence of missing data, the information component and its standard deviation can be calculated for any combination of variable values.

Why a neural network is used

The network is transparent, in that it is easy to see what has been calculated. It is also robust because valid, relevant results can be generated despite missing data. This is an important advantage as most reports in the database contain some empty fields. The results are reproducible, making validation and checking simple. The network is easy to train; it takes only one pass across the data, which makes it time efficient. A small proportion of all possible drug and adverse reaction combinations are actually non-zero in the database, thus use of a sparse matrix method makes searches through the database quick and efficient.

Value of bayesian statistics in a neural network

The neural network provides an efficient computational model for the analysis of large amounts of data and combinations of variables, whether real, discrete, or binary. The efficiency is enhanced by the information component being the weight in the neural network. The neural network architecture allows the same framework to be used for analysis of data and data mining as well as for prediction, which is used to recognise patterns and for classification. Bayesian statistical principles fit intuitively into the framework of a neural network approach as both build on the concept of adapting on the basis of new data.

The method has also been extended to detect dependencies between several variables and is robust in handling missing data. Pattern recognition by the network does not depend on any a priori hypothesis, as an unsupervised learning approach is used. This is useful in detecting new syndromes, finding age profiles of patients with adverse reactions to a drug, and determining groups at high risk and dose relations. The network can thus be used to find complex dependencies that have not necessarily been considered before. Naturally, changes in patterns may also be important.[1]

How bayesian statistics have been implemented

In this bayesian analysis the probability distributions (px, py) for each marginal event are considered the random variables of interest. The use of a conjugate prior model makes the probability distributions of the events beta distributed and the joint distribution Dirichlet distributed. Priors used for the marginals are uniform beta (a ,b )—that is, beta with hyperparameters a =cx+1, b =C-cy+1, where C=total number of counts, cx=total number of reports of variable x, and cy=total number of reports of variable y. For the joint distribution an estimate of the marginal product is used as prior—that is, a Dirichlet distribution where each term is a beta with hyperparameters a =cxy+1/(pxpy), b =C-cxy+1/(pxpy), where cxy=total number of reports of variables x and y. The logarithm to base 2 of the quotient between the joint distribution and the product of the marginal distributions gives a density measure of the dependency relation between the marginals. This dependency relation is referred to as the information component and is related to mutual information as used in information theory. The expectation and variance (and thus the standard deviation) are then calculated by integration.

This approach, when routinely applied to drug and adverse reaction combinations where variable x is the drug and variable y is the adverse reaction, can be seen as the calculation of the logarithm of the ratio of observed rate of adverse drug reactions to expected rate of adverse drug reactions under the null hypothesis of no association between drug and adverse reaction. The calculation is, however, done in a bayesian statistical framework.

  1. Hand DJ. Statistics and data mining: intersecting disciplines. SIGKDD Explorations 1999;1:16-21.
  2. Orre R, Lansner A, Bate A, Lindquist M. Bayesian neural networks with confidence estimations applied to data mining. Computational Statistics and Data Analysis 2000;34:473-93.
  3. Bate A, Lindquist M, Edwards IR, Olsson S, Orre R, Lansner A, et al. A Bayesian neural network method for adverse drug reaction signal generation. Eur J Clin Pharmacol 1998;54:315-21.
  4. Lindquist M, Edwards IR, Bate A, Fucik H, Nunes AM, Ståhl M. From association to alert—a revised approach to international signal analysis. Pharmacoepidemiol Drug Safety 1999;8:S15-25.
  5. Lindquist M, Ståhl M, Bate A, Edwards IR, Meyboom RHB. A retrospective evaluation of a data mining approach to aid finding new adverse drug reaction signals in the WHO international database. Drug Safety 2000;23:533-42.
  6. Lansner A, Ekeberg O. A one layer feedback artificial neural network with a bayesian learning rule. Int J Neural Syst 1989;1:77-87.
  7. Holst A. The use of a Bayesian neural network model for classification tasks [[Thesis]]. Stockholm: Royal Institute of Technology, 1997.
  8. Holst A, Lansner A. A higher order neural network for classification and diagnosis. In: Gammerman A, ed. Computational learning and probabilistic reasoning. Chichester: Wiley, 1996:199-209.
  9. Holst A, Lansner A. A flexible and fault tolerant query-reply system based on a Bayesian neural network. Int J Neural Syst 1993;4:257-67.
  10. Orre R, Lansner A. Pulp quality modelling using Bayesian mixture density neural networks. J Syst Eng 1996;6:128-36.