Skip to main content
Springer Nature - PMC COVID-19 Collection logoLink to Springer Nature - PMC COVID-19 Collection
. 2009 Jul 14;21:49–72. doi: 10.1007/978-1-4419-1278-7_4

Data Analysis and Outbreak Detection

Hsinchun Chen 5, Daniel Zeng 6,7, Ping Yan 6
PMCID: PMC7498921

Abstract

The analysis components of a syndromic surveillance system focus on detecting the changes in public health status, which may be indicative of disease outbreaks. At the core of these analysis components is the automated process of detecting aberration or data anomalies in the public health surveillance data, which often have prominent temporal and spatial data elements, by statistical analysis or data mining techniques. These methods are also capable of dealing with various common problems in epidemiological data such as bias, delay, lack of accuracy, and seasonality. These techniques are the focus of this chapter.

When processing public health surveillance data streams, it is often necessary to map the collected syndromic data into a small set of syndrome categories to facilitate follow-up analysis and outbreak detection. Section 4.1 discusses related syndrome classification approaches. In Sect. 4.2, we provide a taxonomy of anomaly analysis and outbreak detection methods used for biosurveillance. Sections 4.3–4.6 summarize various specific detection methods spanning from classic statistical methods to data mining approaches, which quantify the possibility of an outbreak conditioned on surveillance data.

Keywords: Anomaly Detection, Exponentially Weighted Move Average, Recursive Least Square, Unify Medical Language System, Syndromic Surveillance


The analysis components of a syndromic surveillance system focus on detecting the changes in public health status, which may be indicative of disease outbreaks. At the core of these analysis components is the automated process of detecting aberration or data anomalies in the public health surveillance data, which often have prominent temporal and spatial data elements, by statistical analysis or data mining techniques. These methods are also capable of dealing with various common problems in epidemiological data such as bias, delay, lack of accuracy, and seasonality. These techniques are the focus of this chapter.

When processing public health surveillance data streams, it is often necessary to map the collected syndromic data into a small set of syndrome categories to facilitate follow-up analysis and outbreak detection. Section 4.1 discusses related syndrome classification approaches. In Sect. 4.2, we provide a taxonomy of anomaly analysis and outbreak detection methods used for biosurveillance. Sections 4.34.6 summarize various specific detection methods spanning from classic statistical methods to data mining approaches, which quantify the possibility of an outbreak conditioned on surveillance data.

Syndrome Classification

The onset of a number of syndromes can indicate certain diseases threatening public health. For example, the influenza-like syndrome could be due to an anthrax attack, which is of particular interest to biodefense. Syndrome classification thus is one of the first and important steps in syndromic data processing and analysis.

A substantial amount of research effort has been expended to classifying free-text chief complaints into syndromes. This classification task is difficult because different expressions, acronyms, abbreviations, and truncations are often found in free-text chief complaints (Sniegoski, 2004). For example, “chst pn,” “CP,” “c/p,” “chest pai,” “chert pain,” “chest/abd pain,” and “chest discomfort” can all mean “chest pain.” On the basis of our summary findings reported in Section 3.1, a majority of syndromic surveillance systems use chief complaints as a major source of data. Therefore, the problem of mapping each chief complaint record to a syndrome category, referred to as syndrome classification, is an important practical challenge needing a solution. Another syndromic data type often used for syndromic surveillance purposes, i.e., ICD-9 or ICD-9-CM codes, also needs to be grouped into syndrome categories. Processing such information is somewhat easier as the data records are structured.

A syndrome category is defined as a set of symptoms, which is an indicator of some specific diseases. For example, a short-phrase chief complaint “coughing with high fever” can be classified as the “upper respiratory” syndrome. Table 4-1 summarizes some of the most commonly-monitored syndrome categories. Note that different syndromic surveillance systems may monitor different categories. For example, in the RODS system there are seven syndrome groups of interest for biosurveillance purposes, whereas EARS defines a more detailed list of 43 syndromes. Some syndromes are of common interest across different systems, such as respiratory or gastrointestinal syndromes.

Table 4-1.

Diseases and syndrome categories commonly monitored.

Influenza-like Respiratory Dermatological
Fever Neurologic Cold
Gastrointestinal Rash Diarrhea
Hemorrhagic illness Severe illness and death Asthma
Localized cutaneous lesion Specific infection Vomit
Lymphadenitis Sepsis Other/none of the above
Constitutional
Bioterrorism agent-related diseases
Anthrax Botulism-like/botulism Plague
Tularemia Smallpox SARS (severe acute respiratory syndrome)

Syndrome Classification Approaches

The syndrome classification process can be either manual or implemented through an automated system. The BioSense system, developed by CDC (Ma et al., 2005), for instance, relies on a working group that develops syndrome mapping using CDC definitions. However, automated, computerized syndrome classification is essential to real-time syndromic surveillance. A software application that analyzes chief complaint records or ICD-9 codes and then determines appropriate syndrome categories is often known as a syndrome classifier.

Manual Grouping The BioSense system (Bradley et al., 2005; Sokolow et al., 2005) and the Syndromal Surveillance Tally Sheet program used in EDs of Santa Clara County, California, use a manual approach to classify the symptoms. They ask the medical experts in syndromic surveillance, infectious diseases, and medical informatics to perform the mapping of laboratory test orders into 11 syndromes categories defined by a multi-agency working group (Ma et al., 2005).

Automated Classification Existing automated classification methods can be roughly categorized into three groups: supervised learning, rule-based classification, and ontology-enhanced classification. The supervised learning methods require as input a set of CC records labeled with syndromes as learning samples before they can proceed to classify unlabelled CC records by syndromes. Naive Bayesian and Bayesian network-based methods are two examples of the supervised learning methods (Ivanov et al., 2002; Sniegoski, 2004). For instance, the CoCo chief complaints classifier developed as part of the RODS system is a Bayesian classifier (Chapman et al., 2003). Often, a learning approach has a natural language processing (NLP) component, which classifies free-text CCs with simplified grammar containing rules for nouns, adjectives, prepositional phrases, and conjunctions. As part of RODS, Chapman et al. adapted the MPLUS, a Bayesian network-based NLP system, to classify the free-text chief complaints (Wagner et al., 2004a; Chapman et al., 2005). Implementing learning algorithms is straightforward; however, collecting training records is usually costly and time-consuming. Another major disadvantage of supervised learning methods is the lack of flexibility and generalizability. Recoding for different syndromic definitions or implementing the CC classification system in an environment that is different from the one where the original labeled training data were collected could be costly and difficult.

In contrast, rule-based classification does not require labeled training data. A text string searching process for syndrome category classification is a typical rule-based approach. In general, the CC records are first cleansed and then mapped to the syndrome categories according to a set of rules often predefined by medical experts following the definitions of syndromes of interest. For instance, an example rule could be “fever, if NOT animal and NOT environmental and fever.” Many applications, for example, EARS (Hutwagner et al., 2003), ESSENCE (CDC, 2003), and the National Bioterrorism Syndromic Surveillance Demonstration Program (Yih, Abrams et al., 2005), make use of such rules. Rule-based methods are relatively flexible, as the inference rules can be easily modified and updated. A major problem with rule-based classification methods is that they cannot handle symptoms not covered in the set of predefined rules.

The third category of automated approaches, ontology-based classification, utilizes relations between medical concepts (Leroy and Chen, 2001). Two representative methods are the BioPortal CC Classifier, which relies on Unified Medical Language System (UMLS) vocabularies and semantics (Lu et al., 2006, 2008), and the BioStorm approach, which uses a vocabulary abstraction method (Crubézy et al., 2005). BioPortal CC Classifier uses UMLS's Meta-thesaurus and SPECIALIST Lexicon to suggest a symptom grouping (as an intermediary representation) for a given CC record and then classify it using rules. It is able to provide a flexible architecture that supports easy adaptation to new syndromic categories. The BioStorm approach creates a series of intermediate abstractions up to a syndrome category from the primitive data (e.g., signs, lab tests) for syndromes indicative of illness due to an agent of bioterrorism.

We summarize representative syndrome classification methods in Table 4-2.

Table 4-2.

Representative syndrome classification approaches.

Category Example approaches Application
Manual grouping Medical experts perform the mapping of laboratory test orders into syndrome categories (Ma et al., 2005). The BioSense system (Bradley et al., 2005; Sokolow et al., 2005) and Syndromal Surveillance Tally Sheet program in EDs of Santa Clara County, California.
Natural language processing (NLP) NLP-based approaches classify free-text CCs with simplified grammar containing rules for nouns, adjectives, prepositional phrases, and conjunctions. Critiques of NLP-based methods include lack of semantic markings in chief complaints and the amount of training needed. As part of RODS, Chapman et al. adapted the MPLUS, a Bayesian network-based NLP system, to classify the free-text chief complaints (Chapman et al., 2005; Wagner et al., 2004a).
Bayesian classifiers Bayesian classifiers, including naïve Bayesian classifiers, bigram Bayes, and their variations, can classify CCs learned from the training data consisting of labeled CCs. The CoCo Bayesian classifier from the RODS project (Chapman et al., 2003)
Text string searching A rule-based method that first uses keyword matching and synonym lists to standardize CCs. Predefined rules are then used to classify CCs or ICD-9 codes into syndrome categories. EARS (Hutwagner et al., 2003), ESSENCE (CDC, 2003), and the National Bioterrorism Syndromic Surveillance Demonstration Program (Yih et al., 2005)
Vocabulary abstraction This approach creates a series of intermediate abstractions up to a syndrome category from the individual data (e.g., signs) for syndromes due to an agent of bioterrorism. The BioStorm system (Crubézy et al., 2005; Buckeridge et al., 2002; Shahar and Musen, 1996)
Ontology-based classification A rule-based system that can generalize symptoms grouping rules based on UMLS-derived vocabularies and semantics. It provides a flexible architecture for changing or adapting new syndromic categories. The syndromic mapping component of the BioPortal system (Lu et al., 2008)

An interesting complementary method using both manual and natural-language processing techniques to create CC classifiers is presented by Halasz et al. (2006). They apply an n-gram text processing program to build an ICD9 classifier to a training set of ED visits for which both the CC and ICD9 code are known. A collection of CC substrings with associated probabilities was constructed and used to generate a CC classifier program. This approach allows the rapid automated creation and updating of CC classifiers based on ICD9 groupings.

Researchers have also started working on a CC classifier for non-English CCs. It is noted that there is a critical need for the development CC classification systems capable of processing non-English CCs as syndromic surveillance is being increasingly practiced around the world. One design first maps non-English CCs to English CCs and then use well-tested English CC classification systems to process translated CCs (Lu et al., 2007a).

Performance of Syndrome Classification Approaches

On the basis of our survey, about 40% of syndromic surveillance systems use automated syndrome classification, while the other 40% rely on a manual approach (details are unknown for the remaining 20%). There is clearly room for improvement and adoption of automated methods.

Evaluation studies have been conducted to compare various classifiers' performance for selected syndrome types (Travers and Haas, 2004). For instance, experiments comparing two Bayesian classifiers for the acute gastrointestinal syndrome showed a 68% mapping success against expert classification of ED reports (Ivanov et al., 2002). In general, however, it is difficult to paint a general picture of how well syndromic classifiers perform and how they fare against each other as many systems have not been evaluated on classification accuracy. In addition, the performance of these classifiers varies with different syndrome categories, further complicating the evaluation task.

Many prior studies show that a considerable portion (30–40%) of the chief complaints data is not classifiable because they are too noisy. However, combining chief complaints with the diagnostic codes (such as ICD-9) during the same visit can achieve a better classification accuracy (Reis and Mandl, 2004).

Another challenge facing syndrome classification is that there are no universally-accepted, standardized syndrome definitions. As a result, significant rewriting/fine-tuning efforts are needed when applying a classification approach in particular application contexts. One possible approach to deal with these difficulties is to create intermediary representations (such as symptom groups) and create explicit rules that map these intermediary representations into customized syndrome categories (Lu et al., 2006).

A Taxonomy of Outbreak Detection Methods

Syndromic surveillance systems typically make available multiple outbreak detection algorithms, as no single method can deliver superior performance across a wide range of scenarios or meet different surveillance objectives (Buckeridge et al., 2003).

Many statistical and data mining techniques for syndromic surveillance have been proposed in the literature. These methods can be generally divided into retrospective and prospective approaches. If instead we consider the characteristics of the surveillance data analyzed, another orthogonal classification scheme is possible, dividing the outbreak detection methods into temporal analysis, spatial analysis, and spatial-temporal analysis approaches. This subsection focuses on both schemes.

Interested readers are referred to http://statpages.org/, which provides tutorials for various kinds of parametric and nonparametric statistical tests that form the statistical foundation of outbreak detection, and http://www.autonlab.org/tutorials/, which includes statistical data mining and machine learning tutorials. The review articles on data mining and its application in health and medical information (Bath, 2004; Benoit, 2002) are also good references to provide in-depth background for the material presented in this section.

Retrospective vs. Prospective Syndromic Surveillance

A number of surveillance approaches fall under the general umbrella of retrospective models, which aim at testing statistically whether events are randomly distributed over space and time for a predefined geographical region during a predetermined time period (Kulldorff, 2001). Some examples of retrospective methods include space scan statistic (Kulldorff, 1997), Nearest Neighbor Hierarchical Clustering (NNH) (Levine, 2002), and Risk-adjusted Support Vector Clustering (RSVC) (Zeng et al., 2004a). When applying retrospective methods, there is usually a clear distinction between the baseline data points and the observations of interest, where the baseline data correspond to known “normal” health status and the observations of interest are case reports to be examined for surveillance purposes. In applications where the separation between the baseline data and observations of interest can be cleanly and meaningfully done, retrospective methods can be effectively applied.

One major limitation of retrospective methods is that they are slow in detecting emerging clusters when the separation between the baseline data and observations of interest is not obvious. The resulting manual trial-and-error interventions severely limit the applicability of retrospective methods.

Prospective surveillance often entails repeated analyses performed periodically on incoming surveillance data streams to identify statistically significant changes in an online context (Chang et al., 2005). Using such a method, the separation of the baseline data and observations of interest is no longer needed as the system automatically tries various combinations of having some time windows as the baseline and some periods after them as the time of interest.

Prospective analysis has long been used in disease surveillance applications. The CUSUM method is one of the most established methods. Other examples include Rogerson's approaches (Rogerson, 1997), Kulldorff's prospective version of time-space scan statistics (Kulldorff, 2001), and the Prospective Support Vector Clustering (PSVC) method (Chang et al., 2005).

Temporal, Spatial, and Spatial-Temporal Outbreak Detection Methods

Table 4-3 summarizes a wide range of outbreak detection methods, all of them implemented in one or more syndromic surveillance systems surveyed. They are divided into three groups: temporal, spatial, and spatial-temporal (Buckeridge et al., 2005b; Mandl et al., 2004). Note that this table does not attempt to exhaustively list every detection algorithm proposed in the literature. Interested readers can refer to (Brookmeyer and Stroup, 2004; Lawson and Kleinman, 2005) for recent in-depth reviews of a more comprehensive set of algorithms. The methods listed in Table 4-3 are chosen because of their connection with the syndromic surveillance systems surveyed. Although not exhaustive, it covers most of the detection method types and provides a useful snapshot of the state of the art. Sections 35 provide additional analysis of these three groups of detection methods, respectively.

Table 4-3.

Outbreak detection algorithms.

Algorithm Short description Availability and applications Features and problems
Temporal analysis
Serfling method A static cyclic regression model with predefined parameters optimized through the training data Available from RODS (Tsui et al.,2001); used by CDC for flu detection; Costagliola et al. applied Serfling's method to the French influenza-like illness surveillance (Costagliola et al., 1981) The model fits data poorly during epidemic periods. To use this method, the epidemic period has to be predefined.
Autoregressive Integrated Moving Average (ARIMA) A linear function learns parameters from historical data. Seasonal effect can be adjusted. Available from RODS Suitable for stationary environments.
Recursive Least Square (RLS) A dynamic autoregressive linear model that predicts the current count of each syndrome within a region based on the historical data; it continuously adjusts model coefficients based on prediction errors Available from RODS Suitable for dynamic environments.
Exponentially Weighted Moving Average (EWMA) Predictions based on exponential smoothing of previous several weeks of data with recent days having the highest weight (Neubauer, 1997) Available from ESSENCE Allowing the adjustment of shift sensitivity by applying different weighting factors.
Cumulative Sums (CUSUM) A control chart-based method to monitor for the departure of the mean of the observations from the estimated mean (Das et al., 2003; Grigoryan et al., 2005). It allows for limited baseline data. Widely used in current surveillance systems including BioSense, EARS (Hutwagner et al., 2003) and ESSENCE, among others This method performs well for quick detection of subtle changes in the mean (Rogerson, 2005); it is criticized for its lack of adjustability for seasonal or day-of-week effects.
Hidden Markov Models (HMM) HMM-based methods use a hidden state to capture the presence or absence of an epidemic of a particular disease and learn probabilistic models of observations conditioned on the epidemic status. Discussed in (Rath et al., 2003) A flexible model that can adapt automatically to trends, seasonality covariates (e.g., gender and age), and different distributions (normal, Poisson, etc.).
Wavelet algorithms Local frequency-based data analysis methods; they can automatically adjust to weekly, monthly, and seasonal data fluctuations. Used in NRDM to indicate zip-code areas in which OTC medication sales are substantially increased (Espino and Wagner, 2001; Zhang et al., 2003) Account for both long-term (e.g., seasonal effects) and short-term trends (e.g., day-of-week effects) (Wagner et al., 2004b).
Spatial analysis
Generalized Linear Mixed Modeling (GLMM) Evaluating whether observed counts in relatively small areas are larger than expected on the basis of the history of naturally occurring diseases (Kleinman et al., 2004, 2005a) Used in Minnesota (Yih et al., 2005) Sensitive to a small number of spatially focused cases; poor in detecting elevated counts over contiguous areas when compared with scan statistic and spatial CUSUM approaches (Kleinman et al., 2004).
SMall Area Regression and Testing (SMART) An adaptation of GLMM that takes into account multiple comparisons and includes parameters for ZIP code, day of the week, holiday, and seasonal cyclic variation. Available from BioSense and National Bioterrorism Syndromic Surveillance Demonstration Program (Yih et al., 2005) Seasonal, weekly effects, and other parameters under consideration can be adjusted during the regression process.
Spatial scan statistics and variations The basic model relies on using simply-shaped areas to scan the entire region of interest based on well-defined likelihood ratios. Its variation takes into account factors such as people mobility Widely adopted by many syndromic surveillance systems; a variation proposed in (Duczmal and Buckeridge, 2005); visualization available from BioPortal (Zeng et al., 2004a). Well-tested for various outbreak scenarios with positive results; the geometric shape of the hotspots identified is limited.
Bayesian spatial scan statistics Combining Bayesian modeling techniques with the spatial scan statistics method; outputting the posterior probability that an outbreak has occurred, and the distribution of this probability over possible outbreak regions Available from RODS (Neill et al., 2005) Computationally efficient; can easily incorporate prior knowledge such as the size and shape of outbreak or the impact on the disease infection rate.
Spatial-temporal analysis
Space-time scan statistic An extension of the space scan statistic that searches all the subregions for likely clusters in space and time with multiple likelihood ratio testing (Kulldorff, 2001). Widely used in many community surveillance systems including the National Bioterrorism Syndromic Surveillance Demonstration Program (Yih et al., 2004) Regions identified may be too large in coverage.
What is Strange About Recent Event (WSARE) Searching for groups with specific characteristics (e.g., a recent pattern of place, age, and diagnosis associated with illness that is anomalous when compared with historic patterns) (Kaufman et al., 2005) Available from RODS; Implemented in ESSENCE In contrast to traditional approaches, this method allows for use of representative features for monitoring (Wong et al., 2003; Wong et al., 2002). To use it, however, the baseline distribution has to be known.
Population-wide ANomaly Detection and Assessment (PANDA) A causal Bayesian network approach to model a population and infer the spatial-temporal probability distribution of disease for the entire population or individual patients Available from RODS (Cooper et al., 2004; Moore et al.,2002) Extensive computational effort
Prospective Support Vector Clustering (PSVC) This method uses the Support Vector Clustering method with risk adjustment as a hotspot clustering engine and a CUSUM-type design to keep track of incremental changes in spatial distribution patterns over time Developed in BioPortal (Chang et al., 2005; Zeng et al., 2004a) This method can identify hotspots with irregular shapes in an online context

Because of the importance of outbreak detection algorithms for syndromic surveillance, we review some of the critical methods adopted in more detail below. The readers should note that the models we are about to discuss can be written in a number of mathematically equivalent ways, while the ones presented in the text are one of the representations.

Temporal Data Analysis

This section discusses representative temporal anomaly detection methods. Temporal anomaly detection belongs to the vast domain of time series analysis. It monitors public health events or incidences as a sequence of data points, measured typically at evenly-distributed successive times. Temporal anomaly detection methods attempt to identify unusual patterns, smooth out naturally-occurring (or known) variations, and distinguish the variations caused by a possible outbreak from natural variations. Such methods either study the event frequency or the intensity of adverse event occurrences (time intervals between occurrences) to detect changes. These changes could follow different trends (e.g., linear, exponential).

Statistical Process Control (SPC)-Based Anomaly Detection

A majority of the systems surveyed employ statistical process control (SPC)-based algorithms. These algorithms were originally developed to monitor a process and its mean in industrial settings. The ability to differentiate the “out-of-control” mean from the “in-control” mean makes these methods readily applicable for anomaly detection.

The basic idea behind SPC-based algorithms is as follows. A small random sample Inline graphic is drawn repeatedly at certain time intervals. The sample mean is compared against given thresholds; alarms are triggered at Inline graphic, if the sample mean exceeds the control limit G(s). The alerting threshold is either theoretically defined, or dynamically estimated through historical data. The later one is proved to be more robust than the former (Buckeridge et al., 2005a). The single time-series analyzed often exhibits substantial day-of-week or seasonal patterns. As such, it is a common practice to estimate the incidence rate using a linear or Poisson regression model, and then to apply a SPC-based method to the regression residuals (Buckeridge et al., 2005a).

The Control Statistical Cumulative Sums (CUSUM) and Exponentially Weighted Moving Average (EWMA) methods are two standard SPC-based methods that have been widely applied for outbreak detection. CUSUM keeps track of the accumulated deviation between observed and expected values. Formally, the accumulated deviation is defined as Inline graphic, where k is a control parameter and Zt models the distribution of the variable of interest (e.g., Inline graphic, if the variable is normally distributed) (Rogerson, 2005). Different forms of CUSUM have been developed, which assume that the underlying distribution could be Poisson or exponential (Rogerson, 2005). Nonparametric models have also been developed, removing the need for knowledge of the underlying distribution. A deployed SPC method often incorporates a short guard band (e.g., 2 days) between the baseline period and the day to be monitored. The guard band may lift the sensitivity by avoiding a gradually increasing outbreak contaminating the baseline with the outbreak signal. CUSUM methods have been specifically designed to deal with limited availability of historical data. Three CUSUM algorithms used in the EARS system require less than 10 days as the baseline period. They differ from each other by the different settings of the baseline period and the threshold levels, resulting in different levels of sensitivity (Hutwagner et al., 2003).

The Shewhart method is another simple form of SPC-based methods. It can be viewed as performing repeated significance tests on deviations of an observation from a target constant. The Shewhart method performs poorly for small and moderate shifts, but for large shifts, CUSUM actually converges to the Shewhart method (Lawson and Kleinman, 2005). One study used a Shewhart control chart to detect epidemics of Influenza A (Quenel et al., 1994).

Instead of considering only the last observation in the Shewhart method, the exponentially weighted moving average (EWMA) method monitors all the previous observations, summing up the multiple deviations in a weighted scheme, giving the most recent observation the greatest weight, and all the previous observations geometrically decreasing weights (Neubauer, 1997).

SPC-based methods are widely used in surveillance due to their simplicity. Their performances have been tested in many real settings. BioSense, EARS, and ESSENCE syndromic surveillance systems among others implemented either CUSUM or EWMA or both, and reported their early aberration detection capacity for influenza-like illness and other diseases (Hutwagner et al., 2005a; Zhu et al., 2005). The details of the performance evaluation can be found in Chapter 10.1007/978-1-4419-1278-7_6.

Serfling Statistic

Serfling's method uses cyclic regression to model the normal pattern of the numbers of patients susceptible to death for pneumonia and influenza when there is not an epidemic with the objective of determining an epidemic threshold. Its use requires a clear definition of the disease, the selection of data to identify a normal pattern of susceptible patients, and the assumption that the normal pattern is periodical.

The Serfling statistic was originally proposed by Serfling for statistical analysis of weekly pneumonia and influenza deaths in 108 US cities in 1963 (Serfling, 1963). Serfling's method uses cyclic regression to establish an expected threshold for daily statistic based on history data excluding the epidemic weeks, accounting for seasonal variations. It requires a clear definition of the disease and the assumption that the normal pattern is periodical (Mandl et al., 2004). A theoretical form of this method is formulated as:

graphic file with name M5.gif

Serfling's method is regarded as a traditional modeling technique applied to a number of disease surveillance practices such as the French influenza-like syndrome data (Costagliola et al., 1981). Serfling's method has also been used by RODS system to model hospital visitation data for influenza (Tsui et al., 2003).

Autoregressive Model-Based Anomaly Detection

The autoregressive integrated moving average (ARIMA) method is a class of time-series analysis models that are typically specified by three parameters: the order of autocorrelation (AR), the order of integration (I), and the order of moving average (MA) (Box et al., 1994). These parameters determine two things: how much of the past should be used to predict the next observation and how much do the past observations weigh in predicting the next observation. The higher-order models are more complex and can usually achieve a better fit of the training data set, while the simpler low-order models are usually less likely to over-fit to training dataset (Reis and Mandl, 2003). Description of the class of ARIMA methods in full details can be found in (Box et al., 1994). We here will give an example ARIMA (1, 1, 1) model to simply show the notations. In the following equation, μ is a constant term, Inline graphic represents a first-order “autoregressive” term, and the forecast error - first-order moving average at period t - 1 is e(t - 1). φ and θ are coefficients.

graphic file with name M7.gif

ARIMA models have been applied to pneumonia and influenza deaths for detection of outbreaks (Reis and Mandl, 2003). In the Automated Epidemiologic Geotemporal Integrated Surveillance (AEGIS) program at Children's Hospital Boston and Harvard Medical School, a hybrid of ARIMA with cyclic regression was found to have excellent predictive ability (Mandl et al., 2004). These models are available in many common statistical software packages (e.g., SAS Time Series Forecasting module). One drawback of the ARIMA models is that there is no systematic way to update model parameters when new data points arrive.

The Recursive Least Square (RLS) algorithm is another method based on autoregressive linear models and is implemented as part of RODS (Wong et al., 2002, 2003). It learns from the time series but does not need a large learning sample. Also it is more sensitive to recent historical data to predict outcomes, so it is well suited to surveillance for short-term events. Unlike ARIMA or the Serfling method, RLS continuously updates its parameters. RLS operates by converging on a set of coefficients (for a weighted linear equation) that best predicts historical values. The algorithm uses these coefficients to predict the current value. It calculates the prediction errors between the predicted values and the time series values. Using the prediction errors and algorithm threshold (expressed in number of standard deviations), RLS computes a threshold value. This algorithm is ideal for detecting spikes of cases when there is little historical data. Using these models implies that transformation of the data leads to a stationary time series, for which a single underlying probability distribution is assumed. These two hypotheses are not necessarily true, however; the data may present abrupt and wide changes of magnitude as well as irregular periodicity, in situations such as epidemics, modifications of the case-definition, screening, or vaccination (Le and Carrat, 1999).

Hidden Markov Model (HMM)-Based Models

The SPC-based models and the cyclic regression methods need nonepidemic data to model the baseline distribution, which is not always available without data preprocessing. This makes it an obstacle for automated surveillance. Researchers, therefore, have proposed to use Hidden Markov Models (HMM) to segment the time series of influenza indicators into epidemic and nonepidemic phases. Hidden Markov models have found major success in temporal pattern recognition such as speech and handwriting recognition, and bioinformatics. The basic idea behind HMM-based models is to add another layer of random signal generation process conditioned on the state of a hidden Markov process to determine the conditional distribution of each observed data point.

The sequence of state transitions in HMM is reconstructed using statistical methods to calculate the most likely trends of the surveillance data. HMM-based models are flexible enough to be easily adapted automatically to trends, seasonality, covariates (e.g., gender and age), and different distributions (normal, Poisson, Gaussian, Gamma, etc.). HMM-based models have been applied in a number of surveillance data time series analysis studies. For example, Le Strat and Carrat applied a univariate HMM to ILI time series surveillance in France (Le and Carrat, 1999). More technical details of HMM in disease surveillance can be found in (Madign, 2005). The author further discussed the proper number of hidden states, multivariate extensions to the above univariate HMM, as well as HMMs with random observation times. Madigan also pointed out that a key extension to the existing research on HMM-based surveillance would be to incorporate a spatial component in the hidden layer of the models.

Spatial Data Analysis

Spatial analysis techniques are used to find the extent of “clustering” of cases across a map and have long been an important component of the surveillance analysis toolset. More specifically, spatial clustering analysis aims to detect and locate the anomalies in disease occurrences or outbreaks by examining the surveillance data's spatial distribution, as clusters might be of insufficient size to be detected in analyses that consider only an entire region. This would also allow for the possibility that some areas contained populations more likely to become sick, such as older people, or more likely to seek healthcare, as might be the case for certain cultural groups. It thus provides the capability of tracking the progression of disease outbreaks and identifying the population at risk for proper treatment and prevention.

The rationale behind spatial surveillance is that natural disease outbreaks or biological attacks are typically localized at some spatial scale. Spatial analysis in syndromic surveillance uses spatial information residing in the data, such as the patient's home residence, sometimes the work place, and the location of the hospital where the illness is reported. Temporal analyses we discussed in the earlier section are capable of detecting elevated rates across an entire region, but would be less sensitive to a smaller number of spatially focused cases. Furthermore, spatially correlated random effects are often ignored by pure time series methods, thus it is assumed that all tests are independent.

Investigations of clusters in space often associate the varying population density with the null hypothesis. Denote the intensity of the disease cases (the number of expected events per unit area) by Inline graphic, where s represents a location in the study area. Also denote by Inline graphic the intensity function of the population at risk. The null hypothesis of normal spatial distribution is in fact a proportional intensity function, Inline graphic where ρ is the expected number of cases divided by the expected number at risk.

One widely-used spatial analysis algorithm is SMART, made available through the BioSense system and the National Bioterrorism Syndromic Surveillance Demonstration Program. Other popular methods include the GLMM algorithm (Kleinman et al., 2004); spatial scan statistics (Kulldorff, 1999) and a number of its variations such as Modified spatial scan statistics (Duczmal and Buckeridge, 2005); and the Risk-adjusted Support Vector Clustering (RSVC) method (Zeng et al., 2004a).

Temporal analysis methods such as CUSUM can also be adapted to analyze spatial information by maintaining CUSUM charts for the surrounding neighborhood of each individual region as local spatial statistics or by maintaining multivariate CUSUM charts for all regions in a global setting (Lawson and Kleinman, 2005). Vice versa, spatial clustering techniques could be adapted to temporal surveillance, if considering time as one-dimensional space.

Generalized Linear Mixed Models and SMART Algorithm

Kleinman et al. (2005a) proposed the use of Generalized Linear Mixed Model (GLMM) statistics based on a logistic regression model to estimate the probability that each subject under surveillance is a case, in each area, on a given day. The simple logistic regression model introduces “shrinkage” estimators showing the density of population in each area, as the size of the population under surveillance in each area often varies. The proposed method treats each small area as if it was an individual, and the relative locations of the small areas are not taken into account by the model. This method in essence ignores much spatial information and cannot detect elevated counts over several contiguous areas.

SMART is an adaptation of the GLMM method, taking additional parameters into account to adjust for seasonal, weekly, social trends, and holiday status (Bradley et al., 2005). In such an approach, generalized linear models are used to establish the expected count per ZIP code per day based on regressing historical series of counts in each small area. The established distribution of case counts are then refined to account for multiple ZIP codes through multiple testing. One experimental study suggested that SMART delivered slightly inferior results to the spatial scan statistic method. However, both methods achieved good performances (Kleinman et al., 2005a).

Spatial Scan Statistic and Its Variations

Most syndromic surveillance systems make use of spatial scan statistic and its variations. Using such methods for spatial analysis, a large set of circular windows with varying sizes is imposed on the map in different locations to search for clusters over the entire region. As the cluster size is unknown a priori, the scan statistic method uses a likelihood ratio test where the alternative hypothesis is that there is an elevated rate within the scanning window when compared with outside. The most likely clusters can then be identified based on the likelihood-ratio test if the null hypothesis is rejected. For each distinct window, the likelihood ratio is proportional to:Inline graphic, where n is the number of cases inside the circle, N is the total number of cases, and μ is the expected number of cases inside the circle (Kulldorff, 1997). Other probability models, i.e., distribution from which the case incidence are generated, have also been used for scan statistics. Poisson model is commonly seen. Bernoulli model can be used for on-off case-control type data, and exponential model is for survival data.

There are several advantages with scan statistic methods. First, they avoid preselection bias regarding the size or location of clusters. Second, they can be easily adjusted for nonuniform population density as well as other factors such as age.

The spatial-temporal version of the scan statistic uses cylinders instead of circles, where the height of the cylinder represents time. Still, the circular base defines a geographic area with a varying radius. The size of the area that is circled could be from zero to hundreds of kilometers or everything in between. The height of the cylinder can represent a time of day or years. The rest of the process is largely unchanged. A moving cylindrical window with variable sizes in both space and time visits all spatial-temporal locations to identify a significant excess of cases within it, until it reaches a predetermined size limit (Kulldorff, 1999, 2001). On the basis of the flexible purely spatial scan statistic, Takahashi et al. proposed a flexibly shaped space-time scan statistic for detecting irregularly-shaped clusters, which may not be detected by the circular spatial scan statistic (Takahashi et al., 2008). The performance of the flexibly-shaped space-time scan statistic is compared with the cylindrical scan statistic with a space-time power distribution developed by extending the purely spatial bivariate power distribution (Takahashi et al., 2008).

SaTScan is a freely-available software package that implements various types of spatial and space-time scan statistics (2006j). It has been used in more than 10 syndromic surveillance systems, according to our survey. Two commercial products, WpiAnalyst extension for ArcView GIS from the Public Health Research Laboratories (2003d) and ClusterSeer developed by TerraSeer (2006c) contain both spatial and spatial-temporal scan statistics together with many other statistical clustering methods. The SaTScan Macro Accessory for Cartography (SMAC) package consists of four SAS macros and was designed as an easier way to run SaTScan multiple times and add graphical output. The package contains individual macros, which allow the user to make the necessary input files for SaTScan, run SaTScan, and create graphical output all from within SAS software. The macros can also be combined to do this all in one step (Abrams and Kleinman, 2007).

A modified spatial scan statistic proposed by Duczmal and Buckeridge considers work-related factors. A factor reflecting the number of “contaminations” from workers at the nearest neighbors is added to the observed cases in the residential zones (Duczmal and Buckeridge, 2005). Their simulation shows that their approach can achieve greater detection power than the scan statistics that do not consider people movements. To apply their approach, workplace location information is required, which unfortunately is not commonly available in surveillance data sources.

There are a few known problems with spatial scan methods. First, they can only identify clusters in simple regular shapes. Second, it is difficult to incorporate prior knowledge, such as the size or shape of the outbreaks or the impact on disease infection rate. Third, exhaustive searches over a large region to perform statistical tests could be computationally expensive.

The method summarized in the next subsection deals with the first problem. To address the second and third problems, Daniel B. Neill et al. (2005) proposed a Bayesian spatial scan statistic that is computationally more efficient and capable of combining the a priori knowledge of the investigated outbreak. A conjugate Gamma-Poisson model, as opposed to the Poisson model in Kulldorff's original spatial scan statistic, is used to produce a spatially smoothed map of disease rates, with a focus on computing the posterior probabilities to determine the outbreak likelihood and to estimate the location and size of potential outbreaks.

Risk-Adjusted Support Vector Clustering (RSVC) Algorithm

Zeng et al. developed an approach called RSVC that combined the risk adjustment idea with a robust Support Vector Clustering (SVC) method to improve the quality of retrospective spatial-temporal analysis. Specifically, for regions with prior dense baseline data distribution, data points are less likely to be grouped to form anomaly clusters. Several steps are involved in the clustering process. First, the input data are implicitly mapped to a high-dimensional feature space defined by a kernel function (typically the Gaussian kernel). Second, the algorithm finds a hypersphere in the feature space with a minimal radius to contain most of the data. The problem of finding this hypersphere can be formulated as a quadratic or linear programming problem depending on the distance function used. Third, the function estimating the support of the underlying data distribution is then constructed using the kernel function and the parameters learned in the second step. The width parameter in the Gaussian kernel function is dynamically adjusted based on kernel density computed using background data. When mapped back to original space, the hypersphere splits into several clusters, which indicated high risk outbreak areas (Zeng et al., 2004b).

Spatial-temporal Data Analysis

Rule-Based Anomaly Detection with Bayesian Network Modeling (WSARE)

WSARE performs a heuristic search over combinations of temporal and spatial features to detect irregularities in space and time. The case features analyzed by WSARE include syndrome category, age, gender, and geographical information. For example, a two-term case feature could be “Gender = Male AND Home Location = NW.” The Number of the cases satisfying and those not satisfying the case feature are computed to be used to determine whether there is significant discrepancy between the observed statistic of the current day and the baseline.

Historic data (e.g., recent weeks before the day of analysis) is fed to a Bayesian network to create a baseline distribution. The network is constructed using an algorithm called optimal reinsertion (Moore et al., 2003) based on ADTrees (Moore and Lee, 1998). The benefit of the approach relies on Bayesian network's generalization capability that is able to predict the probability of a situation that may not have been encountered in the past. The network structure is rebuilt every month, while the parameters are updated daily. Environmental attributes such as season and day of week can be incorporated in the model as conditional probability.

All feature-value combinations are then searched and scored exhaustively. The scores are generated by conducting hypothesis testing for each feature-value combination against the baseline distribution. Instead of exhaustively searching for i-term feature-value combinations with an exponential complexity (i = 1, 2, …, suppose that there are n features in total), a greedy search approach is designed by searching the best 1-term case feature first and then adding another term to it to compose a 2-term case feature, and so forth. Compared with several other algorithms that do not examine covariate information, WSARE performed better as measured by timeliness at the expense of slightly higher false-positive rate (Wong et al. 2002).

Population-Wide Anomaly Detection and Assessment

Population-wide anomaly detection and assessment (PANDA) is a causal Bayesian network-based model constructing and inferring the spatial-temporal probability distribution of disease in a population as a whole. The causal Bayesian network consists of a large set of inter-linked patient-specific probabilistic causal models, each of them including variables that represent risk factors (e.g., infectious disease exposures of various types), disease states, and patient symptoms (Cooper et al., 2004). Simulation conducted by the RODS team showed that the model can handle a population size of 1.4 million (Cooper et al., 2004).

Monitoring Multiple Data Streams

In Sections 6 and 7, we discuss two specific sets of issues concerning outbreak detection that are worth separate treatments.

In disease surveillance, multiple data sets (data are collected simultaneously from pharmacies, hospitals, nurse help telephone calls, and clinics) are usually available for surveillance. However, the majority of implemented detection algorithms monitor individual data sources and do not cross reference between them. The problem is that no single data source captures all the individuals in the outbreak (Kulldorff et al.,2005). One potentially fruitful detection approach is a data-fusion approach using multiple sources of data (e.g., ED visits and OTC sales data) to perform outbreak detection. For example, MCUSUM and MEWMA (Yeh et al., 2003, 2004) were developed to increase detection sensitivity while limiting the number of false alarms. Multiple univariate statistical techniques and multivariate methods have also been used in prior studies based on different independence assumptions among the data streams. Multiple univariate methods assume independence among the data; while multivariate methods establish the covariance matrix typically estimated from a baseline period (Buckeridge et al., 2005a). In the ESSENCE II project, chief complaints data and sales of OTC medications are treated as covariates (Lombardo et al., 2004). However, to model the multiple univariate signals from different data streams, an in-depth investigation and characterization of health-care-seeking behavior is necessary.

Another approach is to monitor stratified data (e.g., based on syndrome type or age group, counties, or treatment facilities) in parallel. The WSARE (What is Strange About Recent Events) system proposed by Wong et al. (2003) is one example, which searches for outbreaks in various groupings of age, gender, or census tracts. Kulldorff et al. (2003) developed a tree-based scan statistic to do surveillance on groupings that can be preclassified into a hierarchical tree structure.

In addition, during major public events, unpredictable shifts in the healthcare data may occur due to changes in healthcare utilization patterns. This problem is addressed by Reis et al. Instead of monitoring different healthcare data streams individually, they proposed a class of epidemiological network models that monitor the interrelationships among these data streams. The integrated network-based modeling of the interrelationships among the epidemiological data streams allows more robust performance in the face of shifts in healthcare utilization during epidemics and major public events (Reis et al., 2007).

Simultaneous wavelets analysis over multiple time series are practiced by Dillard and Shmueli (Shmueli and Fienberg, 2006). Rigorous comparative evaluations to quantify the gain of using covariates from multiple data sources in surveillance are needed.

Special Events Surveillance

Another challenging issue for real-time outbreak detection is that the surveillance algorithms often rely on historic datasets that span a considerable length of time. Few methods demonstrate reliable detection capability with short-term baseline data. This is a particular concern for surveillance systems for special events (also referred to as drop-in models), which are implemented against bioterrorism attacks or natural disease outbreaks in settings such as international and national sports events or meetings that involve many participants in a short time window.

EARS was used for syndromic surveillance at several large public events in the United States, including the Democratic National Convention of 2000, the 2001 Super Bowl, and the 2001 World Series (Hutwagner et al., 2003). The RODS system was used during the 2002 Winter Olympic Games (Gesteland et al., 2002). The LEADERS system often serves as a drop-in surveillance system intended to facilitate communication and coordination within and between public health facilities (Ritter, 2002).

Summary of Data Analysis Process for Syndromic Surveillance

In this chapter, we first introduce syndrome classification as the first step of syndromic data analysis. We then summarize a large number of disease surveillance algorithms. These algorithms are organized in two dimensions. In the first dimension, a surveillance method is either retrospective surveillance or prospective. Retrospective analysis focuses on analyzing historical data, whereas prospective analysis is more useful for processing online data streams. In the second dimension, a surveillance method can be seen as either a temporal, spatial, or spatial-temporal analysis method. Methods designed for special events are discussed separately due to their unique characteristics. We also examine methods that monitor multiple data streams, which warrant further exploration due to their importance and applicability. We conclude this chapter by pointing out some technical issues to watch for while applying these surveillance methods.

First, the outbreak detection methods make a number of assumptions about the analyzed data. The distribution of the disease events are in many cases assumed, so before the application of any surveillance methods to the disease data, there should be analysis regarding disease behaviors such as the outbreak patterns and events distribution. Second, an algorithm's performance is related to a number of settings: (1) the availability of historic data; data collection process as discussed in Chapter 10.1007/978-1-4419-1278-7_2 is thus closely related to a surveillance algorithm performance; (2) the type of outbreak signals (e.g., slow-building or surge outbreak); (3) the spatial granularity of the data in spatial analysis.

All the complications due to the dynamics of different diseases need to be considered and well investigated before applying a detection algorithm. In (Burkom and Murphy, 2007), the authors propose a data-adaptive method selection scheme to “suit the remedy to the case,” by first evaluating a number of data discriminates such as mean, variance, and skewness before selecting a detection algorithm for analysis. The BioStorm research group developed an ontology-based method to incorporate the a priori knowledge so that different analytical methods are assigned to different types of surveillance data in different settings (Crubézy et al., 2005).

Contributor Information

Hsinchun Chen, Email: hchen@eller.arizona.edu.

Daniel Zeng, zeng@email.arizona.edu.

Ping Yan, pyan@email.arizona.edu.

References

  1. Abrams AM, Kleinman KP. "A Satscan™ Macro Accessory for Cartography (SMAC) Package Implemented with Sas® Software,". International Journal of Health Geographics. 2007;6:6. doi: 10.1186/1476-072X-6-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bath PA. "Data Mining in Health and Medical Information,". Annual Review of Information Science and Technology (ARIST) 2004;38:331–369. [Google Scholar]
  3. Benoit G. "Data Mining,". Annual Review of Information Science and Technology (ARIST) 2002;36:265–310. [Google Scholar]
  4. Bradley CA, Rolka H, Walker D, Loonsk J. "BioSense: Implementation of a National Early Event Detection and Situational Awareness System,". MMWR (CDC) 2005;54(Suppl):11–20. [PubMed] [Google Scholar]
  5. Brookmeyer R, Stroup D. Monitoring the Health of Populations. Statistical Surveillance in Public Health. New York: Oxford University Press; 2004. [Google Scholar]
  6. Buckeridge D, Burkom H, Campbell M, Hogan W, Moore A. "Algorithms for Rapid Outbreak Detection: A Research Synthesis,". Journal of Biomedical Informatics. 2005;38:99–113. doi: 10.1016/j.jbi.2004.11.007. [DOI] [PubMed] [Google Scholar]
  7. Buckeridge D, Graham J, O'Connor J, Choy MK, Tu SW, Musen M. American Medical Informatics Association Symposium. TX: San Antonio; 2002. "Knowledge-Based Bioterrorism Surveillance,". [PMC free article] [PubMed] [Google Scholar]
  8. Buckeridge, D., Musen, M., Switzer, P., and Crubezy, M. 2003. "An Analytic Framework for Space-Time Aberrancy Detection in Public Health Surveillance Data," AMIA Symposium pp. 120–124. [PMC free article] [PubMed]
  9. Burkom, H., and Murphy, S. 2007. "Data Classification for Selection of Temporal Alerting Methods for Biosurveillance." BioSurvellance workshop 2007.
  10. CDC. "HIPAA Privacy Rule and Public Health: Guidance from CDC and the US Department of Health and Human Services,". MMWR. 2003;52(Suppl):1–20. [PubMed] [Google Scholar]
  11. Chang W, Zeng D, Chen H. proceedings of the 8th IEEE International Conference on Intelligent Transportation Systems. Vienna, Austria: 2005. "Prospective Spatio-Temporal Data Analysis for Security Informatics,". [Google Scholar]
  12. Chapman WW, Christensen L, Wagner MM, Haug P, Ivanov O, Dowling J, Olszewski R. "Classifying Free-Text Triage Chief Complaints into Syndromic Categories with Natural Language Processing,". Artificial Intelligence in Medicine. 2005;33(1):31–40. doi: 10.1016/j.artmed.2004.04.001. [DOI] [PubMed] [Google Scholar]
  13. Chapman WW, Cooper GF, Hanbury P, Chapman BE, Harrison LH, Wagner MM. "Creating a Text Classifier to Detect Radiology Reports Describing Mediastinal Findings Associated with Inhalational Anthrax and Other Disorders,". Journal of the American Medical Informatics Association. 2003;10(5):494–503. doi: 10.1197/jamia.M1330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Cooper GF, Dash DH, Levander JD, Wong WK, Hogan WR, Wagner MM. "Bayesian Biosurveillance of Disease Outbreaks," Twentieth Conference on Uncertainty in Artificial Intelligence. Alberta, Canada: Banff; 2004. pp. 94–103. [Google Scholar]
  15. Costagliola D, Flahault A, Galinec D, Garnerin P, Menares J, Valleron A. "A Routine Tool for Detection and Assessment of Epidemics of Influenza-Like Syndromes in France,". American Journal of Public Health. 1981;81(1):97–99. doi: 10.2105/ajph.81.1.97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Crubézy M, O'Connor Pincus, Z., Musen MA. "Ontology-Centered Syndromic Surveillance for Bioterrorism,". IEEE Intelligent Systems. 2005;20(5):26–35. [Google Scholar]
  17. Das D, Weiss D, Mostashari F. "Enhanced Drop-in Syndromic Surveillance in New York City Following September 11, 2001,". J Urban Health. 2003;80:1(suppl):176–188. doi: 10.1007/PL00022318. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Duczmal L, Buckeridge D. "Using Modified Spatial Scan Statistic to Improve Detection of Disease Outbreak When Exposure Occurs in Workplace - Viginia,". MMWR (CDC) 2005;54(Suppl):187. [Google Scholar]
  19. Espino, J.U., and Wagner, M.M. 2001. "The Accuracy of ICD-9 Coded Chief Complaints for Detection of Acute Respiratory Illness," Proc AMIA Symp, pp. 164–168. [PMC free article] [PubMed]
  20. Gesteland PH, Wagner MM, Chapman WW, Espino JU, Tsui F.-C., Gardner RM, Rolfs RT, Dato VM, James BC, Haug PJ. "Rapid Deployment of an Electronic Disease Surveillance System in the State of Utah for the 2002 Olympic Winter Games,". Proceedings of AMIA Symposium. 2002;2002:285–289. [PMC free article] [PubMed] [Google Scholar]
  21. Grigoryan VV, Wagner MM, Waller K, Wallstrom GL, Hogan WR. in: RODS Laboratory Technical Report, 2005. 2005. "The Effect of Spatial Granularity of Data on Reference Dates for Influenza Outbreaks,". [Google Scholar]
  22. Halasz S, Brown P, Goodall C, Cochrane DG, Allegra` JR. "The N-gram CC Classifier: A Novel Method of Automatically Creating CC Classifiers Based on ICD-9 Groupings,". Advances in Disease Surveillance. 2006;1(30):2006. [Google Scholar]
  23. Hutwagner L, Thompson W, Seeman GM. "The Bioterrorism Preparedness and Response Early Aberration Reporting System (EARS),". J Urban Health. 2003;80(2 suppl 1):89–96. doi: 10.1007/PL00022319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Hutwagner, L., Browne, T., Seeman, G.M., and Fleischauer, A.T. 2005a. "Comparing Aberration Detection Methods with Simulated Data," Emerg Infect Dis [serial on the Internet] (11), pp. 314–316. [DOI] [PMC free article] [PubMed]
  25. Ivanov, O., Wagner, M.M., Chapman, W.W., and Olszewski, R.T. 2002. "Accuracy of Three Classifiers of Acute Gastrointestinal Syndrome for Syndromic Surveillance," AMIA Symp, pp. 345–349. [PMC free article] [PubMed]
  26. Kaufman Z, Cohen E, Peled-Leviatan T, Lavi C, Aharonowitz G, Dichtiar R, Bromberg M, Havkin O, Shalev Y, Marom R, Shalev V, Shemer J, Green M. "Using Data on an Influenza B Outbreak to Evaluate a Syndromic Surveillance System - Israel, June 2004 [Abstract],". MMWR (CDC) 2005;54(Suppl):191. [Google Scholar]
  27. Kleinman K, Lazarus R, Platt R. "A Generalized Linear Mixed Models Approach for Detecting Incident Cluster/Signals of Disease in Small Areas, with an Application to Biological Terrorism (with Invited Commentary),". American Journal of Epidemiology. 2004;159:217–224. doi: 10.1093/aje/kwh029. [DOI] [PubMed] [Google Scholar]
  28. Kleinman K, Abrams A, Kulldorff M, Platt R. "A Model-Adjusted Spacetime Scan Statistic with an Application to Syndromic Surveillance,". Epidemiology and Infection. 2005;119:409–419. doi: 10.1017/s0950268804003528. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Kulldorff M. "A Spatial Scan Statistic,". Communications in Statistics: Theory and Methods. 1997;26:1481–1496. [Google Scholar]
  30. Kulldorff M. "Spatial Scan Statistics. Calculations, and Applications," Scan Statistics and Applications, J.B. Glaz (ed.). Birkhauser, Boston: Models; 1999. pp. 303–322. [Google Scholar]
  31. Kulldorff M. "Prospective Time Periodic Geographical Disease Surveillance Using a Scan Statistic,". Journal of the Royal Statistical Society (Series A. 2001;164:61–72. [Google Scholar]
  32. Kulldorff M, Fang Z, Walsh S. "A Tree-Based Scan Statistic for Database Disease Surveillance,". Biometrics. 2003;9:641–646. doi: 10.1111/1541-0420.00039. [DOI] [PubMed] [Google Scholar]
  33. Kulldorff, M., Mostashari, F., Duczmal, L., Yih, K., Kleinman, K., and Platt, R. 2005. "Multivariate Spatial Scan Statistics for Disease Surveillance." [DOI] [PubMed]
  34. Lawson A B, Kleinman K. Spatial & Syndromic Surveillance for Public Health. New York: Wiley; 2005. [Google Scholar]
  35. Le SY, Carrat F. "Monitoring Epidemiologic Surveillance Data Using Hidden Markov Models,". Statistics in Medicine. 1999;18:3463–3478. doi: 10.1002/(sici)1097-0258(19991230)18:24<3463::aid-sim409>3.0.co;2-i. [DOI] [PubMed] [Google Scholar]
  36. Leroy G, Chen H. "Meeting Medical Terminology Needs - the Ontology-Enhanced Medical Concept Mapper,". IEEE Transactions on Information Technology in Biomedicine. 2001;5:261–270. doi: 10.1109/4233.966101. [DOI] [PubMed] [Google Scholar]
  37. Levine N. Washington. DC: The National Institute of Justice; 2002. "Crimestat III: A Spatial Statistics Program for the Analysis of Crime Incident Locations,". [Google Scholar]
  38. Lombardo J, Burkom H, Pavlin J. "Electronic Surveillance System for the Early Notification of Community-Based Epidemics (ESSENCE II), Framework for Evaluating Syndromic Surveillance Systems," Syndromic surveillance: report from a national conference, 2003. MMWR 2004. 2004;53(Suppl):159–165. [PubMed] [Google Scholar]
  39. Lu H.-M., King C.-C., Wu TS, Shin F.-Y., Hsiao J.-Y., Zeng D, Chen H. " Chinese Chief Complaint Classification for Syndromic Surveillance,". In: BioSurveillance D. Zeng, Gotham I, Komatsu K, Lynch C, Thurmond M, Madigan D, Lober B, Kvach J, Chen H., editors. Intelligence and Security Informatics. Springer Lecture Notes in Computer Science, No: New Brunswick, NJ; 2007. p. 4506. [Google Scholar]
  40. Lu H.-M., Zeng D, Trujillo L, Komatsu K, Chen H. "Ontology-Enhanced Automatic Chief Complaint Classification for Syndromic Surveillance,". Journal of Biomedical Informatics. 2008;41(2):340–356. doi: 10.1016/j.jbi.2007.08.009. [DOI] [PubMed] [Google Scholar]
  41. Ma H, Rolka H, Mandl K, Buckeridge D, Fleischauer A, Pavlin J. "Implementation of Laboratory Order Data in Biosense Early Event Detection and Situation Awareness System,". MMWR (CDC) 2005;54(Suppl):27–30. [PubMed] [Google Scholar]
  42. Madign D. Spatial & Syndromic Surveillance for Public Health. A.B. Lawson and K. Kleinman (eds.). New York: Wiley; 2005. "Bayesian Data Mining for Health Surveillance,". [Google Scholar]
  43. Mandl KD, Overhage JM, Wagner MM, Lober WB, Sebastiani P, Mostashari F, Pavlin JA, Gesteland PH, Treadwell T, Koski E, Hutwagner L, Buckeridge DL, Aller RD, Grannis S. "Implementing Syndromic Surveillance: A Practical Guide Informed by the Early Experience,". Journal of American Medical Informatics Association. 2004;11(2):141–150. doi: 10.1197/jamia.M1356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Moore A, Lee MS. "Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets,". Journal of Artificial Intelligence Research. 1998;8:67–91. [Google Scholar]
  45. Moore, A.W., Cooper, G., Tsui, F.-C., and Wagner, M.M. 2002. "Summary of Biosurveillance-Relevant Statistical and Data Mining Techniques," RODS Laboratory Technical Report.
  46. Neill, D., Moore, A., and Cooper, G. 2005. "A Bayesian Spatial Scan Statistic," Neural Information Processing Systems (18).
  47. Neubauer A. "The EWMA Control Chart: Properties and Comparison with Other Quality-Control Procedures by Computer Simulation,". Clinical Chemistry. 1997;43(4):594–601. [PubMed] [Google Scholar]
  48. Quenel P, Dab W, Hannoun C, Cohen J. "Sensitivity, Specificity and Predictive Values of Health Service Based Indicators for the Surveillance of Influenza-A Epidemics,". International Journal of Epidemiology. 1994;23:849–855. doi: 10.1093/ije/23.4.849. [DOI] [PubMed] [Google Scholar]
  49. Rath, T.M., Carreras, M., and Sebastiani, P. 2003. "Automated Detection of Influenza Epidemics with Hidden Markov Models," Lecture Notes in Computer Science Berlin: Springer, pp. 521–532.
  50. Reis B, Mandl K. "Time Series Modeling for Syndromic Surveillance,". BMC Medical Informatics and Decision Making. 2003;3:2. doi: 10.1186/1472-6947-3-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Reis B, Mandl K. "Syndromic Surveillance: The Effects of Syndrome Grouping on Model Accuracy and Outbreak Detection,". Annals of Emergency Medicine. 2004;44(3):235–241. doi: 10.1016/j.annemergmed.2004.03.030. [DOI] [PubMed] [Google Scholar]
  52. Reis BY, Kohane IS, Mandl KD. "An Epidemiological Network Model for Disease Outbreak Detection,". PLoS Medicine. 2007;4(6):1019–1031. doi: 10.1371/journal.pmed.0040210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Ritter, T. 2002. "Leaders: Lightweight Epidemiology Advanced Detection and Emergency Response System," SPIE, pp. 110–120.
  54. Rogerson PA. "Surveillance Systems for Monitoring the Development of Spatial Patterns,". Statistics in Medicine. 1997;16(18):2081–2093. doi: 10.1002/(sici)1097-0258(19970930)16:18<2081::aid-sim638>3.0.co;2-w. [DOI] [PubMed] [Google Scholar]
  55. Rogerson PA. "Spatial Surveillance and Cumulative Sum Methods," Spatial & Syndromic Surveillance for Public Health. K.K. Andrew B Lawson (ed.). New York: Wiley; 2005. pp. 95–113. [Google Scholar]
  56. Serfling, R.E. 1963. "Methods for Current Statistical Analysis of Excess Pneumonia Influenza Deaths," Public Health Reports (78), pp. 494–506. [PMC free article] [PubMed]
  57. Shahar, Y., and Musen, M. 1996. "Knowledge-Based Temporal Abstraction in Clinical Domains," Artificial Intelligence in Medicine (8), pp. 267–298. [DOI] [PubMed]
  58. Shmueli G, Fienberg SE. "Current and Potential Statistical Methods for Monitoring Multiple Data Streams for Bio-Surveillance,". In: Wilson A, Wilson G, Olwell DH, editors. Statistical Methods in Counter-Terrorism: Game Theory, Modeling, Syndromic Surveillance, and Biometric Authentication. Berlin: Springer; 2006. [Google Scholar]
  59. Sniegoski CA. "Automated Syndromic Classifi Cation of Chief Complaint Records,". Johns Hopkins Apl Technical Digest. 2004;25(1):68–75. [Google Scholar]
  60. Sokolow LZ, Grady N, Rolka H, Walker D, McMurray P, English-Bullard R, Loonsk J. "Deciphering Data Anomalies in Biosense,". MMWR (CDC) 2005;54(Suppl):133–140. [PubMed] [Google Scholar]
  61. Takahashi K, Kulldorff M, Tango T, Yih K. "A Flexibly Shaped Space-Time Scan Statistic for Disease Outbreak Detection and Monitoring,". International Journal of Health Geographics. 2008;7:14. doi: 10.1186/1476-072X-7-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Travers DA, Haas SW. "Evaluation of Emergency Medical Text Processor, a System for Cleaning Chief Complaint Textual Data,". Academic Emergency Medicine. 2004;11:1170–1176. doi: 10.1197/j.aem.2004.08.012. [DOI] [PubMed] [Google Scholar]
  63. Tsui F.-C., Espino JU, Dato VM, Gesteland PH, Hutman J, Wagner MM. "Technical Description of Rods: A Real-Time Public Health Surveillance System,". Journal of American Medical Informatics Association. 2003;2003(10):399–408. doi: 10.1197/jamia.M1345. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Tsui, F.-C., Wagner, M.M., Dato, V.M., and Chang, C.C.H. 2001. "Value of ICD-9-Coded Chief Complaints for Detection of Epidemics," Symposium of Journal of American Medical Informatics Association. [PMC free article] [PubMed]
  65. Wagner MM, Espino J, Tsui FC, Gesteland P, Chapman WW, Ivanov O, Moore A, Wong W, Dowling J, Hutman J. "Syndrome and Outbreak Detection Using Chief-Complaint Data - Experience of the Real-Time Outbreak and Disease Surveillance Project,". MMWR (CDC) 2004;53(Suppl):28–32. [PubMed] [Google Scholar]
  66. Wong WK, Moore A, Cooper G, Wagner M. "WSARE: What's Strange About Recent Events? ". Journal of Urban Health. 2003;80(2 Suppl. 1):66–75. doi: 10.1007/PL00022317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Wong WK, Moore A, Cooper GF, Wagner M. "Rule-Based Anomaly Pattern Detection for Detecting Disease Outbreaks," AAAI-02. Edmonton: Alberta pp; 2002. pp. 217–223. [Google Scholar]
  68. Yeh AB, Lin D.K.J., Zhou H, Venkataramani C. "A Multivariate Exponentially Weighted Moving Average Control Chart for Monitoring Process Variability,". Journal of Applied Statistics. 2003;30(5):507–536. [Google Scholar]
  69. Yih W, Caldwell B, Harmon R. "The National Bioterrorism Syndromic Surveillance Demonstration Program,". MMWR (CDC) 2004;53(Suppl):43–46. [PubMed] [Google Scholar]
  70. Yih WK, Abrams A, Danila R, Green K, Kleinman K, Kulldorff M, Miller B, Nordin J, Platt R. "Ambulatory-Care Diagnoses as Potential Indicators of Outbreaks of Gastrointestinal Illness - Minnesota,". MMWR (CDC) 2005;54(Suppl):157–162. [PubMed] [Google Scholar]
  71. Zeng D, Chang W, Chen H. "A Comparative Study of Spatio-Temporal Hotspot Analysis Techniques in Security Informatics," 7th IEEE Transactions on Intelligent Transportation Systems. DC: Washington; 2004. pp. 106–111. [Google Scholar]
  72. Zhang, J., Tsui, F., Wagner, M., and Hogan, W. 2003. "Detection of Outbreaks from Time Series Data Using Wavelet Transform," AMIA Symp, pp. 748–752. [PMC free article] [PubMed]
  73. Zhu Y, Wang W, Atrubin D, Wu Y. "Initial Evaluation of the Early Aberration Reporting System - Florida,". MMWR (CDC) 2005;54(Suppl):123–130. [PubMed] [Google Scholar]

Articles from Infectious Disease Informatics are provided here courtesy of Nature Publishing Group

RESOURCES