Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2005 Nov-Dec;12(6):618–629. doi: 10.1197/jamia.M1841

Generating a Reliable Reference Standard Set for Syndromic Case Classification

Wendy W Chapman 1, John N Dowling 1, Michael M Wagner 1
PMCID: PMC1294033  PMID: 16049227

Abstract

Objective To generate and measure the reliability for a reference standard set with representative cases from seven broad syndromic case definitions and several narrower syndromic definitions used for biosurveillance.

Design From 527,228 eligible patients between 1990 and 2003, we generated a set of patients potentially positive for seven syndromes by classifying all eligible patients according to their ICD-9 primary discharge diagnoses. We selected a representative subset of the cases for chart review by physicians, who read emergency department reports and assigned values to 14 variables related to the seven syndromes.

Measurements (1) Positive predictive value of the ICD-9 diagnoses; (2) prevalence of the syndromic definitions and related variables; (3) agreement between physician raters demonstrated by κ, κ corrected for bias and prevalence, and Finn's r; and (4) reliability of the reference standard classifications demonstrated by generalizability coefficients.

Results Positive predictive value for ICD-9 classification ranged from 0.33 for botulinic to 0.86 for gastrointestinal. We generated between 80 and 566 positive cases for six of the seven syndromic definitions. Rash syndrome exhibited low prevalence (34 cases). Agreement between physician raters was high, with κ > 0.70 for most variables. Ratings showed no bias. Finn's r was >0.70 for all variables. Generalizability coefficients were >0.70 for all variables but three.

Conclusion Of the 27 syndromes generated by the 14 variables, 21 showed high enough prevalence, agreement, and reliability to be used as reference standard definitions against which an automated syndromic classifier could be compared. Syndromic definitions that showed poor agreement or low prevalence include febrile botulinic syndrome, febrile and nonfebrile rash syndrome, respiratory syndrome explained by a nonrespiratory or noninfectious diagnosis, and febrile and nonfebrile gastrointestinal syndrome explained by a nongastrointestinal or noninfectious diagnosis.


Medical informatics evaluations are beginning to place more emphasis on measurement studies that quantify the reliability or validity of reference standard diagnoses.1,2 Biosurveillance is a fairly new field, and quantitative studies of the accuracy and effectiveness of automated surveillance systems have only begun to be published. Our objective was to generate a reliable reference standard set representative of 27 case definitions of interest in syndromic surveillance. To do this, we used ICD-9 classification together with manual chart review and measured agreement and reliability of the reference standard classifications, providing quantitative validation for many of the syndromic case definitions.

Generating a reference standard set for evaluation of a biosurveillance system is difficult for several reasons. First, many of the target cases for detection have not previously been defined. For example, syndromic definitions, such as neurological and respiratory, have only recently been explicitly defined for the purpose of disease and outbreak surveillance and have not been externally validated as have other case definitions (e.g., pneumonia or influenza). Second, there exists no standard set of syndromes to monitor, and existing syndromic case definitions demonstrate substantial heterogeneity of findings constituting the definition.3 For example, existing case definitions of respiratory syndrome vary depending on whether the definition includes upper respiratory symptoms, on how severe the respiratory illness is, and on whether the patient is febrile.4,5,6,7 Third, many of the diseases and syndromes that surveillance systems attempt to detect rarely, if ever, occur. For instance, in hospitals in the United States, patients with hemorrhagic or botulinic syndrome and patients with anthrax or West Nile virus are rarely seen. Using random selection of patient records to generate a reference standard set would be infeasible because of the low prevalence of many of the target cases.

Syndromic case classification studies measure how well a classifier categorizes patients into appropriate syndromic categories, such as respiratory or gastrointestinal. We and others have performed studies to evaluate syndromic case classifiers against reference standards to determine how accurately chief complaints or ICD-9 codes can classify patients admitted to the emergency department (ED). The reference standards that we used in these studies have been limited. Several of the studies only evaluated case detection for a single case definition (e.g., gastrointestinal syndrome8 or respiratory syndrome9,10) and had a small number of representative cases. We have expanded our studies to include hundreds of cases of multiple syndromes, but those studies relied on a reference standard based on classification from lists of ICD-9 discharge diagnoses.11 ICD-9 code use is variable across institutions and even across coders, and, because ICD-9 codes are often optimized for billing purposes, ICD-9 diagnoses are not always accurate.12

We implemented a systematic method for selecting patients likely to manifest findings consistent with bioterrorism agents using the patients' ICD-9 discharge diagnoses. Physicians then read the previously dictated ED reports for the selected patients to determine which syndromic case definitions applied to each patient. The methodology that we used was designed to include patients with various syndromic presentations and to use expensive physician expert time as efficiently as possible. We describe the methodology below and compare a number of measures of interrater agreement and reliability for the resulting reference standard.

Background

Syndromic Surveillance

Rapid outbreak detection may not depend on confirmed diagnoses but rather on recognition of suspicious patterns that occur earlier in the course of a disease. Grouping cases into syndromes (e.g., respiratory syndrome) rather than into specific diagnoses (e.g., pneumonia) may provide earlier evidence of infection because many diseases in their early phase have overlapping symptoms that may not initially alarm clinicians.13,14,15,16,17,18 For example, a retrospective analysis of a Cryptosporidium contamination in Milwaukee in 1993 that killed more than 100 people showed an unrecognized increase in the number of patients with diarrhea weeks before the deaths.19

Manual20,21,22,23 and automated22,24,25,26,27,28,29,30,31,32,33,34,35,36,37 syndromic surveillance systems have been in place since 1999. Early developers of these systems found that health provider encounter data, ED data in particular, are readily available and well suited to syndromic surveillance.38 Both ED triage chief complaints and ICD-9 admit diagnosis codes have been used to automatically classify ED visits into syndromes.

Real-time Outbreak and Disease Surveillance System

The Real-time Outbreak and Disease Surveillance (RODS) system33 is an open-source39 biosurveillance system that began development in 1999. RODS collects ED registration data, including age, sex, zip code, and triage chief complaint from over 200 emergency care facilities in Pennsylvania, Utah, Ohio, New Jersey, Kentucky, Michigan, Illinois, Nevada, and California. A naïve Bayesian classifier called CoCo40 assigns every patient a syndromic category based on the patient's chief complaint. RODS currently monitors seven syndromic categories: respiratory (congestion, shortness of breath, cough, etc.); gastrointestinal (nausea, vomiting, abdominal pain, etc.); rash (any rash); botulinic (ocular abnormalities, difficulty swallowing or speaking, etc.); hemorrhagic (bleeding from any site); neurological (nonpsychiatric neurological symptoms, such as headache or seizure); and constitutional (generalized complaints like fever, chills, or malaise). Time-series detection algorithms41,42 are applied to the syndromic counts and shown in graphical form on the user interface. If the number of patients presenting with a syndromic category exceeds the number expected, RODS sends an electronic alarm to a team of researchers and public health physicians.

We are developing and evaluating syndromic case classification applications that apply natural language-processing techniques to ED data to classify patients into the seven syndromes monitored by RODS.40,43,44 To understand the performance of these applications, we set out to obtain a rigorous and reliable reference standard that included cases from all seven syndromes and that allowed us to narrow the syndromic definitions to capture patients with problems more likely to resemble those induced by an infectious disease or a bioterrorist agent.

Methods

Our objectives for this study were twofold: (1) generate a reference standard set with representative cases from seven broad syndromic case definitions and several narrower syndromic definitions and (2) quantify interrater agreement and reliability for the reference standard classifications.

Setting

The study was conducted on data collected from the University of Pittsburgh Medical Center (UPMC) Presbyterian Hospital ED from December 1990 to September 2003. The ED at the UPMC Presbyterian Hospital admits approximately 35,000 patients per year. Patient data have been stored in the MARS database since 1990, including free-text triage chief complaints, dictated and transcribed ED reports, and coded ICD-9 discharge diagnoses.

All patients with records stored in MARS between December 1990 and September 2003 were eligible for this study, which was approved by the University of Pittsburgh's Institutional Review Board. Patients without an electronic triage chief complaint, primary ICD-9 discharge diagnosis, or electronic ED report were excluded from the study.

Objective 1: Generate a Reference Standard Set with Representative Cases for Multiple Syndromic Case Definitions

To accomplish Objective 1, we developed a systematic method for selecting patients who are likely to manifest findings consistent with bioterrorism agents and obtained the patients' syndromic classifications based on manual chart review.

Selection of Potentially Positive Patients

We generated a superset of patients who potentially exhibited symptoms or signs consistent with seven syndromes by classifying all eligible patients according to their ICD-9 primary discharge diagnoses, as described below. We then selected a representative subset for chart review by physicians.

For each of the seven syndromes monitored by RODS, a list of ICD-9 codes including findings and diseases representative of the syndrome was compiled. One author (JND), who is a physician board-certified in internal medicine and infectious diseases with over 30 years of experience, manually compiled the list of ICD-9 codes for every syndrome by electronically searching for ICD-9 codes using keywords. He included codes for the diseases potentially caused by bioterrorism threats included in Wagner et al.,7 a compilation of bioterrorist threats published by six different organizations, including the CDC and NATO. He added codes for related diseases and for findings that commonly occur with the bioterrorism threats. JND assigned every code one of seven syndromic categories. The final list used for compiling a superset of positive patients contained 831 unique ICD-9 codes for seven syndromes (Appendix A, available as an online JAMIA supplement at www.jamia.org). shows representative codes for each syndrome.

Table 1.

Examples of ICD-9 Codes Used for Compiling a Superset of Positive Patients

Syndrome (No. of Codes) Examples of ICD-9 Codes
Respiratory (n = 287) 020.3 (primary pneumonic plague), 021.2 (pulmonary tularemia), 011 (pulmonary tuberculosis), 480 (viral pneumonia), 487 (influenza), 033 (whooping cough), 511 (pleurisy), 786.05 (shortness of breath)
Botulinic (n = 60) 005.1 (botulism), 045.0 (acute paralytic poliomyelitis, bulbar), 357 (acute infective polyneuritis), 351.0 (Bell's palsy), 368.2 (diplopia), 374.3 (ptosis of eyelid), 787.2 (dysphagia)
Gastrointestinal (n = 119) 001 (cholera), 003.0 (Salmonella gastroenteritis), 005 (food poisoning, other bacterial), 007.4 (cryptosporidiosis), 787.91 (Diarrhea)
Neurological (n = 111) 066.4 (West Nile fever), 331.81 (Reye's syndrome), 323 (encephalitis, myelitis, and encephalomyelitis), 094.2 (syphilitic meningitis), 320 (bacterial meningitis), 780.01 (coma), 784.0 (headache)
Rash (n = 99) 022.0 (cutaneous anthrax), 050 (smallpox), 034.1 (scarlet fever), 053 (herpes zoster), 055 (measles), 684 (impetigo)
Constitutional (n = 66) 022.0 (bubonic plague), 002.0 (typhoid fever), 075 (infectious mononucleosis), 079.9 (viral infection nos), 780.6 (fever), 780.7 (malaise and fatigue)
Hemorrhagic (n = 89) 065 (arthropod hemorrhagic fever), 530.82 (esophageal hemorrhage), 535.01 (acute gastritis w/ hemorrhage), 578.0 (hematemesis), 599.7 (hematuria)

We implemented a systematic process to select a representative subset of eligible patients to be reviewed by physicians. Our selection algorithm was created with the following goals in mind: (1) ensure a potentially equivalent number of cases for every syndrome; (2) include as many patients with a discharge diagnosis code for an actual bioterrorism threat as possible; (3) aside from bioterrorism threats, avoid bias toward particular ICD-9 codes; and (4) represent a broad spectrum of disorders. We selected cases for chart review using the following algorithm:

  1. For every ICD-9 code representing a bioterrorism threat, randomly select up to 30 eligible patients with a discharge diagnosis matching that ICD-9 code.

  2. For all other ICD-9 codes, randomly select up to ten cases.

  3. Group the selected cases into potential syndromic categories according to the ICD-9 codes. For every syndrome:
    • if the number of selected cases (n) for that syndrome is between 200 and 250, select all cases for physician review;
    • if the number of selected cases for that syndrome is less than 200:
      • § increase n for selection of patients with ICD-9 diagnoses classified into that syndrome until selection size is approximately 200;
    • if the number of selected cases for that syndrome is >250:
      • § collapse nonbioterrorist threat ICD-9 codes with the same three-number prefix into one group (e.g., the ICD-9 codes for gastrointestinal category, 009.0, 009.1, 009.2, and 009.3, would be collapsed into a single group labeled 009) with certain exceptions (e.g., signs and symptoms that could be indicative of diseases in more than one syndromic category);
      • § select n cases from the nonbioterrorist threat ICD-9 codes in that syndrome, selecting from collapsed groups where relevant, rather than from every individual ICD-9 code in the collapsed group;
      • § adjust n until the selected size is approximately 200.

Applying the above algorithm, we selected a subset of 1,557 patients as the final record set for chart review by physicians. We evaluated how well the ICD-9 codes identified positive syndromic cases by measuring sensitivity and positive predictive value of the ICD-9 code classification of the 1,557 patients against classifications made by the physicians.

Physician Review of Potentially Positive Patients

Nine physicians board certified in general internal medicine participated as reference standard experts. Every ED report for the 1,557 selected patients was read by at least two randomly designated physicians, who individually assigned values to the 14 variables shown in , including general syndromic classification, detailed syndromic features, and fever status. Every variable was classified as either “present” or “absent” except for fever status, which was assessed as either “present,” “absent,” or “no information (in the record).”

Table 2.

Estimated Prevalence of Actually Positive Cases in the Record Set of 1,557 Cases (Column 1) and within the Relevant Syndrome or Feature (Column 2)

Variable No. (%) of Positive Cases in Record Set No. (%) of Positive Cases within Superseding Syndrome or Feature
Respiratory syndrome 517 (33.2) N/A
    Lower respiratory 327 (21.0) 327/517 (63.2)
        Radiological evidence 164 (10.5) 164/327 (50.2)
    Explained by nonrespiratory or noninfectious diagnosis 32 (2.1) 32/517 (6.2)
Gastrointestinal syndrome 566 (36.4) N/A
    Diarrhea 152 (9.8) 152/566 (26.9)
    Explained by nongastrointestinal or noninfectious diagnosis 30 (1.9) 30/566 (5.3)
Constitutional syndrome 283 (18.2) N/A
Rash syndrome 34 (2.2) N/A
Hemorrhagic syndrome 261 (16.8) N/A
Neurological syndrome 470 (30.2) N/A
    Meningoencephalitic 51 (3.3) 51/470 (10.9)
Botulinic syndrome 80 (5.1) N/A
Fever status 315 (20.2) N/A

N/A = Not Applicable.

Instructions for the physicians' classification task are included as Appendix B (available as an online JAMIA supplement at www.jamia.org). The authors jointly composed the case definitions contained in the first draft of instructions. Physicians individually applied the case definitions to ten pilot cases, and we provided feedback on their classifications when compared to classifications of JND. Based on classification mistakes and on questions asked by the physicians during the pilot process, we clarified and refined the case definitions. Changes to the case definitions included adding examples of findings supportive of a positive classification encountered in the pilot reports (e.g., adding physical findings of pneumonia as evidence of respiratory syndrome), clarifications on ambiguous findings (e.g., cough could be considered lower or upper respiratory, depending on the patient's other findings), and exceptions to positive classification (e.g., hemorrhagic syndrome is bleeding from any site except the central nervous system or into the conjunctiva). Physicians classified a second pilot set of six cases, after which we provided feedback and made further refinement to the case definitions.

After the pilot stage, physicians followed the instructions in Appendix B, which represent the final case definitions, to assign values to the variables for each of the 1,557 cases. The physicians determined which of the seven general syndromic case definitions described the patient at the time of discharge from the ED. None, one, or any number of syndromes could be scored as “present” for each record. If respiratory syndrome, gastrointestinal syndrome, or neurological syndrome were present, physicians assigned values to relevant syndromic features that would allow us to group patients based on more specific syndromic case definitions. For example, if a physician determined that a patient had respiratory syndrome, we asked him or her to also determine whether the respiratory symptoms affected the lower respiratory tract and whether the patient had a positive chest radiograph. A patient with lower respiratory signs or symptoms is more likely to be a respiratory case of interest in outbreak detection, and a patient with lower respiratory symptoms and a positive chest radiograph is even more likely to be of interest.

The resulting classifications by physicians allowed us to group patients into 27 different syndromic definitions, including seven general syndromic definitions (e.g., respiratory syndrome), seven narrow syndromes (e.g., lower respiratory syndrome, fever), and 13 febrile syndromes (e.g., febrile respiratory syndrome, febrile lower respiratory syndrome). We estimated the prevalence of the 27 syndromic definitions within the entire record set by counting the number of times the two initial readers agreed that the syndromic definition was present.

When a record yielded disagreement between two physicians for any variable, a third physician read the report. A tenth physician was added to examine some of the reports with disagreements. The reference standard classification of the patients for every variable was the result of majority vote of the physicians' individual ratings.

Objective 2: Evaluate Agreement and Reliability of Physician Classifications

There are no standard methods for selecting which syndromic case definitions are best for a syndromic surveillance system. The first step in validating different syndromic case definitions may be quantifying how well experts agree when classifying cases into the definitions. A case definition that exhibits poor agreement may not be appropriate for surveillance systems because automated methods can hardly be expected to do what physicians themselves cannot.

We measured interrater agreement and reliability for the 27 different case definitions formed from combinations of the variables assigned by physicians. According to Tinsley and Weiss,45 interrater agreement represents the extent to which different judges tend to make exactly the same judgments about the rated objects. Perfect agreement among raters means that the raters assigned exactly the same values when rating the same object. In contrast, interrater reliability represents the degree to which the ratings of different judges are proportional when expressed as deviations from their means. Interrater reliability is usually reported in terms of analysis of variance (ANOVA) or correlational indices and quantifies the reproducibility or precision of the reference standard.2 Both agreement and reliability statistics have been applied to reference standard generation in medical informatics research.2,46,47,48

Agreement

We applied a number of common agreement metrics to the physician classifications. For each variable, we measured the proportion of the initial two readers who agreed in their classification, along with the positive specific agreement (Ppos) and the negative specific agreement (Pneg), defined as the proportion of positive or negative cases for which the raters agreed.2 We calculated the κ value by both the method of Cohen49 and that of Siegel and Castellan,50 as suggested by Di Eugenio and Glass,51 in order to account for possible bias in the data set. The standard error and confidence intervals for Cohen's κ were obtained as described by Fleiss et al.52 We also measured 2P(A) − 1, where P(A) is the observed agreement among raters. This metric adjusts Cohen's κ for differences in prevalence of the possible classification responses (e.g., present or absent).51 We computed Finn's r53 using the computationally simpler but numerically identical result for the within-subjects variance when the ratings are assigned randomly (Sc2), as proposed by Tinsley and Weiss.45 The variable fever status was analyzed both with the three possible responses and with the “no information” and “absent” replies combined. We justified merging the two responses on the assumption that any evidence of fever during an ED visit was likely to be reported in the dictated note, so missing information about fever status in the report implied absence of fever.

For all variables, we used a univariate ANOVA to measure agreement among the nine individual physicians based on the number of times each physician concurred with the other reader of the same record. Then, we compared every combination of pairs of physicians using Tukey's honest significant difference test, which adjusts the observed significance level for the fact that multiple comparisons are being made, and is more powerful than the Bonferroni test when testing a large number of pairs of means (SPSS 12.0 for Windows. Release 12.0.0, 2003).

Reliability

To measure reliability of the reference standard classifications, we applied an extension of classic reliability theory to calculate the generalizability coefficients for every syndromic case definition, as described in Hripcsak et al.46 We calculated generalizability coefficients from the output of a two-way ANOVA for all patient records read by two physicians and separately for those records scored by three physicians because of disagreement between the first two readers.46,54 We performed separate generalizability analyses for each variable, as suggested by Shavelson et al.54 and Hripcsak et al.46 The numbers of raters required to attain a generalizability coefficient of at least 0.70 were obtained using the formula46:Inline graphicwhere Np' = number of raters required and p′ = target reliability (e.g., 0.70). This formula is numerically equivalent to the Spearman-Brown prophecy formula.55

Results

From 527,228 eligible patients, we selected a superset of potentially positive patients that included 96,818 records representing 355 unique ICD-9 codes from the initial 831 codes. We selected a final record set of 1,557 patients for chart review using the algorithm described above. Of the 1,557 records, 1,022 were disagreed on for at least one variable and had to be read by a third physician.

Objective 1: Generate a Reference Standard Set with Representative Cases for Multiple Syndromic Case Definitions

We approximated the prevalence of actual positive instances of each variable represented in the record set by examining the number of times that the two initial readers agreed that the variable was present. Nested features (e.g., lower respiratory) could only be judged as present if the relevant superior syndrome or feature was present. shows the estimated prevalence of every variable in the record set, along with the prevalence of the nested variables within the associated superior variable (e.g., the presence of lower respiratory within cases positive for respiratory syndrome). In the entire record set, gastrointestinal syndrome, respiratory syndrome, and neurological syndrome were the most prevalent, whereas botulinic syndrome and rash syndrome were the least. Over half of the positive lower respiratory cases had radiological evidence of lower respiratory disease. Of the patients with gastrointestinal syndrome, just over a fourth of them had diarrhea. Eleven percent of patients with neurological syndrome exhibited meningoencephalitic symptoms.

shows the predictive performance of the ICD-9 groups at identifying positive syndromic cases when compared to the reference standard classifications. More than half of the 1,557 patients with an ICD-9 code classified as respiratory, gastrointestinal, hemorrhagic, or neurological syndrome were classified as positive by the reference standard. ICD-9 codes for botulinic, constitutional, and rash received the lowest positive predictive values.

Table 3.

Sensitivity and Positive Predictive Value (PPV) of ICD-9 Codes Used to Select Potentially Positive Patients

Feature Sensitivity PPV
Respiratory syndrome 0.38 (232/607) 0.74 (232/312)
Gastrointestinal syndrome 0.30 (188/653) 0.86 (188/219)
Constitutional syndrome 0.24 (93/390) 0.43 (93/217)
Rash syndrome 0.66 (91/138) 0.48 (91/189)
Hemorrhagic syndrome 0.45 (147/328) 0.71 (147/208)
Neurological syndrome 0.29 (163/471) 0.79 (163/207)
Botulinic syndrome 0.55 (67/122) 0.33 (67/205)

shows the prevalence of fever coexistent with other case definitions in the entire record set and of fever within each of the 13 syndromic case definitions. Fever was present in 20% of the records and ranged in prevalence within the syndromes from 5% in botulinic to 63% in constitutional. Fever was one of the elements that could be used to define constitutional syndrome, so a 63% prevalence of fever in the constitutional syndrome merely indicates how often fever was used to classify a record as positive for that syndrome.

Table 4.

Estimated Prevalence of Actually Positive Cases of Febrile Syndromes in the Record Set of 1,557 Cases (Column 1) and of Fever in Positive Cases of the Relevant Syndrome or Feature (Column 2)

Classification No. (%) in Record Set with Positive Classification of Syndrome and Fever No. (%) in Superseding Syndrome or Feature with Positive Classification of Fever
Respiratory syndrome 163 (10.5) 163/517 (31.5)
    Lower respiratory 99 (6.4) 99/327 (30.3)
        Radiological evidence 66 (4.2) 66/164 (40.2)
    Explained by nonrespiratory or noninfectious diagnosis 3 (0.2) 3/32 (9.4)
Gastrointestinal syndrome 151 (9.7) 151/566 (26.7)
    Diarrhea 56 (3.6) 56/152 (36.8)
    Explained by nongastrointestinal or noninfectious diagnosis 1 (0.1) 1/30 (3.3)
Constitutional syndrome 178 (11.4) 178/283 (62.9)
Rash syndrome 8 (0.5) 8/34 (23.5)
Hemorrhagic syndrome 34 (2.2) 34/261 (13.0)
Neurological syndrome 119 (7.6) 119/470 (25.3)
    Meningoencephalitic 29 (1.9) 29/51 (56.9)
Botulinic syndrome 4 (0.3) 4/80 (5.0)

Aside from constitutional syndrome, fever was judged to be most prevalent in patients with meningoencephalitic symptoms and diarrhea and in patients with respiratory syndrome, with or without lower respiratory or radiological evidence of lower respiratory disease. Fever showed a distinctly low prevalence in botulinic syndrome and in both respiratory and gastrointestinal syndromes explained by a nonrespiratory, nongastrointestinal, or noninfectious disease.

Objective 2: Evaluate Agreement and Reliability of Physician Classifications

Agreement

Each physician agreed with the other reader of the same record on the presence or absence of all 14 variables between 29% and 41% of the time, with an average of 35%. ANOVA of disagreements for one or more of the 14 variables revealed that the between-expert variance was statistically significant (p = 0.001). Multiple comparisons between pairs of physicians showed that one physician disagreed with the readings of three other physicians with p-values of 0.002, 0.005, and 0.047. No other pairings of the nine physicians demonstrated a statistically significant disagreement in the reading of the same records.

shows the measures of agreement between two physician raters on the 14 variables for the 1,557 records. The κ values obtained by the methods of Cohen (κc) and of Siegel and Castellan (κs&c) are virtually identical for every variable; κ varied from 0.22 to 0.86. The κ values for five variables were < 0.70. However, the κ for botulinic syndrome (0.66) was not significantly different from 0.70 (95% confidence interval: 0.61–0.72). κ was < 0.30 for three variables: Respiratory syndrome explained by a nonrespiratory or noninfectious diagnosis, gastrointestinal syndrome explained by a noninfectious or nongastrointestinal diagnosis, and rash syndrome. When κ was adjusted for prevalence by the calculation of 2P(A) − 1, both κc and κs&c increased to 0.79 or greater for these three variables. Finn's r, which also adjusts for prevalence, was >0.70 for all variables. Calculation of positive and negative agreement (Ppos and Pneg) between the two raters showed that Pneg was >0.90 in all instances. Ppos was slightly higher than κ for all variables and was low for the three variables for which κCo and κs&c were < 0.30.

Table 5.

Agreement Statistics for All Variables

Variable Proportion of Agreement Ppos Pneg κC κS&C 2P(A)−1 Finn's r
Respiratory syndrome 0.89 0.86 0.91 0.77 0.77 0.78 0.78
    Lower respiratory 0.89 0.79 0.92 0.72 0.72 0.78 0.78
        Radiological evidence 0.95 0.80 0.97 0.77 0.77 0.89 0.89
    Explained by nonrespiratory or noninfectious diagnosis 0.92 0.33 0.96 0.29 0.29 0.83 0.83
Gastrointestinal syndrome 0.91 0.89 0.93 0.82 0.82 0.83 0.83
    Diarrhea 0.97 0.85 0.98 0.83 0.83 0.93 0.93
    Explained by nongastrointestinal or noninfectious diagnosis 0.90 0.28 0.95 0.22 0.22 0.80 0.80
Constitutional syndrome 0.89 0.77 0.93 0.70 0.70 0.78 0.78
Rash syndrome 0.89 0.29 0.94 0.23 0.23 0.79 0.79
Hemorrhagic syndrome 0.93 0.82 0.95 0.77 0.77 0.85 0.85
Neurological syndrome 0.89 0.84 0.91 0.75 0.75 0.77 0.77
    Meningoencephalitic 0.96 0.60 0.98 0.58 0.58 0.91 0.91
Botulinic syndrome 0.95 0.69 0.98 0.66 0.66 0.91 0.91
    Fever status 0.89 0.89 0.92 0.75 0.75 0.78 0.78
    Fever status (combined) 0.95 0.89 0.97 0.86 0.86 0.90 0.90
Total of all syndromes and features (fever combined) 0.92 0.79 0.95 0.74 0.74 0.84 0.84

Ppos is the percentage of agreement on positive cases, Pneg is percent agreement on negative cases. κc is Cohen's κ,49 κs&c is the κ of Siegel and Castellan,50 2P(A)−1 is a correction for prevalence to κC. Fever status (combined) combines the values of absent and no information given.

The agreement statistics for the presence of fever are all quite high (). Only the κ values for gastrointestinal explained by a nongastrointestinal or noninfectious diagnosis and botulinic syndrome were < 0.70, and in both of these instances κc was not significantly different from 0.70 (95% confidence intervals 0.11–1.0 and 0.19–1.0, respectively). These two variables in combination with fever also had relatively low Ppos values.

Table 6.

Agreement Statistics for All Febrile Syndromes

Classification Proportion of Agreement Ppos Pneg κc κs&c 2P(A)−1 Finn's r
Febrile respiratory syndrome 0.92 0.89 0.94 0.84 0.84 0.93 0.95
    Febrile lower respiratory 0.95 0.92 0.96 0.88 0.88 0.89 0.98
        Febrile radiological evidence 0.95 0.94 0.95 0.89 0.89 0.89 0.99
    Febrile explained by nonrespiratory or noninfectious diagnosis 0.97 0.86 0.98 0.84 0.84 0.94 1.00
Febrile gastrointestinal syndrome 0.95 0.92 0.97 0.88 0.88 0.91 0.97
    Febrile diarrhea 0.94 0.93 0.95 0.88 0.88 0.88 0.99
    Febrile explained by nongastrointestinal or noninfectious diagnosis 0.97 0.67 0.98 0.65 0.65 0.93 1.00
Febrile constitutional syndrome 0.90 0.93 0.85 0.77 0.77 0.80 0.96
Febrile rash syndrome 0.94 0.89 0.96 0.85 0.85 0.88 1.00
Febrile hemorrhagic syndrome 0.95 0.83 0.97 0.80 0.80 0.89 0.98
Febrile neurological syndrome 0.96 0.92 0.97 0.89 0.89 0.91 0.97
    Febrile meningoencephalitic 0.96 0.97 0.95 0.92 0.92 0.92 1.00
Febrile botulinic syndrome 0.95 0.67 0.97 0.64 0.64 0.90 1.00
Total of all syndromes and features (except constitutional) 0.95 0.91 0.96 0.87 0.87 0.89 0.91

Ppos is the percentage of agreement on positive cases, Pneg is the percentage of agreement on negative cases. κc is Cohen's κ,49 κs&c is the κ of Siegel and Castellan,50 2P(A)−1 is a correction for prevalence to κC.

Reliability

shows the generalizability coefficients for the first two physician raters' judgments on each variable in the 1,557 records. The generalizability coefficient for two raters ranged from 0.37 (for gastrointestinal syndrome explained by a noninfectious or nongastrointestinal diagnosis) to 0.92 (for neurological syndrome and for fever status [combined]). The generalizability coefficient was < 0.50 for respiratory syndrome explained by a nonrespiratory or noninfectious diagnosis, gastrointestinal syndrome explained by a noninfectious or nongastrointestinal diagnosis, and rash syndrome. To obtain a target generalizability coefficient of 0.70 for these three variables would require six, nine, and eight raters, respectively.

Table 7.

Generalizability Coefficients for One and Two Raters for All Variables

Variable Generalizability Coefficient per Rater Generalizability Coefficient for Two Raters No. of Experts Required for ρ ≥ 0.70
Respiratory syndrome 0.77 0.87 1
    Lower respiratory 0.72 0.83 1
        Radiological evidence 0.77 0.87 1
    Explained by nonrespiratory or noninfectious diagnosis 0.29 0.45 6
Gastrointestinal syndrome 0.82 0.90 1
    Diarrhea 0.83 0.91 1
    Explained by a nongastrointestinal or noninfectious diagnosis 0.22 0.37 9
Constitutional syndrome 0.71 0.83 1
Rash syndrome 0.23 0.38 8
Hemorrhagic syndrome 0.77 0.87 1
Neurological syndrome 0.86 0.92 1
    Meningoencephalitic 0.61 0.76 2
Botulinic syndrome 0.67 0.80 2
Fever status 0.58 0.73 2
Fever status (combined) 0.86 0.92 1

Fever status (combined) combines the values of absent and no information given.

The generalizability coefficients per rater and for three raters derived from the analysis of the records judged by three experts (n = 1,022) are shown in . The coefficients were lowest for gastrointestinal syndrome explained by a noninfectious or nongastrointestinal diagnosis (0.45) and highest for fever status (combined) (0.93). Comparable to the analysis of records judged by two raters, the generalizability coefficient was < 0.65 for respiratory syndrome explained by a nonrespiratory or noninfectious diagnosis, gastrointestinal syndrome explained by a noninfectious or nongastrointestinal diagnosis, and rash syndrome. To reach a target generalizability coefficient of 0.70 for these three features would require six, nine, and five raters, respectively.

Table 8.

Generalizability Coefficients for One and Three Raters for All Variables

Feature Generalizability Coefficient per Rater Generalizability Coefficient for Three Raters No. of Raters Required for ρ ≥ 0.70
Respiratory syndrome 0.69 0.87 2
    Lower respiratory 0.63 0.83 2
        Radiological evidence 0.62 0.83 2
    Explained by nonrespiratory or noninfectious diagnosis 0.31 0.58 6
Gastrointestinal syndrome 0.73 0.89 1
    Diarrhea 0.74 0.90 1
    Explained by a nongastrointestinal or noninfectious diagnosis 0.21 0.45 9
Constitutional syndrome 0.61 0.82 2
Rash syndrome 0.36 0.63 5
Hemorrhagic syndrome 0.67 0.86 2
Neurological syndrome 0.68 0.86 2
    Meningoencephalitic 0.52 0.77 3
Botulinic syndrome 0.57 0.80 2
Fever status (absent, present, not given) 0.55 0.78 2
Fever status (absent, or not given, present) 0.81 0.93 1

Discussion

Our objectives for this study were twofold: (1) generate a reference standard set with representative cases from seven broad syndromic case definitions and several narrower syndromic definitions, and (2) quantify interrater agreement and reliability for the reference standard classifications. We discuss our results in light of these objectives below.

Objective 1: Generate a Reference Standard Set with Representative Cases for Multiple Syndromic Case Definitions

We applied a systematic technique using ICD-9 discharge diagnoses in an attempt to ensure approximately equal representation of all seven general syndromes in the test set. The technique was not completely successful. Each syndrome was represented in 2.2% to 36.4% of the 1,557 records. In particular, respiratory syndrome (33.2%), gastrointestinal syndrome (36.4%), neurological syndrome (30.2%), constitutional syndrome (18.2%), and hemorrhagic syndrome (16.8%) were reasonably well represented in the test set, whereas botulinic syndrome (5.1%) and rash syndrome (2.2%) were less well represented. Part of the reason for lower representation of botulinic and rash syndrome was relatively low positive predictive values for the botulinic (33%) and rash (48%) ICD-9 codes at identifying positive cases.

Most of the signs and symptoms designated as possibly representing botulinic syndrome, such as cranial nerve palsies and limb paralysis, could logically be interpreted by a physician as neurological. Of the 205 records included in the data set with ICD-9 codes consistent with botulinic syndrome, only 52 were classified positively by both physicians, and 22 of the 52 were classified as both botulinic and neurological syndromes. Of the 153 not ultimately classified as botulinic syndrome, 63 were labeled as neurological syndrome. Patients with ICD-9 diagnoses such as myasthenia gravis (358.0), myoneural disorders (358.9), and Bell's palsy (351.0) could potentially resemble a patient with botulism in the earliest stages of the disease, but they become distinctly not botulinic-like as they progress. It may be that cases with these discharge diagnoses did not actually resemble a patient with botulism, or it may be that in spite of instructions to the contrary (Appendix B), physicians did not consistently mark a record with a plausible botulinic feature together with a disease thought to be neurological as an instance of both syndromes.

That relatively few records were classified into rash syndrome is more difficult to understand. Rash ICD-9 classification identified positive cases of rash with a positive predictive value of 0.48. For some of the rash ICD-9 codes, there was a substantial difference between records with codes and physician designation of rash syndrome. For example, 67 cases were included with ICD-9 diagnoses for varicella, zoster, and herpes simplex infections, with or without complications. For only one of 67 did two physicians agree that rash syndrome was present. The physicians assigned the other 66 records about equally to respiratory syndrome, constitutional syndrome, gastrointestinal syndrome, and neurological syndrome. In contrast, six of the ten records with ICD-9 codes denoting scarlet fever were classified as rash syndrome. It is unlikely that multiple physicians would consistently miss a rash-producing illness if it were manifest in the medical record. Perhaps the herpesvirus diagnoses were mentioned in the record as occurring in the recent or remote past and yet were coded as a primary discharge diagnosis.

There was no attempt to represent the nested variables, such as radiological evidence of lower respiratory symptoms, in the 1,557 records. Two readers agreed that lower respiratory disease was present in 63% of the 517 instances of respiratory syndrome and that radiological evidence was present in 50% of the 327 instances of lower respiratory disease. A lower respiratory syndrome case definition would likely be of value in the detection of diseases of bioterrorist and public health importance that often manifest as a pneumonia. Likewise, evaluation of classification into diarrhea and meningoencephalitic neurological illness is feasible because diarrhea was present in 152 (27%) of the cases positive for gastrointestinal syndrome, and meningoencephalitic symptoms or signs were present in 51 (11%) cases positive for neurological syndrome.

In contrast, two physicians infrequently agreed that a case of gastrointestinal syndrome or respiratory syndrome could be explained by a nongastrointestinal, nonrespiratory, or noninfectious diagnosis. An attempt to define specific case definitions for infectious respiratory or gastrointestinal diseases by the exclusion of patients with alternative explanations for their signs and symptoms does not yield a reasonably sized set of classifications.

Fever was judged by two physicians to be present in 315 (20%) of the 1,557 records. Fever is an important diagnostic sign for infectious diseases and is an element of some syndromic definitions. For example, the Centers for Disease Control and Prevention case definition of respiratory syndrome includes the lower or upper respiratory symptoms and a fever (i.e., febrile respiratory illness). More than 25% of patients classified as respiratory, gastrointestinal, and neurological syndromes were considered to have fever. The classification of fever from these records would allow researchers to generate a moderate-size set of patients matching febrile syndromic case definitions. Fever was present in only 5% of 80 records classified as botulinic syndrome. Patients with botulism are almost universally afebrile.56 In fact, the lack of fever in a patient with compatible neurological features is taken as evidence in favor of the diagnosis.

Objective 2: Evaluate Agreement and Reliability of Physician Classifications

Tukey's honest significant difference test showed that the physicians had high agreement with each other with the exception of one physician. Sometimes an outlier is the physician who is most correct. However, our experience with this physician led us to attribute the disagreements to inability to remember case definitions and insecurity with ambiguity in the record. Having a third reader on cases in which there was disagreement is important in tempering suspicious responses from an outlying physician.

Cohen's κ measures interrater agreement not due to chance and is susceptible to two sources of error: skewing in the distribution of the variable values in the data set (prevalence) and the degree to which coders disagree in their overall behavior toward the choices (bias). To account for these errors in κ calculations, Byrt et al.57 suggested that intercoder agreement be reported as three parameters: κCo and two adjustments of κCo, one with prevalence removed and the other with bias removed. Adjusted for prevalence, κCo yields a measure equal to 2P(A) − 1, where P(A) is the observed agreement among the coders.51

The value of κCo adjusted for bias is κs&c. κs&c was identical to κCo for all 14 variables, indicating a lack of bias among the raters. When Cohen's κ was adjusted for prevalence, the meningoencephalitic feature and botulinic syndrome, which showed κ values of 0.58 and 0.66, respectively, both demonstrated a 2P(A) − 1 of 0.91. Moreover, the values of 2P(A) − 1 were 0.80 or greater for the three variables with a κ < 0.30. Finn's r, which also accounts for an unbalanced sample, was likewise >0.70 for the five variables for which κ was low.

There are several possible explanations for low κ values on these variables. First, determining whether respiratory syndrome or gastrointestinal syndrome is explained by a noninfectious etiology requires hypothesis building, interpretation of data that are either not specifically mentioned or are ambiguous in the ED record, reliance on experience, and highly developed analytical skills. Conditions that require substantial interpretation are likely to produce greater disagreement.46

Second, the prevalence of positive cases was low for all of the variables with low κ values. We measured positive and negative specific agreement (Ppos and Pneg) between the raters, as suggested by Cicchetti and Feinstein.58 Pneg was >0.90 in all instances. Ppos tracked the κ values rather closely for the three variables for which kCo and ks&c were < 0.30 (respiratory syndrome explained by a nonrespiratory or noninfectious diagnosis, gastrointestinal syndrome explained by a noninfectious or nongastrointestinal diagnosis, and rash syndrome). Hripcsak et al.2 have pointed out that a severely unbalanced sample does not contain enough information to accurately distinguish good from bad raters, and κ values can be artificially low for unbalanced samples.53

Finn's r, like κ, can be interpreted as the proportion of the observed ratings not due to chance, but Finn's r is not affected by the distribution of the ratings59 and thus is not affected by low prevalence. Finn's r was identical to 2P(A) − 1 for all 14 variables.

However, corrections for prevalence may actually mask poor discriminatory ability of the raters.2 The reliability of the ratings may provide insight into the contradiction of a low κ and a high Finn's r. Because a high generalizability coefficient connotes high reliability, a low κ, high Finn's r, and high generalizability coefficient may indicate that the ratings were indeed reliable and that the source of the low κ values was skewed prevalence. Similarly, a low κ, high Finn's r, and low generalizability coefficient may mean that because the ratings have low reliability, κ represents the true discriminatory ability of the raters better than the values correcting for prevalence.

We measured the reliability of all variables by calculating their generalizability coefficients. Generalizability coefficients for records read by two or three readers were similar but were sometimes lower with three experts. This is not unexpected since the records read by a third expert were a subset of the records judged by two in which the original readers disagreed on responses to at least one variable. Disagreements on one or more variables occurred in 66% of all the records.

A generalizability coefficient of ≥0.70 is considered adequate if the reference standard is going to be used only to estimate the overall performance of a system46 because individual mistakes will be averaged over many cases. The variables for which the generalizability coefficients were < 0.70 were the same in the analyses of records rated by two and three physicians ( and ): respiratory syndrome explained by a nonrespiratory or noninfectious diagnosis, gastrointestinal syndrome explained by a noninfectious or nongastrointestinal diagnosis, and rash syndrome. Therefore, the three variables that showed poor agreement also showed poor reliability. The three variables with low reliability would require six, nine, and five to eight raters, respectively, for the generalizability coefficient to be >0.70 ( and ). It may be that low prevalence was not the cause for low κ values for these three variables but that rater agreement was truly low.

The agreement statistics and generalizability coefficients ( and ) indicate that physicians are quite congruent in judging the presence of febrile syndromes in a medical record. There is no evidence for prevalence error or bias in the data. In all variables except gastrointestinal syndrome explained by a noninfectious or nongastrointestinal diagnosis and botulinic syndrome, a single expert would be sufficient to judge the presence or absence of fever in a report. The moderate κ values, Ppos values, and generalizability coefficients for fever combined with gastrointestinal explained by a nongastrointestinal or noninfectious diagnosis and botulinic syndrome are probably due to the low number of positive instances of the combination of fever with these two attributes in the data set (1 and 4, respectively).

Table 9.

Generalizability Coefficients for One and Two Raters for Febrile Syndromes

Classification Generalizability Coefficient per Rater Generalizability Coefficient for Two Raters No. of Experts Required for ρ ≥ 0.70
Febrile respiratory syndrome 0.84 0.91 1
    Febrile lower respiratory 0.88 0.93 1
        Febrile radiological evidence 0.88 0.94 1
    Febrile explained by nonrespiratory or noninfectious diagnosis 0.84 0.91 1
Febrile gastrointestinal syndrome 0.88 0.94 1
    Febrile diarrhea 0.88 0.94 1
    Febrile explained by nongastrointestinal or noninfectious diagnosis 0.66 0.79 2
Febrile constitutional syndrome 0.78 0.87 1
Febrile rash syndrome 0.86 0.92 1
Febrile hemorrhagic syndrome 0.80 0.89 1
Febrile neurological syndrome 0.89 0.94 1
    Febrile meningoencephalitic 0.92 0.96 1
Febrile botulinic syndrome 0.64 0.78 2
Total of all syndromes and features (except constitutional) 0.86 0.92 1

All but three of the 14 variables in the reference standard can be reliably discerned by two expert physicians and used to judge the overall performance of a system. In fact, a single physician would be sufficient to construct a reference standard for gastrointestinal syndrome, diarrhea, and fever status (combined). In practice, using a single rater for these variables in a new study would require demonstrating that the task was the same and that the rater was similar to raters in this study (perhaps by comparing the variability of the single rater against the variability of those participating in this study).46

Significance of This Work

This work potentially improves the ability to perform syndromic case detection evaluations in several ways. First, we compiled a reliable reference standard set for multiple syndromic definitions, ranging from very broad definitions including all possible signs and symptoms of potential bioterrorist threats to signs and symptoms that are more likely to be due to an infectious cause. Most studies in case classification and outbreak detection have been limited to respiratory and gastrointestinal syndromes, which are the most commonly occurring syndromes. We will use this set to evaluate various syndromic classification techniques, which will contribute to unanswered questions regarding whether syndromic surveillance can successfully detect outbreaks. This set can be used to answer questions such as “With what sensitivity and specificity can we detect cases of common and rare syndromes?” and “Can we detect febrile syndromic definitions from chief complaint input data?”

Second, we used a two-step process for selecting positive patients in seven broad syndromic groups that represent signs and symptoms for all potential bioterrorist threats listed by major government associations.7 The methodology could be duplicated to develop similar reference standard sets from other institutions. We quantified the positive predictive value of the syndromic lists of ICD-9 codes, which can help researchers determine how many potentially positive cases they would need to select using this list in order to have a particular number of actually positive cases. We also developed syndromic case definitions reliably applied by general internists to classify patients based on their ED reports, and we calculated how many physicians would be required to generate a similarly reliable reference standard set.

The case definitions that we developed are quite comprehensive but are also flexible and could be useful both for institutions using the RODS system (which monitors the broadest syndromic definitions) and for applications that monitor more specific case definitions, such as febrile lower respiratory syndrome or meningoencephalitic neurological syndrome.

Future Work

We measured the positive predictive value of the ICD-9 syndromic lists and would like to expand that analysis to learn which combinations of ICD-9 codes are best for identifying positive patients according to physician classifications. We will apply machine learning methods to the data generated in this study to induce a more precise list of ICD-9 codes for syndromic classification.

Limitations

The design of this study was chosen to minimize the expense and time required to develop a reference standard set by having all records read by two experts and employing a third rater only in the cases in which the first two physicians disagreed in their responses to one or more variables. The total number of physician hours required for generating the reference standard was 346. Because the number of variables to be judged was high, 1,022 of the 1,557 (66%) records required a third reader, resulting in a slight cost savings of approximately 11%. There are trade-offs with this type of design. First, although some expensive physician time was saved, the authors expended substantial effort randomly assigning third reads to physicians who had not already read the reports. Second, analysis of the data was not as straightforward as it would have been if the same number of experts rated every record. Because some records were read by two physicians and some by three, we were not able to account for variance due to the number of raters as a source of error in the generalizability formula and had to calculate generalizability coefficients separately for two and three ratings. However, there was no substantive difference between the analyses of two and three judges ( and ), except an improvement in the generalizability coefficient for the rash syndrome with a third judge.

The major limitation of the reference standard resulting from this study is the underrepresentation and low reliability of the rash syndrome without a certain explanation of the problem. For all other general syndromes, we have constructed a reliable reference standard set from 1,557 records read by two or three physicians.

This measurement study addressed reliability of a reference standard but did not deal with validity. High agreement among the raters, evidenced by high κ values and high generalizability coefficients, indicates good reliability but does not imply that the physicians' answers were valid. However, the use of experts who were board certified internists and inspection of the instruments by one of the authors (JND) supports content or face validity.46

Validation of syndromic definitions based on physicians' ability to agree on syndromic case classification is only the first step to determining which syndromic definitions should be monitored by an automated syndromic surveillance system. The next steps include (1) determining whether the data source used to automatically classify patients into the syndromic categories contains sufficient information. For example, monitoring febrile respiratory syndrome from chief complaints would require that the complaints consistently represent both a respiratory sign and fever and (2) evaluating automated methods for assigning syndromic categories to ensure that they perform as well as physicians do at classifying patients.

Conclusion

We applied a systematic method for generating a reference standard set with representative cases from seven broad syndromic case definitions and several narrower definitions and quantified interrater agreement and reliability for the reference standard classifications. The sampling method that we used did not provide an equal distribution of all syndromic categories but did provide multiple positive cases for rarely occurring syndromes that could not be obtained through random sampling, including 80 cases of botulinic syndrome and 261 cases of hemorrhagic syndrome. Although the sampling method did not attempt to identify patients with fever, an important sign of infection, we ended up with a reasonable number of cases of fever within every general syndrome except for botulinic, which is typically afebrile.

We applied multiple metrics of agreement and reliability to the reference standard set and believe that the combination of metrics was valuable in understanding the quality of the reference standard. Measuring specific negative and positive agreement, Cohen's κ, and metrics that adjust for the susceptibility of κ to bias and prevalence provided insight into expert agreement and into the relationship of discriminatory ability and prevalence. Calculating pairwise agreement between experts and calculating generalizability coefficients provided insight into the similarity of the experts and the reliability of the resulting reference standard.

Of the 27 syndromic definitions generated by the 14 variables, 21 showed high enough agreement and reliability to be used as reference standard definitions against which an automated syndromic classifier could be compared. All general syndromic definitions were reliably classified by physicians except for rash. Syndromic definitions that did not generate a large enough set of cases or that showed poor agreement include febrile botulinic syndrome, febrile and nonfebrile rash syndrome, respiratory syndrome explained by a nonrespiratory or noninfectious diagnosis, and febrile and nonfebrile gastrointestinal syndrome explained by a nongastrointestinal or noninfectious diagnosis.

Supplementary Material

Appendix A and B
jamia_M1841_index.html (720B, html)

This work was funded by Defense Advanced Research Projects Agency (DARPA) Cooperative Agreement No. F30602-01-2-0550, Pennsylvania Department of Health grant no. ME-01-737, and AHRQ grant no. 290-00-0009.

The authors thank the physician raters: Karen Barnard, Amber Barnato, Peter Bulova, Rebecca Drayer, Gary Fischer, Mandy Garber, Robin Gehris, Franziska Jovin, Chris Rihn, and David Segel.

References

  • 1.Friedman CP. Toward a measured approach to medical informatics. J Am Med Inform Assoc. 1999;6:176–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Hripcsak G, Heitjan DF. Measuring agreement in medical informatics reliability studies. J Biomed Inform. 2002;35:99–110. [DOI] [PubMed] [Google Scholar]
  • 3.Graham J, Buckeridge D, Choy M, Musen M. Conceptual heterogeneity complicates automated syndromic surveillance for bioterrorism. Proc AMIA Annu Fall Symp. 2002:1030.
  • 4.Brillman J, Joyce E, Forslund D, Picard R, Umland E, Koster F, et al. The bio-surveillance analysis, feedback, evaluation and response (B-SAFER) system. National Syndromic Surveillance Conference 2002. Available from: http://www.nyam.org. Accessed Aug. 23, 2004.
  • 5.Friedman C, Hripcsak G. Evaluating natural language processors in the clinical domain. Methods Inf Med. 1998;37:334–44. [PubMed] [Google Scholar]
  • 6.Reis BY, Mandl KD. Syndromic surveillance: the effects of syndrome grouping on model accuracy and outbreak detection. Ann Emerg Med. 2004;44:235–41. [DOI] [PubMed] [Google Scholar]
  • 7.Wagner MM, Dato V, Dowling JN, Allswede M. Representative threats for research in public health surveillance. J Biomed Inform. 2003;36:177–88. [DOI] [PubMed] [Google Scholar]
  • 8.Ivanov O, Wagner MM, Chapman WW, Olszewski RT. Accuracy of three classifiers of acute gastrointestinal syndrome for syndromic surveillance. Proc AMIA Symp. 2002:345–9. [PMC free article] [PubMed]
  • 9.Espino JU, Wagner MM. Accuracy of ICD-9-coded chief complaints and diagnoses for the detection of acute respiratory illness. Proc AMIA Symp. 2001:164–8. [PMC free article] [PubMed]
  • 10.Beitel AJ, Olson KL, Reis BY, Mandl KD. Use of emergency department chief complaint and diagnostic codes for identifying respiratory illness in a pediatric population. Pediatr Emerg Care. 2004;20:355–60. [DOI] [PubMed] [Google Scholar]
  • 11.Chapman WW, Dowling JN, Wagner MM. Classification of emergency department chief complaints into seven syndromes: a retrospective analysis of 527,228 patients. Ann Emerg Med. (in press). [DOI] [PubMed]
  • 12.Fisher ES, Whaley FS, Krushat WM, Malenka DJ, Fleming C, Baron JA, et al. The accuracy of Medicare's hospital claims data: progress has been made, but problems remain. Am J Public Health. 1992;82:243–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Kuehnert MJ, Doyle TJ, Hill HA, Bridges CB, Jernigan JA, Dull PM, et al. Clinical features that discriminate inhalational anthrax from other acute respiratory illnesses. Clin Infect Dis. 2003;36:328–36. [DOI] [PubMed] [Google Scholar]
  • 14.Crook LD, Tempest B. Plague. A clinical review of 27 cases. Arch Intern Med. 1992;152:1253–6. [DOI] [PubMed] [Google Scholar]
  • 15.Jernigan JA, Stephens DS, Ashford DA, Omenaca C, Topiel MS, Galbraith M, et al. Bioterrorism-related inhalational anthrax: the first 10 cases reported in the United States. Emerg Infect Dis. 2001;7:933–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Recognition of illness associated with the intentional release of a biologic agent. MMWR Morb Mortal Wkly Rep. 2001;50:893–7. [PubMed] [Google Scholar]
  • 17.Lee N, Hui D, Wu A, Chan P, Cameron P, Joynt GM, et al. A major outbreak of severe acute respiratory syndrome in Hong Kong. N Engl J Med. 2003;348:1986–94. [DOI] [PubMed] [Google Scholar]
  • 18.Considerations for distinguishing influenza-like illness from inhalational anthrax. MMWR Morb Mortal Wkly Rep. 2001;50:984–6. [PubMed] [Google Scholar]
  • 19.Proctor ME, Blair KA, Davis JP. Surveillance data for waterborne illness detection: an assessment following a massive waterborne outbreak of Cryptosporidium infection. Epidemiol Infect. 1998;120:43–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Centers for Disease Control and Prevention. Recognition of illness associated with the intentional release of a biologic agent. MMWR Morb Mortal Wkly Rep. 2001;50:893–7. [PubMed] [Google Scholar]
  • 21.Kortepeter M, Pavlin J, Gaydos J, Rowe J, Kelley P, Ludwig G, et al. Surveillance at US military installations for bioterrorist and emerging infectious disease threats. Mil Med. 2000;165:ii–iii. [PubMed] [Google Scholar]
  • 22.Matsui T, Takahashi H, Ohyama T, Tanaka T, Kaku K, Osaka K, et al. [An evaluation of syndromic surveillance for the G8 Summit in Miyazaki and Fukuoka, 2000]. Kansenshogaku Zasshi. 2002;76:161–6. [DOI] [PubMed] [Google Scholar]
  • 23.Moran G, Talan D. Update on emerging infections: news from the Centers for Disease Control and Prevention. Syndromic surveillance for bioterrorism following the attacks on the World Trade Center—New York City, 2001. Ann Emerg Med. 2003;41:414–8. [DOI] [PubMed] [Google Scholar]
  • 24.Lober W, Karras B, Wagner M, Overhage J, Davidson A, Fraser H, et al. Roundtable on bioterrorism detection: information system-based surveillance. J Am Med Inform Assoc. 2002;9:105–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Zelicoff A, Brillman J, Forslund D, George J, Zink S, Koenig S, et al. The rapid syndrome validation project (RSVP). Proc AMIA Symp. 2001:771–5. [PMC free article] [PubMed]
  • 26.Brillman J, Joyce E, Forslund D, Picard R, Umland E, Koster F, et al. The bio-surveillance analysis, feedback, evaluation and response (B-SAFER) system [poster]. 2002 National Syndromic Surveillance Conference 2002. Available from: http://www.nyam.org. Accessed Aug. 23, 2004.
  • 27.Brinsfield K, Gunn J, Barry M, McKenna V, Dyer K, Sulis C. Using volume-based surveillance for an outbreak early warning system. Acad Emerg Med. 2001;8:492. [Google Scholar]
  • 28.Reis B, Mandl K. Time series modeling for syndromic surveillance. BMC Med Inform Decis Mak. 2003;3:2–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Lazarus R, Kleinman K, Dashevsky I, DeMaria A, Platt R. Using automated medical records for rapid identification of illness syndromes (syndromic surveillance): the example of lower respiratory infection. BMC Public Health. Epub 2001:1–9. [DOI] [PMC free article] [PubMed]
  • 30.Lazarus R, Kleinman K, Dashevsky I, Adams C, Kludt P, DeMaria A Jr, et al. Use of automated ambulatory-care encounter records for detection of acute illness clusters, including potential bioterrorism events. Emerg Infect Dis. 2002;8:753–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Irvin C, Nouhan P, Rice K. Syndromic analysis of computerized emergency department patients' chief complaints: an opportunity for bioterrorism and influenza surveillance. Ann Emerg Med. 2003;41:447–52. [DOI] [PubMed] [Google Scholar]
  • 32.Gesteland P, Wagner M, Chapman W, Espino J, Tsui F, Gardner R, et al. Rapid deployment of an electronic disease surveillance system in the state of Utah for the 2002 Olympic Winter Games. Proc AMIA Symp. 2002:285–9. [PMC free article] [PubMed]
  • 33.Tsui F, Espino J, Dato V, Gesteland P, Hutman J, Wagner M. Technical description of RODS: a real-time public health surveillance system. J Am Med Inform Assoc. 2003;10:399–408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Lewis M, Pavlin J, Mansfield J, O'Brien S, Boomsma L, Elbert Y, et al. Disease outbreak detection system using syndromic data in the greater Washington DC area. Am J Prev Med. 2002;23:180–6. [DOI] [PubMed] [Google Scholar]
  • 35.Buckenridge D, O'Connor M, Graham J, Choy M, Pincus Z, Musen M. Knowledge based public health surveillance. National Syndromic Surveillance Conference 2002. Available from: http://www.nyam.org. Accessed Aug. 23, 2004.
  • 36.Foldy S, Biedrzycki P, Barthell E, Haney-Healey N, Baker B, Howe D, et al. Milwaukee surveillance project: real-time syndromic surveillance using secure regional Internet. National Syndromic Surveillance Conference 2002. Available from: http://www.nyam.org. Accessed Aug. 23, 2004.
  • 37.Chapman WW, Wagner M, Ivanov O, Olszewski R, Dowling JN. Syndromic surveillance from free-text triage chief complaints. J Urban Health. 2003;80(suppl):i120. [Google Scholar]
  • 38.Barthell EN, Aronsky D, Cochrane DG, Cable G, Stair T. The frontlines of medicine project progress report: standardized communication of emergency department triage data for syndromic surveillance. Ann Emerg Med. 2004;44:247–52. [DOI] [PubMed] [Google Scholar]
  • 39.Espino JU, Wagner MM, Tsui F, Su H, Olszewski RT, Lie Z, et al. The RODS Open Source Project: removing a barrier to syndromic surveillance. Medinfo. 2004;2004:1192–6. [PubMed] [Google Scholar]
  • 40.Olszewski RT. Bayesian classification of triage diagnoses for the early detection of epidemics. In: Proceedings of the FLAIRS Conference; 2003. Menlo Park, CA: AAAI Press; 2003. p. 412–6.
  • 41.Tsui FC, Wagner MM, Dato V, Chang CC. Value of ICD-9 coded chief complaints for detection of epidemics. Proc AMIA Symp. 2001:711–5. [PMC free article] [PubMed]
  • 42.Wong W, Moore AW, Cooper G, Wagner M. Rule-based anomaly pattern detection for detecting disease outbreaks. Presented at the 18th National Conference on Artificial Intelligence (AAAI-02), 2002.
  • 43.Chapman WW, Fiszman M, Dowling JN, Chapman BE, Rindflesch TC. Identifying respiratory findings in emergency department reports for biosurveillance using MetaMap. Medinfo. 2004;2004:487–91. [PubMed] [Google Scholar]
  • 44.Chapman WW, Christensen LM, Wagner MM, Haug PJ, Ivanov O, Dowling JN, et al. Classifying free-text triage chief complaints into syndromic categories with natural language processing. Artif Intell Med. 2005;33:31–40. [DOI] [PubMed] [Google Scholar]
  • 45.Tinsley HE, Wiess DJ. Interrater reliability and agreement of subjective judgments. J Counseling Psychol. 1975;22:358–76. [Google Scholar]
  • 46.Hripcsak G, Kuperman GJ, Friedman C, Heitjan DF. A reliability study for evaluating information extraction from radiology reports. J Am Med Inform Assoc. 1999;6:143–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Hripcsak G, Wilcox A. Reference standards, judges, and comparison subjects: roles for experts in evaluating system performance. J Am Med Inform Assoc. 2002;9:1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Fiszman M, Chapman WW, Aronsky D, Evans RS, Haug PJ. Automatic detection of acute bacterial pneumonia from chest X-ray reports. J Am Med Inform Assoc. 2000;7:593–604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37–46. [Google Scholar]
  • 50.Siegel S, Castellan N Jr. Nonparametric statistics for the behavioral sciences. , 2nd ed. New York: McGraw-Hill, 1988.
  • 51.Di Eugenio B, Glass M. The kappa statistic: a second look. Comput Ling. 2004;30:95–101. [Google Scholar]
  • 52.Fleiss J, Cohen J, Everitt B. Large sample standard errors of kappa and weighted kappa. Psychol Bull. 1969;72:323–7. [Google Scholar]
  • 53.Finn P. A note on estimating the reliability of categorical data. Educ Psychol Meas. 1970;30:71–6. [Google Scholar]
  • 54.Shavelson RJ, Webb NM. Generalizability theory: a primer. Newbury Park, CA: Sage, 1991.
  • 55.Friedman C, Wyatt JC. Evaluation methods for medical informatics. New York: Springer-Verlag, 1997, p. 99–100.
  • 56.Arnon SS, Schechter R, Inglesby TV, Henderson DA, Bartlett JG, Ascher MS, et al. Botulinum toxin as a biological weapon: medical and public health management. JAMA. 2001;285:1059–70. [DOI] [PubMed] [Google Scholar]
  • 57.Byrt T, Bishop J, Carlin J. Bias, prevalence and kappa. J Clin Epidemiol. 1993;46:423–9. [DOI] [PubMed] [Google Scholar]
  • 58.Cicchetti D, Feinstein A. High agreement but low kappa. II. resolving the paradoxes. J Clin Epidemiol. 1990;43:551–8. [DOI] [PubMed] [Google Scholar]
  • 59.Whitehurst GJ. Interrater agreement for journal manuscript reviews. Am Psychol. 1984;39:22–8. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix A and B
jamia_M1841_index.html (720B, html)
jamia_M1841_1.pdf (68.5KB, pdf)

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES