Abstract
The sensitivity and specificity of syndrome definitions used in early event detection (EED) systems affect the usefulness of the system for end-users. The ability to calculate these values aids system designers in the refinement of syndrome definitions to better meet public health needs. Utilizing a stratified sampling method and expert review to create a gold standard dataset for the calculation of sensitivity and specificity, we describe how varying syndrome structure impacts these statistical parameters and discuss the relevance of this to outbreak detection and investigation.
Introduction
Efforts to improve the earlier detection of health threats have focused on syndromic surveillance—the use of syndromes (constellations of patient signs and symptoms) that predict or herald certain disease states of interest. Although a framework for the development of syndromic surveillance systems has been established1, no standard methodology has emerged for the development of syndrome definitions. These definitions are typically developed based on expert consensus and local need then later refined through an iterative approach derived from experience with both the system and the syndromes.
One of the challenges that has faced early event detection (EED) systems to date has been their validation. There is a lack of standard methodology for the development of syndrome definitions and the choice of gold standards against which to measure the validity of syndrome definitions.2–4 Intricately tied to validation is the question of how to evaluate sensitivity and specificity, since it is difficult to define a gold standard data set that is independent from the testing data set. We have previously used the ICD-9-CM code for influenza as the reference standard in order to improve the sensitivity of an influenza-like-illness syndrome query.5,6 The drawback to using ICD-9-CM codes for this purpose (without expert review) is that different clinicians have varying thresholds for assigning syndrome-related diagnoses based on clinical suspicion (i.e. without confirmatory testing), resulting in additional false positive and false negative cases being considered as gold standard true positive records. Other investigators have defined reference standards using ICD-9-CM diagnosis codes to identify records for expert review.7 The method of arriving at the gold standard dataset for the purpose of calculating sensitivity and specificity affects the validity of these calculations.
While high sensitivity is important to meet the empiric needs of electronic EED systems, high specificity is also important so that public health professionals can allocate their time efficiently. False positive syndrome hits may result in system alarms, which can waste time due to unnecessary signal investigation. In this paper, we describe a method for measuring the sensitivity and specificity of syndrome classification in order to evaluate ways to improve EED system performance through the creation of a gold standard data set.
Methods
Sample Selection
The basis of our methodology was to randomly select a stratified sample of records, basing sample allocation on estimated prevalence, sensitivity, and specificity. Since only a small percentage of records in an emergency department (ED) database are positive for any one syndrome, we created a less stringent version of our standard respiratory syndrome definition from which we could then draw our samples. The composition of both these queries is described below. Clinical experts then independently reviewed the sampled records and assigned a syndrome status to each record according to our syndrome case definition, thereby creating a gold-standard dataset of true positive syndrome cases.
We created a static database containing approximately 1.1 millions visits from the North Carolina Disease Event Tracking and Epidemiologic Collection Tool8 (NC DETECT, www.ncdetect.org), a population based EED system which utilizes ED data as one of its primary data sources. These visits occurred between Oct 1, 2004 and Sep 30, 2005 (this date range was chosen over a calendar year so as to include an entire influenza season). Once we removed all visits with an injury related primary diagnosis code, we were left with 956,015 records.
The NC DETECT syndrome queries search the chief complaint and triage note data elements of each record for free text strings in order to find terms associated with a syndrome. The queries operationalize clinical case definitions, which are derived from the CDC’s text-based case definitions for bioterrorism syndromes.9 The case definitions have been modified over several years in an iterative fashion according to the knowledge and experience of the NC DETECT Syndrome Definition Workgroup. Syndrome queries are written in structured query language (SQL) designed to identify these terms, as well as common synonyms, acronyms, abbreviation, truncations and misspellings, in the free text data.
Chief complaint (CC) is defined by the Centers for Disease Control and Prevention (CDC) as “(the) patient’s reason for seeking care or attention (in the ED)”. The triage note (TN) field contains more detail on reason for ED visit as documented by the triage nurse at the time of initial patient presentation. NC DETECT’s standard respiratory query looks for a respiratory symptom (e.g. cough, shortness of breath) and a constitutional symptom (e.g. fever, fatigue) in either the CC or TN fields. The less stringent respiratory query only requires a respiratory symptom, without a requirement for a constitutional term. Standard and less stringent respiratory syndrome queries were applied to the entire data set and each record was given a corresponding syndrome designation.
In order to select the dataset for expert review, the data were divided into four strata according to their less stringent respiratory syndrome query results and the availability of TN. Because the proportion of the respiratory syndrome cases was expected to differ by strata, a stratified random sample was ideal. We needed to determine the sample size and sample allocations among the strata that were most efficient given the required accuracy of the desired estimates. For our purposes, accuracy was measured through the standard errors of the estimated prevalence, sensitivity, and specificity, all of which are important when evaluating a classification method. Given that we used the query classification as our reference, we specified our accuracy by requiring the standard error of the estimated prevalence, sensitivity, and specificity of the query classification to be bounded by certain pre-specified values.
Given the resources available, we set the bounds for the estimated variances to 0.01% for sensitivity, so that the 95% confidence interval for its estimate would be (point estimate - 2%, point estimate + 2%) and to 0.0025% for specificity and prevalence rate so that the corresponding 95% confidence intervals would be (point estimate - 1%, point estimate + 1%). The larger variance of sensitivity was due to the fact its reduction increases the required sample size more rapidly then that of the other parameters. To choose the smallest sample size and optimal sample allocation, we used the algorithm developed by Chromy.10 To apply this algorithm, we used estimates of the proportions of the respiratory syndrome cases for each of the strata obtained from a pilot study that had been conducted in which 1,000 cases were sampled from earlier data. The estimated proportions in the four strata are quite different, implying that the stratified sample should be much more efficient than a simple random sample.
Using the estimated variances based on these estimated proportions as inputs to the SAS code developed by Chromy to implement his algorithm, we obtained the minimum sample size and best sample allocation that satisfies the constraints on the variances of the estimated prevalence, sensitivity and specificity. The result is a total sample size of 3,699 records allocated into the four strata based on less stringent respiratory definition status and presence/absence of triage notes (Table 1). Samples from each stratum were selected independently of the others and delivered to the case reviewers for the clinical review process to begin. SAS version 9.1 (Cary, NC) was used for statistical calculations.
Table 1.
Allocation of records sampled from each of four categories based on less stringent respiratory definition status and presence/absence of triage notes.
Less Stringent Respiratory Definition Positive | Less Stringent Respiratory Definition Negative | |
---|---|---|
Triage Notes Present | 503 (S1) | 418 (S3) |
Triage Notes Absent | 585 (S2) | 2193 (S4) |
Expert Review
Records were classified according to respiratory syndrome status based on the expert’s review of all available data elements. The NC DETECT case definition for respiratory syndrome was used as a guideline. In addition to the CC, TN and initial ED temperature data elements utilized by the automated syndrome classifier, the expert reviewers also had access to patient age, blood pressure, pulse rate, oxygen saturation and ED diagnosis codes, when available. All cases were reviewed by two clinical experts and a kappa statistic was calculated. A third reviewer adjudicated any reviewer disagreements. The kappa statistic for the chart review by the two initial reviewers was 76.1%11.
Once this gold standard data set was defined, we processed our standard respiratory syndrome query against these records in order to estimate the sensitivity and specificity. Since these results were based on a comparison of 3,699 expertly reviewed records, we calculated the weighted sensitivity and specificity for the total 956,015 records. For example, the sensitivity is estimated by the weighted total of test true positives divided by the weighted total of all true positives. The sample weight is the reciprocal of the sample proportion of the stratum. A sampled visit from the stratum of less stringent respiratory syndrome with TN carries a weight of 39,698/503 = 78.92 because the stratum total is 39,698 and the sample size for that stratum is 503.
We then processed several different variations of the standard respiratory query to look for the most ideal combination of sensitivity and specificity (Table 2). The first alternate query (A1) ignored TN, even when it was present. The second alternate query (A2) used the less stringent respiratory definition (no requirement for a constitutional term) and ignored TN. The third alternate query (A3) used the less stringent respiratory definition, and used TN when available. The fourth alternate query (A4) used the standard definition (requirement for a constitutional term) when TN was available, and the less stringent definition when TN was not available.
Table 2.
Weighted sensitivity (Sn) and specificity (Sp) results for the five different queries.
Query | Weighted Sn (%) | Weighted Sp (%) |
---|---|---|
Standard Query | 23.09 | 98.62 |
Alternate Query 1 (A1) | 9.49 | 99.76 |
Alternate Query 2 (A2) | 43.78 | 95.14 |
Alternate Query 3 (A3) | 53.30 | 92.78 |
Alternate Query 4 (A4) | 45.86 | 94.87 |
Results
The unweighted counts of the 3,699 expert reviewed cases categorized by NC DETECT standard respiratory query are displayed in Table 3. These results are further classified by presence or absence of triage note data. Table 4 shows these results weighted for the 956,015 non-injury records in the database.
Table 3.
Counts of gold standard review results by NC DETECT standard respiratory query.
Gold Standard | NC DETECT Respiratory Query | |||
---|---|---|---|---|
+ | − | Total | ||
TN | + | 90 | 81 | 171 |
− | 112 | 638 | 750 | |
No TN | + | 89 | 245 | 334 |
− | 31 | 2,413 | 2,444 | |
All | + | 179 | 326 | 505 |
− | 143 | 3,051 | 3,194 | |
Total | 322 | 3,377 | 3,699 |
Table 4.
Weighted counts of gold standard review results by NC DETECT standard respiratory query.
Gold Standard | NC DETECT Respiratory Query | |||
---|---|---|---|---|
+ | − | Total | ||
TN | + | 7,103 | 10,865 | 17,968 |
− | 8,839 | 18,9682 | 198,521 | |
No TN | + | 9,554 | 44,612 | 54,166 |
− | 3,328 | 682,032 | 685,360 | |
All | + | 16,657 | 55,477 | 72,134 |
− | 12,167 | 871,714 | 883,881 | |
Total | 28,824 | 927,191 | 956,015 |
For each of the five queries, the sensitivity and specificity estimates weighted for the database of 956,015 non-injury records are shown in Table 2, while the weighted estimates of the positive predictive value (PPV) and negative predictive value (NPV) are shown in Table 5.
Table 5.
Weighted positive predictive value (PPV) and negative predictive value (NPV) results for the five different queries
Query | Weighted PPV (%) | Weighted NPV (%) |
---|---|---|
Standard Query | 57.79 | 94.02 |
Alternate Query 1 (A1) | 76.18 | 93.11 |
Alternate Query 2 (A2) | 42.35 | 95.40 |
Alternate Query 3 (A3) | 37.59 | 96.05 |
Alternate Query 4 (A4) | 42.19 | 95.55 |
Discussion
We identified several challenges in creating a gold standard data set to evaluate the sensitivity and specificity of syndrome definitions used in an EED system. One of our foremost concerns was how to draw a truly representative sample of records for expert review that was manageable in size and accurately represented the prevalence, sensitivity and specificity of respiratory cases within the larger data set.
Only 2.6% of records in the NC DETECT database match the respiratory query. A simple random sample of records requires expert review of far too many records than feasible in order to obtain an adequate number of gold-standard review positive cases. The estimated sensitivity and specificity of a query is a function of proportions of the true syndrome within the query outcome classes. To estimate the sensitivity and specificity more efficiently, the review sample was stratified by the less stringent respiratory query outcome and the availability of TN, since we expect these two factors strongly influence the true syndrome status. This method allowed us to reduce the number of records for expert review to a manageable level while maintaining accurate assessment of sensitivity and specificity.
When applied to all non-injury records within our test database, the standard NC DETECT respiratory query (which requires both a respiratory and constitutional term in either the CC or TN field) returned a weighted sensitivity of 23.09% and a weighted specificity of 98.62%. Only eight (14.8%) of 54 hospitals, however, which contributed ED visit information to the static database used in this project submitted TN data (all submitted chief complaint). Because hospitals that submit TN are typically larger hospitals with higher emergency department volume, 19.3% of non-injury patient visits had a TN available for analysis. We therefore applied a variation on the standard respiratory query (A1) in which only chief complaint was searched for both a constitutional term and a respiratory term. The resulting weighted sensitivity (9.49%) and weighted specificity (99.76%) reflect the performance our standard respiratory query can be expected to achieve when triage note is not available, as is the case in many early event detection systems. The observed decrease in sensitivity can be attributed to the limited amount of searchable text typically contained within the chief complaint field.
We hypothesized that removing the requirement for a constitutional term would address this problem when TN is unavailable. Another modified query (A2), in which only a respiratory term was required, resulted in a weighted sensitivity of 43.78% and weighted specificity of 95.14% when applied to only the chief complaint data element. When this same query was applied to both chief complaint and triage note data elements (A3), weighted sensitivity improved to 53.3% while weighted specificity fell to 92.78%. Finally, a hybrid syndrome construct (A4) designed to take advantage of the increased specificity realized by applying the standard query when TN is available and the increased sensitivity seen by applying the less stringent query when TN data are not available returned a weighted sensitivity and specificity of 45.86% and 94.87%, respectively (Table 2).
When determining the optimal sensitivity and specificity for an early event detection system, many factors must be considered. Signal generation depends on the detection algorithm used to identify unusual increases in the number of syndromic cases observed. In general, higher sensitivity results in identification of more syndromic cases and a decreased likelihood of missing an event. The tradeoff for this is an increase in false alarms due to an increase in false positive results. Conversely, increasing specificity minimizes false alarms and increases the likelihood of missing an event. The appropriate balance between these statistical parameters is a function of both the early event detection system under consideration and the event or syndrome being monitored. For events or disease processes of great concern to public health, higher sensitivity may be desired. The public health agency operating the EED system must have the capacity to investigate the increased number of signals resulting from this, whether it is by electronic “reach back” capability12 or by more traditional outbreak investigation methods. For those systems with limited time and personnel assigned to signal investigation, higher specificity may be more desirable.
Positive predictive value of a definition reflects the probability that a case categorized as positive matches the associated case definition for the syndrome. Conversely, negative predictive value reflects the probability that a case not identified by the definition is truly not representative of the syndrome. The performance of the syndrome definitions reported here will vary when applied within the context of a syndromic surveillance system and will be affected by factors such as the detection algorithm employed.1 The relationship between the various parameters will, however, remain constant. All other factors being equal, modifying a syndrome to increase its sensitivity will result in a decrease in PPV, decreasing the likelihood of missing a real event at the cost of increased likelihood that any given alert is a false alarm.
By using the stratified sampling method described here to create our gold standard dataset, we were able to calculate the sensitivity and specificity of our standard respiratory query as well as determine how variations in the structure of this query affect its performance. By weighting the results, we were able to limit the size of the sample needed for expert review while maintaining the ability to extrapolate these findings to the entire dataset.
Conclusions
We believe that the methodology described here will allow the development of practical and useful gold standard datasets for the evaluation of the sensitivity and specificity of various syndrome definitions. This method may have particular utility in a priori crafting of case definitions for outbreak investigations. During epidemic investigations, the valued test characteristics of a given case definition vary in a characteristic manner, depending upon the phase of a given outbreak. During the early stages of an outbreak, health officials typically use a sensitive case definition, in order to find all possible cases that may provide insights into source and causality; once the outbreak has been confirmed, a more specific case definition usually enables more efficient use of scarce resources to implement control measures. Use of this methodology will allow for refinement of syndromic surveillance queries to meet public health needs.
Acknowledgments
The authors would like to thank Jennifer MacFarquhar, Dennis Falls, Aaron Kipp and John Crouch for their contributions to this project.
Funding for this study was provided by the CDC through R01 PH000038-0
References
- 1.Buehler JW, Hopkins RS, Overhage JM, Sosin DM, Tong V. Framework for evaluating public health surveillance systems for early detection of outbreaks. MMWR. 2004;53(RR05):1–11. [PubMed] [Google Scholar]
- 2.Sosin DM, DeThomasis J. Evaluation challenges for syndromic surveillance- Making incremental progress. MMWR. 2004;53S:125–129. [PubMed] [Google Scholar]
- 3.Chapman WW, Dowling JN, Wagner MM. Fever detection from free-text clinical records for biosurveillance. J Biomed Infor. 2004;37:120–127. doi: 10.1016/j.jbi.2004.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Espino JU, Wagner MM. Accuracy of ICD-9-coded chief complaints and diagnoses for the detection of acute respiratory illness. Proc AMIA Symp. 2001:164–8. [PMC free article] [PubMed] [Google Scholar]
- 5.Scholer MJ, MacFarquhar J, Sickbert-Bennett E, Kipp A, Travers DA, Waller AE. Reverse engineering of a syndrome definition for influenza. Adv Dis Surv. 2006;1:64. [Google Scholar]
- 6.Scholer MJ, Waller AE, Falls D, Johnson K. Development of a syndrome definition for influenza-like illness. Abstract/oral presentation at the 2004 American Public Health Association Meeting; Washington, DC. 2004. Nov, [Google Scholar]
- 7.Chapman WW, Dowling JN, Wagner MM. Generating a reliable reference standard set for syndromic case classification. JAMIA. 2005;12(6):618–629. doi: 10.1197/jamia.M1841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Li M, Ising A, Waller A, Falls D, Eubanks T, Kipp A. North Carolina bioterrorism and emerging infection prevention system. Adv Dis Surv. 2006:1–80. [Google Scholar]
- 9.Centers for Disease Control and Prevention Syndrome definitions for diseases associated with critical bioterrorism-associated agents October2003Http://www.bt.cdc.gov/surveillance/syndromedef/index.asp, accessed, June 24, 2005.
- 10.Chromy JR. Proceedings of the Survey Research Methods Section. Vol. 1987. American Statistical Association; Design optimization with multiple objectives; pp. 194–199. [Google Scholar]
- 11.Ghneim GS, Wu S, Westlake M, Scholer MJ, Travers DA, Waller AE, Wetterhall SF. Defining and applying a method for establishing gold standard sets of emergency room visit data. Adv Dis Surv. 2007;2:9. [PMC free article] [PubMed] [Google Scholar]
- 12.Cline JS.Embedding reachback capability in emergency-department syndromic surveillance Abstract/oral presentation at the 2005 National Syndromic Surveillance Conference Seattle, WA2005September14http://thci.org/_Documents/temp/IMCAbstract_NSS_7.15.05.doc, accessed March 15, 2007. [Google Scholar]