Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2008 Mar-Apr;15(2):227–234. doi: 10.1197/jamia.M2493

Individual and Joint Expert Judgments as Reference Standards in Artifact Detection

Marion Verduijn a , d ,, Niels Peek a , 1 , Nicolette F de Keizer a , Erik-Jan van Lieshout b , Anne-Cornelie JM de Pont b , Marcus J Schultz b , Evert de Jonge b , Bas AJM de Mol c , d
PMCID: PMC2274798  PMID: 18096912

Abstract

Objective

To investigate the agreement among clinical experts in their judgments of monitoring data with respect to artifacts, and to examine the effect of reference standards that consist of individual and joint expert judgments on the performance of artifact filters.

Design

Individual judgments of four physicians, a majority vote judgment, and a consensus judgment were obtained for 30 time series of three monitoring variables: mean arterial blood pressure (ABPm), central venous pressure (CVP), and heart rate (HR). The individual and joint judgments were used to tune three existing automated filtering methods and to evaluate the performance of the resulting filters.

Measurements

The interrater agreement was calculated in terms of positive specific agreement (PSA). The performance of the artifact filters was quantified in terms of sensitivity and positive predictive value (PPV).

Results

PSA values between 0.33 and 0.85 were observed among clinical experts in their selection of artifacts, with relatively high values for CVP data. Artifact filters developed using judgments of individual experts were found to moderately generalize to new time series and other experts; sensitivity values ranged from 0.40 to 0.60 for ABPm and HR filters (PPV: 0.57–0.84), and from 0.63 to 0.80 for CVP filters (PPV: 0.71–0.86). A higher performance value for the filters was found for the three variable types when joint judgments were used for tuning the filtering methods.

Conclusion

Given the disagreement among experts in their individual judgment of monitoring data with respect to artifacts, the use of joint reference standards obtained from multiple experts is recommended for development of automatic artifact filters.

Introduction

Evaluation studies of medical informatics systems that are designed to carry out clinical tasks automatically are often complicated due to the lack of an objective gold standard. 1 In medical informatics, clinical domain experts play an important role in the evaluation of these systems. They may be employed to generate a reference standard, to judge the output of the system, or they may serve as comparison subject to value the system’s performance. 2 The quality of the reference standard, however, may have an important impact on the generalizability of the findings in evaluation studies, especially when subjective expert judgments are used.

This article presents a study on reference standards obtained from clinical experts for automated artifact detection from monitoring data. In the intensive care unit (ICU), automated monitoring systems measure many physiological variables with high frequency to continuously check the patient’s condition. In modernly equipped ICUs, these measurements are automatically recorded in ICU information systems. Monitoring data, however, often contain inaccurate and erroneous measurements, also called artifacts. 3 Data artifacts hamper interpretation and analysis of the data, as they do not reflect the true state of the patient. In practice, experienced clinicians ignore particular measurements that they consider as unreliable when inspecting and using monitoring data. Computerized medical assistants, such as decision support systems, that are increasingly implemented in ICU information systems 4–6 may provide inaccurate support based on monitoring data, when they do not discern artifacts in these measurements. This has induced research on methods for automated artifact detection in order to exclude the artifacts (data filtering), or to repair them given the available data. 7–9

Except for measurements that take theoretically impossible values (e.g., negative blood pressures), defining which measurements have to be considered as artifacts is difficult. This is primarily due to the fact that the concept of ‘artifact’ is vague and hard to define. Thus, individual clinicians may differ in their interpretation of monitoring data with respect to identifying artifacts. 3 Nevertheless, judgments obtained from a single clinical expert have been used in several studies on automated artifact detection in monitoring data. 8–10 The individual judgments generally serve as reference standards to tune methods for automated artifact detection on a training sample, and to validate the resulting filters on a test sample, or in a cross validation design.

The objective of this study is threefold. Our first aim is to investigate the agreement among experts in their judgments of monitoring data with respect to artifacts. Second, we examine the quality of individual judgments as reference standards on the performance of artifact filters that have been developed using these standards. Reference standards that join judgments of individual experts are considered to be more reliable. 2 Our final aim is to investigate the performance of artifact filters that have been developed with joint judgments.

To be able to answer the research questions, we obtained individual judgments on a sample of three monitoring variables (mean arterial blood pressure, central venous pressure, and heart rate) from four clinical experts, as well as two joint judgments (a majority vote judgment and a consensus judgment). We used the judgments to tune three artifact detection methods that have been proposed in the literature, and to validate the resulting filters.

Data and Methods

Monitoring Data

In this study, monitoring data were used of the department of Intensive Care Medicine of the Academic Medical Center (AMC) in Amsterdam, The Netherlands. At this department, critically ill patients are monitored by IntelliVue Monitor MP90 systems (Philips Medical Systems, Eindhoven, The Netherlands). The monitoring data are recorded with a frequency of one measurement per minute, and recorded in the Metavision ICU information system (iMDsoft, Tel Aviv, Israel).

Our study is restricted to three physiological variables that concern the cardiovascular system: mean arterial blood pressure (ABPm), central venous pressure (CVP), and heart rate (HR). The latter variable is obtained by electrocardiogram; the HR values as presented by the monitor are derived from six heartbeats. The blood pressures are measured by separate probes; these independently measured variables do therefore not contain correlated artifacts due to probe malfunction. The three variables are recorded in the ICU information system with equal frequency, but they differ greatly in their variability. For instance, arterial pressure and heart rate are much more amenable to sudden changes than venous pressure, where in the heart rate patterns, these sudden changes may persist for certain episodes.

For our experiment, 30 time series of the three cardiovascular variables were selected from a research database of monitoring data of 367 patients who underwent cardiac surgery at the AMC in the period of April 2002 to June 2003. The time series were selected for their relatively rough course using visual inspection of the data. Each of these subseries included several hundreds of measurements (a duration of two to five hours); they originated from 18 different patients. Some descriptive statistics of the selected ABPm, CVP, and HR data are listed in .

Table 1.

Table 1 Descriptive Statistics of the Selected ABPm, CVP, and HR Time Series

Variable (unit) Number of Time Series Mean Duration Total Number of Measurements Mean Min Max SD
ABPm (mmHg) 10 271.0 2701 80.5 −5 328 18.48
CVP (mmHg) 13 247.1 3193 13.8 −22 183 9.28
HR (beats/min) 7 286.7 2005 85.1 0 142 15.28

Generating Reference Standards

For each time series, three types of reference standards were developed: four individual judgments, a majority vote judgment, and a consensus judgment. Four experienced ICU physicians from the AMC (where the data were recorded) were asked to inspect the series and point out individual data points that they considered as artifact. All physicians were internist-intensivists; their postgraduate experience as internist ranged from 8 to 16 years, and as intensivist from 5 to 13 years.

We prepared the time series to be judged by visualizing the rough measurements on paper. To improve the visualization on paper, all measurements were excluded from the series that took values that are theoretically impossible independent of the clinical context (e.g., negative blood pressures). For that purpose, the four physicians defined a domain of theoretically possible values for each variable type. The excluded measurements were considered to be judged as artifacts by each physician. We will refer to these measurements as ‘range errors’ in the further part of this article. Range errors were only excluded during scoring of the time series by the physicians; they were part of the series during development and evaluation of the artifact filters.

In addition, we provided the physicians with relevant context information of the time series to be judged by visualizing other physiological variables that were recorded simultaneously on the same patient. These variables included the ABPm, CVP, and HR time series (depending on the series to be judged), as well as the patients’ body temperature, fraction inspired oxygen, and respiration pressure. Moreover, we provided the physicians with data of concurrent therapy (medication and fluid administration) by presenting the time point, duration, and amount of therapy that was given. All context information was also provided on paper.

First, the four physicians were asked individually, for each of the time series, to mark data points they judged to be artifacts. The formal rule was to mark data points that they suspected to not reflect the actual health status of the patient at the time of measurement, and that they would therefore neglect in clinical practice. Removal of these points would therefore not result in a loss of information with respect to the patient’s health status, but rather clean the data from disturbances that would be ignored by clinicians anyway.

Subsequently, we combined the initial judgments of the four physicians in two different ways. First, we automatically derived a majority vote judgment of each time series by regarding each measurement as artifact that was judged as such by at least three out of four physicians. Second, a consensus meeting was organized in which the four physicians involved were asked to harmonize their individual judgments to a consensus judgment. The same context information was provided, as well as the initial judgments of all four physicians. In this meeting, the physicians re-inspected the time series, one series at a time: they compared and discussed the individual judgments of the time series to come to a consensus judgment of each time series. During this meeting, they increasingly specified for each monitoring variable which types of measurement have to be judged as artifacts and which measurements can be regarded as reliable and informative data. Simultaneously, they considered whether they should revise the consensus judgments of time series that were previously discussed during the meeting. Two additional researchers (MV, NP) were present during this meeting to guard consistency in the judgments.

This resulted in six judged versions of each of the 30 time series (four individual judgments, one majority vote judgment, and one consensus judgment) in which each measurement is marked with true (artifact) or false (non-artifact).

Measurement of Agreement among Physicians

We investigated the agreement among physicians in their judgment of monitoring data with respect to artifacts. For that purpose, we quantified the interrater agreement for the individual judgments of each pair of physicians by calculating positive specific agreement. 11 Specific agreement is recommended in case of class unbalance, 12,13 which is clearly the case in our study, as non-artifacts highly dominate the data (>95%). Specific agreement quantifies the degree of agreement for the positive and negative classes separately. Positive specific agreement (PSA) between two raters is defined as

graphic file with name 227.S1067502707003568.si1.jpg

where a in this study denotes the number of measurements judged as artifacts by both raters, and b and c denote the total number of measurements considered as artifacts by the two raters individually. It takes the values 0 in case of complete disagreement on artifacts and 1 in case of complete agreement on artifacts. Negative specific agreement is defined in a similar way for non-artifacts. We did not calculate this latter measure as it will only take values around 0.99 due to the extreme class unbalance. 14

Automated Filtering Methods

In this study on reference standards in automated artifact detection, we validated artifact filters that were developed using three methods that have been proposed in the literature for filtering monitoring data. The first method is the well-known and often applied moving median filtering. 14–16 Second, we applied the method ArtiDetect as proposed in the work of C. Cao et al., 9 and third, we applied the method of multiple signal integration by tree induction as proposed in the work of C.L. Tsien et al. 10 We refer to the Appendix, which is available as a JAMIA online-only data supplement at www.jamia.org, for a brief description of the filtering methods and their application in this study for development of filters for each type of monitoring variable.

Validating Artifact Filters Developed Using Standards of Individual Experts

We examined the internal and external validity of artifact filters developed using individual judgments as reference standards. Internal validity of a filter is its ability to detect artifacts where judgments of the same expert are used for developing and testing the filter. In the literature, this is known as validation of the reproducibility of a filter. 17 External validity is the ability of a filter to detect artifacts where judgments of the different experts are used for developing and testing the filter. This can be seen as a validation of the transportability of a filter. 17

For each variable type and filtering method, we performed the following experiments to assess the internal validity of the resulting artifact filters. First, we tuned the filtering method on a training sample using the individual judgments of each physician as reference standards, resulting in four filters, one for each physician. We subsequently applied the filters on a test sample and evaluated their performance using the judgments of the corresponding physician as reference standards. shows a diagram of the design of these experiments, where expert A and B represent the same expert. This resulted in four experiments in our study. This resulted in four experiments in our study. The overall performance was calculated as the mean value of the performance of the filters in the four experiments.

Figure 1.

Figure 1

The internal and external validation study design of a filter developed using a particular filtering method, where expert A and B are the same expert when quantifying the internal validity of the filter (four experiments), and expert A and B are different experts when quantifying the external validity of the filter (twelve experiments). Similar to Hripcsak and Wilcox, 2 rounded rectangles indicate tasks, observations, or measurements, ovals indicate actions of performed by an expert or a filter, and diamonds indicate actions that require no domain expertise.

The external validity of the four filters was subsequently assessed by evaluating their performance on the test sample using the judgments of each of the three other physicians as reference standards. also shows the design of these experiments; expert A and B now represent different experts. This resulted in twelve experiments in our study. We calculated the overall performance as the mean value of the performance of the filters in these experiments.

To make optimal use of the available data, we evaluated the performance of the filters in a 10-fold cross validation design. We used this design in the experiments to assess the internal as well as the external validity of the filters, although for the external validation the standards used for tuning and testing were obtained from different physicians. The reason for this is that these standards are correlated and not independent, as the experts used the same data to judge the time series.

We quantified the performance of a filter in each experiment by calculating the sensitivity (i.e., the proportion of artifacts that have been classified as such by the filter) and the positive predictive value (i.e., the proportion of measurements that have been classified as artifacts by the filter that are artifacts according to clinical judgment); these measures are analogous, respectively, to recall and precision in the evaluation literature. 13 As non-artifacts dominate the time series, we do not report the specificity and negative predictive value.

Validating Artifact Filters Developed Using Joint Standards

We investigated whether artifact filters better generalize when they are developed using joint judgments instead of individual judgments of experts. For that purpose, we performed two additional experiments for each variable type and each filtering method using the majority vote judgments and the consensus judgments of the set of time series as reference standards. The experiments were only performed in the internal validation study design using the same type of standard for developing and testing the artifact filter, because joint judgments of other raters were not available in the study. Again, we validated the filters using 10-fold cross validation, and quantified the performance in terms of the sensitivity and the positive predictive value.

Results

Generating Reference Standards

The four physicians defined range errors as measurements outside the following domains: ABPm 25–200 mmHg, CVP 0–45 mmHg, and HR 0–300 beats/min. lists the number of range errors for the three types of time series, in addition to the total number of measurements judged as artifacts in the individual judgments, the majority vote judgments, and the consensus judgments. illustrates these results for a mean arterial blood pressure series. Four measurements in this series were judged as artifacts by more than two physicians, while four additional points were considered as artifacts in the consensus judgment. The measurements representing a drop at approximately 650 minutes were not considered as artifacts by the physicians, as they represented a decreasing trend over multiple minutes.

Table 2.

Table 2 The Number of Measurements Considered as Range Errors and the Total Number of Measurements Considered as Artifacts in the Individual (I), Majority Vote (MV), and Consensus (C) Judgments; This Latter Number Includes the Number of Range Errors

Variable
Total Number of Measurements
Number of Range Errors
Total Number of Artifacts
I
MV C
1 2 3 4
ABPm 2701 8 20 65 54 57 46 30
CVP 3193 48 66 66 99 121 68 70
HR 2005 0 46 58 42 16 26 46

Figure 2.

Figure 2

The individual judgments of a series of 500 mean arterial blood pressure measurements. Small shaded circles represent data points that one or more ICU physicians considered as artifacts. The associated numbers correspond to the number of physicians having that judgment; each data point that was regarded as artifact by at least three physicians, was judged as artifacts in the majority vote judgment. Large unshaded circles represent data points considered as artifacts in the consensus judgment.

Measurement of Agreement among Physicians

The interrater agreement for each pair of physicians quantified in terms of positive specific agreement is listed in . The table shows relatively high agreement values among the physicians for the CVP data (PSA ≥ 0.65); for the ABPm and HR data, we found large variation in the interrater agreement. Furthermore, the table shows that high agreement values were observed for no pair of physicians for all three variables.

Table 3.

Table 3 Positive Specific Agreement Among the Physicians, and within Brackets the Number of Artifacts they Agreed upon

Variable Expert 2 3 4
ABPm 1 0.33 (14) 0.38 (14) 0.42 (16)
2 0.76 (45) 0.80 (49)
3 0.83 (46)
CVP 1 0.83 (55) 0.74 (61) 0.65 (61)
2 0.76 (63) 0.70 (65)
3 0.80 (88)
HR 1 0.85 (44) 0.56 (25) 0.42 (13)
2 0.49 (25) 0.41 (15)
3 0.33 (10)

visualizes the intersection of the individual and joint judgments of the ABPm, CVP, and HR time series using scaled rectangle diagrams. 18 The CVP diagram shows that almost all data points that were judged as artifact by expert 1 and 2 were included in the data points regarded as artifact by expert 3 and 4. It appears from the HR diagram that the majority of data points that were considered as artifacts by expert 4, a conservative rater for the HR data, were also judged as artifacts by the other experts. Furthermore, this diagram shows that a number of HR data points that were initially considered as reliable by all experts were marked as artifacts in the consensus judgment (left upper part of the consensus rectangle); this was a result of the discussion during this meeting how to characterize HR artifacts. This shows that developing a consensus judgment does not necessarily involve a restriction in the judgment of data points as artifacts. This phenomenon did not occur for the ABPm and CVP data.

Figure 3.

Figure 3

The intersection of individual and joint judgments of the ABPm, CVP, and HR time series visualized in scaled rectangle diagrams.

Validating Artifact Filters Developed Using Standards of Individual Experts

summarizes the results of the experiments to investigate the generalizability of the artifact filters developed using standards obtained from individual experts. Each table row first shows the variable type and the filtering method that was applied to develop the filters. The results of the internal validity of the filter are presented in column 3–4. These columns list respectively the mean value of the sensitivity and positive predictive value (PPV) over four experiments (one for each expert), respectively. Column 5–6 show the mean value of these statistics over twelve experiments as obtained in the external validity of the filters (three for each expert), respectively.

Table 4.

Table 4 Internal and External Validity of Artifact Filters Developed Using Individual Standards

Variable Filtering Method Internal Validity
External Validity
Sensitivity PPV Sensitivity PPV
ABPm Median filtering 0.48 (0.32–0.64) 0.81 (0.62–0.93) 0.44 (0.29–0.59) 0.74 (0.56–0.84)
ArtiDetect 0.60 (0.45–0.74) 0.84 (0.64–0.91) 0.55 (0.42–0.67) 0.70 (0.49–0.79)
Tree induction 0.59 (0.45–0.74) 0.65 (0.50–0.80) 0.55 (0.40–0.69) 0.59 (0.43–0.73)
CVP Median filtering 0.79 (0.69–0.87) 0.80 (0.70–0.88) 0.78 (0.68–0.85) 0.78 (0.69–0.86)
ArtiDetect 0.80 (0.70–0.88) 0.86 (0.77–0.93) 0.77 (0.68–0.84) 0.83 (0.74–0.89)
Tree induction 0.68 (0.57–0.78) 0.79 (0.68–0.88) 0.63 (0.52–0.72) 0.71 (0.60–0.81)
HR Median filtering 0.52 (0.35–0.69) 0.83 (0.61–0.94) 0.48 (0.31–0.63) 0.73 (0.55–0.85)
ArtiDetect 0.40 (0.24–0.57) 0.65 (0.41–0.82) 0.40 (0.25–0.55) 0.63 (0.38–0.78)
Tree induction 0.57 (0.37–0.71) 0.70 (0.49–0.86) 0.44 (0.27–0.60) 0.57 (0.38–0.73)

The performance is quantified in terms of the 10-fold cross-validated sensitivity and positive predictive value (PPV) in the set of time series (95% confidence interval).

The table shows moderate results for the internal validity of the filters for the ABPm and HR time series for each filtering method, and a relatively high performance for the CVP data. Furthermore, the results show a decrease in the mean performance of all nine filters when they were validated for other experts.

Validating Artifact Filters Developed using Joint Standards

The results of the experiments in which we examined the internal validity of artifact filters developed using joint standards are listed in . It appears from the results that higher sensitivity values for unseen time series were found for the majority vote judgment in six out of nine filters compared to the individual standards (, internal validity), and for the consensus judgment in seven out of nine filters; this includes all CVP filters. The PPV was higher for eight and five, out of nine filters, respectively. shows varying results when comparing the performance statistics for the majority vote judgment and the consensus judgment. All ABPm filters developed with a majority vote judgment as reference standard had equal or higher PPV, two out of three CVP filters were more sensitive, and for both statistics, two out of three HR filters were superior. For the consensus judgment, two out of three ABPm filters had higher sensitivity values, and two out of three CVP filters showed higher PPV compared to the filters developed using a majority vote judgment.

Table 5.

Table 5 Internal Validity of Artifact Filters Developed Using Joint Standards

Variable Filtering Method Majority Vote
Consensus
Sensitivity PPV Sensitivity PPV
ABPm Median filtering 0.44 (0.29–0.59) 0.91 (0.71–0.99) 0.67 (0.47–0.83) 0.91 (0.71–0.99)
ArtiDetect 0.52 (0.37–0.67) 0.86 (0.67–0.96) 0.77 (0.58–0.90) 0.64 (0.46–0.79)
Tree induction 0.67 (0.52–0.81) 0.84 (0.72–0.92) 0.60 (0.41–0.77) 0.69 (0.48–0.86)
CVP Median filtering 0.85 (0.75–0.93) 0.74 (0.63–0.84) 0.87 (0.77–0.94) 0.71 (0.60–0.80)
ArtiDetect 0.88 (0.78–0.95) 0.87 (0.77–0.94) 0.84 (0.74–0.92) 0.97 (0.89–1.00)
Tree induction 0.77 (0.65–0.86) 0.83 (0.71–0.91) 0.73 (0.52–0.81) 0.84 (0.68–0.94)
HR Median filtering 0.89 (0.70–0.98) 0.79 (0.60–0.92) 0.54 (0.39–0.69) 0.86 (0.80–1.00)
ArtiDetect 0.27 (0.12–0.48) 1.00 (0.59–1.00) 0.33 (0.20–0.48) 0.63 (0.41–0.81)
Tree induction 0.65 (0.44–0.83) 0.74 (0.52–0.90) 0.57 (0.41–0.71) 0.65 (0.50–0.78)

The performance is quantified in terms of the 10-fold cross-validated sensitivity and positive predictive value (PPV) in the set of time series (95% confidence interval).

Discussion and Conclusions

This study shows that clinical experts disagree in their judgments of ABPm and HR data with respect to artifacts. Furthermore, we have shown that artifact filters of these variables poorly generalize to other experts when judgments from single experts were used for tuning the filtering methods. The internal validity of these filters was also found to be relatively low, though. Relatively high agreement among experts was found for CVP data, and filters for these data resulting from individual judgments were found to generalize well. Artifact filters showed higher performance values for three monitoring variables when joint judgments of groups of experts were used.

Few studies have compared the judgments of monitoring data by different clinical experts. Most studies on automated artifact detection methods 8–10 use judgments obtained from individual experts. In the study of S. Cunningham et al., judgments were obtained from three experts to assess the effect of artifact removal on the mean and median values of time series. 3 Similar to our study, large differences were found in the number of measurements that were considered as artifacts by the individual experts; the differences were traced back to different perceptions of what constitutes artifacts. Compared to the study by Cunningham, we investigated the agreement among the judgments of experts in more detail, and considered the use of these judgments in the development of artifact filters.

According to C.P. Friedman and J.C. Wyatt, training of raters is an important requirement to obtain reliable reference standards. 1 Training is even more important for ambiguous rating tasks, such as artifact detections. Authors of studies on automated artifact detection are generally vague about the instructions that were given to experts for rating monitoring data (e.g., a definition of artifacts). We provided the four physicians with the simple instruction to mark all data points that they suspected not to reflect the actual health status of the patient at the time of measurement and that they would therefore neglect in clinical practice. Effective training of raters for artifact detection is complicated due to the fact that the concept of ‘artifact’ is vague. It appeared from the consensus meeting that it may be impossible to develop a general definition of ‘artifact.’ Context-specific definitions, e.g., pertaining to a specific variable, can probably be formulated. We recommend letting individual ratings of time series be preceded by a meeting to discuss a number of series and artifact definitions; this may contribute to a higher quality reference standard.

An interesting topic given the different levels of agreement that we observed among experts is the number of experts that is necessary for obtaining a reliable joint standard. The assessment of a reliability coefficient of ratings using, for instance, Cronbach’s alpha is an important subject in measurement studies. 1 The Spearman-Brown prophecy formula can be used to estimate the effect of increasing the number of judges on the reliability coefficient. In this study, four physicians rated the time series of the three monitoring variables. As good agreement was observed for CVP data, fewer experts may be needed for judging CVP time series compared to ABPm and HR series in order to obtain a reliable joint standard. How much reliability, and corresponding required effort of experts, is necessary for reference standards depends on the use of the standards; lower reliability might be sufficient in pilot studies, while better reference standards and more thorough evaluation methodologies are required when developing systems (i.e., artifact filters in this study) as clinical end products.

A limitation of the study is that no formal method was used for development of the consensus judgments, such as the Delphi method 19 or the nominal group technique. 20 The use of these methods would probably improve the consistency of consensus judgments and may reveal the superiority of the use of this type of joint judgment to majority vote judgments. The consensus judgments in this study were developed in a meeting during which the experts discussed and compared their individual judgments. Since consistency in the consensus judgments was guarded by the presence of two additional researchers, we believe the joint judgment of this study approach the more formal methods.

A second limitation of the study is that the transportability of the filters was not validated for the majority vote judgment and the consensus judgment. Externally validating these judgments would have required obtaining individual judgments of a new group of experienced ICU physicians, and organizing an additional consensus meeting to harmonize their judgments. An additional type of experiment that could be done using the available expert judgments is the assessment of the performance of filters developed by majority vote judgments of three experts in comparison to the judgment of the fourth individual expert. This experiment is outside the primary scope of this article of using the same type of reference standard (individual judgment or joint judgment) for development and evaluation of automated filters, though; the resulting performance will also reflect to what extent individual experts disagree with groups of other experts in judging monitoring data.

We investigated the generalizability of the artifact filters developed using individual judgments and joint judgments by comparing the (mean) sensitivity and positive predictive value as calculated in the different type of experiments. We did not statistically test the differences in the statistics, and could therefore not make any statement that joint judgments are more suitable reference standards than individual judgments. The results were found to be consistent over the three monitoring variables and filtering methods, though. Furthermore, we equally valued both performance statistics in this study. In practice, the importance of the sensitivity and the positive predictive value of an artifact filter depends on the specific use of the filtered data by computerized medical assistants, physicians, and data analysts.

The lower performance values observed in the internal validation of the filters do not reveal whether it is due to a noisy reference standard (low consistency), or to the fact that the artifact detection ‘rules’ of the expert or expert group are to complex to be captured by the machine learning algorithm. This latter explanation would indicate a failure by the filtering methods in question, and not by the experts that provided the judgments. To exclude this possibility, we have used three filtering methods that operate in highly different manners and allow for a varying degree of complexity in the resulting filters; they are jointly representative for the field of automated artifact detection in monitoring data. Because the findings are consistent over these three methods, we believe that the complexity of the underlying rules did not influence our results.

The study was limited to three monitoring variables that each concerns the cardiovascular system, and provides as such no information on the agreement among experts for other monitoring variables and the performance of artifact filters developed using expert judgments. Moreover, our findings were obtained in a single center study. We can not exclude the possibility that the agreement among experts from a single hospital is larger than the agreement among experts from different hospitals, due to similar education and joint discussions of the condition of patients based on monitoring data. The agreement among clinical experts in their judgments of artifacts may therefore be overestimated in this study, as well as the transportability of artifact filters to individuals in other hospitals.

The selection of series for their relatively rough course is a potential source of bias in this study, as stable time series were underrepresented. Due to this selection bias, the agreement among physicians as measured in this study supposedly underestimates the agreement in their judgment of monitoring data with respect to artifacts in general; high agreement can be assumed for clinical judgment of stable time series. However, the comparison of the performance of artifact filters developed using individual and joint standards was performed on the same set of time series. We therefore suppose that the selection bias has not affected our conclusions on reference standards in automated artifact detection. A similar argument holds for the number of selected time series per variable type and the length of the series that both varied in this study for no apparent reason. Furthermore, the 30 time series were not obtained from 30 different patients. We also suppose that this has not affected the results of the study, as ABPm, CVP, and HR data were included in the context information that was provided to the experts for each time series to be judged. Moreover, development and evaluation of the artifact filters was performed separately for each variable type.

In conclusion, the main implication of this study is that reference standards obtained from individual experts appear to be less suitable for development of artifact filters than reference standard composed of joint judgment, as the transportability of the resulting filters to other experts is poor. This also implies that one should be cautious with deploying filters from the literature that were trained by individuals. Filters developed using joint judgments tended to have better performance for artifact detection in new time series. A majority vote judgment seems to be equally effective in this respect as a consensus judgment, which is more difficult and time consuming to obtain.

References

  • 1.Friedman CP, Wyatt JC. Evaluation Methods in Biomedical Informatics2nd ed.. New York: Springer; 2006.
  • 2.Hripcsak G, Wilcox A. Reference standards, judges, and comparison subjects; roles for experts in evaluating system performance J Am Med Inform Assoc 2002;9:1-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Cunningham S, Symon AG, McIntosh N. The practical management of artifact in computerised physiological data Int J Clin Monit Comp 1994;11:211-216. [DOI] [PubMed] [Google Scholar]
  • 4.Miksch S, Horn W, Popow C, Paky F. Utilizing temporal data abstraction for data validation and therapy planning for artificially ventilated newborn infants Artif Intell Med 1996;8:543-576. [DOI] [PubMed] [Google Scholar]
  • 5.Michel A, Junger A, Benson M, Brammen DG, Hempelmann G, Dudeck J, et al. A data model for managing drug therapy within a patient data management system for intensive care units Comp Meth Progr Biomed 2003;70:71-79. [DOI] [PubMed] [Google Scholar]
  • 6.Charbonnier S. On line extraction of temporal episodes from ICU high-frequency data: a visual support for signal interpretation Comp Meth Progr Biomed 2005;78:115-132. [DOI] [PubMed] [Google Scholar]
  • 7.Horn W, Miksch S, Egghart G, Popow C, Paky F. Effective data validation of high-frequency data: time-point-, time-interval-, and trend-based methods Computer in Biology and Medicine, Special Issue: Time-Oriented Systems in Medicine 1997;27:389-409. [DOI] [PubMed] [Google Scholar]
  • 8.Imhoff M, Bauer M, Gather U, Löhlein D. Statistical pattern detection in univariate time series of intensive care on-line monitoring data Intensive Care Med 1998;24:1305-1314. [DOI] [PubMed] [Google Scholar]
  • 9.Cao C, McIntosh N, Kohane IS, Wang K. Artifact detection in the PO2 and PCO2 time series monitoring data from preterm infants J Clin Monit Comp 1999;15:369-378. [DOI] [PubMed] [Google Scholar]
  • 10.Tsien CL, Kohane IS, McIntosh N. Multiple signal integration by decision tree induction to detect artifacts in the neonatal intensive care unit Artif Intell Med 2000;19:189-202. [DOI] [PubMed] [Google Scholar]
  • 11.Fleiss JL. Measuring agreement between two judges on the presence of absence of a trait Biometrics 1975;31:651-659. [PubMed] [Google Scholar]
  • 12.Cicchetti DV, Feinstein AR. High agreement but low kappa: II resolving the paradoxes J Clin Epidemiol 1990;43:551-558. [DOI] [PubMed] [Google Scholar]
  • 13.Hripcsak G, Heitjan DF. Measuring agreement in medical informatics reliability studies J Biomed Inform 2002;35:99-110. [DOI] [PubMed] [Google Scholar]
  • 14.Mäkivirta A, Koski E, Kari A, Sukuvaara T. The median filter as a preprocessor for a patient monitor limit alarm system in intensive care Comp Meth Progr Biomed 1991;34:139-144. [DOI] [PubMed] [Google Scholar]
  • 15.Jakob S, Korhonen I, Ruokonen E, Virtanen T, Kogan A, Takala J. Detection of artifacts in monitored trends in intensive care Comp Meth Progr Biomed 2000;63:203-209. [DOI] [PubMed] [Google Scholar]
  • 16.Hoare SW, Beatty PCW. Automatic artifact identification in anaesthesia patient record keeping: a comparison of techniques Med Eng Phys 2000;22:547-553. [DOI] [PubMed] [Google Scholar]
  • 17.Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic information Ann Int Med 1999;130:515-524. [DOI] [PubMed] [Google Scholar]
  • 18.Marshall RJ. Scaled rectangle diagrams can be used to visualize clinical and epidemiological data J Clin Epidemiol 2005;58:974-981. [DOI] [PubMed] [Google Scholar]
  • 19.Dalkey NC, Helmer O. An experimental study of group opinion: the Delphi method Futures 1969;1:408-426. [Google Scholar]
  • 20.Delbecq A, van de Ven A. A group process model for problem identification and program planning J Appl Behav Res 1971;7:467-492. [Google Scholar]

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES