Skip to main content
. Author manuscript; available in PMC: 2017 Apr 1.
Published in final edited form as: J Biomed Inform. 2016 Feb 10;60:199–209. doi: 10.1016/j.jbi.2016.02.005

Table 4.

Stratified analysis of crowd performance (Mean AUC)

(a) CrowdFlower: low-cost, low-quality
Not Useful One Useful Two Useful All
Pre-Delphi 0.08 0.13
Post-Delphi 0.53 0.47
Near-Agreement 0.78 1.00 0.63
All 0.57 0.07 0.98 0.52
(b) CrowdFlower: high-cost, low-quality
Not Useful One Useful Two Useful All
Pre-Delphi 0.90 0.94
Post-Delphi 0.73 0.74
Near-Agreement 0.52 1.00 0.66
All 0.70 0.91 1.00 0.73
(c) CrowdFlower: low-cost, high-quality
Not Useful One Useful Two Useful All
Pre-Delphi 0.84 0.79
Post-Delphi 0.74 0.73
Near-Agreement 0.49 0.31 0.51
All 0.56 0.84 0.25 0.58
(d) CrowdFlower: high-cost, high-quality
Not Useful One Useful Two Useful All
Pre-Delphi 0.38 0.31
Post-Delphi 0.53 0.53
Near-Agreement 0.55 0.07 0.70
All 0.63 0.35 0.43 0.62
(e) Mechanical Turk: low-cost, low-quality
Not Useful One Useful Two Useful All
Pre-Delphi 0.65 0.62
Post-Delphi 0.41 0.40
Near-Agreement 0.38 0.11 0.47
All 0.49 0.63 0.35 0.48
(f) Mechanical Turk: high-cost, low-quality
Not Useful One Useful Two Useful All
Pre-Delphi 0.60 0.68
Post-Delphi 0.50 0.55
Near-Agreement 0.41 0.43 0.62
All 0.58 0.67 0.70 0.60
(g) Mechanical Turk: low-cost, high-quality
Not Useful One Useful Two Useful All
Pre-Delphi 0.32 0.42
Post-Delphi 0.36 0.39
Near-Agreement 0.57 0.20 0.50
All 0.44 0.40 0.58 0.44
(h) Mechanical Turk: high-cost, high-quality
Not Useful One Useful Two Useful All
Pre-Delphi 0.30 0.33
Post-Delphi 0.48 0.48
Near-Agreement 0.37 0.06 0.38
All 0.45 0.31 0.44 0.44

All pairs significant except Near-Agreement, Two Useful—All, Two Useful

For each configuration (shown per subtable), we investigated worker performance on the stratified relationships shown in Table 3 (See it for row and column definitions). We measured the crowd performance in those strata by bootstrapped AUC. Next, we performed a Two-Way ANOVA to measure the effect that expert agreement, definition utility, and their interaction have on AUC. In addition, within each configuration, we compared each stratum pairwise to understand where crowd performance differed significantly between those strata.

All pairs significant except: Pre-Delphi, One Useful—All, All

All pairs significant except: All, Two Useful—All, All Task difficulty, definition quality, and their interaction have a significant effect on crowd AUC for every configuration (p<0.05 via Two-Way ANOVA). Crowd AUC between each stratum is significantly different except where noted in the subtable footnote. Blanks indicate there is not at least one correct and incorrect relationship and therefore AUC is incalculable.