Abstract
This comparative effectiveness research study assesses the discriminatory ability of diagnostic criteria for pyoderma gangrenosum.
Pyoderma gangrenosum (PG) is a rare, frequently misdiagnosed neutrophilic dermatosis.1 Although 3 diagnostic criteria (Su et al,2 Delphi,3 and PARACELSUS4) have been proposed, few validation studies have assessed their sensitivity and specificity in independent cohorts of cases and controls. The aim of this study was to systemically evaluate the discriminatory ability of PG diagnostic criteria in a sample of ulcerative PG cases and controls.
Methods
In this comparative effectiveness research study, we searched MEDLINE for reports of ulcerative PG and PG mimickers (eMethods and eTable in Supplement 1) from January 1, 2000, to March 1, 2022. Inclusion criteria were patients older than 18 years and clinicopathological data consistent with the diagnosis. Cases used to develop published criteria and nonulcerative PG cases were excluded. Two independent reviewers (A.J.-X. and W.L.) screened articles and assigned diagnostic scores. Discrepancies were resolved by an independent reviewer (M.-C.B.). The University of California, Davis Institutional Review Board deemed this study exempt from review and waived the informed consent requirement because only published and deidentified data were used. We followed the STARD reporting guideline. A 2-sided P < .05 was considered statistically significant.
Results
We identified 162 cases, including 93 patients with ulcerative PG and 69 with PG mimickers classified based on published diagnostic categories1 (Table 1). The Su criteria exhibited the lowest sensitivity (86.21%) and intermediate specificity (69.57%) and area under the receiver operating characteristic curve (AUC, 0.74) for diagnosis of ulcerative PG (Table 1). Delphi had intermediate sensitivity (88.00%) but the highest specificity (90.48%) and AUC (0.89). PARACELSUS displayed the highest sensitivity (95.89%) but the lowest specificity (4.65%) and AUC (0.55). All differences in sensitivity, specificity, and AUC were statistically significant (Table 1). Results were confirmed with imputation to explore biases due to missing data. Cutoff value analysis of PARACELSUS revealed that increasing the cutoff from the original value of 10 to 14 or 15 provided higher specificity (Table 2). Multirater agreement between diagnostic criteria using Fleiss κ was fair (0.36; 95% CI, 0.14-0.59). Pairwise agreements using Cohen κ revealed highest agreement between Su and Delphi (0.56; 95% CI, 0.36-0.77), intermediate agreement between Su and PARACELSUS (0.42; 95% CI, 0.20-0.95), and lowest agreement between Delphi and PARACELSUS (0.15; 95% CI, 0.02-0.28).
Table 1. Discrimination Statistics for Published Diagnostic Criteria for Ulcerative Pyoderma Gangrenosuma.
Diagnostic criteria | Mean (SE) [95% CI], % | AUC (95% CI)b | Correctly classified, % | |
---|---|---|---|---|
Sensitivity | Specificity | |||
Su et al2 | ||||
Complete case data | 86.21 (0.05) [77.33-95.08] | 69.57 (0.07) [56.27-82.86] | 0.74 (0.67-0.81) | 78.21 |
Imputed case data | 86.39 (0.04) [79.32-93.46] | 70.41 (0.06) [59.50-81.31] | NA | 79.41 |
Delphi3 | ||||
Complete case data | 88.00 (0.05) [78.99-97.01] | 90.48 (0.05) [81.60-99.35] | 0.89 (0.83-0.94) | 89.13 |
Imputed case data | 86.72 (0.05) [77.26-96.18] | 90.49 (0.05) [81.10-99.89] | NA | 89.07 |
PARACELSUS4 | ||||
Complete case data | 95.89 (0.02) [91.33-100] | 4.65 (0.03) [0-10.95] | 0.55 (0.48-0.62) | 61.02 |
Imputed case data | 91.31 (0.03) [85.48-97.15] | 14.36 (0.06) [2.25-26.48] | NA | 60.36 |
P valuesc | ||||
Su et al vs Delphi | .02 | <.001 | <.001 | <.001 |
Su et al vs PARACELSUS | .001 | <.001 | <.001 | <.001 |
Delphi vs PARACELSUS | .001 | <.001 | <.001 | <.001 |
Abbreviations: AUC, area under receiver operating characteristic curve; NA, not applicable.
Case data were considered complete when all information was available to fully score all 3 diagnostic criteria (see eMethods in the Supplement). Missing data were accounted for using the multiple imputation by chained equations framework under both missing completely at random and missing at random assumptions (imputed case data). Cases included the following published diagnostic categories: vascular occlusive or venous, 19; primary infection, 16; cancer, 16; drug-induced or exogenous injury, 7; calciphylaxis, 6; and vasculitis, 5.
AUC is a value that measures the overall diagnostic performance of a binary classifier (eg, a set of diagnostic criteria). An AUC value of 0.5 indicates that the diagnostic criteria have no discriminatory ability (random classifier), and an AUC value of 1.0 indicates perfect discriminatory ability (perfect classifier).
P values for sensitivity and specificity were calculated with a 2-sided McNemar test. P values for AUCs were calculated using the DeLong nonparametric approach for comparing AUCs under 2 or more receiver operating characteristic curves.
Table 2. Sensitivity and Specificity of the PARACELSUS Score at Different Point Threshold Requirements for Diagnosing Ulcerative Pyoderma Gangrenosum.
Point threshold requirement | Mean (SE) [95% CI], % | |
---|---|---|
Sensitivity | Specificity | |
9 | 97.47 (0.02) [94.00-100] | 2.00 (0.02) [0-5.88] |
10a | 95.89 (0.02) [91.33-100] | 4.65 (0.03) [0-10.95] |
11 | 94.03 (0.03) [88.36-99.70] | 10.00 (0.05) [0.70-19.30] |
12 | 91.31 (0.03) [85.48-97.15] | 14.36 (0.06) [2.25-26.48] |
13 | 88.68 (0.04) [80.15-97.21] | 48.57 (0.08) [32.01-65.13] |
14 | 80.36 (0.05) [69.96-90.76] | 76.19 (0.07) [63.31-89.07] |
15 | 71.70 (0.06) [59.57-83.83] | 88.89 (0.05) [79.71-98.07] |
16 | 54.00 (0.07) [40.19-67.82] | 96.23 (0.03) [91.10-100] |
17 | 31.92 (0.07) [18.59-45.24] | 100 (0.03) [100-100] |
Threshold used in original publication.4
Discussion
In this study, Delphi yielded intermediate sensitivity and the highest specificity and AUC. This finding is possibly attributable to its requirement for histopathology showing neutrophilic infiltrate, which may lower sensitivity while increasing specificity. PARACELSUS was the most sensitive but least specific criteria. Although the high sensitivity of PARACELSUS could render it useful for screening, the lower specificity indicates a more stringent cutoff may be needed when using PARACELSUS for diagnosis.
One study5 that assessed the sensitivity of these criteria in PG cases found that PARACELSUS had the highest sensitivity (89%), followed by Delphi and Su (both 74%), consistent with findings of the present study. However, because of the lack of a control group, specificity and AUC could not be assessed. Another study6 found that PARACELSUS was more sensitive than Delphi (99% vs 32%) and both displayed similar specificity (60% vs 57%); however, Su was not evaluated. High specificity and AUC are crucial in evaluating suspected PG because PG is a rare condition at high risk of base rate neglect (ie, ignoring the prevalence of PG during the diagnostic process). Suboptimal specificity can mean misdiagnosis of alternative conditions as PG and iatrogenic harm from immunosuppression, including infection exacerbation or cancer.
We assessed PG diagnostic criteria performance in an independent cohort of PG cases and controls. Although PARACELSUS exhibited the highest sensitivity, Delphi provided the highest specificity and AUC. Limitations of the study included the small sample size, retrospective design, and possible selection bias due to the use of case reports. Furthermore, there is currently no gold standard for PG diagnosis, and not all PG cases exhibit classic histopathological findings. Prospective studies are needed to assess the discriminatory ability of these criteria in better controlled settings and harmonize existing diagnostic criteria for this rare condition.
References
- 1.Weenig RH, Davis MD, Dahl PR, Su WP. Skin ulcers misdiagnosed as pyoderma gangrenosum. N Engl J Med. 2002;347(18):1412-1418. doi: 10.1056/NEJMoa013383 [DOI] [PubMed] [Google Scholar]
- 2.Su WP, Davis MD, Weenig RH, Powell FC, Perry HO. Pyoderma gangrenosum: clinicopathologic correlation and proposed diagnostic criteria. Int J Dermatol. 2004;43(11):790-800. doi: 10.1111/j.1365-4632.2004.02128.x [DOI] [PubMed] [Google Scholar]
- 3.Maverakis E, Ma C, Shinkai K, et al. Diagnostic criteria of ulcerative pyoderma gangrenosum: a Delphi consensus of international experts. JAMA Dermatol. 2018;154(4):461-466. doi: 10.1001/jamadermatol.2017.5980 [DOI] [PubMed] [Google Scholar]
- 4.Jockenhöfer F, Wollina U, Salva KA, Benson S, Dissemond J. The PARACELSUS score: a novel diagnostic tool for pyoderma gangrenosum. Br J Dermatol. 2019;180(3):615-620. doi: 10.1111/bjd.16401 [DOI] [PubMed] [Google Scholar]
- 5.Haag C, Hansen T, Hajar T, et al. Comparison of three diagnostic frameworks for pyoderma gangrenosum. J Invest Dermatol. 2021;141(1):59-63. doi: 10.1016/j.jid.2020.04.019 [DOI] [PubMed] [Google Scholar]
- 6.Min MS, Kus K, Wei N, et al. Evaluating the role of histopathology in diagnosing pyoderma gangrenosum using Delphi and PARACELSUS criteria: a multicentre, retrospective cohort study. Br J Dermatol. 2022;186(6):1035-1037. doi: 10.1111/bjd.20967 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.