Skip to main content
Gynecologic Oncology Reports logoLink to Gynecologic Oncology Reports
letter
. 2016 Sep 14;18:49–50. doi: 10.1016/j.gore.2016.09.003

Flawed external validation study of the ADNEX model to diagnose ovarian cancer

B Van Calster a,b,⁎,1, EW Steyerberg b,1, T Bourne a,c,d, D Timmerman a,c, GS Collins e,1; on behalf of TG6 of the STRATOS initiative
PMCID: PMC5154673  PMID: 27995172

Dear Editor,

External validation studies of prediction models are of utmost importance in order to assess the performance of a prediction model in different locations (Altman et al., 2009). We therefore read with interest the recent external validation study of the ADNEX model (Szubert et al., 2016).

For patients with a persistent adnexal tumor who are scheduled for surgery, the ADNEX model predicts the risk of five tumor types: benign, borderline malignant, stage I cancer, stage II–IV cancer, or secondary metastatic cancer (Van Calster et al., 2014). The model was developed on data from 5909 patients collected at 24 centers, in 10 countries, between 1999 and 2012. ADNEX aims to assist clinicians make appropriate clinical decisions for patients presenting with an adnexal mass. When validating the ADNEX model, it is natural to first evaluate the prediction of malignancy, followed by the multiclass prediction of malignancy subtypes, in a similar way to other validation studies of multiclass models (Steyerberg et al., 1998). This approach is followed in the recent paper, but there are a number of important issues around the design, analysis, and reporting we wish to raise.

First, validation studies should be designed to reliably assess performance in terms of discrimination and calibration (Steyerberg, 2009). In this particular case, the authors report a sample size calculation for testing the hypothesis that the AUC of the model is higher than 0.5. Assuming an AUC of 0.94 leads to a very low required sample size (n = 22). This approach is at odds with methodological guidance and the result is that the precision of performance measures will be low: for dichotomous prediction, previous studies have suggested that at least 100, and preferably at least 200 individuals with the event (in this case ovarian malignancy) are required for a meaningful validation (Steyerberg, 2009, Vergouwe et al., 2005, Collins et al., 2016). Here, center 1 has 70 malignant tumors, whilst center 2 has only 34, leading to unreliable per center results. Validation would therefore best be done on all patients, with center-specific results as an exploratory addition. Furthermore, statistical tests to compare results between centers are provided throughout the text. Although heterogeneity of performance across locations is important (Riley et al., 2016), p-values to compare two specific centers are uninformative. It is useful to observe that the AUCs were 0.955 and 0.907, since this is in line with the center-specific values reported in the original publication describing the ADNEX model (Van Calster et al., 2014). A detailed investigation of heterogeneity should however involve a larger dataset with patients from many different centers. Furthermore, subgroup analyses by menopausal status become very unreliable when stratified by center.

Second, the authors have not adequately described their population and results. The prevalence of each of the five tumor types is not clearly provided, and the prevalence of stage I cancer and stage II–IV cancer can only be derived from the confusion matrix. The ADNEX model has variants with and without the serum marker CA125 as a predictor. The authors mix both variants depending on the availability of CA125, such that it is unclear to what variant the reported performance is referring.

Third, the calibration of the predicted risk of malignancy has not been investigated, i.e. whether observed frequencies of malignancy correspond to predicted risks, especially around the risk threshold of 10%. Unfortunately, this aspect of risk prediction models is often overlooked despite its importance (Steyerberg, 2009).

Finally, the ‘multiclass’ performance evaluation is fundamentally flawed. The key problem is the confusion matrix, which classifies patients into one of the five tumor types by choosing the group with the highest predicted risk. Baseline risk, or prevalence, of each tumor type varies substantially: among 327 patients, 223 are benign tumors (68%), 16 borderline (5%), 14 stage I primary cancers (4%), 64 stage II–IV primary cancers (20%), and 10 secondary metastatic cancers (3%). Given these large differences in prevalence, it is unlikely that ADNEX based risk predictions for secondary metastatic cancer will be larger than those for a benign tumor. As a result, the confusion matrix will rarely classify a tumor as a metastatic cancer, resulting in near zero sensitivity for this tumor type. Analogous arguments apply to borderline tumors and stage I primary cancers. Such results are misleading, since they are unrelated to the model's ability to discriminate between tumor types. More generally, it makes little clinical sense to classify patients into only one category. It is much more relevant to monitor which risks are high or increased, and to act upon them accordingly. For example, the predicted risk of advanced-stage ovarian cancer and the risk of secondary metastasis might both be increased (although the latter will usually be smaller than the former due to the lower prevalence). In such cases the clinician may focus management decisions on both tumor types. An elevated risk of a metastatic tumor may trigger planning additional preoperative diagnostic tests, such as gastroscopy, x-ray mammography or a full body MRI. Instead of a confusion matrix, concordance or c statistics for subgroup discrimination should be given. We would advise to present pairwise c statistics using the conditional risk method (Van Calster et al., 2012, Van Calster et al., 2014), although other approaches could be followed. Nevertheless, we warn that in this study the sample size is far too small to draw meaningful conclusions, although we realize that it would require a very large sample to have information on 100 secondary metastatic cancers, as in the IOTA collaboration (Van Calster et al., 2014).

In conclusion, we are happy to observe the excellent discrimination between benign and malignant tumors seen in this study, in line with the original publication (Van Calster et al., 2014). However, the analysis does not allow us to draw any reliable conclusions with respect to multiclass discrimination. To improve reporting of prediction model studies, the TRIPOD guidelines have recently been introduced (Moons et al., 2015). These guidelines highlight the need for adequate sample size, assessment of calibration and transparent reporting of key information such as number of events in each category. Although we recognize that validation of multiclass models involves additional difficulties, it is clear that the TRIPOD recommendations should be followed to ensure all key information is clearly reported.

Conflict of interest

The authors declare no conflicts of interest.

References

  1. Altman D.G., Vergouwe Y., Royston P., Moons K.G. Prognosis and prognostic research: validating a prognostic model. BMJ. 2009;338:b605. doi: 10.1136/bmj.b605. [DOI] [PubMed] [Google Scholar]
  2. Collins G.S., Ogundimu E.O., Altman D.G. Sample size considerations for the external validation of a multivariable prognostic model: a resampling study. Stat. Med. 2016;35:214–226. doi: 10.1002/sim.6787. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Moons K.G., Altman D.G., Reitsma J.B., Ioannidis J.P., Macaskill P., Steyerberg E.W. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration. Ann. Intern. Med. 2015;162:W1–73. doi: 10.7326/M14-0698. [DOI] [PubMed] [Google Scholar]
  4. Riley R.D., Ensor J., Snell K.I., Debray T.P., Altman D.G., Moons K.G. External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges. BMJ. 2016;353:i3140. doi: 10.1136/bmj.i3140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Steyerberg E.W. Springer-Verlag; New York: 2009. Clinical Prediction Models. A Practical Approach to Development, Validation, and Updating. [Google Scholar]
  6. Steyerberg E.W., Gerl A., Fossa S.D., Sleijfer D.T., de Wit R., Kirkels W.J. Validity of predictions of residual retroperitoneal mass histology in nonseminomatous testicular cancer. J. Clin. Oncol. 1998;16:269–274. doi: 10.1200/JCO.1998.16.1.269. [DOI] [PubMed] [Google Scholar]
  7. Szubert S., Wojtowicz A., Moszynski R., Zywica P., Dyczkowski K., Stachowiak A. External validation of the IOTA ADNEX model performed by two independent gynecologic centers. Gynecol. Oncol. 2016 doi: 10.1016/j.ygyno.2016.06.020. [DOI] [PubMed] [Google Scholar]
  8. Van Calster B., Vergouwe Y., Looman C.W., Van Belle V., Timmerman D., Steyerberg E.W. Assessing the discriminative ability of risk models for more than two outcome categories. Eur. J. Epidemiol. 2012;27:761–770. doi: 10.1007/s10654-012-9733-3. [DOI] [PubMed] [Google Scholar]
  9. Van Calster B., Van Hoorde K., Valentin L., Testa A.C., Fischerova D., Van Holsbeke C. Evaluating the risk of ovarian cancer before surgery using the ADNEX model to differentiate between benign, borderline, early and advanced stage invasive, and secondary metastatic tumours: prospective multicentre diagnostic study. BMJ. 2014;349:g5920. doi: 10.1136/bmj.g5920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Vergouwe Y., Steyerberg E.W., Eijkemans M.J., Habbema J.D. Substantial effective sample sizes were required for external validation studies of predictive logistic regression models. J. Clin. Epidemiol. 2005;58:475–483. doi: 10.1016/j.jclinepi.2004.06.017. [DOI] [PubMed] [Google Scholar]

Articles from Gynecologic Oncology Reports are provided here courtesy of Elsevier

RESOURCES