Table 2.
Required Report level-specific specificity and sensitivity: When classifications with more than 2 levels are involved, Sleep Health requires the use of the terms ‘Specificity’ and ‘Sensitivity’ in relation to any level of a classification (e.g., 4-level sleep classification with wake-, N1 + N2, N3, and REM sleep specificity and sensitivity outputs). |
Employ evaluation methods including more than accuracy alone, which is affected by disproportionate numbers of cases underlying sensitivity and specificity. To avoid mischaracterizing a new technology as having high accuracy while there is an imbalance in sensitivity/specificity performance, Sleep Health requires providing accuracy together with sensitivity and specificity, and/or other methods for balancing. |
Sample-based confusion matrix: Sleep Health requires that the confusion matrix be calculated on an individual level, and then summarized by providing mean, SD, and 95% CI9. |
Recommended Do not use “Validation”: Deprecated terminology to be usually avoided. Sleep Health recommends the use of “evaluation” to avoid the improper use of the concept “validation” to state that a new technology is ‘valid’ simply because there is a study evaluating its performance against a reference. Exceptions include and are not limited to machine learning or deep learning algorithms, where the term “validation” has a different meaning. Use independent samples: When the performance of an empirically-derived algorithm is tested, Sleep Health recommends that independent samples be used for training and testing. This can include, but is not limited to, both internal cross-validation and external validation in a separate sample. For machine learning algorithms, using a k-fold process for selecting repeated, balanced training and testing datasets is an accepted standard method. Further evolution of rigorous methods is expected. Report proportional bias and limits of agreement (LOAs): When reporting Bland-Altman plots and statistics, Sleep Health recommends distinguishing between uniform bias (differences are uniformly distributed across the size of measurement) and proportional bias (differences are proportional to the range of measurement) and distinguishing between homoscedastic LOAs (bias ±1.96*SD of the differences) and heteroscedastic LOAs (expected LOAs proportional to the range of measurement). Report level-specific receiver operating characteristic (ROC): Sleep Health recommends (where appropriate) plotting a ROC curve and reporting the corresponding area under the curve (AUC) for each level of relevant classification (e.g., wake, N1 + N2, N3, and REM sleep). Report measures of classification agreement not due to chance, such as the coefficient and the prevalence-adjusted bias-adjusted kappa (PABAK) coefficients. |
In studies evaluating the performance of sleep technology, we frequently encounter inconsistency in the use of evaluation terminology. We define recommended and required elements in the use of a set of standardized terminology to address commonly “misused” language in the field of technology evaluation. Please refer to Menghini and colleagues13 for a comprehensive discussion on the topic and a detailed description and operationalization of the key analytics and outcomes. We recognize that the field is fluid. We will update our Guide for Authors and Template as these metrics and methods evolve.