Abstract
Diagnostic tests are approaches used in clinical practice to identify with high accuracy the disease of a particular patient and thus to provide early and proper treatment. Reporting high-quality results of diagnostic tests, for both basic and advanced methods, is solely the responsibility of the authors. Despite the existence of recommendation and standards regarding the content or format of statistical aspects, the quality of what and how the statistic is reported when a diagnostic test is assessed varied from excellent to very poor. This article briefly reviews the steps in the evaluation of a diagnostic test from the anatomy, to the role in clinical practice, and to the statistical methods used to show their performances. The statistical approaches are linked with the phase, clinical question, and objective and are accompanied by examples. More details are provided for phase I and II studies while the statistical treatment of phase III and IV is just briefly presented. Several free online resources useful in the calculation of some statistics are also given.
1. Introduction
An accurate and timely diagnostic with the smallest probability of misdiagnosis, missed diagnosis, or delayed diagnosis is crucial in the management of any disease [1, 2]. The diagnostic is an evolving process since both disease (the likelihood and the severity of the disease) and diagnostic approaches evolve [3]. In clinical practice, it is essential to correctly identify the diagnostic test that is useful to a specific patient with a specific condition [4–6]. The over- or underdiagnostic closely reflects on unnecessary or no treatment and harms both the subjects and the health-care systems [3].
Statistical methods used to assess a sign or a symptom in medicine depend on the phase of the study and are directly related to the research question and the design of the experiment (Table 1) [7].
Table 1.
Anatomy on phases of a diagnostic test.
Phase | What? | Design |
---|---|---|
I | Determination of normal ranges (pharmacokinetics, pharmacodynamics, and safe doses) | Observational studies on healthy subjects |
| ||
II | Evaluation of diagnosis accuracy | Case-control studies on healthy subjects and subjects with the known (by a gold standard test) and suspected disease of interest (i) Phase IIa: healthy subjects and subjects with the known disease of interest, all diagnosed by a gold standard method (ii) Phase IIb: testing the relevance of the disease severity (evaluate how a test works in ideal conditions) (iii) Phase IIc: assess the predictive values among subjects with suspected disease |
| ||
III | Evaluation of clinical consequences (benefic and harmful effects) of introducing a diagnostic test | Randomized control trials, randomization determine whether a subject receive or not the diagnosis test |
| ||
IV | Determination of the long-term consequences of introducing a new diagnostic test into clinical practice | Cohort studies of consecutive participants to evaluate if the diagnostic accuracy of a test in practice corresponds to predictions from systematic reviews of phase III trials |
Adapted from [7].
A significant effort was made to develop the standards in reporting clinical studies, both for primary (e.g., case-control studies, cohort studies, and clinical trials) and secondary (e.g., systematic review and meta-analysis) research. The effort led to the publication of four hundred twelve guidelines available on the EQUATOR Network on April 20, 2019 [8]. Each guideline is accompanied by a short checklist describing the information needed to be present in each section and also include some requirements on the presentation of statistical results (information about what, e.g., mean (SD) where SD is the standard deviation, and how to report, e.g., the number of decimals). These guidelines are also used as support in the critical evaluation of an article in evidence-based clinical practice. However, insufficient attention has been granted to the minimum set of items or methods and their quality in reporting the results. Different designs of experiments received more attention, and several statistical guidelines, especially for clinical trials, were developed to standardize the content of the statistical analysis plan [9], for phase III clinical trials in myeloid leukemia [10], pharmaceutical industry-sponsored clinical trials [11], subgroup analysis [12], or graphics and statistics for cardiology [13]. The SAMPL Guidelines provide general principles for reporting statistical methods and results [14]. SAMPL recommends to provide numbers with the appropriate degree of precision, the sample size, numerator and denominator for percentages, and mean (SD) (where SD = standard deviation) for data approximately normally distributed; otherwise medians and interpercentile ranges, verification of the assumption of statistical tests, name of the test and the tailed (one- or two-tailed), significance level (α), P values even statistically significant or not, adjustment(s) (if any) for multivariate analysis, statistical package used in the analysis, missing data, regression equation with regression coefficients for each explanatory variable, associated confidence intervals and P values, and models' goodness of fit (coefficient of determination) [14]. In regard to diagnostic tests, standards are available for reporting accuracy (QUADAS [15], QUADAS-2 [16], STARD [17, 18], and STARD 2015 [19]), diagnostic predictive models (TRIPOD [20]), systematic reviews and meta-analysis (AMSTAR [21] and AMSTAR 2 [22]), and recommendations and guidelines (AGREE [23], AGREE II [24], and RIGHT [25]). The requirements highlight what and how to report (by examples), with an emphasis on the design of experiment which is mandatory to assure the validity and reliability of the reported results. Several studies have been conducted to evaluate if the available standards in reporting results are followed. The number of articles that adequately report the accuracy is reported from low [26–28] to satisfactory [29], but not excellent, still leaving much room for improvements [30–32].
The diagnostic tests are frequently reported in the scientific literature, and the clinicians must know how a good report looks like to apply just the higher-quality information collected from the scientific literature to decision related to a particular patient. This review aimed to present the most frequent statistical methods used in the evaluation of a diagnostic test by linking the statistical treatment of data with the phase of the evaluation and clinical questions.
2. Anatomy of a Diagnostic Test
A diagnostic test could be used in clinical settings for confirmation/exclusion, triage, monitoring, prognosis, or screening (Table 2) [19, 38]. Table 2 presents the role of a diagnostic test, its aim, and a real-life example.
Table 2.
Anatomy of the role of a diagnostic test.
Role | What? | Example (ref.) |
---|---|---|
Confirmation/exclusion | Confirm (rule-in) or exclude (rule-out) the disease | Brain natriuretic peptide: diagnostic for left ventricular dysfunction [33] |
Triage | An initial test that could be rapidly applied and have a small number of false-positive results | Renal Doppler resistive index: hemorrhagic shock in polytrauma patients [34] |
Monitoring | A repeated test that allows assessing the efficacy of an intervention | Glycohemoglobin (A1c Hb): overall glycemic control of patients with diabetes [35] |
Prognosis | Assessment of an outcome or the disease progression | PET/CT scan in the identification of distant metastasis in cervical and endometrial cancer [36] |
Screening | Presence of the disease in apparently asymptomatic persons | Cytology test: screening of cervical uterine cancer [37] |
Different statistical methods are used to support the results of a diagnostic test according to the question, phase, and study design. The statistical analysis depends on the test outcome type. Table 3 presents the most common types of diagnostic test outcome and provides some examples.
Table 3.
Diagnosis test result: type of data.
Data | Example (ref.) |
---|---|
Qualitative dichotomial | Positive/negative or abnormal/normal (i) Endovaginal ultrasound in the diagnosis of normal intrauterine pregnancy [39] (ii) QuantiFERON-TB test for the determination of tubercular infection [40] |
| |
Qualitative ordinal | (i) Prostate bed after radiation therapy: definitely normal/probably normal/uncertain/probably abnormal/definitely abnormal [41] (ii) Scores: Apgar score (assessment of infants after delivery): 0 (no activity, pulse absent, floppy grimace, skin blue or pale, and respiration is absent) to 10 (active baby, pulse over 100 bps, prompt response to stimulation, pink skin, and vigorous cry) [42]; Glasgow coma score: eye opening (from 1 = no eye opening to 4 = spontaneously), verbal response (from 1 = none to 5 = patient oriented), and motor response (from 1 = none to 6 = obeys commands) [43]; Alvarado score (the risk of appendicitis) evaluates 6 clinical items and 2 laboratory measurements and had an overall score from 0 (no appendicitis) to 10 (“very probable” appendicitis) [44]; and sonoelastographic scoring systems in evaluation of lymph nodes [45] (iii) Scales: quality-of-life scales (SF-36 [46], EQ-5D [47, 48], VascuQoL [49, 50], and CIVIQ [51]) and pain scale (e.g., 0 (no pain) to 10 (the worst pain)) [52] |
| |
Qualitative nominal | (i) Apolipoprotein E gene (ApoE) genotypes: E2/E2, E2/E3, E2/E4, E3/E3, E3/E4, and E4/E4 [53, 54] (ii) SNP (single-nucleotide polymorphism) of IL-6: at position −174 (rs1800795), −572 (rs1800796), −596 (rs1800797), and T15 A (rs13306435) [55] |
| |
Quantitative discrete | (i) Number of bacteria in urine or other fluids [56] (ii) Number of contaminated products with different bacteria [57] (iii) Glasgow aneurysm score (= age in years + 17 for shock + 7 for myocardial disease + 10 for cerebrovascular disease + 14 for renal disease) [58] |
| |
Quantitative continuous | (i) Biomarkers: chitotriosidase [59], neopterin [60], urinary cotinine [61], and urinary cadmium levels [61] (ii) Measurements: resistivity index [62], ultrasound thickness [63], and interventricular septal thickness [64] |
The result of an excellent diagnostic test must be accurate (the measured value is as closest as possible by the true value) and precise (repeatability and reproducibility of the measurement) [65]. An accurate and precise measurement is the primary characteristic of a valid diagnostic test.
The reference range or reference interval and ranges of normal values determined in healthy persons are also essential to classify a measurement as a positive or negative result and generally refer to continuous measurements. Under the assumption of a normal distribution, the reference value of a diagnostic measurement had a lower reference limit/lower limit of normal (LRL) and an upper reference limit/upper limit of normal (URL) [66–71]. Frequently, the reference interval takes the central 95% of a reference population, but exceptions from this rule are observed (e.g., cTn-cardiac troponins [72] and glucose levels [73] with <5% deviation from reference intervals) [74, 75]. The reference ranges could be different among laboratories [76, 77], genders and/or ages [78], populations [79] (with variations inclusive within the same population [80, 81]), and to physiological conditions (e.g., pregnancy [82], time of sample collection, or posture). Within-subject biological variation is smaller than the between-subject variation, so reference change values could better reflect the changes in measurements for an individual as compared to reference ranges [83]. Furthermore, a call for establishing the clinical decision limits (CDLs) with the involvement of laboratory professionals had also been emphasized [84].
The Z-score (standardized value, standardized score, or Z-value, Z-score = (measurement − μ)/σ)) is a dimensionless metric used to evaluate how many standard deviations (σ) a measurement is far from the population mean (μ) [85]. A Z-score of 3 refers to 3 standard deviations that would mean that more than 99% of the population was covered by the Z-score [86]. The Z-score is properly used under the assumption of normal distribution and when the parameters of the population are known [87]. It has the advantage that allows comparing different methods of measurements [87]. The Z-scores are used on measurements on pediatric population [88, 89] or fetuses [90], but not exclusively (e.g., bone density tests [91]).
3. Diagnostic Tests and Statistical Methods
The usefulness of a diagnostic test is directly related with its reproducibility (the result is the same when two different medical staff apply the test), accuracy (the same result is obtained if the diagnostic test is used more than once), feasibility (the diagnostic method is accessible and affordable), and the effect of the diagnostic test result on the clinical decision [92]. Specific statistical methods are used to sustain the utility of a diagnostic test, and several examples linking the phase of a diagnostic test with clinical question, design, and statistical analysis methods are provided in Table 4 [101].
Table 4.
Statistical methods in the assessment of the utility of a diagnostic test.
Phase | Clinical question | Objective(s) | Statistics for results | Example (ref.) |
---|---|---|---|---|
I | Which are the normal ranges of values of a diagnostic test? | Determination of the range of values on healthy subjects | Centrality and dispersion (descriptive) metrics: (i) mean (SD), where SD = standard deviation, if data follow the normal distribution; (ii) otherwise, median (Q1 − Q3), where Q1 = 25th percentile and Q3 = 75th percentiles |
(i) Levels of hepcidin and prohepcidin in healthy subjects [93] (ii) plasma pro-gastrin-releasing peptide (ProGRP) levels in healthy adults [94] |
| ||||
I | Is the test reproducible? | Variability: (i) Intra- and interobserver (ii) Intra- and interlaboratory |
(i) Agreement analysis: % (95% confidence interval) and agreement coefficients (dichotomial data: Cohen, ordinal data: weighted kappa, numerical: Lin's concordance correlation coefficient, and Bland and Altman diagram) (ii) Variability analysis: Coefficient of variation, distribution of differences |
(i) Intra- and interobserver variability of uterine measurements [95] (ii) Interlaboratory variability of cervical cytopathology [96] (iii) Concordance between tuberculin skin test and QuantiFERON in children [40] |
| ||||
II | Is the test accurate? Which are performances of the diagnostic test? | Determine the accuracy as compared to a gold standard test | (i) Metrics (dichotomial outcome): Se (sensitivity), Sp (specificity), PPV (predictive positive value), NPV (negative predictive value), and DOR (diagnostic odds ratio) (ii) Clinical performances (dichotomial outcome): PLR (positive likelihood ratio) and NLR (negative likelihood ratio) (iii) Threshold identification (numerical or ordinal with a minimum of five classes outcome): ROC (receiver operating characteristic curve) analysis |
(i) Digital breast tomosynthesis for benign and malignant lesions in breasts [97] (ii) Chitotriosidase as a marker of inflammatory status in critical limb ischemia [59] (iii) Sonoelastographic scores to discriminate between benign and malignant cervical lymph nodes [45] |
| ||||
III | Which are the costs, risk, and acceptability of a diagnostic test? | (i) Evaluation of beneficial and harmful effects (ii) Cost-effective analysis |
Retrospective or prospective studies: (i) beneficial (e.g., improvement of clinical outcome) or harmful effects (e.g., morbidity and mortality) by proportions, risk ratio, odds ratio, hazard ratio, the number needed to treat, and rates and ratios of desirable or undesirable outcomes (ii) cost-effective analysis (mean cost and quality-adjusted life years (QALYs)) |
(i) The computed tomography in children, the associated radiation exposure, and the risk of cancer [98] (ii) Healthcare benefit and cost-effectiveness of a screening strategy for colorectal cancer [99] |
| ||||
IV | Which are the consequences of introducing a new diagnostic test into clinical practice? | (i) Does the test result affect the clinical decision? | (i) Studies of pre- and posttest clinical decision-making (ii) %: abnormal, of discrepant results, of tests leading to change the clinical decisions (iii) Costs: per abnormal result, decision change |
(i) Does the interferon-gamma release assays (IGRAs) change the clinical management of patients with latent tuberculosis infection (LTBI)? [100] |
3.1. Descriptive Metrics
A cohort cross-sectional study is frequently used to establish the normal range of values. Whenever data follow the normal distribution (normality tests such as Shapiro–Wilk [102] or Kolmogorov–Smirnov test [103, 104] provide valid results whenever the sample sizes exceed 29), the mean and standard deviations are reported [105], and the comparison between groups is tested with parametric tests such as Student's t-test (2 groups) or ANOVA test (more than 2 groups). Median and quartiles (Q1 − Q3) are expected to be reported, and the comparison is made with nonparametric tests if experimental data did not follow the normal distribution or the sample size is less than 30 [105]. The continuous data are reported with one or two decimals (sufficient to assure the accuracy of the result), while the P values are reported with four decimals even if the significance threshold was or not reached [106].
The norms and good practice are not always seen in the scientific literature while the studies are frequently more complex (e.g., investigation of changes in the values of biomarkers with age or comparison of healthy subjects with subjects with a specific disease). One example is given by Koch and Singer [107], which aimed to determine the range of normal values of the plasma B-type natriuretic peptide (BNP) from infancy to adolescence. One hundred ninety-five healthy subjects, infants, children, and adolescents were evaluated. Even that the values of BNP varied considerably, the results were improper reported as mean (standard deviation) on the investigated subgroups, but correctly compared subgroups using nonparametric tests [107, 108]. Taheri et al. compared the serum levels of hepcidin (a low molecular weight protein role in the iron metabolism) and prohepcidin in hemodialysis patients (44 patients) and healthy subjects (44 subjects) [93]. Taheri et al. reported the values of hepcidin and prohepcidin as a mean and standard deviation, suggesting the normal distribution of data, and compared using nonparametric tests, inducing the absence of normal distribution of experimental data [93]. Furthermore, they correlated these two biomarkers while no reason exists for this analysis since one is derived from the other [93].
Zhang et al. [94] determined the reference values for plasma pro-gastrin-releasing peptide (ProGrP) levels in healthy Han Chinese adults. They tested the distribution of ProGrP, identified that is not normally distributed, and correctly reported the medians, ranges, and 2.5th, 5th, 50th, 95th, and 97.5th percentiles on two subgroups by ages. Spearman's correlation coefficient was correctly used to test the relation between ProGrP and age, but the symbol of this correlation coefficient was r (symbol attributed to Pearson's correlation coefficient) instead of ρ. The differences in the ProGrP among groups were accurately tested with the Mann–Whitney test (two groups) and the Kruskal–Wallis test (more than two groups). The authors reported the age-dependent reference interval on this specific population without significant differences between genders [94].
The influence of the toner particles on seven biomarkers (serum C-reactive protein (CRP), IgE, interleukin (IL-4, IL-6, and IL-8), serum interferon-γ (IFN-γ), and urine 8-hydroxy-2′-deoxyguanosine (8OHdG)) was investigated by Murase et al. [109]. They conducted a prospective cohort study (toner exposed and unexposed) with a five-year follow-up and measured annually the biomarkers. The reference values of the studied biomarkers were correctly reported as median and 27th–75th percentiles as well as the 2.5th–97.5th percentiles (as recommended by the Clinical and Laboratory Standards Institute [108]).
3.2. Variability Analysis
Two different approaches are used whenever variability of quantitative data is tested in phase I studies, both reflecting the repeated measurements (the same or different device or examiner), namely, variation analysis (coefficient of variation, CV) or the agreement analysis (agreement coefficients).
3.2.1. Variation Analysis
Coefficient of variation (CV), also known as relative standard deviation (RSD), is a standardized measure of dispersion used to express the precision (intra-assay (the same sample assayed in duplicate) CV < 10% is considered acceptable; interassay (comparison of results across assay runs) CV < 15% is deemed to be acceptable) of an assay [110–112]. The coefficient of variation was introduced by Karl Pearson in 1896 [113] and could also be used to test the reliability of a method (the smaller the CV values, the higher the reliability is) [114], to compare methods (the smallest CV belongs to the better method) or variables expressed with different units [115]. The CV is defined as the ratio of the standard deviation to the mean expressed as percentage [116] and is correctly calculated on quantitative data measured on the ratio scale [117]. The coefficient of quartile variation/dispersion (CQV/CQD) was introduced as a preferred measure of dispersion when data did not follow the normal distribution [118] and was defined based on the third and first quartile as (Q3 – Q1)/(Q3 + Q1)∗100 [119]. In a survey analysis, the CQV is used as a measure of convergence in experts' opinions [120].
The confidence interval associated with CV is expected to be reported for providing the readers with sufficient information for a correct interpretation of the reported results, and several online implementations are available (Table 5).
Table 5.
Online resources for confidence intervals calculation: coefficient of variation.
What? | URL (accessed on August 26, 2018) |
---|---|
Two-sided confidence interval (CI) for s CVa |
https://www1.fpl.fs.fed.us/covnorm.dcd.html
https://community.jmp.com/kvoqx44227/attachments/kvoqx44227/scripts/77/1/CI%20for%20CV%202.jsl |
One-sided CIa Lower bound Upper bound |
https://www1.fpl.fs.fed.us/covlow.html
https://www1.fpl.fs.fed.us/covup.html |
Two-sided CI for s CVb | https://www1.fpl.fs.fed.us/covln.html |
Ratio of two CVsa | https://www1.fpl.fs.fed.us/covratio.html |
aNormal distribution and blognormal distribution.
The inference on CVs can be made using specific statistical tests according to the distribution of data. For normal distributions, tests are available to compare two [121] or more than two CVs (Feltz and Miller test [122] or Krishnamoorthy and Lee test [123], the last one also implemented in R [124]).
Reporting the CVs with associated 95% confidence intervals allows a proper interpretation of its point estimator value (CV). Schafer et al. [125] investigated laboratory reproducibility of urine N-telopeptide (NTX) and serum bone-specific alkaline phosphatase (BAP) measurements with six labs over eight months and correctly reported the CVs with associated 95% confidence intervals. Furthermore, they also compared the CVs between two assays and between labs and highlighted the need for improvements in the analytical precision of both NTX and BAP biomarkers [125]. They concluded with the importance of the availability of laboratory performance reports to clinicians and institutions along with the need for proficiency testing and standardized guidelines to improve market reproducibility [125].
However, good practice in reporting CVs is not always observed. Inter- and intra-assay CVs within laboratories reported by Calvi et al. [126] on measurements of cortisol in saliva are reported as point estimators, and neither confidence intervals nor statistical test is provided. Reed et al. [127] reported the variability of measurements (thirty-three laboratories with fifteen repeated measurements on each lab) of human serum antibodies against Bordetella pertussis antigens by ELISA method using just the CVs (no associated 95% confidence intervals) in relation with the expected fraction of pairs of those measurements that differ by at least a given factor (k).
3.2.2. Agreement Analysis
Percentage agreement (po), the number of agreements divided into the number of cases, is the easiest agreement coefficient that could be calculated but may be misleading. Several agreement coefficients that adjust the proportional agreement by the agreement expected by chance were introduced:
Nominal or ordinal scale: Cohen's kappa coefficient (nominal scale, inclusive dichotomial such as positive/negative test result), symbol κ [128], and its derivatives (Fleiss' generalized kappa [129], Conger's generalized kappa [130], and weighted kappa (ordinal scale test result)) [131]
Numerical scale: intraclass (Pearson's correlation coefficient (r)) [132] and interclass correlation coefficient (ICC) [133] (Lin's concordance correlation coefficient (ρc) [134, 135] and Bland and Altman diagram (B&A plot [136, 137]))
The Cohen's kappa coefficient has three assumptions: (i) the units are independent, (ii) the categories on the nominal scale are independent and mutually exclusive, and (iii) the readers/raters are independent [128]. Cohen's kappa coefficient takes a value between −1 (perfect disagreement) and 1 (complete agreement). The empirical rules used to interpret the Cohen's kappa coefficient [138] are as follows: no agreement for κ ≤ 0.20, minimal agreement for 0.21 < κ ≤ 0.39, week agreement for 0.40 ≤ κ ≤ 0.59, moderate agreement for 0.60 ≤ κ ≤ 0.79, strong agreement for 0.80 ≤ κ ≤ 0.90, and almost perfect agreement for κ > 0.90. The minimum acceptable interrater agreement for clinical laboratory measurements is 0.80. The 95% CI must accompany the value of κ for a proper interpretation, and the empirical interpretation rules must apply to the lower bound of the confidence interval.
The significance of κ could also be calculated, but in many cases, it is implemented to test if the value of κ is significantly different by zero (H0(null hypothesis): κ = 0). The clinical significance value is 0.80, and a test using the null hypothesis as H0: κ = 0.79 vs. H1(one-sided alternative hypothesis): κ > 0.79 should be applied.
Weighted kappa is used to discriminate between different readings on ordinal diagnostic test results (different grade of disagreement exists between good and excellent compared to poor and excellent). Different weights reflecting the importance of agreement and the weights (linear, proportional to the number of categories apart or quadratic, proportional to the square of the number of classes apart) must be established by the researcher [131].
Intra- and interclass correlation coefficients (ICCs) are used as a measure of reliability of measurements and had their utility in the evaluation of a diagnostic test. Interrater reliability (defined as two or more raters who measure the same group of individuals), test-retest reliability (defined as the variation in measurements by the same instrument on the same subject by the same conditions), and intrarater reliability (defined as variation of data measured by one rater across two or more trials) are common used [139]. McGraw and Wong [140] defined in 1996 the ten forms of ICC based on the model (1-way random effects, 2-way random effects, or 2-way fixed effects), the number of rates/measurements (single rater/measurement or the mean of k raters/measurements), and hypothesis (consistency or absolute agreement). McGraw and Wong also discuss how to correctly select the correct ICC and recommend to report the ICC values along with their 95% CI [140].
Lin's concordance correlation coefficient (ρc) measures the concordance between two observations, one measurement as the gold standard. The ranges of values of Lin's concordance correlation coefficient are the same as for Cohen's kappa coefficient. The interpretation of ρc takes into account the scale of measurements, with more strictness for continuous measurements (Table 6) [141, 142]. For intra- and interobserver agreement, Martins and Nastri [142] introduced the metric called limits of agreement (LoA) and proposed a cutoff < 5% for very good reliability/agreement.
Table 6.
Intra- and interclass correlation coefficients and concordance correlation coefficient: an empirical assessment of the strength of agreement.
Agreement | Continuous measurement | Ultrasound fetal measurements | Semiautomated measurements |
---|---|---|---|
Very good | ρ c > 0.99 | ρ c > 0.998 | ρ c > 0.90 |
Good | 0.95 < ρc ≤ 0.99 | 0.99 < ρc ≤ 0.998 | 0.80 < ρc ≤ 0.90 |
Moderate | 0.90 < ρc ≤ 0.95 | 0.98 < ρc ≤ 0.99 | 0.65ρc ≤ 0.80 |
Poor | 0.70 < ρc ≤ 0.90 | 0.95 < ρc ≤ 0.98 | ρ c < 0.65 |
Very poor | ρ c < 0.70 | ρ c < 0.95 |
Reporting the ICC and/or CCC along with associated 95% confidence intervals is good practice for agreement analysis. The results are reported in both primary (such as reliability analysis of the Microbleed Anatomical Rating Scale in the evaluation of microbleeds [143], automatic analysis of relaxation parameters of the upper esophageal sphincter [144], and the use of signal intensity weighted centroid in magnetic resonance images of patients with discs degeneration [145]) and secondary research studies (systematic review and/or meta-analysis: evaluation of the functional movement screen [146], evaluation of the Manchester triage scale on an emergency department [147], reliability of the specific physical examination tests for the diagnosis of shoulder pathologies [148], etc.).
Altman and Bland criticized the used of correlation (this is a measure of association, and it is not correct to infer that the two methods can be used interchangeably), linear regression analysis (the method has several assumptions that need to be checked before application, and the assessment of residuals is mandatory for a proper interpretation), and the differences between means as comparison methods aimed to measure the same quantity [136, 149, 150]. They proposed a graphical method called the B&A plot to analyze the agreement between two quantitative measurements by studying the mean difference and constructing limits of agreement [136, 137]. Whenever a gold standard method exists, the difference between the two methods is plotted against the reference values [151]. Besides the fact that the B&A plot provides the limits of agreements, no information regarding the acceptability of the boundaries is supplied, and the acceptable limits must be a priori defined based on clinical significance [150]. The B&A plot is informally interpreted in terms of bias (How big the average discrepancy between the investigated methods is? Is the difference large enough to be clinically relevant?), equivalence (How wide are the limits of agreement?, limits wider than those defined clinically indicate ambiguous results while narrow and small bias suggests that the two methods are equivalent), and trend and variability (Are the dots homogenous?).
Implementation of the 95% confidence intervals associated to ICC, CCC, or kappa statistics and the test of significance are implemented in commercial or free access statistical programs (such as SPSS, MedCalc, SAS, STATA, R, and PASS-NCSS) or could be found freely available online (e.g. vassarstats-©Richard Lowry 2001–2018, http://vassarstats.net/kappa.html; KappaCalculator ©Statistics Solutions 2018, http://www.statisticssolutions.com/KappaCalculator.html; and KappaAcc-Bakeman's Programs, http://bakeman.gsucreate.org/kappaacc/; all accessed August 27, 2018)).
3.3. Accuracy Analysis
The accuracy of a diagnostic test is related to the extent that the test gives the right answer, and the evaluations are done relative to the best available test (also known as gold standard test or reference test and hypothetical ideal test with sensitivity (Se) = 100% and specificity (Sp) = 100%) able to reveal the right answer. Microscopic examinations are considered the gold standard in the diagnosis process but could not be applied to any disease (e.g., stable coronary artery disease [152], rheumatologic diseases [153], psychiatric disorders [154], and rare diseases with not yet fully developed histological assessment [155]).
The factors that could affect the accuracy of the diagnostic test can be summarized as follows [156, 157]: sampling bias, imperfect gold standard test, artefactual variability (e.g., changes in prevalence due to inappropriate design) or clinical variability (e.g., patient spectrum and “gold-standard” threshold), subgroups differences, or reader expectations.
Several metrics calculated based on the 2 × 2 contingency table are frequently used to assess the accuracy of a diagnostic test. A gold standard or reference test is used to classify the subject either in the group with the disease or in the group without the disease of interest. Whatever the type of data for the diagnostic test is, a 2 × 2 contingency table can be created and used to compute the accuracy metrics. The generic structure of a 2 × 2 contingency table is presented in Table 7, and if the diagnostic test is with high accuracy, a significant association with the reference test is observed (significant Chi-square test or equivalent (for details, see [158])).
Table 7.
2 × 2 contingency generic table.
Diagnostic test result | Disease present | Disease absent | Total |
---|---|---|---|
Positive | TP (true positive) | FP (false positive) | TP + FP |
Negative | FN (false negative) | TN (true negative) | FN + TN |
Total | TP + FN | FP + TN | n = TP + FP + FN + TN |
Total on the rows represents the number of subjects with positive and respectively negative test results; total on the columns represents the number of subjects with (disease present) and respectively without (disease absent) the disease of interest; and the classification as test positive/test negative is done using the cutoff value for ordinal and continuous data.
Several standard indicators and three additional metrics useful in the assessment of the accuracy of a diagnostic test are briefly presented in Tables 8 and 9.
Table 8.
Standard statistic indicators used to evaluate diagnostic accuracy.
Statistic (Abb) | Formula | Remarks |
---|---|---|
Sensitivity (Se) | TP/(TP + FN) | (i) The highest the Se, the smallest the number of false negative results (ii) High Se: (a) a negative result rules-out (SnNOUT) (b) suitable for screening (ruling-out) |
| ||
Specificity (Sp) | TN/(TN + FP) | (i) The highest the Se, the smallest the number of false-positive results (ii) High Sp: (a) a positive result rules-in (SpPIN) (b) It is suitable for diagnosis (ruling-in) |
| ||
Accuracy index (AI) | (TP + TN)/(TP + FP + FN + TN) | (i) Give information regarding the cases with the right diagnosis (ii) It is difficult to convert its value to a tangible clinical concept (iii) It is affected by the prevalence of the disease |
| ||
Youden's index (J) [159] | Se + Sp − 1 | (i) Sums the cases wrongly classified by the diagnostic test (ii) Assess the overall performance of the test. J = 0, if the proportion of positive tests is the same in the group with/without the disease. J = 1, if no FPs or FNs exist (iii) Misleading interpretation in comparison of the effectiveness of two tests (iv) Used to identify the best cutoff on ROC analysis: its maximum value corresponds to the highest distance from diagonal |
| ||
Positive predictive value (PPV)∗ | TP/(TP + FP) | (i) Answer the question “what is the chance that a person with a positive test truly has the disease?” (ii) Clinical applicability for a particular subject with a positive test result (iii) It is affected by the prevalence of the disease |
| ||
Negative predictive value (NPV)∗ | TN/(TN + FN) | (i) Answer the question “what is the chance that a person with a negative test truly not to have the disease?” (ii) Clinical applicability for a particular subject with a negative test result (iii) It is affected by the prevalence of the disease |
| ||
Positive likelihood ratio (PLR/LR+)∗ | Se/(1 − Sp) | (i) Indicates how much the odds of the disease increase when a test is positive (indicator to rule-in) (ii) PLR (the higher, the better) (a) > 10 ⟶ convincing diagnostic evidence (b) 5 < PLR < 10 ⟶ strong diagnostic evidence |
| ||
Negative likelihood ratio (NLR/LR−)∗ | (1 − Se)/Sp | (i) Indicates how much the odds of the disease decrease when a test is negative (indicator to rule-out) (ii) NLR (the lower, the better) (a) < 0.1 ⟶ convincing diagnostic evidence (b) 0.2 < PLR < 0.1 ⟶ strong diagnostic evidence |
Diagnostic odds ratio (DOR)∗∗ [160] | (TP/FN)/(FP/TN) [Se/(1 − Se)]/[(1 − Sp)/Sp] [PPV/(1 − PPV)]/[(1 − NPV)/NPV] PLR/NLR |
(i) High DOR indicates a better diagnostic test performance (ranges from 0 to infinite). A value of 1 indicates a test not able to discriminate between those with and those without the disease (ii) Combines the strengths of Se and Sp (iii) Useful to compare different diagnostic tests (iv) Not so useful when the aim is to rules-in or rules-out (v) Convenient indicator in the meta-analysis |
| ||
Posttest odds (PTO)∗ Posttest probability (PTP)∗ |
Pretest odds (prevalence/(1 − prevalence)) × LR PTO/(PTO + 1) |
(i) Gives the odds that the patient has to the target disorder after the test is carried out (ii) Gives the proportion of patients with that particular test result who have the target disorder |
All indicators excepting J are reported with associated 95% confidence intervals; ROC = receiver-operating characteristic; ∗patient-centered indicator; TP = true positive; FP = false positive; FN = false negative; TN = true negative; and PPV and NPV depend on the prevalence (to be used only if (no. of subjects with disease)/(no. of patients without disease) is equivalent with the prevalence of the disease in the studied population).
Table 9.
Other metrics used to evaluate diagnosis accuracy.
Statistic (Abb) | Formula | Remarks |
---|---|---|
Number needed to diagnose (NND) [161] | 1/[Se − (1 − Sp)]1/J | (i) The number of patients that need to be tested to give one correct positive test result (ii) Used to compare the costs of different tests |
| ||
Number needed to misdiagnose (NNM) [162] | 1/[1 − (TP + TN)/n] | (i) The highest the NNM, the better the diagnostic test |
| ||
Clinical utility index (CUI) [163, 164] | CUI+ = Se × PPV CUI− = Sp × NPV |
(i) Gives the degree to which a diagnostic test is useful in clinical practice (ii) Interpretation: CUI > 0.81 ⟶ excellent utility; 0.64 ≤ CUI < 0.81 ⟶ good utility; 0.49 ≤ CUI < 0.64 ⟶ fair utility; 0.36 ≤ CUI < 0.49 ⟶ poor utility; and CUI < 0.36 ⟶ very poor utility |
Abb = abbreviation; all indicators excepting J are reported with associated 95% confidence intervals; TP = true positive; FP = false positive; FN = false negative; and TN = true negative.
The reflection of a positive or negative diagnosis on the probability that a patient has/not a particular disease could be investigated using Fagan's diagram [165]. The Fagan's nomogram is frequently referring in the context of evidence-based medicine, reflecting the decision-making for a particular patient [166]. The Bayes' theorem nomogram was published in 2011, the method incorporating in the prediction of the posttest probability the following metrics: pretest probability, pretest odds (for and against), PLR or NLR, posttest odds (for and against), and posttest probability [167]. The latest form of Fagan's nomogram, called two-step Fagan's nomogram, considered pretest probability, Se (Se of test for PLR), LRs, and Sp (Sp of test for NLR), in predicting the posttest probability [166].
The receiver operating characteristic (ROC) analysis is conducted to investigate the accuracy of a diagnostic test when the outcome is quantitative or ordinal with at least five classes [168, 169]. ROC analysis evaluates the ability of a diagnostic test to discriminate positive from negative cases. Several metrics are reported related to the ROC analysis in the evaluation of a diagnostic test, and the most frequently used metrics are described in Table 10 [170, 171]. The closest the left-upper corner of the graph, the better the test. Different metrics are used to choose the cutoff for the optimum Se and Sp, such as Youden's index (J, maximum), d2 ((1 − Se)2 + (1 − Sp)2, minimum), the weighted number needed to misdiagnose (maximum, considered the pretest probability and the cost of a misdiagnosis) [172], and Euclidean index [173]. The metrics used to identify the best cutoff value are a matter of methodology and are not expected to be reported as a result (reporting a J index of 0.670 for discrimination in small invasive lobular carcinoma [174] is not informative because the same J could be obtained for different values of Se and Sp: 0.97/0.77, 0.7/0.97, 0.83/0.84, etc.). Youden's index has been reported as the best metric in choosing the cutoff value [173] but is not able to differentiate between differences in sensitivity and specificity [175]. Furthermore, Youden's index can be used as an indicator of quality when reported with associated 95% confidence intervals, and a poor quality being associated with the presence of 0.5 is the confidence interval [175].
Table 10.
Metrics for global test accuracy evaluation or comparisons of performances of two tests.
Statistic (Abb) | Method | Remarks |
---|---|---|
Area under the ROC curve (AUC) | (i) Nonparametric (no assumptions): empirical method (estimated AUC is biased if only a few points are in the curve) and smoothed-curve methods such as kernel density method (not reliable near the extremes of the ROC curve) (ii) Parametric (the distributions of the cases and controls are normal): binomial method (tighter asymptotic confidence bounds for samples less than 100) |
(i) AUC = 1 ⟶ perfect diagnostic test (perfect accuracy) (ii) AUC ∼ 0.5 ⟶ random classification (iii) 0.9 < AUC ≤ 1 ⟶ excellent accuracy classification (iv) 0.8 < AUC ≤ 0.9 ⟶ good accuracy (v) 0.7 < AUC ≤ 0.8 ⟶ worthless |
| ||
Partial area under the curve (pAUC) | (i) Nonparametric (no assumptions) (ii) Parametric: using the binomial assumption |
(i) Looks to a portion AUC for a predefined range of interest (ii) Depends on the scale of possible values on the range of interest (iii) Has less statistical precision compared to AUC |
| ||
Diagnostic odds ratio (DOR) | (i) Must use the same fixed cutoff (ii) Most useful in a meta-analysis when two or more tests are compared |
(i) DOR = 1 ⟶ test (ii) DOR increases as ROC is closer to the top left-hand corner of the ROC plot (iii) The same DOR could be obtained for different combinations of Se and Sp |
| ||
TP fraction for a given FP fraction (TPFFPF) | (i) Need the same false-positive fraction | (i) Useful to compare two different tests at a specific FPF (decided based on clinical reasoning), especially when the ROC curves cross |
| ||
Comparison of two tests | (i) Comparison of AUC of two different tests (ii) Absolute difference (SeA − SeB) or ratio (SeA/SeB), where A is one diagnostic test and B is another diagnostic test |
(i) Apply the proper statistical test; each AUC must be done relative to the “gold-standard” test (ii) Test A better than B if absolute difference is > 0; ratio > 1 |
Abb = abbreviation; all indicators are reported with associated 95% confidence intervals; ∗patient-centered indicator; TP = true positive; FP = false positive; FN = false negative; and TN = true negative.
3.4. Performances of a Diagnostic Test by Examples
The body mass index (BMI) was identified as a predictor marker of breast cancer risk on Iranian population [176], with an AUC 0.79 (95% CI: 0.74 to 0.84).
A simulation dataset was used to illustrate how the performances of a diagnostic test could be evaluated, evaluating the BMI as a marker for breast cancer. The simulation was done with respect to the normal distribution for 100 cases with malign breast tumor and 200 cases with benign breast tumors with BMI mean difference of 5.7 kg/m2 (Student's t-test assuming unequal variance: t-stat = 9.98, p < 0.001). The body mass index (BMI) expressed in kg/m2 varied from 20 to 44 kg/m2, and the ROC curve with associated AUC is presented in Figure 1.
Figure 1.
Summary receiver operating characteristic (ROC) curve for BMI as an anthropometric marker to distinguish benign from malign breast tumors. The red line shows an equal proportion of correctly classified breast cancer sample and incorrectly classifies samples without breast cancer (random classification). The J max (max (Se + Sp − 1)) corresponds to a Se = 0.67 and a Sp = 0.88 for a cutoff > 29.5 kg/m2 (BMI) for the breast cancer sample.
The ROC curve graphically represents the pairs of Se and (1 − Sp) for different cutoff values. The AUC of 0.825 proved significantly different by 0.5 (p < 0.001), and the point estimator indicates a good accuracy, but if the evaluation is done based on the interpretation of the 95% lower bound, we found the BMI as a worthless test for breast cancer. The J had its maximum value at a cutoff equal to 29.5 kg/m2 and corresponded to a Se of 0.67, a Sp of 0.88, and an AI of 0.81. The PLR of 5.58 indicates that the BMI is strong diagnostic evidence, but this classification is not supported by the value of NLR which exceed the value of 0.2 (Table 10). A BMI >29.5 kg/m2usually occurs in those with breast cancer while a BMI ≤ 29.5 kg/m2often occurs in those without breast cancer. At a cutoff of 29.5 kg/m2, the marker is very poor for finding those with breast cancer but is good for screening.
The performance metrics varied according to the cutoff values (Table 11). A cutoff with a low value is chosen whenever the aim is to minimize the number of false negatives, assuring a Se of 1 (19.5 kg/m2, TP = 100, Table 10). If a test able to correctly classify the true negatives is desired, the value of the cutoff must be high (38.5 kg/m2, TN = 200, Table 11) assuring a Sp of 1.
Table 11.
Performances metrics for body mass index (BMI) as an anthropometric marker for breast cancer.
Indicator | Cutoff–BMI (kg/m2) | ||||||
---|---|---|---|---|---|---|---|
19.5 | 22.5 | 25.5 | 29.5 | 32.5 | 35.5 | 38.5 | |
TP (true positives) | 100 | 96 | 87 | 67 | 43 | 25 | 13 |
FP (false positives) | 200 | 176 | 117 | 24 | 3 | 1 | 0 |
TN (true negatives) off | 0 | 24 | 83 | 176 | 197 | 199 | 200 |
FN (false negatives) | 0 | 4 | 13 | 33 | 57 | 75 | 87 |
Se (sensitivity) | 1 | 1 | 0.87 | 0.67 | 0.43 | 0.25 | 0.13 |
Sp (specificity) | 0 | 0.10 | 0.42 | 0.88 | 0.99 | 0.99 | 1 |
PPV (positive predictive value) | 0.33 | 0.40 | 0.43 | 0.74 | 0.94 | 0.96 | 1 |
NPV (negative predictive value) | n.a. | 0.90 | 0.87 | 0.84 | 0.78 | 0.73 | 0.70 |
PLR (positive likelihood ratio) | 1.00 | 1.10 | 1.49 | 5.58 | 28.7 | 50.0 | n.a. |
NLR (negative likelihood ratio) | n.a. | 0.30 | 0.31 | 0.38 | 0.58 | 0.75 | 0.84 |
AI (accuracy index) | 0.33 | 0.40 | 0.57 | 0.81 | 0.80 | 0.75 | 0.71 |
CUI+ (clinical utility index positive) | 0.33 | 0.30 | 0.37 | 0.47 | 0.40 | 0.24 | 0.13 |
CUI− (clinical utility index negative) | n.a. | 10 | 0.36 | 0.74 | 0.76 | 0.72 | 0.70 |
The analysis of the performance metrics for our simulation dataset showed that the maximum CUI+ and CUI− values are obtained for the cutoff value identified by the J index, supporting the usefulness of the BMI for screening not for case finding.
The accuracy analysis is reported frequently in the scientific literature both in primary and secondary studies. Different actors such as the authors, reviewers, and editors could contribute to the quality of the statistics reported. The evaluation of plasma chitotriosidase as a biomarker in critical limb ischemia reported the AUC with associated 95% confidence intervals, cutoff values [59], but no information on patient-centered metrics or utility indications are provided. Similar parameters as reported by Ciocan et al. [59] have also been reported in the evaluation of sonoelastographic scores in the differentiation of benign by malign cervical lymph nodes [45]. Lei et al. conducted a secondary study to evaluate the accuracy of the digital breast tomosynthesis versus digital mammography to discriminate between malign and benign breast lesions and correctly reported Se, Sp, PLR, NLR, and DOR for both the studies included in the analysis and the pooled value [97]. However, insufficient details are provided in regard to ROC analysis (e.g., no AUCs confidence intervals are reported) or any utility index [97]. Furthermore, Lei et al. reported the Q∗ index which reflect the point on the SROC (summary receiver operating characteristic curve) at which the Se is equal with Sp that could be useful in specific clinical situations [97].
The number needed to diagnose (NND) and number needed to misdiagnose (NNM) are currently used in the identification of the cutoff value on continuous diagnostic test results [172, 177], in methodological articles, or teaching materials [161, 178, 179]. The NND and NNM are less frequently reported in the evaluation of the accuracy of a diagnostic test. Several examples identified in the available scientific literature are as follows: color duplex ultrasound in the diagnosis of carotid stenosis [180], culture-based diagnosis of tuberculosis [181], prostate-specific antigen [182, 183], endoscopic ultrasound-guided fine needle biopsy with 19-gauge flexible needle [184], number needed to screen-prostate cancer [185, 186], the integrated positron emission tomography/magnetic resonance imaging (PET/MRI) for segmental detection/localization of prostate cancer [187], serum malondialdehyde in the evaluation of exposure to chromium [188], the performances of the matrix metalloproteinase-7 (MMP-7) in the diagnosis of epithelial injury and of biliary atresia [189], lactate as a diagnostic marker of pleural and abdominal exudate [190], the Gram stain from a joint aspiration in the diagnosis of pediatric septic arthritis [191], and performances of a sepsis algorithm in an emergency department [192]. Unfortunately, the NND or NNM point estimators are not all the time reported with the associated 95% confidence intervals [161, 180, 181, 186, 187, 190, 191].
The reporting of the clinical utility index (CUI) is more frequently seen in the evaluation of a questionnaire. The grades not the values of CUIs were reported by Michell et al. [193] in the assessment of a semistructured diagnostic interview as a diagnostic tool for the major depressive disorder. Johansson et al. [194] reported both the CUI + value and its interpretation in cognitive evaluation using Cognistat. The CUI+/CUI− reported by Michell et al. [195] on the patient health questionnaire for depression in primary care (PHQ-9 and PHQ-2) is reported as a value with associated 95% confidence interval as well as interpretation. The CUI+ and CUI− values and associated confidence intervals were also reported by Fereshtehnejad et al. [196] in the evaluation of the screening questionnaire for Parkinsonism but just for the significant items. Fereshtehnejad et al. [196] also used the values of CUI+ and CUI− to select the optimal screening items whenever the value of point estimator was higher than 0.63. Bartoli et al. [197] represented the values of CUI graphically as column bars (not necessarily correct since the CUI is a single value, and a column could induce that is a range of values) in the evaluation of a questionnaire for alcohol use disorder on different subgroups. The accurate reporting of CUIs as values and associated confidence intervals could also be seen in some articles [198, 199], but is not a common practice [200–207].
Besides the commercial statistical programs able to assist researchers in conducting an accuracy analysis for a diagnostic test, several free online (Table 12) or offline applications exist (CATmaker [208] and CIcalculator [209]).
Table 12.
Online applications for diagnostic tests: characteristics.
Name | Input | Output |
---|---|---|
Diagnostic test calculatora | TP, FP, TN, FN OR Prevalence AND Se AND Sp AND sample size OR Prevalence AND PLR AND NLR AND sample size |
Prevalence AND Se AND Sp AND PLR AND NLR Fagan diagram |
| ||
Diagnostic test calculator evidence-based medicine toolkitb | TP, FP, TN, FN | Se, Sp, PPV, NPV, PLR, NLR with associated 95% confidence intervals Posttest probability graph |
| ||
MedCalc: Bayesian analysis modelc | Prevalence AND Se AND Sp OR TP, FP, TN, FN |
PPV, NPV, LPR, NLR, posttest probability |
| ||
MedCalcd | TP, FP, TN, FN | Se, Sp, PPV, NPV, PLR, NLR, prevalence, AI with associated 95% confidence intervals |
| ||
Clinical calculator 1e | TP, FP, TN, FN | Se, Sp, PPV, NPV, PLR, NLR, prevalence, AI with associated 95% confidence intervals |
| ||
Clinical utility index calculatorf | TP, TN, total number of cases, the total number of noncases | Se, Sp, PPV, NPV, PLR, NLR, prevalence, AI with associated 95% confidence intervals |
| ||
DiagnosticTestg | Number of positive and negative gold standard results for each level of the new diagnostic test | Se, Sp, PPV, NPV, PLR, NLR, AI, DOR, Cohen's kappa, entropy reduction, and a bias Index ROC curve if > 2 levels for all possible cutoff |
| ||
Simple ROC curve analysish | Absolute frequencies for false positive and the true positive for up to ten diagnostic levels | Cumulative rates (false positive and true positive) and ROC curve (equation, R2, and AUC) |
| ||
ROC analysisi | Five different type of input data: an example for each type is provided | Se, Sp, AI, positive cases missed, negative cases missed, AUC, ROC curve |
| ||
AUSVET: EpiToolsj | TP, FP, TN, FN | Different tools from basic accuracy to comparison of two diagnostic tests to ROC analysis |
All URLs were retrieved on April 20, 2019. TP = true positive; FP = false positive; FN = false negative; TN = true negative; Se = sensitivity; Sp = specificity; AI = accuracy index; PPV = positive predictive value; NPV = negative predictive value; PLR = positive likelihood ratio; NLR = negative likelihood ratio; DOR = diagnostic odds ratio; ROC = receiver operating characteristic; AUC = area under the ROC curve; ahttp://araw.mede.uic.edu/cgi-bin/testcalc.pl; bhttps://ebm-tools.knowledgetranslation.net/calculator/diagnostic/; chttp://www.medcalc.com/bayes.html; dhttps://www.medcalc.org/calc/diagnostic_test.php; ehttp://vassarstats.net/clin1.html; fhttp://www.psycho-oncology.info/cui.html; ghttp://www.openepi.com/DiagnosticTest/DiagnosticTest.htm; hhttp://vassarstats.net/roc1.html; ihttp://www.rad.jhmi.edu/jeng/javarad/roc/JROCFITi.html; and jhttp://epitools.ausvet.com.au/content.php?page=TestsHome.
Smartphone applications have also been developed to assist in daily clinical practice. The DocNomo application for iPhone/iPad free application [210] allows calculation of posttest probability using the two-step Fagan nomogram. Other available applications are Bayes' posttest probability calculator, EBM Tools app, and EBM Stats Calc. Allen et al. [211] and Power et al. [212] implemented two online tools for the visual examination of the effect of Se, Sp, and prevalence on TP, FP, FN, and TN values and the evaluation of clinical accuracy and utility of a diagnostic test [213]. Furthermore, they have underconstructed the evaluation of the uncertainties in assessing test accuracy when the reference standard is not perfect as support for the evidence-based practice.
4. Cost-Benefit Analysis
The studies conducted in phase III and IV in the investigation of a diagnostic test could be covered under the generic name of cost-benefit analysis. Different aspects of the benefit could be investigated such as societal impact (the impact on the society), cost-effectiveness (affordability), clinical efficacy or effectiveness (effects on the outcome), cost-consequence analysis, cost-utility analysis, sensitivity analysis (probability of disease and/or recurrence, cost of tests, impact on QALY (quality-adjusted life-year), and impact of treatment), and analytical performances (precision, linearity, and cost-effectiveness ratio) [214]. Thus, the evaluation of diagnostic tests benefits could be investigated from different perspectives (e.g., societal, health-care system, and health-care provider) and considering different items (e.g., productivity, patient and family time, medication, and physician time) [215]. Furthermore, an accurate comparison of two diagnostic tests must consider both the accuracy and benefit/harm in the assessment of the clinical utility [216, 217]. Generally, then cost-benefit analysis employs multivariate and multifactorial analysis using different designs of the experiment, including survival analysis, and the statistical approach is selected according to the aim of the study. Analysis of relationships is done using correlation method (Person's correlation (r) when the variables (two) are quantitative and normal distributed, and a linear relation is assuming between them; Spearman's (ρ) or Kendall's (τ) correlation coefficient otherwise; it is recommended to use Kendall's tau instead of Spearman's rho when data have ties [218]) or regression analysis when the nature of the relationship is of interest and an outcome variable exists [219]. The statistical methods applied when cost-benefit analysis is of interest are not discussed in detail here, but the basic requirements in reporting results are as follows [220–225]:
Correlation analysis: give summary statistic according to the distribution of data (with associated 95% confidence intervals when appropriate, for both baseline data and outcome data), graphical representation as scatter plot, use correct symbol of the correlation coefficient and associate the P value along with the sample size, report missing data, and report the check for influential/outliers.
Multivariate or multifactorial analysis: summary of the check of assumptions (plots, tests, and indicators), provide the plot of the model, give the model with coefficients, standard error of the coefficients and associated P values or 95% confidence intervals, determination coefficient of the mode, standard error of the model, statistic and P value of the model, provide the sample size, give the number of missing data for each predictor, and adjusted and unadjusted metrics (e.g., OR in logistic regression and HR (hazard ratio) in survival analysis).
Miglioretti et al. [98] investigated the link between radiation exposure of children through the CT examination and the risk of cancer. They reported a trend of the use in the CT which increased from 1996 to 2005, a plateau between 2005 and 2007 followed by a decrease till 2010. The number of CT scans was reported per 1,000 children. Regardless of the anatomical CT scan, the average effective doses were expressed as mean and percentiles (25th, 50th, 75th, and 95th), while the dose exceeding 20 mSv was reported as percentages. The mean organ dose was also reported and the lifetime attributable risk of solid cancer or leukemia, as well as some CT scans leading to one case of cancer per 10,000 scans [98]. The reported numbers and risks were not accompanied by the 95% confidence intervals [98] excepting the estimated value of the total number of future radiation-induced cancers related to pediatric CT use (they named it as uncertainty limit).
Dinh et al. [99] evaluated the effectiveness of a combined screening test (fecal immunological test and colonoscopy) for colorectal cancer using the Archimedes model (human physiology, diseases, interventions, and health-care systems [226]). The reported results, besides frequently used descriptive metrics, are the health utility score [227], cost per person, quality-adjusted life-years (QALYs) gained per person, and cost/QALYs gain as numerical point estimators not accompanied by the 95% confidence interval.
Westwood et al. [228] conducted a secondary study to evaluate the performances of the high-sensitivity cardiac troponin (hs-cTn) assays in ruling-out the patients with acute myocardial infarction (AMI). Clinical effectiveness using metrics such as Se, Sp, NLR, and PLR (for both any threshold and 99th percentile threshold) was reported with associated 95% confidence intervals. As the cost-effectiveness metrics the long-term costs, cost per life-year (LY) gained, quality-adjusted life-years (QALYs), and costs/QALYs were reported with associated 95% confidence intervals for different Tn testing methods. Furthermore, the incremental cost-effectiveness ratio (ICER) was used to compare the mean costs of two Tn testing methods along with the multivariate analysis (reported as estimates, standard error of the estimate, and the distribution of data).
Tiernan et al. [100] reported the changes in the clinical practice for the diagnosis of latent tuberculosis infection (LTBI) with interferon-gamma release assay, namely, QuantiFERON-TB Gold In-Tube (QFT, Cellestis, Australia). Unfortunately, the reported outcome was limited to the number of changes in practice due to QFT as absolute frequency and percentages [100].
5. Limitations and Perspectives
The current paper did not present either detail regarding the research methodology for diagnostic studies nor the critical appraisal of a paper presenting the performances of a diagnostic test because these are beyond the aim. Extensive scientific literature exists regarding both the design of experiments for diagnostic studies [4, 15, 92, 229, 230] and the critical evaluation of a diagnostic paper [231–234]. As a consequence, neither the effect of the sample size on the accuracy parameters, or the a priori computation of the sample size needed to reach the level of significance for a specific research question, nor the a posteriori calculation of the power of the diagnostic test is discussed. The scientific literature presenting the sample size calculation for diagnostic studies is presented in the scientific literature [235–238], but these approaches must be used with caution because the calculations are sensitive and the input data from one population are not a reliable solution for another population, so the input data for sample size calculation are recommended to come from a pilot study. This paper does not treat how to select a diagnostic test in clinical practice, the topic being treated by the evidence-based medicine and clinical decision [239–241].
Health-care practice is a dynamic field and records rapid changes due to changes in the evolution of known diseases, the apparition of new pathologies, the life expectancy of the population, progress in information theory, communication and computer sciences, development of new materials, and approaches as solutions for medical problems. The concept of personalized medicine changes the way of health care, the patient becomes the core of the decisional process, and the applied diagnostic methods and/or treatment closely fit the needs and particularities of the patient [242]. Different diagnostic or monitoring devices such as wearable health monitoring systems [243, 244], liquid biopsy or associated approaches [245, 246], wireless ultrasound transducer [247], or other point-of-care testing (POCT) methods [248, 249] are introduced and need proper analysis and validation. Furthermore, the availability of big data opens a new pathway in analyzing medical data, and artificial intelligence approaches will probably change the way of imaging diagnostic and monitoring [250, 251]. The ethical aspects must be considered [252, 253] along with valid and reliable methods for the assessment of old and new diagnostic approaches that are required. Space for methodological improvements exists, from designing the experiments to analyzing of the experimental data for both observational and interventional approaches.
6. Concluding Remarks
Any diagnostic test falls between perfect and useless test, and no diagnostic test can tell us with certainty if a patient has or not a particular disease. No ideal diagnostic tests exist, so any test has false-positive and false-negative results.
The metric reported in the assessment of the precision (variability analysis) or accuracy of a diagnostic test must be presented as point indicators and associated 95% confidence interval, and the thresholds for interpretation are applied to the confidence intervals.
The correct evaluation of performances of two methods measuring the same outcome is done with the Bland and Altman plot (evaluate the bias of the difference between two methods) not correlation or agreement (assess the association between two measurements) analysis.
A gold standard test is mandatory in the evaluation of the accuracy of a test. Both sensitivity and specificity with 95% confidence intervals are reported together to allow a proper interpretation of the accuracy. Based on these values, the clinical utility index is used to support the rule-in and/or rule-out and thus respectively the usefulness of a diagnostic test as identification of the disease or in screening.
The correct interpretation of positive and negative predictive values is just made if the prevalence of the disease is known.
The sensitivity and specificity must be reported any time when Youden's index is given. Report the ROC analysis by providing AUC with associated 95% confidence interval, the threshold according to Youden's index, sensitivity, and specificity with 95% confidence intervals.
Report full descriptive and inferential statistics associated with the benefits analysis. Multivariate or multifactorial analysis could be used to test the cost-benefit of a diagnostic test, and the good practice in reporting such analysis must be strictly followed by providing the full model with the values of coefficients associated to the predictors and measures of variability, significance of both models and each coefficient, and risk metrics with associated 95% confidence intervals when appropriate (e.g., relative risk and hazard ratio).
Conflicts of Interest
The author declares that she have no conflicts of interest.
References
- 1.Singh H. Helping health care organizations to define diagnostic errors as missed opportunities in diagnosis. Joint Commission Journal on Quality and Patient Safety. 2014;40(3):99–101. doi: 10.1016/S1553-7250(14)40012-6. [DOI] [PubMed] [Google Scholar]
- 2.Schiff G. D., Hasan O., Kim S., et al. Diagnostic error in medicine: analysis of 583 physician-reported errors. Archives of Internal Medicine. 2009;169(20):1881–1887. doi: 10.1001/archinternmed.2009.333. [DOI] [PubMed] [Google Scholar]
- 3.Zwaan L., Singh H. The challenges in defining and measuring diagnostic error. Diagnosis. 2015;2(2):97–103. doi: 10.1515/dx-2014-0069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sackett D. L., Haynes R. B., Guyatt G. H., Tugwell P. Clinical Epidemiology, A Basic Science for Clinical Medicine. 2nd. Boston, MA, USA: Little Brown; 1991. [Google Scholar]
- 5.Jaeschke R., Guyatt G., Sackett D. L. Users’ guides to the medical literature. III. How to use an article about a diagnostic test. A. Are the results of the study valid? Evidence-Based Medicine Working Group. JAMA. 1994;271(5):389–391. doi: 10.1001/jama.1994.03510290071040. [DOI] [PubMed] [Google Scholar]
- 6.Jaeschke R., Guyatt G. H., Sackett D. L. Users’ guides to the medical literature. III. How to use an article about a diagnostic test. B. What are the results and will they help me in caring for my patients? The Evidence-Based Medicine Working Group. JAMA. 1994;271(9):703–707. doi: 10.1001/jama.1994.03510330081039. [DOI] [PubMed] [Google Scholar]
- 7.Gluud C., Gluud L. L. Evidence based diagnostics. BMJ. 2005;330:724–726. doi: 10.1136/bmj.330.7493.724. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. EQUATOR network, Enhancing the QUAlity and Transparency of health Research, 2019, http://www.equator-network.org.
- 9.Gamble C., Krishan A., Stocken D., et al. Guidelines for the content of statistical analysis plans in clinical trials. JAMA. 2017;318(23):2337–2343. doi: 10.1001/jama.2017.18556. [DOI] [PubMed] [Google Scholar]
- 10.Guilhot J., Baccarani M., Clark R. E., et al. Definitions, methodological and statistical issues for phase 3 clinical trials in chronic myeloid leukemia: a proposal by the European LeukemiaNet. Blood. 2012;119(25):5963–5971. doi: 10.1182/blood-2011-10-383711. [DOI] [PubMed] [Google Scholar]
- 11.Matcham J., Julious S., Pyke S., et al. Proposed best practice for statisticians in the reporting and publication of pharmaceutical industry-sponsored clinical trials. Pharmaceutical Statistics. 2011;10(1):70–73. doi: 10.1002/pst.417. [DOI] [PubMed] [Google Scholar]
- 12.Wang R., Lagakos S. W., Ware J. H., Hunter D. J., Drazen J. M. Statistics in medicine--reporting of subgroup analyses in clinical trials. New England Journal of Medicine. 2007;357(21):2189–2194. doi: 10.1056/NEJMsr077003. [DOI] [PubMed] [Google Scholar]
- 13.Boers M. Graphics and statistics for cardiology: designing effective tables for presentation and publication. Heart. 2018;104:192–200. doi: 10.1136/heartjnl-2017-311581. [DOI] [PubMed] [Google Scholar]
- 14.Lang T. A., Altman D. G. Basic statistical reporting for articles published in biomedical journals: the “statistical analyses and methods in the published literature” or the SAMPL guidelines. International Journal of Nursing Studies. 2015;52(1):5–9. doi: 10.1016/j.ijnurstu.2014.09.006. [DOI] [PubMed] [Google Scholar]
- 15.Whiting P., Rutjes A., Reitsma J., Bossuyt P., Kleijnen J. The development of QUADAS: a tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews. BMC Medical Research Methodology. 2003;3(1) doi: 10.1186/1471-2288-3-25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Whiting P. F., Rutjes A. W. S., Westwood M. E., et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Annals of Internal Medicine. 2011;155(8):529–536. doi: 10.7326/0003-4819-155-8-201110180-00009. [DOI] [PubMed] [Google Scholar]
- 17.Bossuyt P. M., Reitsma J. B., Bruns D. E., et al. Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Clinical Chemistry. 2003;49(1):1–6. doi: 10.1373/49.1.1. [DOI] [PubMed] [Google Scholar]
- 18.Bossuyt P. M., Reitsma J. B., Bruns D. E., et al. Standards for reporting of diagnostic accuracy. Annals of Internal Medicine. 2003;138(1):p. W1. doi: 10.7326/0003-4819-138-1-200301070-00012-w1. [DOI] [PubMed] [Google Scholar]
- 19.Cohen J. F., Korevaar D. A., Altman D. G., et al. STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration. BMJ Open. 2016;6(11) doi: 10.1136/bmjopen-2016-012799.e012799 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Collins G. S., Reitsma J. B., Altman D. G., Moons K. G. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015;350 doi: 10.1136/bmj.g7594.g7594 [DOI] [PubMed] [Google Scholar]
- 21.Shea B. J., Grimshaw J. M., Wells G. A., et al. Development of AMSTAR: a measurement tool to assess the methodological quality of systematic reviews. BMC Medical Research Methodology. 2007;7(1) doi: 10.1186/1471-2288-7-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Shea B. J., Reeves B. C., Wells G., et al. AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ. 2017;358 doi: 10.1136/bmj.j4008.j4008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.The AGREE Collaboration. Writing Group, Cluzeau F. A., Burgers J. S., et al. Development and validation of an international appraisal instrument for assessing the quality of clinical practice guidelines: the AGREE project. Quality and Safety in Health Care. 2003;12(1):18–23. doi: 10.1136/qhc.12.1.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Brouwers M. C., Kerkvliet K., Spithoff K. The AGREE reporting checklist: a tool to improve reporting of clinical practice guidelines. BMJ. 2016;352 doi: 10.1136/bmj.i1152.i1152 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Chen Y., Yang K., Marušić A., et al. A reporting tool for practice guidelines in health care: the RIGHT statement. Annals of Internal Medicine. 2017;166(2):128–132. doi: 10.7326/M16-1565. [DOI] [PubMed] [Google Scholar]
- 26.Wilczynski N. L. Quality of reporting of diagnostic accuracy studies: no change since STARD statement publication--before-and-after study. Radiology. 2008;248(3):817–823. doi: 10.1148/radiol.2483072067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Korevaar D. A., van Enst W. A., Spijker R., Bossuyt P. M., Hooft L. Reporting quality of diagnostic accuracy studies: a systematic review and meta-analysis of investigations on adherence to STARD. Evidence Based Medicine. 2014;19(2):47–54. doi: 10.1136/eb-2013-101637. [DOI] [PubMed] [Google Scholar]
- 28.Gallo L., Hua N., Mercuri M., Silveira A., Worster A. Adherence to standards for reporting diagnostic accuracy in emergency medicine research. Academic Emergency Medicine. 2017;24(8):914–919. doi: 10.1111/acem.13233. [DOI] [PubMed] [Google Scholar]
- 29.Maclean E. N., Stone I. S., Ceelen F., Garcia-Albeniz X., Sommer W. H., Petersen S. E. Reporting standards in cardiac MRI, CT, and SPECT diagnostic accuracy studies: analysis of the impact of STARD criteria. European Heart Journal Cardiovascular Imaging. 2014;15(6):691–700. doi: 10.1093/ehjci/jet277. [DOI] [PubMed] [Google Scholar]
- 30.Chiesa C., Pacifico L., Osborn J. F., Bonci E., Hofer N., Resch B. Early-onset neonatal sepsis: still room for improvement in procalcitonin diagnostic accuracy studies. Medicine. 2015;94:30. doi: 10.1097/MD.0000000000001230.e1230 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Choi Y. J., Chung M. S., Koo H. J., Park J. E., Yoon H. M., Park S. H. Does the reporting quality of diagnostic test accuracy studies, as defined by STARD 2015, affect citation? Korean Journal of Radiology. 2016;17(5):706–714. doi: 10.3348/kjr.2016.17.5.706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Hong P. J., Korevaar D. A., McGrath T. A., et al. Reporting of imaging diagnostic accuracy studies with focus on MRI subgroup: Adherence to STARD 2015. Journal of Magnetic Resonance Imaging. 2018;47(2):523–544. doi: 10.1002/jmri.25797. [DOI] [PubMed] [Google Scholar]
- 33.Talwar S., Sieberhofer A., Williams B., Ng L. Influence of hypertension, left ventricular hypertrophy, and left ventricular systolic dysfunction on plasma N terminal pre-BNP. Heart. 2000;83:278–282. doi: 10.1136/heart.83.3.278. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Corradi F., Brusasco C., Vezzani A., et al. Hemorrhagic shock in polytrauma patients: early detection with renal doppler resistive index measurements. Radiology. 2011;260(1):112–1128. doi: 10.1148/radiol.11102338. [DOI] [PubMed] [Google Scholar]
- 35.Razavi Z., Ahmadi M. Efficacy of thrice-daily versus twice-daily insulin regimens on glycohemoglobin (Hb A1c) in type 1 diabetes mellitus: a randomized controlled trial. Oman Medical Journal. 2011;26(1):10–13. doi: 10.5001/omj.2011.03. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Gee M. S., Atri M., Bandos A. I., Mannel R. S., Gold M. A., Lee S. I. Identification of distant metastatic disease in uterine cervical and endometrial cancers with FDG PET/CT: analysis from the ACRIN 6671/GOG 0233 multicenter trial. Radiology. 2018;287(1):176–184. doi: 10.1148/radiol.2017170963. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Rerucha C. M., Caro R. J., Wheeler V. L. Cervical cancer screening. American Family Physician. 2018;97(7):441–448. [PubMed] [Google Scholar]
- 38.Badrick T. Evidence-based laboratory medicine. Clinical Biochemist Reviews. 2013;34(2):43–46. [PMC free article] [PubMed] [Google Scholar]
- 39.Rodgers S. K., Chang C., DeBardeleben J. T., Horrow M. M. Normal and abnormal US findings in early first-trimester pregnancy: review of the society of radiologists in ultrasound 2012 consensus panel recommendations. RadioGraphics. 2015;35(7):2135–2148. doi: 10.1148/rg.2015150092. [DOI] [PubMed] [Google Scholar]
- 40.Bua A., Molicotti P., Cannas S., Ruggeri M., Olmeo P., Zanetti S. Tuberculin skin test and QuantiFERON in children. New Microbiologica. 2013;36(2):153–156. [PubMed] [Google Scholar]
- 41.Liauw S. L., Pitroda S. P., Eggener S. E., et al. Evaluation of the prostate bed for local recurrence after radical prostatectomy using endorectal magnetic resonance imaging. International Journal of Radiation Oncology. 2013;85(2):378–384. doi: 10.1016/j.ijrobp.2012.05.015. [DOI] [PubMed] [Google Scholar]
- 42.American Academy of Pediatrics Committee on Fetus and Newborn. The Apgar score. Pediatrics. 2015;136(4):819–822. doi: 10.1542/peds.2015-2651. [DOI] [PubMed] [Google Scholar]
- 43.Teasdale G., Jennett B. Assessment of coma and impaired consciousness. The Lancet. 1974;304(7872):81–84. doi: 10.1016/S0140-6736(74)91639-0. [DOI] [PubMed] [Google Scholar]
- 44.Alvarado A. A practical score for the early diagnosis of acute appendicitis. Annals of Emergency Medicine. 1986;15(5):557–564. doi: 10.1016/S0196-0644(86)80993-3. [DOI] [PubMed] [Google Scholar]
- 45.Lenghel L. M., Botar Jid C., Bolboacă S. D., et al. Comparative study of three sonoelastographic scores for differentiation between benign and malignant cervical lymph nodes. European Journal of Radiology. 2015;84(6):1075–1082. doi: 10.1016/j.ejrad.2015.02.017. [DOI] [PubMed] [Google Scholar]
- 46.Ware J. E. J., Sherbourne C. D. The MOS 36-item short-form health survey (SF-36): I. conceptual framework and item selection. Medical Care. 1992;30:473–483. [PubMed] [Google Scholar]
- 47.EuroQol Group. EuroQol-a new facility for the measurement of health-related quality of life. Health Policy. 1990;16(3):199–208. doi: 10.1016/0168-8510(90)90421-9. [DOI] [PubMed] [Google Scholar]
- 48.Rabin R., de Charro F. EQ-5D: a measure of health status from the EuroQol group. Annals of Medicine. 2001;33(5):337–343. doi: 10.3109/07853890109002087. [DOI] [PubMed] [Google Scholar]
- 49.Morgan M. B., Crayford T., Murrin B., Fraser S. C. Developing the vascular quality of life questionnaire: a new disease-specific quality of life measure for use in lower limb ischemia. Journal of Vascular Surgery. 2001;33(4):679–687. doi: 10.1067/mva.2001.112326. [DOI] [PubMed] [Google Scholar]
- 50.Nordanstig J., Wann-Hansson C., Karlsson J., Lundström M., Pettersson M., Morgan M. B. Vascular quality of life questionnaire-6 facilitates health-related quality of life assessment in peripheral arterial disease. Journal of Vascular Surgery. 2014;59(3):700–707. doi: 10.1016/j.jvs.2013.08.099. [DOI] [PubMed] [Google Scholar]
- 51.Launois R., Reboul-Marty J., Henry B. Construction and validation of a quality of life questionnaire in chronic lower limb venous insufficiency (CIVIQ) Quality of Life Research. 1996;5(6):539–554. doi: 10.1007/BF00439228. [DOI] [PubMed] [Google Scholar]
- 52.Korff M., Ormel J., Keefe F. J., Dworkin S. F. Grading the severity of chronic pain. Pain. 1992;50:133–149. doi: 10.1016/0304-3959(92)90154-4. [DOI] [PubMed] [Google Scholar]
- 53.Corbo R. M., Scacchi R. Apolipoprotein E (apoE) allele distribution in the world. Is apoE∗4 a ‘thrifty’ allele? Annals of Human Genetics. 1999;63:301–310. doi: 10.1046/j.1469-1809.1999.6340301.x. [DOI] [PubMed] [Google Scholar]
- 54.Boulenouar H., Mediene Benchekor S., Meroufel D. N., et al. Impact of APOE gene polymorphisms on the lipid profile in an Algerian population. Lipids in Health and Disease. 2013;12(1) doi: 10.1186/1476-511X-12-155. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Yu Y., Wang W., Zhai S., Dang S., Sun M. IL6 gene polymorphisms and susceptibility to colorectal cancer: a meta-analysis and review. Molecular Biology Reports. 2012;39(8):8457–8463. doi: 10.1007/s11033-012-1699-4. [DOI] [PubMed] [Google Scholar]
- 56.Urquhart G. E. D., Gould J. C. Simplified technique for counting the number of bacteria in urine and other fluids. Journal of Clinical Pathology. 1965;18(4):480–482. doi: 10.1136/jcp.18.4.480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Postollec F., Mathot A.-G., Bernard M., Divanach M.-L., Pavan S., Sohier D. Tracking spore-forming bacteria in food: From natural biodiversity to selection by processes. International Journal of Food Microbiology. 2012;158:1–8. doi: 10.1016/j.ijfoodmicro.2012.03.004. [DOI] [PubMed] [Google Scholar]
- 58.Özen A., Unal E. U., Mola S., et al. Glasgow aneurysm score in predicting outcome after ruptured abdominal aortic aneurysm. Vascular. 2015;23(2):120–123. doi: 10.1177/1708538114533539. [DOI] [PubMed] [Google Scholar]
- 59.Ciocan R. A., Drugan C., Gherman C. D., et al. Evaluation of Chitotriosidase as a marker of inflammatory status in critical limb ischemia. Annals of Clinical & Laboratory Science. 2017;47(6):713–719. [PubMed] [Google Scholar]
- 60.Drugan C., Drugan T. C., Miron N., Grigorescu-Sido P., Nascu I., Catana C. Evaluation of neopterin as a biomarker for the monitoring of Gaucher disease patients. Hematology. 2016;21(6):379–386. doi: 10.1080/10245332.2016.1144336. [DOI] [PubMed] [Google Scholar]
- 61.Sánchez-Rodríguez J. E., Bartolomé M., Cañas A. I., et al. Anti-smoking legislation and its effects on urinary cotinine and cadmium levels. Environmental Research. 2015;136:227–233. doi: 10.1016/j.envres.2014.09.033. [DOI] [PubMed] [Google Scholar]
- 62.Nahar N., Khan N., Chakraborty R. K., et al. Color doppler sonography and resistivity index in the differential diagnosis of hepatic neoplasm. Mymensingh Medical Journal. 2014;23(1):35–40. [PubMed] [Google Scholar]
- 63.Botar Jid C., Bolboacă S. D., Cosgarea R., et al. Doppler ultrasound and strain elastography in the assessment of cutaneous melanoma: preliminary results. Medical Ultrasonography. 2015;17(4):509–514. doi: 10.11152/mu.2013.2066.174.dus. [DOI] [PubMed] [Google Scholar]
- 64.Hăşmăşanu M. G., Bolboacă S. D., Matyas M., Zaharie G. C. Clinical and Echocardiographic Findings in Newborns of Diabetic Mothers. Acta Clinica Croatica. 2015;54(4):458–466. [PubMed] [Google Scholar]
- 65.BS ISO 5725-1. Accuracy (Trueness and Precision) of Measurement Methods and Results-Part 1: General Principles and Definitions. 1994. https://www.evs.ee/preview/iso-5725-1-1994-en.pdf. [Google Scholar]
- 66.Solberg H. E. International Federation of Clinical Chemistry (IFCC), Scientific Committee, Clinical Section, Expert Panel on Theory of Reference Values, and International Committee for Standardization in Haematology (ICSH), Standing Committee on Reference Values. Approved Recommendation (1986) on the theory of reference values. Part 1. The concept of reference values. Journal of Clinical Chemistry and Clinical Biochemistry. 1987;25(5):337–342. [PubMed] [Google Scholar]
- 67.International Federation of Clinical Chemistry (IFCC). Scientific Committee, Clinical Section. Expert Panel on Theory of Reference Values (EPTRV). IFCC Document (1982) stage 2, draft 2, 1983-10-07 with a proposal for an IFCC recommendation. The theory of reference values. Part 2. Selection of individuals for the production of reference values. Clinica Chimica Acta. 1984;139(2):205F–213F. [PubMed] [Google Scholar]
- 68.Solberg H. E., PetitClerc C. International Federation of Clinical Chemistry (IFCC), Scientific Committee, Clinical Section, Expert Panel on Theory of Reference Values. Approved recommendation (1988) on the theory of reference values. Part 3. Preparation of individuals and collection of specimens for the production of reference values. Journal of Clinical Chemistry and Clinical Biochemistry. 1988;26(9):593–598. [PubMed] [Google Scholar]
- 69.Solberg H. E., Stamm D. IFCC recommendation: The theory of reference values. Part 4. Control of analytical variation in the production, transfer and application of reference values. Journal of Automat Chemistry. 1991;13(5):231–234. doi: 10.1155/S146392469100038X. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Solberg H. E. Approved recommendation (1987) on the theory of reference values. Part 5. Statistical treatment of collected reference values. Determination of reference limits. Journal of Clinical Chemistry and Clinical Biochemistry. 1987;25:645–656. [PubMed] [Google Scholar]
- 71.Dybkær R., Solberg H. E. Approved recommendation (1987) on the theory of reference values. Part 6. Presentation of observed values related to reference values. Journal of Clinical Chemistry and Clinical Biochemistry. 1987;25:657–662. [Google Scholar]
- 72.Thygesen K., Alpert J. S., Jaffe A. S., et al. Third universal definition of myocardial infarction. Journal of the American College of Cardiology. 2012;60(16):1581–1598. doi: 10.1016/j.jacc.2012.08.001. [DOI] [PubMed] [Google Scholar]
- 73.Weitgasser R., Gappmayer B., Pichler M. Newer portable glucose meters--analytical improvement compared with previous generation devices? Clinical Chemistry. 1999;45(10):1821–1825. [PubMed] [Google Scholar]
- 74.Sacks D. B., Arnold M., Bakris G. L., et al. Guidelines and recommendations for laboratory analysis in the diagnosis and management of diabetes mellitus. Diabetes Care. 2011;34(6):e61–e99. doi: 10.2337/dc11-9998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Clerico A., Ripoli A., Masotti S., et al. Evaluation of 99th percentile and reference change values of a high-sensitivity cTnI method: A multicenter study. Clininica Chimica Acta. 2019;493:156–161. doi: 10.1016/j.cca.2019.02.029. [DOI] [PubMed] [Google Scholar]
- 76.Le M., Flores D., May D., Gourley E., Nangia A. K. Current practices of measuring and reference range reporting of free and total testosterone in the United States. Journal of Urology. 2016;195(5):1556–1561. doi: 10.1016/j.juro.2015.12.070. [DOI] [PubMed] [Google Scholar]
- 77.Alnor A. B., Vinholt P. J. Paediatric reference intervals are heterogeneous and differ considerably in the classification of healthy paediatric blood samples. European Journal of Pediatrics. 2019 doi: 10.1007/s00431-019-03377-w. In press. [DOI] [PubMed] [Google Scholar]
- 78.McGee C., Hoehn A., Hoenshell C., McIlrath S., Sterling H., Swan H. Age- and gender-stratified adult myometric reference values of isometric intrinsic hand strength. Journal of Hand Therapy. 2019;(18) doi: 10.1016/j.jht.2019.03.005. pii: S0894-1130.30352 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Zhang X. T., Qi K. M., Cao J., Xu K. L., Cheng H. [Survey and Establishment of the Hematological Parameter Reference Intervals for Adult in Xuzhou Area of China] Zhongguo Shi Yan Xue Ye Xue Za Zhi. 2019;27(2):549–556. doi: 10.19746/j.cnki.issn.1009-2137.2019.02.039. [DOI] [PubMed] [Google Scholar]
- 80.Adeli K., Higgins V., Nieuwesteeg M., et al. Biochemical marker reference values across pediatric, adult, and geriatric ages: establishment of robust pediatric and adult reference intervals on the basis of the Canadian Health Measures Survey. Clinical Chemistry. 2015;61:1049–1062. doi: 10.1373/clinchem.2015.240515. [DOI] [PubMed] [Google Scholar]
- 81.Addai-Mensah O., Gyamfi D., Duneeh R. V., et al. Determination of haematological reference ranges in healthy adults in three regions in Ghana. BioMed Research International. 2019;2019 doi: 10.1155/2019/7467512.7467512 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Li A., Yang S., Zhang J., Qiao R. Establishment of reference intervals for complete blood count parameters during normal pregnancy in Beijing. Journal of Clinical Laboratory Analysis. 2017;31(6) doi: 10.1002/jcla.22150.e22150 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Siest G., Henny J., Gräsbeck R., et al. The theory of reference values: an unfinished symphony. Clinical Chemistry and Laboratory Medicine. 2013;51(1):47–64. doi: 10.1515/cclm-2012-0682. [DOI] [PubMed] [Google Scholar]
- 84.Ozarda Y., Sikaris K., Streichert T., Macri J., IFCC Committee on Reference intervals and Decision Limits (C-RIDL) Distinguishing reference intervals and clinical decision limits-A review by the IFCC committee on reference intervals and decision limits. Critical Reviews in Clinical Laboratory Sciences. 2018;55(6):420–431. doi: 10.1080/10408363.2018.1482256. [DOI] [PubMed] [Google Scholar]
- 85.Colan S. D. The why and how of Z scores. Journal of the American Society of Echocardiography. 2013;26(1):38–40. doi: 10.1016/j.echo.2012.11.005. [DOI] [PubMed] [Google Scholar]
- 86.Field A. An Adventure in Statistics: The Reality Enigma. London, UK: Sage; 2016. pp. 189–214. [Google Scholar]
- 87.Hazra A., Gogtay N. Biostatistics series module 7: the statistics of diagnostic tests. Indian Journal of Dermatology. 2017;62(1):18–24. doi: 10.4103/0019-5154.198047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Chubb H., Simpson J. M. The use of Z-scores in paediatric cardiology. Annals of Pediatric Cardiology. 2012;5(2):179–184. doi: 10.4103/0974-2069.99622. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Curtis A. E., Smith T. A., Ziganshin B. A., Elefteriades J. A. The mystery of the Z-score. Aorta. 2016;4(4):124–130. doi: 10.12945/j.aorta.2016.16.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Mao Y. K., Zhao B. W., Zhou L., Wang B., Chen R., Wang S.S. Z-score reference ranges for pulsed-wave doppler indices of the cardiac outflow tracts in normal fetuses. International Journal of Cardiovascular Imaging. 2019;35(5):811–825. doi: 10.1007/s10554-018-01517-1. [DOI] [PubMed] [Google Scholar]
- 91.Gregson C. L., Hardcastle S. A., Cooper C., Tobias J. H. Friend or foe: high bone mineral density on routine bone density scanning, a review of causes and management. Rheumatology. 2013;52(6):968–985. doi: 10.1093/rheumatology/ket007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Newman T. B., Browner W. S., Cummings S. R., Hulley S. B. Designing studies of medical tests. In: Hulley S. B., Cummings S. R., Browner W. S., Grady D. G., Newman T. B., editors. Designing Clinical Research. 4th. Phyladelphia, PA, USA: Lippincott Williams & Wilkins; 2013. pp. 171–191. [Google Scholar]
- 93.Taheri N., Roshandel G., Mojerloo M., et al. Comparison of serum levels of hepcidin and pro-hepcidin in hemodialysis patients and healthy subjects. Saudi Journal of Kidney Diseases and Transplantation. 2015;26(1):34–38. doi: 10.4103/1319-2442.148730. [DOI] [PubMed] [Google Scholar]
- 94.Zhang J., Zhao Y., Chen Y. Reference intervals for plasma pro-gastrin releasing peptide (ProGRP) levels in healthy adults of Chinese Han ethnicity. International Journal of Biological Markers. 2014;29(4):e436–e439. doi: 10.5301/jbm.5000084. [DOI] [PubMed] [Google Scholar]
- 95.Saravelos S. H., Li T. C. Intra- and inter-observer variability of uterine measurements with three-dimensional ultrasound and implications for clinical practice. Reproductive Biomedicine Online. 2015;31(4):557–564. doi: 10.1016/j.rbmo.2015.06.024. [DOI] [PubMed] [Google Scholar]
- 96.Azara C. Z., Manrique E. J., Alves de Souza N. L., Rodrigues A. R., Tavares S. B., Amaral R. G. External quality control of cervical cytopathology: interlaboratory variability. Acta Cytologica. 2013;57(6):585–590. doi: 10.1159/000353843. [DOI] [PubMed] [Google Scholar]
- 97.Lei J., Yang P., Zhang L., Wang Y., Yang K. Diagnostic accuracy of digital breast tomosynthesis versus digital mammography for benign and malignant lesions in breasts: a meta-analysis. European Radiology. 2014;24(3):595–602. doi: 10.1007/s00330-013-3012-x. [DOI] [PubMed] [Google Scholar]
- 98.Miglioretti D. L., Johnson E., Williams A., et al. The use of computed tomography in pediatrics and the associated radiation exposure and estimated cancer risk. JAMA Pediatrics. 2013;167(8):700–707. doi: 10.1001/jamapediatrics.2013.311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Dinh T., Ladabaum U., Alperin P., Caldwell C., Smith R., Levin T. R. Health benefits and cost-effectiveness of a hybrid screening strategy for colorectal cancer. Clinical Gastroenterology and Hepatology. 2013;11(9):1158–1166. doi: 10.1016/j.cgh.2013.03.013. [DOI] [PubMed] [Google Scholar]
- 100.Tiernan J. F., Gilhooley S., Jones M. E., et al. Does an interferon-gamma release assay change practice in possible latent tuberculosis? QJM. 2013;106(2):139–146. doi: 10.1093/qjmed/hcs185. [DOI] [PubMed] [Google Scholar]
- 101.Newman T. B., Browner W. S., Cummings S. R., Hulley S. B. Chapter 12. Designing studies of medical tests. In: Hulley S. B., Cummings S. R., Browner W. S., Grady D. G., Newman T. B., editors. Designing Clinical Research. 4th. Philadelphia, PA, USA: Lippincott Williams & Wilkins; 2007. pp. 170–191. [Google Scholar]
- 102.Shapiro S. S., Wilk M. B. An analysis of variance test for normality (complete samples) Biometrika. 1965;52(3–4):591–611. doi: 10.1093/biomet/52.3-4.591. [DOI] [Google Scholar]
- 103.Kolmogorov A. N. Sulla determinazione empirica di una legge di distribuzione. Giornale dell’Istituto Italiano degli Attuari. 1933;4:83–91. [Google Scholar]
- 104.Smirnov N. Table for estimating the goodness of fit of empirical distributions. Annals of Mathematical Statistics. 1948;19:279–281. doi: 10.1214/aoms/1177730256. [DOI] [Google Scholar]
- 105.Arifin W. N., Sarimah A., Norsaadah B., et al. Reporting statistical results in medical journals. Malaysian Journal of Medical Science. 2016;23(5):1–7. doi: 10.21315/mjms2016.23.5.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Peacock J. L., Kerry S. M., Balise R. R. Presenting Medical Statistics from Proposal to Publication. 2nd. UK: Oxford University Press; 2017. Chapter 5. Introduction to presenting statistical analysis; pp. 48–51. [Google Scholar]
- 107.Koch A., Singer H. Normal values of B type natriuretic peptide in infants, children, and adolescents. Heart. 2003;89(8):875–878. doi: 10.1136/heart.89.8.875. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108. CLSI, Defining, Establishing, and Verifying Reference Intervals in the Clinical Laboratory; Approved Guideline—Third Edition, CLSI document EP28-A3c, Clinical and Laboratory Standards Institute, Wayne, PA, USA, 2008, https://clsi.org/media/1421/ep28a3c_sample.pdf.
- 109.Murase T., Kitamura H., Kochi T., et al. Distributions and ranges of values of blood and urinary biomarker of inflammation and oxidative stress in the workers engaged in office machine manufactures: evaluation of reference values. Clinical Chemistry and Laboratory Medicine. 2013;51(2):421–428. doi: 10.1515/cclm-2012-0330. [DOI] [PubMed] [Google Scholar]
- 110. Calculating Inter- and Intra-Assay Coefficients of Variability, 2018, https://www.salimetrics.com/calculating-inter-and-intra-assay-coefficients-of-variability/
- 111.Chard T. An Introduction to Radioimmunoassay and Related Techniques. Amsterdam, Netherlands: Elsevier; 1995. [Google Scholar]
- 112.Wild D. The Immunoassay Handbook: Theory and Applications of Ligand Binding, ELISA and Related Techniques. 4th. Amsterdam, Netherlands: Elsevier; 2013. [Google Scholar]
- 113.Pearson K. Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character. 1896;187:253–318. doi: 10.1098/rsta.1896.0007. [DOI] [Google Scholar]
- 114.Steel R. G. D., Torrie J. H. Principles and Procedures of Statistics. 2nd. New York, NY, USA: McGraw-Hill; 1980. [Google Scholar]
- 115.Sangnawakij P., Niwitpong S. Confidence intervals for coefficients of variation in two-parameter exponential distributions. Communications in Statistics-Simulation and Computation. 2016;46(8):6618–6630. doi: 10.1080/03610918.2016.1208236. [DOI] [Google Scholar]
- 116.Everitt B. The Cambridge Dictionary of Statistics. Cambridge, UK: Cambridge University Press; 1998. ISBN 0521593468. [Google Scholar]
- 117.Zar J. H. Biostatistical Analysis. 2nd. Englewood Cliffs, NJ, USA: Prentice Hall, Inc.; 1984. p. p. 32. [Google Scholar]
- 118.Zwillinger D., Kokoska S. Standard Probability and Statistical Tables and Formula. Boca Raton, FL, USA: Chapman & Hall; 2000. p. p. 17. [Google Scholar]
- 119.Bonett D. G. Confidence interval for a coefficient of quartile variation. Computational Statistics & Data Analysis. 2006;50(11):2953–2957. doi: 10.1016/j.csda.2005.05.007. [DOI] [Google Scholar]
- 120.Quyên Đ. T. N. Developing university governance indicators and their weighting system using a modified Delphi method. Procedia-Social and Behavioral Sciences. 2014;141:828–833. doi: 10.1016/j.sbspro.2014.05.144. [DOI] [Google Scholar]
- 121.Forkman J. Estimator and tests for common coefficients of variation in normal distributions. Communications in Statistics-Theory and Methods. 2009;38:233–251. doi: 10.1080/03610920802187448. [DOI] [Google Scholar]
- 122.Feltz C. J., Miller G. E. An asymptotic test for the equality of coefficients of variation from k populations. Statistics in Medicine. 1996;15(6):647–658. doi: 10.1002/(sici)1097-0258(19960330)15:6<647::aid-sim184>3.0.co;2-p. [DOI] [PubMed] [Google Scholar]
- 123.Krishnamoorthy K., Lee M. Improved tests for the equality of normal coefficients of variation. Computational Statistics. 2014;29(1-2):215–232. doi: 10.1007/s00180-013-0445-2. [DOI] [Google Scholar]
- 124.Marwick B., Krishnamoorthy K. R Software Package Version 0.1.3. 2018. cvequality: Tests for the Equality of Coefficients of Variation from Multiple Groups. https://github.com/benmarwick/cvequality. [Google Scholar]
- 125.Schafer A. L., Vittinghoff E., Ramachandran R., Mahmoudi N., Bauer D. C. Laboratory reproducibility of biochemical markers of bone turnover in clinical practice. Osteoporosis International. 2010;21(3):439–445. doi: 10.1007/s00198-009-0974-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 126.Calvi J. L., Chen F. R., Benson V. B., et al. Measurement of cortisol in saliva: a comparison of measurement error within and between international academic-research laboratories. BMC Research Notes. 2017;10(1) doi: 10.1186/s13104-017-2805-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 127.Reed G. F., Lynn F., Meade B. D. Use of coefficient of variation in assessing variability of quantitative assays. Clinical and Diagnostic Laboratory Immunology. 2002;9(6):1235–1239. doi: 10.1128/CDLI.9.6.1235-1239.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 128.Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 1960;20(1):37–46. doi: 10.1177/001316446002000104. [DOI] [Google Scholar]
- 129.Fleiss J. L. Measuring nominal scale agreement among many raters. Psychological Bulletin. 1971;76(5):378–382. doi: 10.1037/h0031619. [DOI] [Google Scholar]
- 130.Conger A. J. Integration and generalization of kappas for multiple raters. Psychological Bulletin. 1980;88(2):322–328. doi: 10.1037//0033-2909.88.2.322. [DOI] [Google Scholar]
- 131.Cohen J. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin. 1968;70(4):213–220. doi: 10.1037/h0026256. [DOI] [PubMed] [Google Scholar]
- 132.Pearson K. Notes on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London. 1895;58(347-352):240–242. doi: 10.1098/rspl.1895.0041. [DOI] [Google Scholar]
- 133.Bartko J. J. The intraclass correlation coefficient as a measure of reliability. Psychological Reports. 1966;19(1):3–11. doi: 10.2466/pr0.1966.19.1.3. [DOI] [PubMed] [Google Scholar]
- 134.Lin L. I.-K. A concordance correlation coefficient to evaluate reproducibility. Biometrics. 1989;45(1):255–268. doi: 10.2307/2532051. [DOI] [PubMed] [Google Scholar]
- 135.Lin L. I.-K. A note on the concordance correlation coefficient. Biometrics. 2000;56:324–325. doi: 10.1111/j.0006-341X.2000.00324.x. [DOI] [Google Scholar]
- 136.Altman D. G., Bland J. M. Measurement in medicine: the analysis of method comparison studies. Statistician. 1983;32:307–317. doi: 10.2307/2987937. [DOI] [Google Scholar]
- 137.Bland J. M., Altman D. G. Measuring agreement in method comparison studies. Statistical Methods in Medical Research. 1999;8:135–160. doi: 10.1177/096228029900800204. [DOI] [PubMed] [Google Scholar]
- 138.McHugh M. L. Interrater reliability: the kappa statistic. Biochemia Medica. 2012;22(3):276–282. doi: 10.11613/bm.2012.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 139.Koo T. K., Li M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine. 2016;15(2):155–163. doi: 10.1016/j.jcm.2016.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 140.McGraw K. O., Wong S. P. Forming inferences about some intraclass correlation coefficients. Psychological Methods. 1996;1:30–46. doi: 10.1037/1082-989X.1.1.30. [DOI] [Google Scholar]
- 141.McBride G. B. A proposal for strength-of-agreement criteria for Lin’s concordance correlation coefficient. NIWA Report HAM2005-062, 2018, https://www.medcalc.org/download/pdf/McBride2005.pdf.
- 142.Martins W. P., Nastri C. O. Interpreting reproducibility results for ultrasound measurements. Ultrasound in Obstetrics & Gynecology. 2014;43(4):479–480. doi: 10.1002/uog.13320. [DOI] [PubMed] [Google Scholar]
- 143.Gregoire S. M., Chaudhary U. J., Brown M. M., et al. The Microbleed Anatomical Rating Scale (MARS): reliability of a tool to map brain microbleeds. Neurology. 2009;73(21):1759–1566. doi: 10.1212/WNL.0b013e3181c34a7d. [DOI] [PubMed] [Google Scholar]
- 144.Lee T. H., Lee J. S., Hong S. J., et al. High-resolution manometry: reliability of automated analysis of upper esophageal sphincter relaxation parameters. Turkish Journal of Gastroenterolpgy. 2014;25(5):473–480. doi: 10.5152/tjg.2014.8021. [DOI] [PubMed] [Google Scholar]
- 145.Abdollah V., Parent E. C., Battié M. C. Is the location of the signal intensity weighted centroid a reliable measurement of fluid displacement within the disc? Biomedizinische Technik. Biomedical Engineering. 2018;63(4):453–460. doi: 10.1515/bmt-2016-0178. [DOI] [PubMed] [Google Scholar]
- 146.Cuchna J. W., Hoch M. C., Hoch J. M. The interrater and intrarater reliability of the functional movement screen: a systematic review with meta-analysis. Physical Therapy in Sport. 2016;19:57–65. doi: 10.1016/j.ptsp.2015.12.002. [DOI] [PubMed] [Google Scholar]
- 147.Parenti N., Reggiani M. L., Iannone P., Percudani D., Dowding D. A systematic review on the validity and reliability of an emergency department triage scale, the Manchester Triage System. International Journal of Nursing Studies. 2014;51(7):1062–1069. doi: 10.1016/j.ijnurstu.2014.01.013. [DOI] [PubMed] [Google Scholar]
- 148.Lange T., Matthijs O., Jain N. B., Schmitt J., Lützner J., Kopkow C. Reliability of specific physical examination tests for the diagnosis of shoulder pathologies: a systematic review and meta-analysis. British Journal of Sports Medicine. 2017;51(6):511–518. doi: 10.1136/bjsports-2016-096558. [DOI] [PubMed] [Google Scholar]
- 149.Bland J. M., Altman D. G. Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet. 1986;327(8476):P307–310. doi: 10.1016/S0140-6736(86)90837-8. [DOI] [PubMed] [Google Scholar]
- 150.Giavarina D. Understanding Bland Altman analysis. Biochemia Medica. 2015;25(2):141–151. doi: 10.11613/BM.2015.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 151.Krouwer J. S. Why Bland-Altman plots should use X, not (Y + X)/2 when X is a reference method. Statistics in Medicine. 2008;27:778–780. doi: 10.1002/sim.3086. [DOI] [PubMed] [Google Scholar]
- 152.Montalescot G., Sechtem U., Achenbach S., et al. 2013 ESC guidelines on the management of stable coronary artery disease: the Task Force on the management of stable coronary artery disease of the European Society of Cardiology. European Heart Journal. 2013;34(38):2949–3003. doi: 10.1093/eurheartj/eht296. [DOI] [PubMed] [Google Scholar]
- 153.Pincus T., Sokka T. Laboratory tests to assess patients with rheumatoid arthritis: advantages and limitations. Rheumatic Disease Clinics of North America. 2009;35(4):731–734. doi: 10.1016/j.rdc.2009.10.007. [DOI] [PubMed] [Google Scholar]
- 154.Aboraya A., France C., Young J., Curci K., LePage J. The validity of psychiatric diagnosis revisited. Psychiatry. 2005;2(9):48–55. [PMC free article] [PubMed] [Google Scholar]
- 155.den Bakker M. A. [Is histopathology still the gold standard?] Ned Tijdschr Geneeskd. 2017;160D981 [PubMed] [Google Scholar]
- 156.Leeflang M. M., Bossuyt P. M., Irwig L. Diagnostic test accuracy may vary with prevalence: implications for evidence-based diagnosis. Journal of Clinical Epidemiology. 2009;62(1):5–12. doi: 10.1016/j.jclinepi.2008.04.007. [DOI] [PubMed] [Google Scholar]
- 157.Leeflang M. M., Deeks J. J., Takwoingi Y., Macaskill P. Cochrane diagnostic test accuracy reviews. Systematic Reviews. 2013;2(1) doi: 10.1186/2046-4053-2-82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 158.Bolboacă S. D., Jäntschi L., Sestraş A. F., Sestraş R. E., Pamfil D. C. Pearson-Fisher chi-square statistic revisited. Information. 2011;2(3):528–545. doi: 10.3390/info2030528. [DOI] [Google Scholar]
- 159.Youden W. J. Index for rating diagnostic tests. Cancer. 1950;3:32–35. doi: 10.1002/1097-0142(1950)3:1<32::aid-cncr2820030106>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
- 160.Glas A. S., Lijmer J. G., Prins M. H., Bonsel G. J., Bossuyt P. M. The diagnostic odds ratio: a single indicator of test performance. Journal of Clinical Epidemiology. 2003;6(11):1129–1135. doi: 10.1016/S0895-4356(03)00177-X. [DOI] [PubMed] [Google Scholar]
- 161.Galloway M. J., Reid M. M. Is the practice of haematology evidence based? III. Evidence based diagnostic testing. Journal of Clinical Pathology. 1998;51:489–491. doi: 10.1136/jcp.51.7.489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 162.Habibzadeh F., Yadollahie M. Number needed to misdiagnose: a measure of diagnostic test effectiveness. Epidemiology. 2013;24(1):p. 170. doi: 10.1097/EDE.0b013e31827825f2. [DOI] [PubMed] [Google Scholar]
- 163.Mitchell A. J. The clinical significance of subjective memory complaints in the diagnosis of mild cognitive impairment and dementia: a meta-analysis. International Journal of Geriatric Psychiatry. 2008;23(11):1191–1202. doi: 10.1002/gps.2053. [DOI] [PubMed] [Google Scholar]
- 164.Mitchell A. J. Sensitivity × PPV is a recognized test called the clinical utility index (CUI+) European Journal of Epidemiology. 2011;26(3):251–252. doi: 10.1007/s10654-011-9561-x. [DOI] [PubMed] [Google Scholar]
- 165.Fagan T. J. Nomogram for Bayes theorem. New England Journal of Medicine. 1975;293(5):p. 257. doi: 10.1056/NEJM197507312930513. [DOI] [PubMed] [Google Scholar]
- 166.Caraguel C. G. B., Vanderstichel R. The two-step Fagan’s nomogram: ad hoc interpretation of a diagnostic test result without calculation. BMJ Evidence-Based Medicine. 2013;18(4):125–128. doi: 10.1136/eb-2013-101243. [DOI] [PubMed] [Google Scholar]
- 167.Marasco J., Doerfler R., Roschier L. Doc, what are my chances. UMAP Journal. 2011;32:279–298. [Google Scholar]
- 168.Zweig M. H., Campbell G. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clinical Chemistry. 1993;39(4):561–577. [PubMed] [Google Scholar]
- 169.Metz C. E. ROC methodology in radiologic imaging. Investigative Radiology. 1986;21(9):720–733. doi: 10.1097/00004424-198609000-00009. [DOI] [PubMed] [Google Scholar]
- 170.Lasko T. A., Bhagwat J. G., Zou K. H., Ohno-Machado L. The use of receiver operating characteristic curves in biomedical informatics. Journal of Biomedical Informatics. 2005;38(5):404–415. doi: 10.1016/j.jbi.2005.02.008. [DOI] [PubMed] [Google Scholar]
- 171.Ma H., Bandos A. I., Rockette H. E., Gur D. On use of partial area under the ROC curve for evaluation of diagnostic performance. Statistics in Medicine. 2013;32(20):3449–3458. doi: 10.1002/sim.5777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 172.Habibzadeh F., Habibzadeh P., Yadollahie M. On determining the most appropriate test cut-off value: the case of tests with continuous results. Biochemia Medica. 2016;26(3):297–307. doi: 10.11613/BM.2016.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 173.Hajian-Tilaki K. The choice of methods in determining the optimal cut-off value for quantitative diagnostic test evaluation. Statistical Methods in Medical Research. 2018;27(8):2374–2383. doi: 10.1177/0962280216680383. [DOI] [PubMed] [Google Scholar]
- 174.Chiorean A. R., Szep M. B., Feier D. S., Duma M., Chiorean M. A., Strilciuc Ş. Impact of strain elastography on BI-RADS classification in small invasive lobular carcinoma. Medical Ultrasonography. 2018;18(2):148–153. doi: 10.11152/mu-1272. [DOI] [PubMed] [Google Scholar]
- 175.Carter J. V., Pan J., Rai S. N., Galandiuk S. ROC-ing along: Evaluation and interpretation of receiver operating characteristic curves. Surgery. 2016;159(6):1638–1645. doi: 10.1016/j.surg.2015.12.029. [DOI] [PubMed] [Google Scholar]
- 176.Hajian-Tilaki K. O., Gholizadehpasha A. R., Bozorgzadeh S., Hajian-Tilaki E. Body mass index and waist circumference are predictor biomarkers of breast cancer risk in Iranian women. Medical Oncology. 2011;28(4):1296–1301. doi: 10.1007/s12032-010-9629-6. [DOI] [PubMed] [Google Scholar]
- 177.Javidi H. Psychometric properties of GHQ-12 in Goa, India. Asian Journal of Psychiatry. 2017;30:p. 141. doi: 10.1016/j.ajp.2017.08.018. [DOI] [PubMed] [Google Scholar]
- 178. How Good is that Test? II [online] © copyright 1994–2007, Bandolier–“Evidence-based thinking about health care,” 2018, http://www.bandolier.org.uk/band27/b27-2.html.
- 179.Habibzadeh F. How to report the results of public health research. Journal of Public Health and Emergency. 2017;1:p. 90. doi: 10.21037/jphe.2017.12.02. [DOI] [Google Scholar]
- 180.Dippel D. W., de Kinkelder A., Bakker S. L., van Kooten F., van Overhagen H., Koudstaal P. J. The diagnostic value of colour duplex ultrasound for symptomatic carotid stenosis in clinical practice. Neuroradiology. 1999;41(1):1–8. doi: 10.1007/s002340050694. [DOI] [PubMed] [Google Scholar]
- 181.Demers A.-M., Verver S., Boulle A., et al. High yield of culture-based diagnosis in a TB-endemic setting. BMC Infectious Diseases. 2012;12(1) doi: 10.1186/1471-2334-12-218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 182.Schröder F. H., Hugosson J., Roobol M. J., et al. The European randomized study of screening for prostate cancer–prostate cancer mortality at 13 years of follow-up. Lancet. 2014;384(9959):2027–2035. doi: 10.1016/S0140-6736(14)60525-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 183.Hugosson J., Godtman R. A., Carlsson S. V., et al. Eighteen-year follow-up of the Göteborg randomized population-based prostate cancer screening trial: effect of sociodemographic variables on participation, prostate cancer incidence and mortality. Scandinavian Journal of Urology. 2018;52(1):27–37. doi: 10.1080/21681805.2017.1411392. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 184.Attili F., Fabbri C., Yasuda I., et al. Low diagnostic yield of transduodenal endoscopic ultrasound-guided fine needle biopsy using the 19-gauge flex needle: a large multicenter prospective study. Endoscopic Ultrasound. 2017;6(6):402–408. doi: 10.4103/eus.eus_54_17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 185.Gulati R., Mariotto A. B., Chen S., Gore J. L., Etzioni R. Long-term projections of the number needed to screen and additional number needed to treat in prostate cancer screening. Journal of Clinical Epidemiology. 2011;64(12):1412–1417. doi: 10.1016/j.jclinepi.2011.06.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 186.Hugosson J., Carlsson S., Aus G., et al. Mortality results from the Göteborg randomised population-based prostate-cancer screening trial. Lancet Oncology. 2010;11(8):725–732. doi: 10.1016/S1470-2045(10)70146-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 187.Lee M. S., Cho J. Y., Kim S. Y., et al. Diagnostic value of integrated PET/MRI for detection and localization of prostate cancer: Comparative study of multiparametric MRI and PET/CT. Journal of Magnetic Resonance Imaging. 2017;45(2):597–609. doi: 10.1002/jmri.25384. [DOI] [PubMed] [Google Scholar]
- 188.Mozafari P., Azari R. M., Shokoohi Y., Sayadi M. Feasibility of biological effective monitoring of chrome electroplaters to chromium through analysis of serum malondialdehyde. International Journal of Occupational and Environmental Medicine. 2016;7(4):199–206. doi: 10.15171/ijoem.2016.782. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 189.Lertudomphonwanit C., Mourya R., Fei L., et al. Large-scale proteomics identifies MMP-7 as a sentinel of epithelial injury and of biliary atresia. Science Translational Medicine. 2017;9(417) doi: 10.1126/scitranslmed.aan8462.eaan8462 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 190.Porta G., Numis F. G., Rosato V., et al. Lactate determination in pleural and abdominal effusions: a quick diagnostic marker of exudate-a pilot study. Internal and Emergency Medicine. 2018;13(6):901–906. doi: 10.1007/s11739-017-1757-y. [DOI] [PubMed] [Google Scholar]
- 191.Bram J. T., Baldwin K. D., Blumberg T. J. Gram stain is not clinically relevant in treatment of pediatric septic arthritis. Journal of Pediatric Orthopaedics. 2018;38(9):e536–e540. doi: 10.1097/BPO.0000000000001226. [DOI] [PubMed] [Google Scholar]
- 192.Shetty A. L., Brown T., Booth T., et al. Systemic inflammatory response syndrome-based severe sepsis screening algorithms in emergency department patients with suspected sepsis. Emergency Medicine Australasia. 2016;28:287–294. doi: 10.1111/1742-6723.12578. [DOI] [PubMed] [Google Scholar]
- 193.Mitchell A. J., McGlinchey J. B., Young D., Chelminski I., Zimmerman M. Accuracy of specific symptoms in the diagnosis of major depressive disorder in psychiatric out-patients: data from the MIDAS project. Phychological Medicine. 2009;39:1107–1116. doi: 10.1017/S00332917080046. [DOI] [PubMed] [Google Scholar]
- 194.Johansson M. M., Kvitting A. S., Wressle E., Marcusson J. Clinical utility of cognistat in multiprofessional team evaluations of patients with cognitive impairment in Swedish primary care. International Journal of Family Medicine. 2014;2014:10. doi: 10.1155/2014/649253.649253 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 195.Mitchell A. J., Yadegarfar M., Gill J., Stubbs B. Case finding and screening clinical utility of the patient health questionnaire (PHQ-9 and PHQ-2) for depression in primary care: a diagnostic meta-analysis of 40 studies. British Journal of Psychiatry Open. 2016;2:127–138. doi: 10.1192/bjpo.bp.115.001685. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 196.Fereshtehnejad S.-M., Shafieesabet M., Rahmani A., et al. A novel 6-item screening questionnaire for parkinsonism: validation and comparison between different instruments. Neuroepidemiology. 2014;43:178–193. doi: 10.1159/000368167. [DOI] [PubMed] [Google Scholar]
- 197.Bartoli F., Crocamo C., Biagi E., et al. Clinical utility of a single-item test for DSM-5 alcohol use disorder among outpatients with anxiety and depressive disorders. Drug and Alcohol Dependence. 2016;165:283–287. doi: 10.1016/j.drugalcdep.2016.06.003. [DOI] [PubMed] [Google Scholar]
- 198.Hoti K., Atee M., Hughes J. D. Clinimetric properties of the electronic Pain Assessment Tool (ePAT) for aged-care residents with moderate to severe dementia. Journal of Pain Research. 2018;2018(11):1037–1044. doi: 10.2147/JPR.S158793. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 199.Mitchell A. J., Shukla D., Ajumal H. A., Stubbs B., Tahir T. A. The mini-mental state examination as a diagnostic and screening test for delirium: systematic review and meta-analysis. General Hospital Psychiatry. 2014;36(6):627–633. doi: 10.1016/j.genhosppsych.2014.09.003. [DOI] [PubMed] [Google Scholar]
- 200.Gothlin M., Eckerstrom M., Rolstad S., Kettunen P., Wallin A. Better prognostic accuracy in younger mild cognitive impairment patients with more years of education. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring. 2018;10:402–412. doi: 10.1016/j.dadm.2018.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 201.Fernandes K. V., Jhobta A., Sarkar M., Thakur S., Sood R. G. Comparative evaluation of 64- slice multidetector CT virtual bronchoscopy with fiberoptic bronchoscopy in the evaluation of tracheobronchial neoplasms. International Journal of Medical and Health Research. 2017;3(8):40–47. [Google Scholar]
- 202.Cugy E., Sibon I. Stroke-associated pneumonia risk score: validity in a French stroke unit. Journal of Stroke and Cerebrovascular Diseases. 2017;26(1):225–229. doi: 10.1016/j.jstrokecerebrovasdis.2016.09.015. [DOI] [PubMed] [Google Scholar]
- 203.Pawar S. D., Naik J. D., Prabhu P., Jatti G. M., Jadhav S. B., Radhe B. K. Comparative evaluation of Indian diabetes risk score and Finnish diabetes risk score for predicting risk of diabetes mellitus type II: A teaching hospital-based survey in Maharashtra. Journal of Family Medicine and Primary Care. 2017;6:120–125. doi: 10.4103/2249-4863.214957. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 204.Gur-Ozmen S., Leibetseder A., Cock H. R., Agrawal N., von Oertzen T. J. Screening of anxiety and quality of life in people with epilepsy. Seizure. 2017;45:107–113. doi: 10.1016/j.seizure.2016.11.026. [DOI] [PubMed] [Google Scholar]
- 205.Scott J., Marwaha S., Ratheesh A., et al. Bipolar at-risk criteria: an examination of which clinical features have optimal utility for identifying youth at risk of early transition from depression to bipolar disorders. Schizophrenia Bulletin. 2017;43(4):737–744. doi: 10.1093/schbul/sbw154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 206.Albadareen R., Gronseth G., Goeden M., Sharrock M., Lechtenberg C., Wang Y. Paraneoplastic autoantibody panels: sensitivity and specificity, a retrospective cohort. International Journal of Neuroscience. 2017;127(6):531–538. doi: 10.1080/00207454.2016.1207644. [DOI] [PubMed] [Google Scholar]
- 207.Diestro J. D. B., Pasco P. M. D., Lee L. V., XDP Study Group of the Philippine Children’s Medical Center Validation of a screening questionnaire for X-linked dystonia parkinsonism: The first phase of the population-based prevalence study of X-linked dystonia parkinsonism in Panay. Neurology and Clinical Neuroscience. 2017;5:79–85. doi: 10.1111/ncn3.12113. [DOI] [Google Scholar]
- 208.Dawes C. Oxford–Center for Evidence-Based Medicine. 2019. https://www.cebm.net/2014/06/catmaker-ebm-calculators/ [Google Scholar]
- 209.Herbert R. Confidence Interval Calculator. 2019. https://www.pedro.org.au/english/downloads/confidence-interval-calculator/ [Google Scholar]
- 210.Caraguel C., Wohlers-Reichel R., Vanderstichel R. DocNomo: the app to add evidence to your diagnosis. Proceedings of the Australian Veterinary Association Annual Conference; May 2016; Adelaide, Australia. [Google Scholar]
- 211.Allen J., Graziadio S., Power M. A Shiny Tool to Explore Prevalence, Sensitivity, and Specificity on Tp, Fp, Fn, and Tn. Newcastle upon Tyne, UK: NIHR Diagnostic Evidence Co-operative Newcastle; 2017. https://micncltools.shinyapps.io/TestAccuracy/ [Google Scholar]
- 212.Power M., Graziadio S., Allen J. A ShinyApp Tool To Explore Dependence of Rule-In And Rule-Out Decisions on Prevalence, Sensitivity, Specificity, and Confidence Intervals. Newcastle upon Tyne, UK: NIHR Diagnostic Evidence Co-operative Newcastle; 2017. https://micncltools.shinyapps.io/ClinicalAccuracyAndUtility/ [Google Scholar]
- 213.Fanshawe T. R., Power M., Graziadio S., Ordóñez-Mena J. M., Simpson J., Allen J. Interactive visualisation for interpreting diagnostic test accuracy study results. BMJ Evidence-Based Medicine. 2018;23(1):13–16. doi: 10.1136/ebmed-2017-110862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 214.Mushlin A. I., Ruchlin H. S., Callahan M. A. Cost effectiveness of diagnostic tests. The Lancet. 2001;358(9290):1353–1355. doi: 10.1016/s0140-6736(01)06417-0. [DOI] [PubMed] [Google Scholar]
- 215.Drummond M. Economic evaluation of health interventions. BMJ. 2008;337 doi: 10.1136/bmj.a1204.a1204 [DOI] [PubMed] [Google Scholar]
- 216.Pennello G., Pantoja-Galicia N., Evans S. Comparing diagnostic tests on benefit-risk. Journal of Biopharmaceutical Statistics. 2016;26(6):1083–1097. doi: 10.1080/10543406.2016.1226335. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 217.Vickers A. J., Van Calster B., Steyerberg E. W. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ. 2016 doi: 10.1136/bmj.i6.i6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 218.Bolboacă S., Jäntschi L. Pearson versus Spearman, Kendall’s tau correlation analysis on structure-activity relationships of biologic active compounds. Leonardo Journal of Sciences. 2006;9:179–200. [Google Scholar]
- 219.Jäntschi L., Balint D., Bolboacă S. D. Multiple linear regressions by maximizing the likelihood under assumption of generalized Gauss-Laplace distribution of the error. Computational and Mathematical Methods in Medicine. 2016;2016:8. doi: 10.1155/2016/8578156.8578156 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 220.Peacock J. L., Kerry S. M., Basile R. R. Presenting Medical Statistics from Proposal to Publication. 2nd. Oxford, UK: Oxford University Press; 2017. [Google Scholar]
- 221.Bolboacă S. D., Jäntschi L. Sensitivity, specificity, and accuracy of predictive models on phenols toxicity. Journal of Computational Science. 2014;5(3):345–350. doi: 10.1016/j.jocs.2013.10.003. [DOI] [Google Scholar]
- 222.Bolboacă S. D., Jäntschi L. The effect of leverage and/or influential on structure-activity relationships. Combinatorial Chemistry & High Throughput Screening. 2013;16(4):288–297. doi: 10.2174/1386207311316040003. [DOI] [PubMed] [Google Scholar]
- 223.Bolboacă S. D., Jäntschi L. Quantitative structure-activity relationships: linear regression modelling and validation strategies by example. International Journal on Mathematical Methods and Models in Biosciences. 2013;2(1) doi: 10.11145/j.biomath.2013.09.089.1309089 [DOI] [Google Scholar]
- 224.Bolboacă S. D., Jäntschi L. Distribution fitting 3. analysis under normality assumptions. Bulletin of University of Agricultural Sciences and Veterinary Medicine Cluj-Napoca Horticulture. 2009;66(2):698–705. doi: 10.15835/buasvmcn-hort:4446. [DOI] [Google Scholar]
- 225.Bolboacă S. D., Jäntschi L. Modelling the property of compounds from structure: statistical methods for models validation. Environmental Chemistry Letters. 2008;6:175–181. doi: 10.1007/s10311-007-0119-9. [DOI] [Google Scholar]
- 226.Schlessinger L., Eddy D. M. Archimedes: a new model for simulating health care systems—the mathematical formulation. Journal of Biomedical Informatics. 2002;35(1):37–50. doi: 10.1016/S1532-0464(02)00006-0. [DOI] [PubMed] [Google Scholar]
- 227.Horsman J., Furlong W., Feeny D., Torrance G. The Health Utilities Index (HUI®): concepts, measurement properties and applications. Health and Quality of Life Outcomes. 2003;1(1):p. 54. doi: 10.1186/1477-7525-1-54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 228.Westwood M., van Asselt T., Ramaekers B., et al. High-sensitivity troponin assays for the early rule-out or diagnosis of acute myocardial infarction in people with acute chest pain: a systematic review and cost-effectiveness analysis. Health Technology Assessment. 2015;19(44):1–234. doi: 10.3310/hta19440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 229.Daya S. Study design for the evaluation of diagnostic tests. Seminars in Reproductive Endocrinology. 1996;14(2):101–109. doi: 10.1055/s-2007-1016317. [DOI] [PubMed] [Google Scholar]
- 230.Glasser S. P. Research methodology for studies of diagnostic tests. In: Glasser S. P., editor. Essentials of Clinical Research. Dordrecht, Netherlands: Springer; 2008. pp. 245–257. [Google Scholar]
- 231.Lijmer J. G., Mol B. W., Heisterkamp S., et al. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA. 1999;282(11):1061–1066. doi: 10.1001/jama.282.11.1061. [DOI] [PubMed] [Google Scholar]
- 232. Centre for Evidence Based Medicine, 2019, https://www.cebm.net/
- 233.McGinn T., Jervis R., Wisnivesky J., Keitz S., Wyer P. C., Evidence-based Medicine Teaching Tips Working Group Tips for teachers of evidence-based medicine: clinical prediction rules (CPRs) and estimating pretest probability. Journal of General Internal Medicine. 2008;23(8):1261–1268. doi: 10.1007/s11606-008-0623-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 234.Richardson W. S., Wilson M. C., Keitz S. A., Wyer P. C., EBM Teaching Scripts Working Group Tips for teachers of evidence-based medicine: making sense of diagnostic test results using likelihood ratios. Journal of General Internal Medicine. 2008;23(1):87–92. doi: 10.1007/s11606-007-0330-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 235.Carley S., Dosman S., Jones S., Harrison M. Simple nomograms to calculate sample size in diagnostic studies. Emergency Medicine Journal. 2005;22(3):180–181. doi: 10.1136/emj.2003.011148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 236.de Bekker-Grob E. W., Donkers B., Jonker M. F., Stolk E. A. Sample size requirements for discrete-choice experiments in healthcare: a practical guide. Patient. 2015;8(5):373–384. doi: 10.1007/s40271-015-0118-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 237.Bujang M. A., Adnan T. H. Requirements for minimum sample size for sensitivity and specificity analysis. Journal of Clinical & Diagnostic Research. 2016;10(10):YE01–YE06. doi: 10.7860/JCDR/2016/18129.8744. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 238.Hajian-Tilaki K. Sample size estimation in diagnostic test studies of biomedical informatics. Journal of Biomedical Informatics. 2014;48:193–204. doi: 10.1016/j.jbi.2014.02.013. [DOI] [PubMed] [Google Scholar]
- 239.Realdi G., Previato L., Vitturi N. Selection of diagnostic tests for clinical decision making and translation to a problem oriented medical record. Clinica Chimica Acta. 2008;393(1):37–43. doi: 10.1016/j.cca.2008.03.024. [DOI] [PubMed] [Google Scholar]
- 240.Kosack C. S., Page A. L., Klatser P. R. A guide to aid the selection of diagnostic tests. Bulletin of the World Health Organization. 2017;95(9):639–645. doi: 10.2471/BLT.16.187468. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 241.Buehler A. M., de Oliveira Ascef B., de Oliveira Júnior H. A., Ferri C. P., Fernandes J. G. Rational use of diagnostic tests for clinical decision making. Revista da Associação Médica Brasileira. 2019;65(3):452–459. doi: 10.1590/1806-9282.65.3.452. [DOI] [PubMed] [Google Scholar]
- 242.Di Sanzo M., Cipolloni L., Borro M., et al. Clinical applications of personalized medicine: a new paradigm and challenge. Current Pharmaceutical Biotechnology. 2017;18(3):194–203. doi: 10.2174/1389201018666170224105600. [DOI] [PubMed] [Google Scholar]
- 243.Liu Y., Wang H., Zhao W., Zhang M., Qin H., Xie Y. Flexible, stretchable sensors for wearable health monitoring: sensing mechanisms, materials, fabrication strategies and features. Sensors. 2018;18(2):p. 645. doi: 10.3390/s18020645. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 244.Masson J.-F. Surface plasmon resonance clinical biosensors for medical diagnostics. ACS Sensors. 2017;2(1):16–30. doi: 10.1021/acssensors.6b00763. [DOI] [PubMed] [Google Scholar]
- 245.Méhes G. Liquid biopsy for predictive mutational profiling of solid cancer: the pathologist’s perspective. Journal of Biotechnology. 2019;297:66–70. doi: 10.1016/j.jbiotec.2019.04.002. [DOI] [PubMed] [Google Scholar]
- 246.Ma X., He S., Qiu B., Luo F., Guo L., Lin Z. Noble metal nanoparticle-based multicolor immunoassays: an approach toward visual quantification of the analytes with the naked eye. ACS Sensors. 2019;4(4):782–791. doi: 10.1021/acssensors.9b00438. [DOI] [PubMed] [Google Scholar]
- 247.Haridy R. News Atlas. 2017. Pocket-sized, affordably-priced ultrasound connects to an iPhone. https://newatlas.com/butterfly-iq-smartphone-ultrasound/51962/ [Google Scholar]
- 248.Katoba J., Kuupiel D., Mashamba-Thompson T. P. Toward improving accessibility of point-of-care diagnostic services for maternal and child health in low-and middle-income countries. Point of Care. 2019;18(1):17–25. doi: 10.1097/POC.0000000000000180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 249.Gong Y., Zheng Y., Jin B., et al. A portable and universal upconversion nanoparticle-based lateral flow assay platform for point-of-care testing. Talanta. 2019;201(15):126–133. doi: 10.1016/j.talanta.2019.03.105. [DOI] [PubMed] [Google Scholar]
- 250.Thrall J. H., Li X., Li Q., et al. Artificial intelligence and machine learning in radiology: opportunities, challenges, pitfalls, and criteria for success. Journal of the American College of Radiology. 2018;15(3):504–508. doi: 10.1016/j.jacr.2017.12.026. [DOI] [PubMed] [Google Scholar]
- 251.Gallix B., Chong J. Artificial intelligence in radiology: who’s afraid of the big bad wolf? European Radiology. 2019;29(4):1637–1639. doi: 10.1007/s00330-018-5995-9. [DOI] [PubMed] [Google Scholar]
- 252.Dowdy D. W., Gounder C. R., Corbett E. L., Ngwira L. G., Chaisson R. E., Merritt M. W. The ethics of testing a test: randomized trials of the health impact of diagnostic tests for infectious diseases. Clinical Infectious Diseases. 2012;55(11):1522–1526. doi: 10.1093/cid/cis736. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 253.Bulboacă A. E., Bolboacă S. D., Bulboacă A. C. Ethical considerations in providing an upper limb exoskeleton device for stroke patients. Medical Hypotheses. 2017;101:61–64. doi: 10.1016/j.mehy.2017.02.016. [DOI] [PubMed] [Google Scholar]