Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Feb 1.
Published in final edited form as: Acad Radiol. 2022 Sep 27:S1076-6332(22)00484-6. doi: 10.1016/j.acra.2022.08.031

A Framework for Evaluating the Technical Performance of Multiparameter Quantitative Imaging Biomarkers (mp-QIBs)

Nancy A Obuchowski 1, Erich Huang 2, Nandita M deSouza 3, David Raunig 4, Jana Delfino 5, Andrew Buckler 6, Charles Hatt 7, Xiaofeng Wang 8, Chaya Moskowitz 9, Alexander Guimaraes 10, Maryellen Giger 11, Timothy J Hall 12, Paul Kinahan 13, Gene Pennello 14
PMCID: PMC9825639  NIHMSID: NIHMS1833806  PMID: 36180328

Abstract

Multiparameter quantitative imaging incorporates anatomical, functional, and/or behavioral biomarkers to characterize tissue, detect disease, identify phenotypes, define longitudinal change, or predict outcome. Multiple imaging parameters are sometimes considered separately but ideally are evaluated collectively. Often, they are transformed as Likert interpretations, ignoring the correlations of quantitative properties that may result in better reproducibility or outcome prediction. In this paper we present three use cases of multiparameter quantitative imaging: i) multidimensional descriptor, ii) phenotype classification, and iii) risk prediction. A fourth application based on data-driven markers from radiomics is also presented. We describe the technical performance characteristics and their metrics common to all use cases, and provide a structure for the development, estimation, and testing of multiparameter quantitative imaging. This paper serves as an overview for a series of individual articles on the four applications, providing the statistical framework for multiparameter imaging applications in medicine.

Keywords: quantitative imaging biomarkers, multiparametric imaging, radiomics, QIBA

Introduction

Multi-component or multivariable imaging (often referred to in the medical literature as “multiparameter” or “multiparametric” imaging) incorporates multiple anatomical and functional measurands, as well as measurands defining disease appearance, behavior, and response to therapy, allowing more comprehensive tissue characterization than any single measurand alone. Multiparameter imaging is critical as a drug development tool, as it is able to non-invasively characterize multidimensional serial changes in phenotypes or risk as treatment progresses, thereby revealing relationships between multiple cooperating biological processes. In contrast, single measurands often cannot capture how the variables work together, such as compensating reactions which have evolved in the biological system in reaction to disease. The characterization and quantification of such relationships has the potential to provide powerful information for many intents of use, including but not limited to diagnosis, prognosis, risk assessment, prediction of treatment response, safety assessment, and dosing.

While multiparameter imaging is currently being used both in clinical trials and clinical care to better understand patients’ disease and outcome, many if not most of multiparameter imaging is taken as either multiple binary endpoints or a reader’s Likert interpretations, thus failing to take full advantage of the quantitative properties of the biomarkers. Furthermore, little attention has been paid to the technical performance of multiple imaging biomarkers when combined to characterize disease. For example, some biomarkers may only be meaningful if others are positive or negative, or the interpretation of disease may depend on the combinations of biomarkers rather than individual ones.

Until recently, the Quantitative Imaging Biomarker Alliance (QIBA) has focused its attention on single quantitative biomarkers [14]; however, as QIBA prepares to expand its profiles to include multiparameter quantitative imaging biomarkers (mp-QIBs), the appropriate methodologic pathway is unclear. mp-QIBs have many varied applications, each with its own unique methodologic issues, such as correlated measurement errors and the often-overlooked effects of different scanners, scanning parameters, image analysis software, and the timing of the imaging. To address these issues, the QIBA assembled the Multiparametric Metrology Working Group, which started meeting in 2018 and was expanded in 2019, to address some of these issues.

In this paper we present three use cases for mp-QIBs, as well as a fourth application involving data-driven markers that may be helpful in clinical practice but which may not articulate a specific relationship to a biologic correlate. We present special considerations associated with each application, compare and contrast metrics of technical performance, and present a general framework for estimating and testing technical performance. We conclude with a discussion of the intersection of the clinical utility of an mp-QIB-based decision tool with the technical performance of its output and of the component imaging biomarkers. This series on mp-QIBs includes four separate articles presenting a detailed discussion of each use case [58].

Multiparameter Use Cases

A quantitative imaging biomarker (QIB) is defined as “an objective characteristic derived from an in vivo image measured on a ratio or interval scale as indicators of normal biological processes, pathogenic processes or a response to a therapeutic intervention” [1]. A multiparameter QIB (mp-QIB) is defined as two or more QIBs that can be used collectively or combined (e.g., through a mathematical expression to define a “classification,” “score,” or “index”) to diagnose, prognose, or monitor a patient’s disease; the utility of a mp-QIB often changes with ongoing studies. Figure 1 illustrates a mp-QIB from a clinical ultrasound imaging system of a breast nodule. Brightness-mode (B-mode) images display the local echo signal amplitude (higher signal amplitudes suggest higher tissue ‘echogenicity’). While useful for surveys of tissue structure, they are highly system- and operator-dependent. The ‘elasticity’ (quasi-static strain) image shown in (b) is also operator-dependent, but objective parameters, such as the ratio of the nodule area in elasticity divided by that in B-mode images is a strong indicator of malignancy [9]. Similarly, objective quantification of blood flow (panel c) in a breast nodule is a strong indicator of malignancy [10]. The ultrasound attenuation image (d) is lower spatial resolution compared to the others because of the larger region of support required for the parameter estimate, but attenuation in combination with backscatter coefficients (a system-independent quantification of tissue ‘echogenicity’) has been shown useful for discriminating among breast tissue types [11, 12]. Combining these parameters, and potentially others, should further improve classification of nodule type.

Figure 1:

Figure 1:

Shown are four parametric images of a breast nodule derived from a clinical ultrasound imaging system including (a) a B-mode image, (b) a strain elasticity image, (c) a “Doppler” image demonstrating local blood flow, and (d) an image of the local acoustic attenuation coefficient. (from [13], with permission).

For the purpose of discussing and comparing methodologic approaches, we present four applications of mp-QIBs based on their intended clinical use and whether they meet the definition of a biomarker. Table 1 summarizes these applications, along with examples of each.

Table 1:

Four Multiparameter Use Cases

Intended use Description Example
1 Multidimensional descriptor Panel of individual, but related, biomarkers, each of importance CT assessment of response to treatment for gastrointestinal tract tumors with Choi criteria [14]; Noninvasive CT assessment of atherosclerotic plaque tissue [15].
2 Phenotype classification Multiple biomarkers used in combination to classify cases into distinct phenotypes that represent different manifestations of pathology (polychotomous endpoint). The clinical purpose is to categorize patients in terms of the likely course of disease and/or to tailor therapeutics based on likely response to differing drug mechanisms of action. Classification of breast cancers into benign/malignant using DCE-MRI [19]; US elastography and MRI for staging liver disease [20, 21].
3 Risk prediction Multiple biomarkers used in combination to predict patients’ current or future outcome or risk (binary or ordinal endpoint). The clinical purpose is to stratify patients in terms of the relative urgency of certain types of treatments, such as differentiating candidates for procedural invasive interventions versus optimized medical therapy. Prediction of survival in patients with resected brain metastases using DW-MRI [23].
4 Data-driven markers Computer extraction of potentially large numbers of derived metrics, or “radiomic features”, for prediction or other clinical purposes where it is not critical for the metrics to be tied to an objective ground truth. CT features to develop a radiomics signature that predicts immunotherapy response [26]; CT features of pulmonary nodules to predict cancer [28]; Classification of molecular subtypes of breast cancer using DCE MRI [27].

CT = computed tomography, DCE-MRI = dynamic contrast-enhanced magnetic resonance imaging, US = ultrasound, DW-MRI=diffusion-weighted magnetic resonance imaging

In Use Case 1, the QIBs are treated as a multivariate vector for diagnosis and prediction, as combining them into a scalar summary over-compresses the information, obscuring important aspects of the full clinical context from a clinician or researcher. For example, in the Choi criteria, used for assessing treatment response in gastrointestinal tract (and potentially other) tumors [14], a partial response to treatment is defined as either a 10% reduction in tumor size or a 15% reduction in density during the portal venous phase of contrast on a CT image. Another example is the assessment of atherosclerotic plaque by examining multiple characteristics of the tissue including its geometry (volume, stenosis area) and composition (calcification, lipid-rich necrotic core), both of which provide complimentary information needed by physicians [15, 16]. In these two examples, the QIBs are not combined into a composite endpoint, nor translated through modeling to classify or predict, but rather used to provide complementary information about different aspects, or domains, of the current disease state, thus illustrating the multivariate concept for Use Case 1.

The other use cases involve mapping the QIB measurements into a scalar summary quantity (a score or index) via a statistical model to identify underlying characteristics and forecast clinically important events. We define a model as a computational procedure that relates input variables such as individual QIBs and standard clinical variables to a score that maps to an endpoint such as a clinical outcome or phenotype [17]. Such a model might be used to develop an understanding of the quantitative relationships between the input variables and the endpoint. More generally, the model may output a vector of scores that maps to one or more clinical outcomes or phenotypes.

In Use Case 2, the individual QIBs are used in a statistical model to classify abnormal tissues (lesions) into phenotypes, namely the observable physiological, developmental, and/or behavioral characteristics of a lesion/organism arising from its genetic predisposition and response to its environment [18]. Examples of these characteristics include presence or absence of malignancy, disease stage, lesion type (cyst or solid), molecular subtype, or expression of a receptor. We emphasize that a phenotype is not defined based on radiographic appearance (i.e. an “imaging phenotype”), as it implies something idiosyncratic to the imaging process rather than an emphasis on classifying the object, and it is not the same as clinical outcome, such as survival, or prediction of future events or risk, which we define as Use Case 3 (described below). Rather, the phenotype is ascribed to the lesion itself. Phenotypes have a ground truth in their biological correlates, but phenotypes are not necessarily QIBs themselves as they are not ratio variables [1].

A statistical description of a model for classifying cases into phenotypes is given in Equation 1 based on three QIBs, where the dependent variable, Z, is the predicted score that would be partitioned to classify the cases into phenotypes, and g is some class of computational procedures that may include generalized linear models (e.g. binomial or multinomial logistic regression model) or more complex ones such as non-linear regression models, neural nets, or other machine learning models. The Xi’s are the measurements of the QIBs, and βi’s are their weights.

Z=g(β1X1,β2X2,β3X3) {1}

Phenotype prediction would come from dichotomizing the predictor Z for a binary phenotype. If there are more than two nominal or ordinal categories, the predicted phenotype is a distinct pattern or groups that derives from Z.

An example is the study by Fusco et al [19] where the investigators combined lesion morphology and dynamic features from DCE-MRI via a decision tree and Bayesian classifier to differentiate benign breast lesions from malignant ones.

In Use Case #3, multiple biomarkers are used together in a model to generate a risk score, a prediction, or other quantitative outcome. The endpoint is a clinical outcome, defined as a variable that describes or reflects how an individual feels, functions or survives [22], The dependent variable here might be a probability (e.g. recurrence by a specified follow-up time), an ordinal variable (e.g. risk stratification), or a continuous outcome (e.g. backscatter, modeled as a function of attenuation coefficient and sound wave speed). A statistical description with three QIBs is given in Equation 2, where the dependent variable is denoted f(O) (e.g. Probability(O = 1) for binary outcomes, or Probability(O > t) for time-to-event outcomes), g is some decision rule, Xi’s are the measurements of the QIBs, and βi’s are their weights. The dependent variable may or may not have a ground truth (e.g., an individual’s true probability of recurrence is generally unascertainable), but researchers can test for concordance between the predicted outcome and subjects’ observed outcomes.

f(O)=g(β1X1,β2X2,β3X3) {2}

The study by Zakaria et al [23] is an example where the investigators built a model to predict survival for patients with brain metastases. The outcome was the time until death and the inputs were traditional clinical scores, treatment, and apparent diffusion coefficient (ADC) from MRI Diffusion Weighted Imaging (DWI). A Cox proportional hazards models were built to determine if ADC improved prediction over the clinical scores.

Note that sometimes use case 2 is a precursor of use case 3, where the risk of an event is tied to the patient’s phenotype. An example is plaque morphology: in use case 2, a descriptor of the current pathology is obtained, which is then used to identify the likelihood of a future event, given the pathology, in use case 3. The clinical objective would be to recognize the phenotype known to respond to a given drug therapy and apply it if time permits; the risk stratification would indicate whether there is adequate time to do so or whether a more invasive procedural intervention is imminently indicated.

The fourth use case focuses on data-driven markers, which do not fall under the definition of a true biomarker in the mathematical sense [1]. Often these data-driven markers are derived through radiomics, namely the practice of converting images into data that can be mined to develop models that may potentially improve diagnostic, prognostic, and predictive accuracy [24]. Radiomics includes automated and semi-automated extraction and analysis of textures and quantitative relationships of voxels within lesions. Automation of experiments is used such that large-scale repetition is feasible. In comparison with the studies in use cases 1, 2 and 3, such computationally derived markers do not have a biological, physiological, or pathologic correlate [25]. For example, measures of heterogeneity, defined as the distribution of voxel intensities within a tumor, is an indirect or derived statistical measure that may correlate strongly with presence/absence of malignancy but has no corresponding ground truth value based on histologic characteristics of the imaged tissue. Rather, measures of heterogeneity are scanner-dependent and require harmonization across imaging methods to derive a reproducible feature that can be used for prognosis and prediction of outcomes.

An example is the study by Sun et al [26] who developed a radiomics signature based on extracted features from CT images and combined them with genomic data from tumor biopsies to develop a radiomics signature to predict immunotherapy response (i.e. CD8 cell tumor infiltration). As another example, Li et al [27] built a model to identify the molecular classification of breast cancers from DCE-MRI features related to morphology and texture, as well as dynamic features; from 88 initial features, 24 were selected for the final model. There are also many applications of radiomics used to predict the malignancy of pulmonary nodules from unenhanced CT images [28]. Note that radiomics can also be used in use cases 1-3 to extract true biomarkers from images (e.g. lesion size); however, we will discuss the practice of radiomics as part of use case 4.

Measures of Performance

Technical Performance

For a single QIB, technical performance is usually characterized by precision of measurement (repeatability, reproducibility), bias, and the property of linearity [13]. Technical performance of individual QIBs is also important for multiparameter applications. In use case #1, these metrics, along with estimates of the correlation between the QIBs in the panel, serve to describe the various QIBs in the multidimensional panel and determine which changes are likely attributable to actual underlying physiological changes rather than just noise. In use cases #2-4, these metrics, particularly the precision metrics, are sometimes used for establishing criteria for selecting variables as inputs in statistical models [29, 30].

For use cases #2-4, however, additional metrics may be needed for characterizing the performance of the output variable of the model. This entails the repeatability and reproducibility of the output values as well as the ability of the model to predict a phenotype or outcome of interest, a result akin to assessing the agreement between an individual QIB and a reference standard value. Various metrics are used to describe model performance. Often, maximizing discrimination, defined as the ability to separate different characteristics or outcomes, is of primary interest. Much methodological development has been devoted to maximizing discrimination from combinations of biomarkers [3141]. The area or partial area under the receiver operating characteristic (ROC) curve is often used as the discrimination metric. Two related metrics of particular relevance for use case #2 are classification accuracy, which is the ability of a model to correctly classify subjects by phenotype, and prediction accuracy, which is the ability of a model to predict a phenotype. For example, sensitivity and specificity are classification probabilities commonly used to evaluate the accuracy of a binary phenotype; these metrics describe the probability of correct test results, given a current phenotype (e.g. probability of a positive/negative test result, given that disease is present/absent). In contrast, positive and negative predictive values are predictive probabilities commonly used to evaluate the accuracy for predicting a binary phenotype (i.e. likelihood of disease, given a current condition or test result).

Calibration

Some models output scores where the score itself doesn’t have an intrinsic meaning; here, the aim is to maximize discrimination. In contrast, a probability or risk score from a prediction model does have a meaning: it is the estimated probability of a future event occurring, or probability of having a certain phenotype. It’s important to remember that this type of output can be utilized by the user (e.g. physician, patient) to develop an interpretation that affects medical decisions. Probability or risk scores may discriminate well but may not necessarily match individual true risks, that is, may not be well calibrated. Calibration is defined as closeness of agreement between a predicted probability or risk and the true probability or risk. It can be assessed by how well the numbers of events predicted by a model agree with the observed number of events that arise in a cohort. A risk model is moderately well-calibrated when, in general, the frequency of events among subjects with a common risk score equals the risk score. At least four different levels of calibration (mean, weak, moderate, and strong calibration) have been described [35]. Some authors [36, 43] suggest that a model that discriminates well but is not well calibrated is inappropriate because individuals’ calculated risk could be misleading. Moreover, the consequences of using a poorly calibrated risk prediction for medical decision making could be dire. Thus, an assessment of the calibration of predicted risk is also critical.

Calibration of a risk score can be visualized with the calibration plot, which is a plot of estimated risk vs. actual risk [36] for subjects binned into risk score deciles (or other quantiles). The coordinate for each decile should fall on or near the 45-degree line if the estimated risks are well calibrated [36, 42]. It’s important that calibration be assessed separately for key subgroups (defined by demographic and disease characteristics, or imaging protocols) to ensure adequate calibration across subgroups. Furthermore, when a model changes or new subgroups are identified, calibration should be re-assessed.

The predictiveness of a risk score is the distribution of risk (probability) as a function of quantiles of the risk score (Z) as displayed in a predictiveness curve (see Figure 2). For example, suppose the risk for a binary phenotype is estimated from a logistic regression model with one or more QIBs included as predictors. The model-estimated risk is plotted against the risk score quantile. Now suppose a QIB-based risk score for a phenotype is well-calibrated across subgroups, and clinical thresholds are available for ruling in and ruling out the phenotype. By superimposing the risk thresholds on the predictiveness curve, the proportion of individuals meeting these thresholds can be approximated from their intersections with the predictiveness curve.

Figure 2.

Figure 2.

Model-based estimate of the predictiveness curve (solid red line) with 95% confidence band (solid light red lines) for a QIB to screen subjects for a phenotype with 0.65% prevalence (horizontal dotted black line). The rule-in risk threshold of 3.25% (horizontal dotted red line) for referring subjects for further work-up and the rule-out risk threshold of 0.13% (horizontal dotted blue line) are superimposed. For this QIB, the phenotype can be ruled out for 55% of subjects and ruled-in for 5% of subjects, with the remaining 40% of subjects having risks not meeting either threshold and thus in whom the clinical decision is equivocal. This hypothetical predictiveness curve is based on screening for colorectal cancer which is estimated to have a prevalence of about 0.65% in asymptomatic subjects in the United States.

Output from models built with QIBs as input can also lack precision, which here is interpreted as the closeness in agreement of output values from the same model but with QIB inputs measured on different occasions and possibly with different imaging methods. Precision is not a measure of agreement with the true output, but rather agreement among repeated outputs from the model. A prediction model can have good discrimination and good calibration, but poor precision when the predictions are noisy (i.e. do not exactly agree). Figure 3 illustrates the distinctions between discrimination, calibration, and precision, showing how a single model can differ in the quality of these aspects.

Figure 3:

Figure 3:

Illustration of Three Technical Performance Characteristics: Discrimination, Calibration, Precision. Data from three subjects (illustrated in red, blue, and green) are illustrated, with five replicates for each subject. Each subject has a different truth value (1, 2, or 3). Good vs. poor discrimination is displayed in the middle two columns, respectively. Good vs. poor calibration (i.e. “strong” calibration, as defined in [35]) is displayed in the last two columns, respectively. Good vs. poor precision is displayed in the top vs. bottom rows, respectively. The x-axis was jittered for clarity.

We have discussed performance metrics that do not rely on defining risk thresholds. These metrics are helpful for comparing multiple models/tools to select ones for further evaluation. Then, decision thresholds based on assessments of the costs and benefits of the tool and its performance, such as the standardized net benefit and net reclassification index, can be constructed in order to evaluate the clinical value of a final model [4245].

Measurement Uncertainty in Model Outputs

While uncommon to report, the uncertainty of a test result – e.g., a quantitative result, risk prediction, or phenotype call – is important to communicate to the user of the test to facilitate proper use of test results in clinical decision making. Suppose a test result Z is defined by equation Z = gX1,X2), where X1 and X2 are quantitative imaging biomarker variables with uncertainties u(X1) and u(X2) akin to a standard deviation in repeated measurement. By the delta method, the uncertainty of Z is given by the error propagation formula in Equation 3.

u(Z)=u(f(X1,X2))=(u(X1)ZX1)2+(u(X2)ZX2)2+2Cov(X1,X2)2ZX1X2 {3}

This formula assumes that u(X1), u(X2), and Cov(X1, X2) are constant across the domain of (X1, X2), which is often not true for quantitative imaging biomarkers. Because Z = g(X1, X2) is a many-to-one-function, the same value Z = z could manifest for two sets of values for (X1, X2), that is, z = g(x10, x20) = g(x11, x21), where (x10, x20) ≠ (x11, x21). The uncertainty u(z) may be different depending on if z was obtained with g(x10, x20) or g(x11, x21) if u(x11) ≠ u(x10) and / or u(x21) ≠ u(x20).

The uncertainty reported for a measurement could incorporate measurement bias (systematic difference) as well as imprecision. Error propagation formulae that include bias as well as imprecision terms are given in many international standards, including ISO 21748 [46] and CLSI EP29-A [47]. Once the measurement uncertainty has been characterized, an uncertainty interval can be formed that covers values that could reasonably be attributed to the measurand.

In the bottom-up modeling approach, a careful, comprehensive dissection of the measurement is performed by identifying its potential sources of uncertainty, and the uncertainties from each source are combined by error propagation to obtain the overall measurement uncertainty. Sources of uncertainty may be studied in the same or separate experiments. If the uncertainty of each source is studied separately, then their covariation cannot be estimated. In the error propagation formula, a covariance of 0 between two source input variables is sometimes assumed without knowing if it is approximately true or not. The bottom-up approach is useful for quantifying the contributions of each source to the overall uncertainty. The knowledge of these contributions can be used to identify potential opportunities for reducing overall uncertainty.

In the top-down modeling approach, the overall uncertainty of a measurement is directly estimated from repeated measurements of selected samples. The basic assumption is that the selected samples are representative of the intended use population of samples. In the top-down approach, individual sources of uncertainty that contribute to the overall uncertainty are not identified. Thus, the approach cannot in principle be used to identify which sources contribute the most to the overall uncertainty.

Development of Decision Tools Based on Multiparameter QIBs

Many authors have proposed guidelines or recommendations for biomarker development [4853], which are crucial for translating results from discovery and development into accurate and effective tools in populations. Here we adapt the general framework proposed in these guidelines and recommendations for use with multiparameter quantitative imaging. The four general phases are discovery (Phase I), development (Phase II), external validation (Phase III), and evaluation of clinical benefit (Phase IV).

Phase I is the Discovery phase. Here we identify QIBs associated with the construct of interest by establishing biological plausibility or empirical relationships with the clinical outcome of interest. We plan the intended use of the QIBs. Intended use is simply the planned use for the multidimensional panel of QIBs (use case 1) or model/decision tool built from QIBs or data-derived markers (use cases 2-4) [54]. For example, a statistical model built from chest CT biomarkers (e.g. biomarkers of airway inflammation, small vessel volume, gas-trapping) might have as an intended use to predict lung cancer. A related term is the context of use which describes the actual conditions under which the model/decision tool will be used in a normal setting [54]. For this example, the context of use for the statistical model might be as a screening tool for people with respiratory symptoms from chronic conditions like COPD. In the development of a decision tool based on mp-QIBs, the intended use must be known for product development, but often the context of use is unclear until the tool has been further developed and validated in a variety of subpopulations.

In addition to investigating the biological plausibility of the potential biomarkers, it is critical that the technical performance characteristics of the biomarkers, particularly precision, are known. Biomarkers with poor repeatability or reproducibility are often discarded from the pool of potential predictors. Several authors have proposed methods of filtering out potential predictors based on poor reproducibility, particularly when the number of predictors greatly exceeds the number of cases [2930]; these methods should be applied independently of the model outcome (e.g. phenotype or risk score).

Too often clinical decision making uses newly discovered biomarkers, combined in an ad-hoc, non-statistically-rigorous way. For example, in one RCT the selection of patients for lung valve implantation was based on fissure completeness >90%, quantitative emphysema >40%, and quantitative emphysema heterogeneity >10% [55]. This application is ideal for the Development Phase, rather than a clinical use phase, because it is essential to construct a scientifically rigorous model from the QIBs.

Phase II is the Model/Decision Tool Development phase. Model development is the process of choosing a model that approximates associations between variables. We need to first identify the training sample(s) which will be used to specify all aspects of the model (i.e. for variable selection, model fitting, selection of any tuning parameters, and internal validation). In choosing and/or building a training sample(s), prospectively determined subject eligibility and standard operating procedures for image acquisition, processing, and QIB extraction are critical; convenience samples too often lead to highly biased results [56]. In addition, there are many issues to consider specifically for quantitative imaging:

  • Have standardized imaging methods (e.g. scanners, image analysis software) been used to generate the QIBs?

  • Is the precision (repeatability, reproducibility) of the QIB measurements known or do technical performance studies need to be performed to establish these?

  • Do the QIB measurements come from a representative sample of imaging scanners and software analysis tools?

  • Are the magnitudes of the QIB measurements representative of the range of plausible values from the intended target population?

  • Are QIBs evaluated equally at all parts of the range or proportionally to the expected prevalence in the population?

  • Were the QIB measurements made from a single scan or multiple scans and how will correlation within a scan be handled?

  • If there are multiple scans, is the timing between scans standardized and were imaging parameters modified based on results from prior scans?

Conventional statistical models are commonly used for building decision tools, but machine learning techniques [57], deep learning neural networks, and other nearest neighbor methods are available. In building the tool, distinctions should be made between i) predictors or clinicopathological covariates that affect outcome, e.g. age, and ii) confounders that may modify the apparent performance of a tool, such as the site where the imaging was performed [41, 42]. Some covariates may have both effects. For the former, we need to include covariates in the model based on available clinical evidence in literature; for the latter cases we need to assess the performance of the decision tool, e.g. its discrimination ability, as a function of the covariate, rather than including the covariate in the model [45]. For imaging biomarkers, sometimes a model might include correction factor(s) for QIB measurements that are well understood to under-or over-estimate the true value; for poorly understood effects or random effects, no correction is possible [58]. When there exists an effect that might alter multiple QIBs measured together (e.g. scanning problem), the model might need to include a variable that defines the shared effect [58]. Note that the simplest model may not be the best performing model but a highly mathematical model may not be easy to interpret. For classification (e.g. use case 2), the bias-variance trade-off often tilts toward living with some bias in exchange for dramatically reduced variance, suggesting that simpler, even naïve models will perform better than more complex models [59]. The benefits of a more complex model are most prominent when the data are highly nonlinear, the signal-to-noise ratio is high, and the number of samples available for model fitting is large.

Another consideration is the practical use of the model. A highly complex model with good fit may not be practically applied or be difficult to interpret, whereas a simpler model may be more useful and marginally less optimal. “Interpretable” models are often preferred where the mechanism of the model itself may be rigorously understood as biologically plausible [59]. The choice of a model is a balance of priorities between model fit, usability, and trustworthiness [58].

Internal testing of the model/tool is an important part of the development phase. Internal validation, or testing, is the assessment of model performance (e.g. discrimination and calibration) within the same dataset used to develop the test. There are many internal testing methods, including cross-validation and bootstrapping, that can be applied. Once a near final model is established, a test should be performed on a sequestered dataset, i.e. a held-out sample from the dataset used in training; this is still considered internal testing or validation. At the end of the development phase, a locked-down tool is established.

Phase III is Independent External Validation. Validation is defined as “a process to establish that the performance of a test, tool, or instrument is acceptable for its intended purpose” [22]. In this manuscript we are careful to draw a distinction between internal and external validation. Generally, internal validation aims to optimize model performance and is often conducted using a dataset which is not fully independent from the development dataset, whereas external validation aims to generalize and establish performance of the model in a patient population different than was used for model development (i.e. data collected from sites not in the training sample). A mp-QIB developed on a training dataset will be fit to measurement error patterns and other peculiarities in the dataset, as well as to actual signal. Thus, internal validation on the same dataset on which the mp-QIB is trained, even if done properly, may not be robust. Internal validation is generally not enough when deriving a data-driven model (use case #4). External validation in independent datasets is a necessary component of rigorous model assessment and should also be conducted.

External validation is performed on the locked-down tool in a dataset independent of the dataset(s) used to develop the test (i.e. dataset collected at clinical sites and imaging facilities different than the training dataset) [61]. External validation establishes confidence that the observed discrimination and calibration performance of a mp-QIB tool will reproduce in new subjects. Sometimes, in the absence of sufficient data for external validation, a model is declared validated after only an internal validation (i.e. using a sequestered sample of the training data); in this situation, authors need to be clear that phase III/External validation is lacking.

Phase IV is the Evaluation of Clinical Benefit. Clinical benefit includes the range of possible benefits or risks to individuals and populations. Clinical benefit is assessed in a clinical outcome study which could be a retrospective design where the model or decision tool does not direct patient care, or prospective design where the model could either not influence patient care (i.e. integrated mp-QIB) or could be used as part of patient care (i.e. integral mp-QIB). The goal is to determine if use of the model in the intended use population as per instructions for use will lead to a net improvement in health outcome, spare patients from suboptimal or unnecessary treatments, help patients avoid toxicity and excessive expense, or provide useful information about diagnosis, treatment efficacy, management, or prevention of a disease [22].

Clinical benefit should also consider factors related to standard treatment practices in the patient population, the expected outcomes on these treatments, and other tests already addressing the same clinical issue (e.g. measurements of estrogen receptor expression through immunohistochemistry and simple blood markers like PSA in prostate cancer or CA125 in ovarian cancer). A mp-QIB decision tool that identified a low-risk subgroup of patients with a high progression-free survival on standard therapy from a higher-risk subgroup with substantially worse outcomes is useful for directing the low-risk subgroup to a less intensive therapy regimen and avoid treatment-related toxicity. Meanwhile, the decision tool would be much less useful if both groups had poor but statistically significant differences in progression-free survival or if the recommended course of treatment for the two groups was the same. If other tests are currently available for addressing the clinical issue, the mp-QIB decision tool would be preferred if it had superior performance or if it had a non-inferior performance but reduced cost, invasiveness, or inconvenience to the patient.

Designing external validation studies and clinical benefit studies to be generalizable and cover the full impact on health can be challenging and requires an in-depth understanding of how the mp-QIB decision tool would be used by health care providers and the clinical context for use [62]. Therefore, the focus of this series on mp-QIBs will be on the statistical considerations in the Discovery and Development phases of the mp-QIB tools.

Technical Performance Claims for mp-QIBs Profiles

Claims about the technical performance of mp-QIBs should be developed similarly to those for individual QIBs. A clear intended use of a decision tool in clinical care, a rationale for its potential adoption into practice, and how it might influence disease management should be stated in the Profile. Standard operating procedures for QIB measurement acquisition should be established. The computational procedure underlying the model should be specified and locked down to the extent possible. Results regarding the ability of the model to predict an outcome of interest and the technical performance of model outputs form the claim language. Detailed descriptions of the data upon which these analyses were based and the statistical methodology used to assess prediction and technical performance should be provided in sufficient detail to allow an independent statistician to potentially reproduce the analyses if the data were provided. Details of each of these components are provided in the subsequent subsections.

Intended Use and Rationale

A statement of the intended use of a decision tool built upon the mp-QIBs should include its role in clinical care and the target population. For example, risk prediction models can be used for risk assessment (forecasting future onset of a condition in a non-diseased subject based on known and/or putative risk factors), prognosis (forecasting a clinical outcome in a disease subject in the absence of therapy or on a standard treatment regimen all patients are likely to receive), treatment response assessment (either as an early indication of the effectiveness of a treatment or as a more quickly or inexpensively ascertainable replacement for a more definitive outcome), screening for disease onset , and surveillance for disease recurrence or progression [63]. Furthermore, how the mp-QIBs are used to guide disease management decision should be included. For example, a mp-QIB-based model with prognostic ability may be used to identify patients for which outcomes on standard therapy are appropriate, favoring a less intensive regimen. Meanwhile, one intended for early assessment of treatment response may quickly identify patients for which outcome on a given treatment is expected to be poor, allowing them to switch to a more effective regimen.

Background information justifying the investigation of the QIBs in this intended use should be included in the Profile. Often, this will include the scientific rationale behind the use of the QIBs, results from previous studies regarding their analytic validity, and preliminary evidence of their association with and ability to predict relevant endpoints (e.g., univariate associations with survival or initial attempts to construct and evaluate a model of likelihood of disease recurrence).

QIB Measurement Acquisition Standard Operating Procedures

Standard operating procedures for image acquisition and processing and QIB extraction should be established to reduce variability of the resulting measurement values that may result from differences in factors such as operator or specialist technique, imaging parameters, or device or software used. Those for image acquisition should include imaging parameters and normalization and harmonization protocols consistent with current conventions in practice. Procedures for QIB extraction should include protocols on operator-dependent steps such as those for manual segmentation and specification of software platforms for semi- or fully automated segmentation algorithms or computation of the measurement values by human readers.

These standard operating procedures should also be accompanied by evidence of adequate analytic validity of individual QIB measurement values. Typically, this would include measures of precision such as test-retest repeatability and reproducibility, strength of agreement of the measurements with some reference standard (i.e. bias), and assessment of the nature of the relationship between the QIB values and the underlying measurand (e.g., linearity and slope of the relationship between the two quantities) [14].

QIB Model Specification

The computation of Z or f(O) in Equations 1 and 2 should be as fully specified as possible in the claim. If g is based on some linear or nonlinear regression model, the full closed-form expression of g should be included. If g is a function without a closed-form expression (e.g., a classifier based on k nearest neighbors or Random Forests), the software or code used to compute g should be referenced; aspects such as seeds for the random number generators should be held fixed or under the control of the researcher. Any preprocessing steps for the model inputs such as imputation procedures for missing measurement values and any transformations (e.g., logarithmic) applied should also be detailed. Possible values of Z or f(O) should also be included, along with their meanings in the context of the intended use.

The claim should include an estimate of the performance of the model, e.g. discrimination ability denoted θ^, and, if possible, 100(1 − α)% confidence intervals (see Equation 3). The choice of θ and its target value θ0 will depend on the outcome variables and the intended use of the mp-QIBs, as described in Section 3. The claim should also include results on the repeatability and reproducibility of Z or f(O) to provide evidence that any mp-QIB-based decision tools built upon the model will produce similar results regardless of where or by whom the imaging and measurement extraction was performed. Thresholds for acceptable repeatability and reproducibility metrics will be dependent on the disease type, modality, and intended use of the mp-QIBs [14].

Discussion

We have presented four use cases of multiparameter quantitative imaging biomarkers from a statistical framework, along with definitions and a general process for biomarker development. Four additional papers are included in this series: Raunig et al [5] describe statistical methods for developing, defining change, and testing a multidimensional QIB panel (use case #1); Delfino et al [6] present analytical methods for developing and testing models to determine subjects’ phenotype using QIBs as input (use case #2); Huang et al [7] review approaches to building models for predicting subjects’ risk and outcome while addressing measurement error in the QIB inputs (use case #3); and Wang et al [8] discuss and illustrate methods for different types of engineered quantitative features derived from radiomics applications with imaging inputs (use case #4). In each of these papers the authors present approaches to developing technical performance claims and testing conformance to claims, along with an illustrative example.

The Quantitative Imaging Biomarker Alliance (QIBA), as well as other groups, has been working on development of standard operating procedures for image acquisition and processing, and QIB extraction. This standardization is essential so that separate mp-QIB models do not need to be developed for each combination of imaging device and software. In the past QIBA has focused on writing profiles that are both practical and achieve certain technical performance claims related to bias and precision. These claims on bias and precision have been generalizable across populations due to the standardization of the imaging methods. QIBA’s venture into the mp-QIB space, however, necessitates a stronger tie to clinical purpose. Indeed, clinical perspective and context are vital at the beginning of the mp-QIB biomarker discovery phase with establishment of an intended use, and construction of a representative training sample for development of tools (phase II) also requires us to keep the clinical perspective in sight. Similarly, in the external validation phase (phase III) ensuring generalizability of the validation dataset(s) requires an understanding of the potential clinical use of the tool. Performance metrics such as discrimination and calibration which are measured in phase III are strongly population-dependent. Finally, the evaluation of the clinical benefit of the mp-QIB tools (phase IV) requires adoption of the tools, initially as integrated and later as integral tools, in clinical trials. There is ongoing discussion within QIBA about its role in the development of mp-QIBs and how tools developed by QIBA will be translated into clinical practice. Future work will focus on this important translational piece.

Some potential limitations in the development and validation of mp-QIB tools include biases in study designs, missing data, model overfitting, and lack of robustness across imaging platforms. These limitations should not deter us; the opportunities afford by mp-QIB to improve treatment options and patient outcomes are immense. Keeping our focus on the measurement of biomarkers that objectively answer clinically important questions will be key to the future of QIBs and mp-QIB tools.

Acknowledgement:

The RSNA provided administrative support for the working group conference calls where this manuscript was discussed. Drs. Hall, Guimaraes, and Obuchowski were paid consultants for QIBA during the writing of this manuscript.

Declaration of interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Nancy Obuchowski reports financial support was provided by Radiological Society of North America

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Declaration of Interest: None

Contributor Information

Nancy A Obuchowski, Quantitative Health Sciences /JJN3, Cleveland Clinic Foundation, 9500 Euclid Ave. Cleveland, OH 44195.

Erich Huang, Biometric Research Program, Division of Cancer Treatment and Diagnosis – National Cancer Institute, National Institutes of Health.

Nandita M deSouza, Division of Radiotherapy and Imaging, The Institute of Cancer Research and Royal Marsden NHS Foundation Trust, London, United Kingdom and European Imaging Biomarkers Alliance (EIBALL), European Society of Radiology (ESR), Am Gestade 1, Vienna, Austria.

David Raunig, Data Science Institute, Takeda.

Jana Delfino, Center for Devices and Radiological Health, US Food and Drug Administration.

Andrew Buckler, Elucid Bioimaging, Inc.

Charles Hatt, University of Michigan.

Xiaofeng Wang, Quantitative Health Sciences, Cleveland Clinic Foundation.

Chaya Moskowitz, Memorial Sloan Kettering Cancer Institute.

Alexander Guimaraes, Department of Radiology, Oregon Health and Science University.

Maryellen Giger, Department of Radiology, University of Chicago.

Timothy J Hall, Department of Medical Physics, University of Wisconsin.

Paul Kinahan, University of Washington.

Gene Pennello, Division of Biostatistics, Center for Devices and Radiological Health, FDA.

References

  • 1.Kessler LG, Barnhart HX, Buckler AJ, et al. The Emerging Science of Quantitative Imaging Biomarkers Terminology and Definitions for Scientific Studies and Regulatory Submissions. SMMR 2015; 24: 9–26. [DOI] [PubMed] [Google Scholar]
  • 2.Raunig DL, McShane LM, Pennello G, et al. Quantitative Imaging Biomarkers: A Review of Statistical Methods for Technical Performance Assessment. SMMR 2015; 24: 27–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Obuchowski NA, Reeves AP, Huang EP, et al. Quantitative imaging biomarkers: A review of statistical methods for computer algorithm comparisons. SMMR 2015; 24:68–106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Huang EP, Wang XF, Choudhury KR, et al. Meta-analysis of the technical performance of an imaging procedure: guidelines and statistical methodology. SMMR 2015; 24: 141–174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Raunig et al. Multidimensional quantitative imaging biomarkers as a multivariate descriptor of health. Submitted to Acad Radiol. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Delfino et al. Multiparametric Quantitative Imaging Biomarkers in Phenotype Classification. Submitted to Acad Radiol. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Huang et al. Multiparametric Quantitative Imaging Biomarkers in Risk Prediction: Recommendations for data acquisition, technical performance assessment, and model development and validation. Submitted to Acad Radiol. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Wang et al. Multiparametric Data-driven Imaging Markers: Guidelines for Development, Application and Reporting of Model Outputs in Radiomics. Submitted to Acad Radiol. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hall TJ, Zhu Y, Spalding CS. In vivo real-time freehand palpation imaging. Ultrasound in Medicine & Biology 2003; 29: 427–435. [DOI] [PubMed] [Google Scholar]
  • 10.LeCarpentier GL, Roubidoux MA, Fowlkes JB, et al. Suspicious breast lesions: assessment of 3D Doppler US indexes for classification in a test population and fourfold cross-validation scheme. Radiology 2008; 249: 463–470. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.d’Astous FT and Foster FS. Frequency dependence of ultrasound attenuation and backscatter in breast tissue. Ultrasound in Medicine and Biology 1986; 12: 795–808. [DOI] [PubMed] [Google Scholar]
  • 12.Nam K, Zagzebski JA, and Hall TJ. Quantitative assessment of in vivo breast masses using ultrasound attenuation and backscatter. Ultrasonic Imaging 2013; 35: 146–161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Rosado-Mendez IM. Advanced spectral analysis methods for quantification of coherent ultrasound scattering: Applications in the breast. 2014. (Doctoral dissertation, The University of Wisconsin-Madison; ). [Google Scholar]
  • 14.Choi H, Charnsangavej C, Faria SC, et al. Correlation of computed tomography and positron emission tomography in patients with metastatic gastrointestinal stromal tumor treated at a single institution with imatinib mesylate: proposal of new computed tomography response criteria. J Clin Oncol 2007; 25:1753–1759. [DOI] [PubMed] [Google Scholar]
  • 15.Sheahan M, Ma X, Paik D, et al. Atherosclerotic plaque tissue: noninvasive quantitative assessment of characteristics with software-aided measurements from conventional CT angiography. Radiology 2018; 286: 622–631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Chrencik MT, Khan AA, Luther L, et al. Quantitative assessment of carotid plaque morphology (geometry and tissue composition) using computed tomography angiography. J Vase Surg 2019: 70: 858–868. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Institute of Medicine of the National Academies. Evolution of translational omics: lessons learned and the path forward. Report Brief. The National Academies Press, Washington, DC, 2012. [PubMed] [Google Scholar]
  • 18.Hoehndorf R, Schofield PN, Gkoutos GV. Analysis of the human diseasome using phenotype similarity between common, genetic, and infectious diseases. Nature Scientific Reports 2015; 5: 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Fusco R, Di Marzo M, Sansone C, et al. Breast DCE-MRI: lesion classification using dynamic and morphological features by means of a multiple classifier system. Eur Radiol Exp 2017; 1: 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Barr RG, Ferraioli G, Palmeri ML, et al. Elastography Assessment of Liver Fibrosis: Society of Radiologists in Ultrasound Consensus Conference Statement. Radiology 2015: 276: 845–861. [DOI] [PubMed] [Google Scholar]
  • 21.McDonald N, Eddowes PJ, Hodson J, et al. Multiparameter magnetic resonance imaging for quantitation of liver disease: a two-centre cross-sectional observational study. Sci Rep 2018; 15: 8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.FDA-NIH Biomarker Working Group. BEST (Biomarkers, EndpointS, and other Tools) Resource. Silver Spring (MD): Food and Drug Administration (US); 2016-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK326791/Co-published by National Institutes of Health (US), Bethesda (MD). [PubMed] [Google Scholar]
  • 23.Zakaria R, Chen YJ, Hughes DM, et al. Does the application of diffusion weighted imaging improve the prediction of survival in patients with resected brain metastases? A retrospective multicenter study. Cancer Imaging 2020; 20: 16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Gillies RJ, Kinahan PE, Hricak H. Radiomics: Images Are More than Pictures, They Are Data. Radiology 2016; 278, 563–577. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Fournier L, Costaridou L, Bidaut L, et al. Incorporating radiomics into clinical trials: expert consensus on considerations for data-driven compared to biologically-driven quantitative biomarkers. Eur Radiology 2020; 31: 6001–6012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Sun R, Limkin EJ, Vakalopoulou M, et al. A radiomics approach to assess tumour-infiltrating CD8 cells and response to anti-PD-1 or anti-PD-L1 immunotherapy: an imaging biomarker, retrospective multicohort study. Lancet Oncol 2018; 19: 1180–1191. [DOI] [PubMed] [Google Scholar]
  • 27.Fan M, Li H, Wang S, et al. Radiomic analysis reveals DCE-MRI features for prediction of molecular subtypes of breast cancer. PLoS One 2017; 12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Hassani C, Varghese BA, Nieva J, et al. Radiomics in Pulmonary Lesion Imaging. American Journal of Roentgenology 2019; 212: 497–504. [DOI] [PubMed] [Google Scholar]
  • 29.Hackstadt AJ, Hess AM. Filtering for increased power for microarray data analysis. BMC Bioinformatics 2009; 10: 1471–2105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Park JE, Park SY, Kim HJ, et al. Reproducibility and generalizability in radiomics modeling: possible strategies in radiologic and statistical perspectives. Korean J Radiol 2019; 20: 1124–1137. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Su JQ and Liu JS. Linear combinations of multiple diagnostic markers. JASA 1993; 88: 1350–1355. [Google Scholar]
  • 32.McIntosh MW, Pepe MS. Combining several screening tests: optimality of the risk score. Biometrics 2002; 58: 657–664. [DOI] [PubMed] [Google Scholar]
  • 33.Liu A, Schisterman EF, Zhu Y. On linear combinations of biomarkers to improve diagnostic accuracy. Statistics in Medicine 2005; 24: 37–47. [DOI] [PubMed] [Google Scholar]
  • 34.Pepe MS, Cai T, Longton G. Combing predictors for classification using the area under the receiver operating characteristic curve. Biometrics 2006; 62: 221–229. [DOI] [PubMed] [Google Scholar]
  • 35.van Calster B, Steyerberg EW. Calibration of prognostic risk score. Wiley StatsRef: Statistics Reference Online. Aug 2018: 1–10. [Google Scholar]
  • 36.Pepe M, Feng Z, Huang Y, et al. Integrating the predictiveness of a marker with its performance as a classifier. Am J Epidemiol 2008; 167: 362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Jin H and Lu Y. The optimal linear combination of multiple predictors under the generalized linear models. Statistics & Probability Letters 2009; 79: 2321–2327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Kang L, Liu A, Tian L. Linear combination methods to improve diagnostic/prognostic accuracy on future observations. SMMR 2013; 25: 1359–1380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Hsu MJ, Hsueh HM. The linear combinations of biomarkers which maximize the partial area under the ROC curves. Comput Stat 2013; 28: 647–666. [Google Scholar]
  • 40.Ma H, Halabi S, Liu A. On the use of min-max combination of biomarkers to maximize the partial area under the ROC curve. Journal of Probability and Statistics 2019; 2019: 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Pepe MS, Janes H. Methods for evaluating prediction performance of biomarkers and tests. Risk Assessment and Evaluation of Predictions. Lecture Notes in Statistics 2013, Springer Science + Business Media, New York. [Google Scholar]
  • 42.Pepe M, Kerr K, Longton G, et al. Testing for improvement in prediction model performance. Statistics in Medicine 2013; 32: 1467–1482. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Pencina MJ, D’Agostino RB Sr, D’Agostino RB Jr, et al. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in Medicine 2008; 27: 157–172. With comments by Pepe M, Feng Z, Gu J. [DOI] [PubMed] [Google Scholar]
  • 44.Pencina MJ, D’Agostino RB, Sr, Steyerberg EW. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Statistics in Medicine 2011; 30: 11–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Huang Y, Pepe M. A parameter ROC model-based approach for evaluating the predictiveness of continuous markers in case-control studies. Biometrics 2009; 65: 1133–1144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.EP29-A., C.A.G.C.d., Expression of Measurement Uncertainty in Laboratory Medicine. 2012, Clinical and Laboratory Standards Institute; Wayne, PA. [Google Scholar]
  • 47.Standardization, I.O.f., Guidance for the use of repeatability, reproducibility and trueness estimates in measurement uncertainty estimation, in (ISO Standard No 21748:2017(E)). 2017: Geneva, Switzerland. [Google Scholar]
  • 48.Institute of Medicine, Board on Heath Care Services, Board on Health Sciences Policy, Committee on the review of omics-based tests for predicting patient outcomes in clinical trials. Evolution of Translational Omics: Lessons learned and the path forward. 2012. [PubMed] [Google Scholar]
  • 49.Pepe MS, Etzioni R, Feng Z, et al. Phases of biomarker development for early detection of cancer. J Natl Cancer Inst 2001; 93: 1054–61. [DOI] [PubMed] [Google Scholar]
  • 50.Pepe MS, Feng Z, Janes H, et al. Pivotal evaluation of the accuracy of a biomarker used for classification or prediction: standards for study design. J Natl Cancer Inst 2008; 100, 1432–1438. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning, 2nd edition. Springer Science+Business Media, New York. 2008 [Google Scholar]
  • 52.Fryback DG, Thornbury JR. The efficacy of diagnostic imaging. Med Decis Making 1991; 11: 88–94. [DOI] [PubMed] [Google Scholar]
  • 53.O’Connor JPB, Aboagye EO, Waterton JC. Imaging biomarker roadmap for cancer studies. Nat Rev Clin Onc 2017; 14: 169–186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.https://www.fda.gov/drugs/biomarker-qualification-program/context-use. Last viewed on February 4, 2022.
  • 55.Criner GJ, Delage A, Voelker K, et al. Improving Lung Function in Severe Heterogenous Emphysema with the Spiration Valve System (EMPROVE). A Multicenter, Open-Label Randomized Controlled Clinical Trial. American Journal of Respiratory and Critical Care Medicine 2019; 200: 1354–1362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Simon RM, Paik S, and Hayes DF. Use of archived specimens in evaluation of prognostic and predictive biomarkers. JNCI 2009; 101: 1446–1452. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Kuhn M, Johnson K. Applied predictive modeling. New York: Springer; 2013 [Google Scholar]
  • 58.Uncertainty of measurements – Part 6: Developing and using measurement models. International Electrotechnical Commission (IEC) 2020; ISO/IEC FDGuide 98–6:2020, Switzerland; available at www.iso.org. [Google Scholar]
  • 59.Dudoit S, Fridlyand J, Speed TP. Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. JASA 2002; 97: 77–87. [Google Scholar]
  • 60.Rudin C Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence. 2019; 1: 206–215. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Altman DG, Royston P. What do we mean by validating a prognostic model? Statist Med 2000; 19: 453–473 [DOI] [PubMed] [Google Scholar]
  • 62.Bossuyt PMM, Lijmer JG, Mol BWJ. Randomised comparisons of medical tests: sometimes invalid, not always efficient. Lancet 2000; 356: 1844–47. [DOI] [PubMed] [Google Scholar]
  • 63.Huang EP, Lin FI, Shankar LK. Beyond correlations, sensitivities, and specificities: a roadmap for demonstrating utility of advanced imaging in oncology treatment and clinical trial design. Acad Radiol 2017; 24: 1036–1049. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES