Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Aug 1.
Published in final edited form as: Int J Radiat Oncol Biol Phys. 2021 Mar 1;110(5):1451–1465. doi: 10.1016/j.ijrobp.2021.02.030

Multi-block Discriminant Analysis of Integrative 18F-FDG-PET/CT Radiomics for Predicting Circulating Tumor Cells in Early Stage Non-small Cell Lung Cancer Treated with Stereotactic Body Radiation Therapy

Sang Ho Lee 1, Gary D Kao 1, Steven J Feigenberg 1, Jay F Dorsey 1, Melissa A Frick 1, Samuel Jean-Baptiste 1, Chibueze Z Uche 1, Keith A Cengel 1, William P Levin 1, Abigail T Berman 1, Charu Aggarwal 2, Yong Fan 3, Ying Xiao 1
PMCID: PMC8286285  NIHMSID: NIHMS1678654  PMID: 33662459

Abstract

Purpose:

To integrate 18F-FDG-PET/CT radiomics with multi-block discriminant analysis for predicting circulating tumor cells (CTCs) in early stage non-small cell lung cancer (ES-NSCLC) treated with stereotactic body radiation therapy (SBRT).

Methods:

Fifty-six patients with stage I NSCLC treated with SBRT underwent 18F-FDG-PET/CT imaging pre-SBRT and post-SBRT (median, 5 months; range, 3–10 months). CTCs were assessed via a telomerase-based assay before and within 3 months after SBRT and dichotomized at 5 and 1.3 CTCs/mL. Pre-SBRT, post-SBRT and delta PET/CT radiomics features (n=1,548×3/1,562×3) were extracted from gross tumor volume. Seven feature blocks were constructed including clinical parameters (n=12). Multi-block data integration was performed using block sparse partial least squares-discriminant analysis (sPLS-DA) referred as DIABLO for identifying key signatures by maximizing common information between different feature blocks, while discriminating CTC levels. Optimal input blocks were identified using a pairwise combination method. DIABLO performance for predicting pre-SBRT and post-SBRT CTCs was evaluated using combined AUC (area under the curve, averaged across different blocks) analysis with 20×5-fold cross-validation (CV), and compared with that of concatenation-based sPLS-DA that consisted of combining all features into one block. CV prediction scores between one class vs the other were compared using the Wilcoxon rank sum test.

Results:

For predicting pre-SBRT CTCs, DIABLO achieved the best performance with combined pre-SBRT PET radiomics and clinical feature blocks, showing CV-AUC of 0.875 (p=0.009). For predicting post-SBRT CTCs, DIABLO achieved the best performance with combined post-SBRT CT and delta CT radiomics feature blocks, showing CV-AUCs of 0.883 (p=0.001). In contrast, all single-block sPLS-DA models could not attain CV-AUCs higher than 0.7.

Conclusions:

Multi-block integration with discriminant analysis of 18F-FDG-PET/CT radiomics has the potential for predicting pre-SBRT and post-SBRT CTCs. Radiomics and CTC analysis may complement and together help guide the subsequent management of patients with ES-NSCLC.

Keywords: Circulating tumor cell, radiomics, 18F-FDG-PET/CT, early stage non-small cell lung cancer, stereotactic body radiation therapy

Introduction

Circulating tumor cells (CTCs) have gained interest as a biologically relevant biomarker in non-small cell lung cancer (NSCLC), and CTC counts appear to reflect oncologic outcomes following radiation therapy (RT) in patients with localized NSCLC (1). The quantity of CTCs in patients with lung cancer can be used to predict higher risk of failure after treatment as well as serve as a surveillance tool to give lead time notice of recurrence or progression of cancer (2). Because CTCs are detected in peripheral blood samples of cancer patients, serial assays can be performed as “liquid biopsies”, with no risk and little discomfort to patients (3).

Patients with early stage non-small cell lung cancer (ES-NSCLC) treated with stereotactic body radiation therapy (SBRT) have excellent local control rates (4), although they have a moderate risk of regional and distant recurrences similar to surgery (5, 6). Recently, patients with elevated pre-SBRT CTCs or persistently detectable post-SBRT CTCs have been found to predict a significantly increased risk of regional and distant recurrence outside the treatment volume (7). Radiomics biomarkers associated with CTCs may have additional value in predicting these recurrences, as they may offer a non-invasive diagnostic alternative when CTC analysis is unavailable. For example, advanced imaging allowing radiomics may be widely available in the community, far removed from tertiary care centers, even those lacking access to or the infrastructure for CTC analysis. Radiomic features have furthermore been linked to oncogenic expression (8), which in turn may contribute to CTCs. Consequently, it is inherently valuable to identify radiomic features that are tightly correlated with CTCs. Such radiomic features identified in turn may suggest testable biological hypotheses for how early stage cancer can transform into regional and metastatic lethal disease, e.g., complex or heterogeneous image shapes may reflect multiclonal or biologically aggressive cancer. Moreover, as CTC analysis attracts increasing interest as a compelling prognostic factor for NSCLC (1, 7, 9, 10), biasing selection towards CTC dependent radiomics features may serve as a means for sorting out promising candidate prognostic imaging biomarkers in a translationally compelling way, which in turn may lead to building a radiomics-based model that is effective for predicting clinical outcomes.

The rapid rise of imaging computing power has provided a wealth of multi-radiomics data coming from multimodality imaging such as CT and PET (11). The large number of radiomics features compared to the limited amount of available patient data presents a computational challenge when identifying key radiomic signatures of disease (12, 13). The most commonly used data integration framework that enables the identification of multi-radiomics signatures in a data-driven analysis is the concatenation-based integration that combines multiple datasets into a single large dataset (14, 15). However, this approach cannot account for the relationship between multiple sources of information captured via different data types, and thus limits our deciphering of interactions between multi-radiomics phenotypes (16). Therefore, there is a crucial need for a novel integrative method that can identify a multi-radiomics signature by borrowing discriminatory strength from complementary information across multi-radiomics data, while providing better insight into disease mechanisms.

In this study, we performed an integrative approach for identifying key signatures by addressing correlated nature of different radiomics phenotypes that reflect anatomic and metabolic radiomics derived from 18F-FDG-PET/CT in a holistic manner. The purpose of this study was to evaluate the effect of FDG-PET/CT radiomics data integration with multi-block discriminant analysis in terms of predicting pre-selected high-risk CTC values in patients with ES-NSCLC treated with SBRT.

Methods and Materials

Dataset and Definition of CTC Positivity

Between October 15, 2012 and June 28, 2017, eligible patients with stage I NSCLC who underwent SBRT were enrolled on a prospective institutional review board approved biomarker trial. The trial included, both subjects with biopsy-proven disease and those with a presumed, clinically diagnosed NSCLC (AJCC 7th edition). The clinical diagnosis of NSCLC was determined by multidisciplinary consensus for those subjects who were ineligible for biopsy because of comorbidities or who had indeterminate pathology results. Factors such as smoking status, lesion size, radiographic appearance of lesions (i.e., spiculations, lack of calcifications), lesion growth on serial imaging, and PET avidity were considered to determine clinically diagnosed lesions as ES-NSCLC, in keeping with recent guidelines from our group (17). Patient eligibility criteria were described in detail in a previous publication (7). SBRT was delivered to all subjects to a median dose of 50 Gy (range, 50–60 Gy) in 4 or 5 fractions (range, 4–20); median biologically effective dose (α/β = 10) was 100 Gy (range, 78–112.5 Gy). For each patient, a pre-SBRT peripheral blood sample was collected at initial consult or simulation for radiation therapy for quantitative and qualitative analysis of CTCs. Post-SBRT blood samples were obtained at follow-up within 3 months after SBRT. A total of 56 patients who underwent 18F-FDG-PET/CT imaging pre-SBRT and post-SBRT (median, 5 months; range, 3 to 10 months) were available for analysis at baseline and early post-SBRT follow-up. Detailed information on PET/CT imaging protocols is provided as a supplementary document (supplementary document 1). Patient characteristics of clinical parameters are summarized comparing between patients enrolled and those available for analysis in Table 1. CTCs were assessed and enumerated via a telomerase-based assay. They were then dichotomized at 5 CTCs/mL before SBRT and 1.3 CTCs/mL within 3 months following SBRT as the threshold for high risk CTC positivity established by our previous investigation (7). There were 8 high risk CTC-positive patients at baseline, and 15 high risk CTC-positive patients in the early post-SBRT period.

Table 1.

Patient characteristics of clinical parameters. Statistical comparison between dataset for patients enrolled and dataset for patients available for analysis was computed using Chi square (categorical variables) or Wilcoxon rank sum test (continuous variables).

Clinical parameter Patients enrolled (n=92) Patients available for analysis (n=56) P-value
Gender [n (%)] 0.678
 Male 39 (42%) 21 (37%)
 Female 53 (58%) 35 (63%)

Age [median (range) in years] 71 (55–93) 70 (55–91) 0.524

Race [n (%)] 0.727
 White 74 (80%) 44 (78%)
 African-American 14 (15%) 11 (20%)
 Asian 1 (1%) 0 (0%)
 Other 3 (3%) 1 (2%)

Body mass index [median (range)] 26 (16–43) 25 (17–43) 0.640

Smoking status [n (%)] 0.917
 Former 69 (75%) 41 (73%)
 Current 19 (20%) 13 (23%)
 Never 4 (5%) 2 (4%)

Pack-years [median (range)] 40 (0–200) 40 (3.5–165) 0.636

Tumor size [median (range) in cm] 1.7 (0.5–5) 1.8 (0.7–4.7) 0.820

Tumor SUV 4.3 (0.8–19.9) 4.2 (0.8–19.9) 0.753

AJCC stage [n (%)] 0.683
 IA 81 (88%) 48 (86%)
 IB 11 (12%) 8 (14%)

T stage [n (%)] 0.876
 1a 59 (64%) 36 (64%)
 1b 21 (23%) 12 (22%)
 2a 12 (13%) 8 (14%)

Histology [n (%)] 0.932
 Adenocarcinoma 22 (24%) 13 (23%)
 Squamous cell carcinoma 15 (16%) 8 (14%)
 Not pathologically confirmed 55 (60%) 35 (63%)

Previous NSCLC [n (%)] 0.628
 Yes 17 (19%) 13 (23%)
 No 75 (81%) 43 (77%)

Radiomics Features

Pre-SBRT, post-SBRT and delta radiomics features were extracted from the gross tumor volume (GTV) of the original, wavelet-filtered, Laplacian of Gaussian (LoG)-filtered CT image and standardized uptake value (SUV) map on 18F-FDG PET in 3D, including shape features (from CT), statistical local texture features (from PET and CT) that can be categorized into the first-order statistics (one pixel), second-order statistics (pair of pixels) and higher-order statistics (three or more pixels), and global texture features (from PET and CT) that can characterize the entire texture (all pixels) of an object independently from the object’s position, orientation, scale or rotation (18). A total of 1,562×3 CT and 1,548×3 PET radiomics features were extracted for constructing pre-SBRT, post-SBRT and delta radiomics datasets, respectively. The delta radiomics features were defined as the relative change of original radiomics features from the pre- to post-SBRT. The shape and local texture features were extracted using an open-source package in Python, PyRadiomics (19), and the global texture features using in-house code written in python. We did not interpolate patient images to uniform voxel sizes in all three dimensions or between patients, and used fixed bin widths of 25 HU and 0.1 SUV for discretization of CT and PET SUV images (20), respectively, and wavelets with a starting bin edge of 0, the default for PyRadiomics. A 3D discrete and single-stage wavelet transform was used to decompose volumetric images into eight decomposed volumes of images, labeled as LLL, LLH, LHL, LHH, HLL, HLH, HHL and HHH, where L and H are low- and high-frequency signals. For instance, LLH is a volume of images transformed by using the low-pass filters on the x and y axis, and a z-axis high-pass filter. For the LoG preprocessing, we chose kernel sizes with σ=1, 3, and 5 mm. More detailed explanations of feature calculations can be found in PyRadiomics documentation (19). The global texture features were extracted after image intensity values were rescaled to [0, 255], which consisted of the second-, third- and fourth-order normalized central moments (NCMs) and moment invariants (MIs). Refer to Lee et al. (18) for details about their computational formulas. Table 2 summarizes radiomics features used in this study.

Table 2.

Radiomics features used in this study.

Feature type Feature class Feature name No. of features
Texture Global 3D moments 2nd-, 3rd- and 4th-order normalized central moments (NCMs, n=31) and moment invariants (MIs, n=5) 36
Higher-order Neighboring gray tone difference matrix (NGTDM) Coarseness, Complexity, Strength, Contrast, Busyness 5
Gray level size zone matrix
(GLSZM)
GrayLevelVariance, SmallAreaHighGrayLevelEmphasis,
GrayLevelNonUniformityNormalized,
SizeZoneNonUniformityNormalized, SizeZoneNonUniformity,
GrayLevelNonUniformity, LargeAreaEmphasis, ZoneVariance,
ZonePercentage, LargeAreaLowGrayLevelEmphasis,
LargeAreaHighGrayLevelEmphasis, HighGrayLevelZoneEmphasis,
SmallAreaEmphasis, LowGrayLevelZoneEmphasis, ZoneEntropy, SmallAreaLowGrayLevelEmphasis
16
Gray level run length matrix
(GLRLM)
ShortRunLowGrayLevelEmphasis, GrayLevelVariance,
LowGrayLevelRunEmphasis, GrayLevelNonUniformityNormalized,
RunVariance, GrayLevelNonUniformity, LongRunEmphasis,
ShortRunHighGrayLevelEmphasis, RunLengthNonUniformity,
ShortRunEmphasis, LongRunHighGrayLevelEmphasis,
RunPercentage, LongRunLowGrayLevelEmphasis, RunEntropy,
HighGrayLevelRunEmphasis, RunLengthNonUniformityNormalized
16
Second-order Gray level dependence matrix (GLDM) GrayLevelVariance, HighGrayLevelEmphasis, DependenceEntropy,
DependenceNonUniformity, GrayLevelNonUniformity,
SmallDependenceEmphasis,
SmallDependenceHighGrayLevelEmphasis,
DependenceNonUniformityNormalized,
LargeDependenceEmphasis,
LargeDependenceLowGrayLevelEmphasis, DependenceVariance,
LargeDependenceHighGrayLevelEmphasis,
SmallDependenceLowGrayLevelEmphasis,
LowGrayLevelEmphasis
14
Gray level co-occurrence matrix (GLCM) JointAverage, SumAverage, JointEntropy, ClusterShade,
MaximumProbability, Idmn, JointEnergy, Contrast,
DifferenceEntropy, InverseVariance, DifferenceVariance, Idn, Idm,
Correlation, Autocorrelation, SumEntropy, MCC, SumSquares,
ClusterProminence, Imc2, Imc1, DifferenceAverage, Id, ClusterTendency
24
First-order Gray level histogram InterquartileRange, Skewness, Uniformity, Median, Energy,
RobustMeanAbsoluteDeviation, MeanAbsoluteDeviation,
TotalEnergy, Maximum, RootMeanSquared, 90Percentile,
Minimum, Entropy, Range, Variance, 10Percentile, Kurtosis, Mean
18
Shape Elongation, Flatness, LeastAxisLength, MajorAxisLength,
Maximum2DDiameterColumn, Maximum2DDiameterRow,
Maximum2DDiameterSlice, Maximum3DDiameter, MeshVolume,
MinorAxisLength, Sphericity, SurfaceArea, SurfaceVolumeRatio,
VoxelVolume
14

Data Analysis

Integration Methodology

The most popular data integration framework for the identification of radiomics signatures in a data-driven analysis is concatenation-based integration that combines multiple datasets into a single large dataset and performs feature selection or dimension reduction to the whole dataset. However, this approach is not equipped to effectively handle correlative features across different data types that are expected to have different relevance to the outcome. A joint data analysis considering combined interactions between features of different data types may help understand the underlying mechanisms of the disease of interest. Therefore, we performed an integrative approach for identifying key signatures from high-throughput multi-radiomics assays derived from PET/CT, by addressing correlated nature of different data types, in order to make use of the synergy between metabolic and structural information provided by PET and CT data and identify robust and reliable multi-radiomics signatures to predict CTCs. Radiomics data integration was performed using block sparse partial least squares-discriminant analysis (sPLS-DA), referred to as Data Integration Analysis for Biomarker discovery using Latent cOmponents (DIABLO) (21), to maximize common information between PET and/or CT radiomics features or between radiomics and clinical features, and identify key signatures while discriminating CTC levels. The sPLS-DA as a basis of DIABLO was originally developed for the supervised analysis of one data set (22). Unlike principal component analysis (PCA), which focuses on variance maximization of the features alone, the sPLS-DA models covariance maximization between the matrix of features and the categorical outcome of interest to estimate the parameters of a linear regression model and thus provide a regression extension of PCA. The block sPLS-DA, i.e., DIABLO enables the integration of the same samples measured on different omics platforms, and extends a sparse generalized canonical correlation analysis (sGCCA) (23) to a supervised classification framework that could detect the latent relations of multi-view data with sparse structures. sGCCA is a multivariate dimension reduction technique that uses singular value decomposition and selects correlated features from different omics datasets. sGCCA integrates multiple datasets by finding principal components (latent variables) that maximize the covariance of scores between different datasets and the categorical outcome of interest. To extend sGCCA for a classification framework, one omics dataset is substituted with a dummy indicator matrix that indicates the class membership of each sample. Dimension reduction is achieved by projecting the data into the smaller dimensional subspace spanned by the principal components, where each sample is assigned a score on each new principal component dimension and this score is calculated as a linear combination of original features to which a weight is applied. The weight vectors used to calculate the linear combinations are called the loading vectors. The selection of correlated features across omics levels is performed internally with LASSO penalty (i.e., l1 penalization) for the loading vectors. The LASSO penalty parameter can be replaced with the number of features to be selected in each dataset and each component as there is a direct correspondence between both parameters. Then, the resulting loading vectors are constrained to give discriminants that correlate between these datasets.

Design Matrix in DIABLO

Before applying the DIABLO procedure, the design matrix can be specified such that it indicates whether datasets should be correlated and contains values between zero (datasets are not connected) and one (datasets are fully connected). Thus, the DIABLO model can be constrained to consider specific pairwise covariances by setting the design matrix accordingly. Such a design enables to model a connection between pairs of data. We evaluate performance of three different designs to make a choice among them when seeking for a predictive model. For example, in case of combining two feature blocks, a 3 × 3 design matrix C can be constructed including the outcome dummy indicator, i.e.: 1) a full design which maximizes correlation between datasets, i.e., C=[011101110], 2) a null design which maximizes separation between the outcome classes, i.e., C=[001001110], and 3) a full weighted design which provides for a compromise between correlation and discrimination, i.e., C=[00.510.501110].

Performance Evaluation

We used pre-SBRT PET/CT radiomics and clinical features for predicting pre-SBRT CTCs, while we used pre-SBRT, post-SBRT, delta PET/CT radiomics and clinical features for predicting post-SBRT CTCs. All features were standardized to zero means and unit variances to normalize data. CT and PET radiomics datasets were composed of three different feature blocks, respectively, i.e., pre-SBRT, post-SBRT and delta radiomics blocks. Clinical data were considered an additional separate feature block. We performed data analysis with a multi-view data combination strategy (24), to identify optimally combined input blocks based on the DIABLO by using every possible combination of pairs of different feature blocks. Note that we defined each feature block as a single view unit and restricted our analysis to the pairwise block combinations for the sake of simplicity. For input data, categorical variables in clinical features were converted into dummy variables, and each feature was centered and scaled internally. A total of four components were used to extract latent information from each feature block and discriminate the CTC levels. The number of features per block and per component that led to the best prediction accuracy of the DIABLO model was determined by a grid search with 20×5-fold cross-validation (CV). The grid in each block was composed of a small number of features ≤4 with a step of 1 for each component to suppress potential model overfitting to limited samples. Prediction accuracy was evaluated based on the balanced error rate (BER) to consider class imbalance of CTC levels in patients, which were calculated on the left-out samples set during the CV procedure and averaged across the repeated CV runs. Once an optimal number of components and an optimal list of selected features for each component were chosen, the final performance was obtained with 20×5-fold CV. During the CV and using the given DIABLO model that used a set of selected features on each component, an AUC was calculated for each CV fold for each component in each block. These AUC values were averaged over CV folds to provide an average AUC for the CV for each component in each block. After all repeats, the AUCs were averaged over repeats for all components of each block. A combined AUC for each component was then calculated by averaging over two different feature blocks, and the best component with which the highest AUC was achieved was selected. CV prediction scores between one class vs the other were compared using the Wilcoxon rank sum test. Bonferroni correction for multiple comparisons was used to adjust the significance level as α/n, where α=0.05 and n was the number of tests with different inputs; taking DIABLO for example, α/n=0.05/3≅0.017 for predicting pre-SBRT CTCs while α/n=0.05/21≅0.002 for predicting post-SBRT CTCs. Of the full design, full weighted design and null design matrices, the best design matrix in the DIABLO model was chosen based on the highest AUC for each combination of different feature blocks. The performance of DIABLO models for predicting pre-SBRT and post-SBRT CTCs was compared with that of concatenation-based sPLS-DA models derived from combining all features in a pair of blocks into a single block. The performance of sPLS-DA for each single block was tested also. For building the single-block sPLS-DA models, a set of features per component was selected by grid search with 20×5-fold CV. For fair comparison with the DIABLO, the number of search features in the single-block sPLS-DA was set to be the sum of the number of search features for each block in the DIABLO. All data analyses were implemented with the R’s mixOmics package (version 6.10.6) (25). An example overview of the single concatenated omics and integrative omics strategies along with their associated algorithms implemented for handling multi-omics features is shown in Figure 1. A workflow diagram representing the process for data acquisition, radiomics feature extraction and the overview of multi-block feature integration with DIABLO is shown in Figure S1(a), and an example of ROC curves illustrating the process of being averaged for 20×5-fold CV on each of three components derived from two different feature datasets with DIABLO in Figure S1(b).

Figure 1.

Figure 1.

Two strategies and their associated algorithms implemented for handling multi-omics datasets in this study. Single concatenated omics (left) combines multiple datasets into a single large dataset, while integrative omics (right) addresses correlated nature of different data types. In the single concatenated omics approach, sparse partial least squares discriminant analysis (sPLS-DA) maximizes covariance only between latent components of a dataset and the dichotomized CTC counts. In the integrative radiomics approach, the multi-block version of the sPLS-DA referred as Data Integration Analysis for Biomarker discovery using Latent cOmponents (DIABLO) maximizes the sum of every pairwise covariance between two components, to identify key features by borrowing discriminatory strength from complementary information across different datasets. Note that penalty terms (for both sPLS-DA and DIABLO) and design matrix (for DIABLO) in the objective function were omitted for simplicity.

Results

Table 3 summarizes 20×5-fold CV AUCs for the single-block sPLS-DA model and the DIABLO models derived with three different design matrices for different input feature blocks. For predicting pre-SBRT CTCs, the performance of the DIABLO model was highest with combined pre-SBRT PET radiomics and clinical feature blocks, achieving a statistically significant CV AUC of 0.875 (p=0.009) with the full weighted design matrix. For predicting post-SBRT CTCs, the highest performance of the DIABLO model was achieved with combined post-SBRT CT and delta CT radiomics feature blocks, showing a statistically significant CV AUC of 0.883 (p=0.001) with the null design matrix. Average ROC curves for 20×5-fold CV comparing full, full weighted and null designs in DIABLO for the best input feature blocks for predicting pre-SBRT and post-SBRT CTCs are shown in Figure S2. Overall, the best design matrix in DIABLO varied according to the type of input feature blocks. The CV AUCs of all DIABLO models were much higher than those of their corresponding single concatenated block sPLS-DA models. The CV AUCs of all single-block sPLS-DA models were less than 0.7.

Table 3.

Performances of the single concatenated block sPLS-DA model and the DIABLO models derived with three different (full, full weighted and null) design matrices for predicting pre-SBRT and post-SBRT CTCs, as evaluated with every possible pairwise combination of different input blocks.

CTC Radiomics dataset Clinical data 20×5-fold CV AUC (p-value in the Wilcoxon test)


CT PET sPLS-DA DIABLO


Pre Post Delta Pre Post Delta Full Full weighted Null
Pre V V 0.368 (0.268) 0.841 (0.018) 0.737 (0.130) 0.640 (0.346)
V V 0.325 (0.182) 0.657 (0.267) 0.653 (0.274) 0.653 (0.274)
V V 0.434 (0.532) 0.848 (0.017) 0.875 (0.009) 0.668 (0.274)

V 0.336 (0.192) N/A
V 0.434 (0.532)
V 0.485 (0.574)

Post V V 0.413 (0.419) 0.833 (0.003) 0.636 (0.232) 0.723 (0.056)
V V 0.439 (0.493) 0.633 (0.253) 0.652 (0.214) 0.647 (0.230)
V V 0.431 (0.455) 0.655 (0.159) 0.655 (0.155) 0.655 (0.162)
V V 0.449 (0.534) 0.601 (0.393) 0.609 (0.369) 0.603 (0.377)
V V 0.444 (0.500) 0.773 (0.018) 0.604 (0.330) 0.592 (0.392)
V V 0.457 (0.518) 0.749 (0.035) 0.748 (0.039) 0.659 (0.177)
V V 0.420 (0.429) 0.855 (0.003) 0.864 (0.003) 0.883 (0.001)
V V 0.452 (0.521) 0.788 (0.015) 0.689 (0.108) 0.803 (0.018)
V V 0.452 (0.530) 0.794 (0.018) 0.794 (0.019) 0.794 (0.019)
V V 0.450 (0.496) 0.646 (0.221) 0.828 (0.008) 0.783 (0.020)
V V 0.529 (0.514) 0.653 (0.195) 0.640 (0.220) 0.846 (0.002)
V V 0.433 (0.501) 0.799 (0.011) 0.708 (0.086) 0.653 (0.187)
V V 0.444 (0.475) 0.800 (0.019) 0.645 (0.308) 0.621 (0.331)
V V 0.439 (0.491) 0.608 (0.339) 0.616 (0.323) 0.699 (0.102)
V V 0.445 (0.531) 0.668 (0.196) 0.745 (0.039) 0.840 (0.003)
V V 0.425 (0.446) 0.703 (0.112) 0.650 (0.200) 0.766 (0.032)
V V 0.407 (0.407) 0.602 (0.375) 0.603 (0.373) 0.605 (0.351)
V V 0.527 (0.601) 0.780 (0.011) 0.670 (0.132) 0.679 (0.116)
V V 0.431 (0.397) 0.779 (0.026) 0.783 (0.021) 0.623 (0.312)
V V 0.489 (0.674) 0.810 (0.008) 0.672 (0.173) 0.694 (0.123)
V V 0.479 (0.672) 0.692 (0.152) 0.692 (0.152) 0.692 (0.152)

V 0.452 (0.528) N/A
V 0.538 (0.481)
V 0.523 (0.557)
V 0.478 (0.498)
V 0.445 (0.502)
V 0.415 (0.443)
V 0.671 (0.062)

Note – Shaded cells indicate the best performance input feature blocks for pre-SBRT and post-SBRT CTCs. Bold numbers show the corresponding CV AUCs of DIABLO and p-values in the Wilcoxon test of which the significance level is adjusted by Bonferroni method for correction of multiple comparisons with different inputs (p<0.05/3≅0.017 for pre-SBRT CTCs, while p<0.05/21≅0.002 for post-SBRT CTCs).

Table 4 summarizes selected features in the DIABLO model derived with the best design matrix for the best input feature blocks for predicting pre-SBRT and post-SBRT CTCs. The DIABLO model for predicting pre-SBRT CTCs selected a neighborhood gray-tone difference matrix (NGTDM) contrast calculated on the pre-SBRT LLH frequency band wavelet-filtered PET SUV image (denoted as wavelet.LLH_ngtdm_Contrast) and age from clinical parameters. For predicting post-SBRT CTCs with the best input feature blocks (i.e., post-SBRT CT and delta CT radiomics features), the DIABLO model selected a gray level size zone matrix (GLSZM) large area low gray level emphasis (LALGLE) calculated on the post-SBRT LoG-filtered CT image with σ=5 mm (denoted as log.sigma.5.mm.3D_glszm_LALGLE), a delta GLSZM LALGLE calculated between the pre-SBRT and post-SBRT LLL frequency band wavelet-filtered CT GLSZM LALGLE (denoted as wavelet.LLL_glszm_LALGLE), and a delta gray level run length matrix (GLRLM) low gray level run emphasis (LGLRE) calculated between the pre-SBRT and post-SBRT original CT GLRLM LGLRE (denoted as original_glrlm_LGLRE).

Table 4.

Selected features from DIABLO with the best design matrix for the best performance input blocks for predicting pre-SBRT and post-SBRT CTCs.

CTC Radiomics dataset Clinical data Selected features
CT PET Left block Right block
Pre Post Delta Pre Post Delta
Pre V V wavelet.LLH_ngtdm_Contrast Age
Post V V log.sigma.5.0.mm.3D_glszm_LALGLE wavelet.LLL_glszm_LALGLE, original_glrlm_LGLRE

Note – LLH=low-low-high, LLL=low-low-low, log=Laplacian of Gaussian, ngtdm=neighborhood gray tone difference matrix, glszm=gray level size zone matrix, glrlm=gray level run length matrix, LALGLE=large area low gray level emphasis, and LGLRE=low gray level run emphasis.

Figure 2 shows box plots that compare the values of selected features in the DIABLO model between CTC-positive and CTC-negative groups. Based on the median, both the pre-SBRT PET wavelet.LLH_ngtdm_Contrast and age were lower in the pre-SBRT CTC-positive group. The post-SBRT CT log.sigma.5.0.mm.3D_glszm_LALGLE was slightly higher in the post-SBRT CTC-positive group, while delta CT wavelet.LLL_glszm_LALGLE and delta CT original_glrlm_LGLRE were both a little bit lower although very similar between the post-SBRT CTC-positive and CTC-negative groups. Interquartile ranges of the pre-SBRT PET wavelet.LLH_ngtdm_Contrast and age were both smaller in the pre-SBRT CTC-positive group, while those of the post-SBRT CT log.sigma.5.0.mm.3D_glszm_LALGLE, delta CT wavelet.LLL_glszm_LALGLE and delta CT original_glrlm_LGLRE were all larger in the post-SBRT CTC-positive group.

Figure 2.

Figure 2.

Box plots comparing the values of selected features in the DIABLO model between CTC-positive and CTC-negative groups. Plots (a) pre-SBRT PET wavelet.LLH_ngtdm_Contrast and (b) age show selected features in the best DIABLO model for predicting pre-SBRT CTCs, derived with pre-SBRT PET radiomics and clinical parameter input blocks. Plots (c) post-SBRT CT log.sigma.5.0.mm.3D_glszm_LALGLE, (d) delta CT wavelet.LLL_glszm_LALGLE and (e) delta CT original_glrlm_LGLRE show selected features in the best DIABLO model for predicting post-SBRT CTCs, derived with post-SBRT CT and delta CT radiomics input blocks. LLH=low-low-high, LLL=low-low-low, log=Laplacian of Gaussian, ngtdm=neighborhood gray tone difference matrix, glszm=gray level size zone matrix, glrlm=gray level run length matrix, LALGLE=large area low gray level emphasis, and LGLRE=low gray level run emphasis.

Discussion

Radiomics is an emerging field that translates medical images into mineable high dimensional data by extracting mathematically quantitative imaging features. Radiomic approaches may allow noninvasive assessment of both molecular and clinical characteristics of tumors, and thus have the potential to contribute to clinical decision-making by systematically analyzing standard-of-care medical images (26). Because of the potential for “personalizing” medicine, an increasing number of studies has investigated associations between radiomics and tumor biology in different cancer types (8, 27, 28), more recently in lung cancer. In the context of lung cancer, radiomic features have been suggested to predict clinical endpoints such as survival and treatment response in ES-NSCLC patients treated with SBRT. For example, Huynh et al. have demonstrated that pretreatment CT radiomics features have potential to be prognostic for distant metastasis that conventional imaging metrics (tumor volume and diameter) could not predict in ES-NSCLC patients treated with SBRT (29). Lafata et al. have found that pretreatment CT radiomics features may be more associated with local failure than non-local failure following SBRT for stage I NSCLC (30). Li et al. have shown that recurrence following SBRT treatment could be prognosticated prior to treatment from imaging features derived from planning CT (31). They have also reported that disease progression could be prognosticated as early as 3 months after SBRT using imaging features derived from the first follow-up CT in ES-NSCLC patients (32). Yu et al. have shown that radiomics signatures could predict several clinical outcomes and allowed pretreatment risk stratification in stage I NSCLC undergoing either surgery or SBRT (33).

Our work is among the first to apply an integrative multi-radiomics approach to the imaging of patients with ES-NSCLC, which is accompanied by unique challenges. Our dataset is exclusively composed of single discrete lesions without lymph node involvement, most of which are in older patients with co-morbidities and thus not operative candidates. Many of the patients do not have tumor tissue available, due to the inaccessibility of the lesions and/or high risk of invasive procedures; However, CTC analysis in these patients is unimpeded due to the low risk associated with peripheral blood sampling and thus the CTCs supply biological confirmation of tumor and an independent indication of its aggressiveness. These features rendered the dataset of imaging ideal to be integrated with CTC analysis.

Tumor growth depends on the constant interactions that take place between tumor cells and the tumor microenvironment. The unique ability of imaging to examine a tumor as a whole enables intratumoral heterogeneity to be measured by high-throughput feature extraction of quantitative information from images. The spatial heterogeneity of tumors has been recognized as a major cause of treatment failure and modulators of intrinsic tumor aggressiveness (3436). Our results suggest that radiomics could be used as a surrogate for extrapolating CTC levels. CTC count is associated with prognosis and response to therapy (7, 37), and all these are likely influenced by tumor heterogeneity. Radiomics may thus serve as a virtual biopsy for tumor heterogeneity that would provide complementary information to the liquid or conventional biopsy. Medical image analysis allows tumor monitoring across time, with images being routinely obtained throughout the course of treatment. Therefore, radiomic biomarkers associated with CTCs may constitute an essential component in patient stratification and serve as an alarm signal in a non-invasive manner in patients with a high likelihood of recurrence and metastasis.

One of principal challenges in the radiomics study is the optimal integration of diverse multimodal data sources in a quantitative manner that delivers unambiguous clinical predictions that enable accurate and robust outcome prediction (38). PET and CT capture different properties (metabolic vs anatomy) of a tumor. Previous single concatenated PET/CT radiomics studies (39, 40) do not focus on incorporating common information between PET and CT into their classification algorithms, and thus derived discriminatory biomarkers may not be the phenotypes coupled mechanistically between metabolic behaviors and anatomic structures of tumors. Our integrative multi-radiomics approach using DIABLO may address this concern because it not only enables the identification of discriminatory radiomics signatures from each radiomics dataset, but also provides a way to plausibly find out functional-anatomic correlates between them. Regarding connection between different feature blocks, we evaluated three different scenarios by using a full design, a full weighted design and a null design. Our results showed that the performance of DIABLO with the full design or the full weighted design tended to be better than DIABLO with the null design for predicting CTCs when combined CT and PET radiomics feature blocks were used as input (see Table 3). For example, DIABLO with the full design for combining pre-SBRT CT and PET radiomics features as well as DIABLO with the full weighted design for combining post-SBRT CT and delta PET radiomics features yielded better performance than DIABLO with the null design for predicting pre-SBRT and post-SBRT CTCs, respectively. This may imply that, when the best pair of CT and PET radiomics feature blocks is used as input, building a predictive model taking into account their correlated information in the classification task would improve the performance for predicting CTCs as compared to constructing a predictive model on each individual feature block, before combining the model predictions. By contrast, DIABLO with the null design for combining post-SBRT CT and delta CT radiomics features attained the best performance for predicting post-SBRT CTCs among all models investigated. It is also notable that DIABLOs with the full weighted design and the full design for the same input achieved the second and third best performances (CV AUC=0.864 and 0.855). This may imply that correlated information between the best pair of different CT radiomics feature blocks would be less necessary for the prediction of post-SBRT CTCs with DIABLO. The American Society of Clinical Oncology recommends that PET and PET/CT scans not be used to watch for a cancer recurrence in patients with no symptoms of recurrence who have finished treatment that was intended to eliminate the cancer. In addition, the National Comprehensive Cancer Network (NCCN) clinical practice guidelines (Treatment of Cancer by Site) and NCCN Imaging Appropriate Use Criteria for various malignancies often note that PET scans are not recommended in asymptomatic individuals. If surveillance imaging is recommended, it does not include PET scans, but CT, MRI, or other forms of imaging. Also, PET and especially PET/CT scans expose a patient to high levels of radiation. Therefore, it is worth finding out that the best performance of DIABLO for predicting post-SBRT CTCs can be achieved using the CT-only radiomics features since CT imaging is the optimal imaging modality in cancer post-treatment surveillance (41). In contrast, the best performance of DIABLO for predicting pre-SBRT CTCs was achieved with the full weighted design when pre-SBRT PET radiomics and clinical features were used as input. This may imply that baseline pre-SBRT PET is still recommended, and clinical characteristics of patients need to be considered to accurately predict pre-SBRT CTCs.

Our results also demonstrated that the DIABLO significantly outperformed the single-block sPLS-DA. The sPLS-DA for the single concatenated blocks might be seriously affected by the larger number of features compared to the number of samples than the DIABLO, as well as the absence of the relationship between different feature blocks in the classification framework. Because collinearity among arbitrary feature subsets and high dimensionality deteriorate the performance of LASSO (42), they might have a detrimental effect on the performance of sPLS-DA based on LASSO penalization of the loading vectors to perform feature selection. In addition, the fact that the sPLS-DA for each lower dimensional single block did not improve the prediction performance may reflect the need for the integrated multi-block data analysis by which significant performance improvement was achieved with DIABLO. The DIABLO might lessen the need for large datasets because the LASSO penalty applies to each block of combined multi-block data when the DIALBO incorporates the latent relation between different blocks into the classification framework.

It seems notable that wavelet-filtered or LoG-filtered image features were identified as key PET/CT radiomics features in the multi-block discriminant analysis with DIABLO to correlate with the results of the CTC analyses, which indicates that wavelet-filtered or LoG-filtered PET/CT images may effectively serve as a measure of tumor image heterogeneity. A 3D discrete wavelet transformation decomposes original images into one approximation and seven detailed images, which are mutually orthogonal sets of wavelets, representing high frequency (heterogeneity) and low frequency (homogeneity) contents of the images (43). For example, the wavelet LLH and LHL images emphasize the internal homogeneity, while the wavelet HLH image emphasizes the internal heterogeneity in 2 out of 3 dimensions. Particularly, the wavelet LLH filtered CT image has played an important role in the prediction of distant metastasis in lung cancer (29, 44), and the wavelet HLH filtered PET SUV image in the prediction of progression-free survival in malignant pleural mesothelioma patients (45). In our study, the wavelet LLH filtered PET SUV image and the wavelet LLL filtered CT image contributed to making better predictions of pre- and post-SBRT CTCs, respectively. On the other hand, a LoG filter involves applying a Gaussian filter to an image to remove random noise while a Laplacian filter is employed to enhance edges on the image. Especially, the LoG-filtered CT image has yielded high performance prognostic features for distant metastasis and survival in patients with lung adenocarcinoma (44). This seems to be in accord with our results (see Table 4), where the post-SBRT LoG-filtered CT image obtained at a coarse level yielded a feature that contributed to the best performance DIABLO model for predicting post-SBRT CTCs, suggesting that the edges within tumor found by the LoG filter using a wider Gaussian with σ=5 mm might be more informative than edges produced using a narrower Gaussian with σ=1 or 3 mm. This may be because edges found using a narrow Gaussian are more susceptible to noise in the input image. Using the LoG filter with a proper Gaussian kernel size is associated with image noise reduction, and can improve the utility of the heterogeneity measures (46). Considering that the best performance DIABLO model for predicting post-SBRT CTCs included radiomics features from the wavelet LLL filtered CT image as well as the coarse LoG-filtered CT image, low frequency contents of the CT image might play a more important role than higher frequency components for measuring tumor heterogeneity and improving prediction performance.

It is also remarkable that the best performance DIABLO models for predicting pre- and post-SBRT CTCs selected higher order local texture features (see Table 4), which are calculated from several texture matrices (i.e., NGTDM, GLSZM and GLRLM) computed based on interrelationships of 3 or more voxels (47). The NGTDM quantifies the sum of differences between the gray level of a voxel and the mean gray level of its neighboring voxels within a predefined distance and is thought to closely resemble the human experience of the image (48). For example, the NGTDM contrast relates to the dynamic range of intensity levels in an image and the level of local intensity variation, which is high when both the dynamic range and the spatial change rate are high, i.e., an image with a large range of gray levels, with large changes between voxels and their neighborhood. The GLSZM counts the number of groups (or zones) of linked voxels (49). Voxels are linked if the neighboring voxel has an identical discretized grey level. Whether a voxel classifies as a neighbor depends on its connectedness (e.g., 26-connectedness in 3D). A more homogeneous texture will result in a wider and flatter matrix. For example, the GLSZM LALGLE measures the proportion in the image of the joint distribution of larger size zones with lower gray-level values. The GLRLM assesses the distribution of discretized grey levels in terms of run lengths (50). A run length is defined as the length of a consecutive sequence of voxels with the same gray level along a fixed image direction. By default, the value of a feature is calculated on the GLRLM for each angle separately, after which the mean of these values is returned. For example, GLRLM LGLRE measures the distribution of low gray-level values, with a higher value indicating a greater concentration of low gray-level values in the image. This indicates that higher-order local statistical properties of texture within tumor may play an important role in quantifying intratumoral heterogeneity and improving performance for predicting CTCs. Our results also showed that the best performance DIABLO model for predicting pre-SBRT CTCs selected age from clinical parameters. So far, no convincing evidence has indicated that age is associated with CTC levels in lung cancer. Although further studies are needed to evaluate the external validity of our finding, we were able to identify clinical-radiomic correlates associated with CTC levels by using DIABLO.

Our study followed the recommendations of a previous study in order to avoid the vulnerabilities of the radiomics models (51): 1) An open-source software, PyRadiomics, was used to extract shape and local texture features. An additional solution to extract global texture features using 3D moments was provided with the source code available at https://github.com/lsho76/GlobalTextureFeatures, which can easily be combined with the PyRadiomics platform. In addition, another open-source software, mixOmics R package, was used to integrate different types of datasets with feature selection by using single-block sPLS-DA or DIABLO, while discriminating CTC levels. Example R codes for the complete model-building and CV pipeline of the single-block sPLS-DA and DIABLO were provided at https://github.com/lsho76/mixOmicsExamples, so that independent researchers can replicate the exact process by which our results were generated and rapidly leverage the code when they attempt to validate the results in their own cohorts (52). 2) Multi-view data analysis using clinical and radiomics features was performed to identify optimal input feature blocks based on the DIABLO that could lead to the best predictive performance. We demonstrated that an integrative approach using radiomics features in DIABLO significantly improves the performance for predicting CTCs as compared to that of the single-block sPLS-DA for the clinical features alone. 3) The DIABLO method builds on the sGCCA, which integrates multiple datasets by finding principal components (latent variables) that maximize the covariance of scores between different datasets and the categorical outcome of interest. The resulting loading vectors are then constrained to give discriminants that correlate between these datasets. Taking notice of this point, see more detailed explanations on how the DIABLO method deals with multicollinearity of features in comparison with the single-block sPLS-DA in the following paragraph. 4) Repeated CV was conducted to tune the optimal number of components and the number of features to select on each component for each block in DIABLO as well as to assess the combined AUC performance (averaged across different blocks) of the DIABLO model on each component. 5) Radiomics and clinical features were all standardized to zero means and unit variances to normalize data. 6) GTV contours were manually drawn on only CT images by radiation oncologists, and then, propagated to 18F-FDG-PET images.

Multicollinearity of features should be investigated during model development with a training dataset (51). As high throughput data are characterized by thousands of radiomics features and a small number of samples, they often imply a high degree of multicollinearity, and, in turn, lead to severely ill-conditioned problems. Another challenge is that combining different types of data increases the number of analyzed features while keeping the number of samples constant. In a supervised classification framework, one solution is to reduce the dimensionality of the data either by conducting feature selection, or by introducing artificial variables through a linear combination of original features that summarize most of the information like what essentially PCA does. Considering that even individual radiomics can be very high dimensional, their integration may be difficult without severe dimensionality reduction or regularization that boils down to the choice of most informative features while removing the noisy ones. It should be noted that univariate feature selection does not account for multicollinearity between features and has a poor generalization. A multivariate feature selection method such as LASSO (53) may be preferable for working with high dimensional data. LASSO helps to increase the model interpretability by eliminating irrelevant features that are not associated with the response variable, this way overfitting is also reduced. However, LASSO cannot fully handle multicollinearity among predictors. For example, if two features are highly correlated, LASSO will choose only one of them by chance and set the coefficient in front of the other one to zero. This kind of selection could be problematic if it occurs that the feature that was missing has more relevant biophysical meaning than the one which was selected by LASSO. The collinearity and high dimensionality add difficulty to the problem of feature selection and deteriorate the performance of the LASSO (54), as demonstrated by poor performance from the single-block sPLS-DA in this study. Here, we used DIABLO to handle common problems in integrative analysis such as heterogeneity between different data types, number of samples much smaller than features to assess, multicollinearity and sparseness to facilitate the interpretation of the results. This method is based on multi-block sPLS-DA that performs feature selection and classification in a single step procedure (22), where sPLS conducts simultaneous feature selection in different datasets by applying a LASSO (L1) penalization on PLS loading vectors associated to the datasets when computing the singular value decomposition (55). Similar to the single-block sPLS-DA, the L1 penalization improves the interpretability of the component scores that is only defined on a subset of features with non-zero coefficients in each dataset. However, contrary to the single-block sPLS-DA, DIABLO identifies a signature composed of highly correlated features across different types of omics, by modeling the relationship between the omics datasets. Therefore, DIABLO may yield improved biological insights of multi-omics signatures, as it allows the identification of hidden mechanisms that were missed when analyzing single omics data types individually.

There are some limitations in this study. The number of patients investigated in this study is small, although it serves as a proof-of-concept comparative and hypothesis-generating study. K-fold CV is a common solution when available datasets are limited. However, it can suffer from bias or variance depending on the size and number of splits (56). Because only subsets of the datasets are used to fit the model, the validation set error rate can be highly variable when we use this method. Unfortunately, there exists no universal (valid under all distributions) unbiased estimator of the variance of k-fold CV (57), but repeating k-fold CV can increase the precision of the estimates while still maintaining a small bias (58, 59). Though our preliminary results are encouraging, a large patient cohort analysis is necessary to further validate the clinical applicability of the proposed multi-block data integration approach using DIABLO in ES-NSCLC. We would also like to point out that the multi-radiomics signatures selected from PET/CT radiomics may have a limited value in unfolding the biological mechanism or underlying pathophysiology of tumor, although it may give us new insight about the anatomic-metabolic correlates to discriminate CTC levels.

Conclusions

To summarize, our results demonstrated the potential usefulness of 18F-FDG-PET/CT radiomics approach in the context of intratumoral heterogeneities to predict CTCs in patients with ES-NSCLC. Multi-block data integration with discriminant analysis of 18F-FDG-PET/CT radiomics has the potential for predicting pre-SBRT and post-SBRT CTCs. Incorporating latent relations between different feature types into the classification framework may improve the discriminatory power of CTC levels. Clinical-radiomic correlates derived by addressing common information between pre-SBRT PET radiomics and clinical features may provide surrogate biomarkers for pre-SBRT CTCs. Post-SBRT CT and delta CT radiomics signatures may serve as surrogate biomarkers for post-SBRT CTCs. This integrative framework may open a promising avenue in the multi-block radiomics study where we may need to give a special consideration to the relationship between different feature categories. A large-scale study that integrates features obtained across imaging, genomic, molecular, and clinical modalities may illuminate more clearly the underlying tumor biology of radiomics features and provide valuable insights into pathophysiology and personalized disease management.

Supplementary Material

Figure S1a

Figure S1a: Overview of workflow for multi-block feature integration with DIABLO.

Figure S1b

Figure S1b: Examples of ROC curves averaged across two different datasets and 20×5-fold CV on each of 3 components in DIABLO.

Figure S2a

Figure S2a: Average ROC curves for 20×5-fold CV comparing full, full weighted and null designs in DIABLO for the best input feature blocks for predicting pre-SBRT CTCs.

Figure S2b

Figure S2b: Average ROC curves for 20×5-fold CV comparing full, full weighted and null designs in DIABLO for the best input feature blocks for predicting post-SBRT CTCs.

Supplementary Document 1

Acknowledgments:

A portion of these results were presented and selected as an award winner at the Science Council Session of the 2020 American Association of Physicists in Medicine (AAPM) annual meeting.

Funding statement:

This project was supported by grants R01CA201071 (to GDK), U24CA180803 (IROC) and U10CA180868 (NRG), all from the National Cancer Institute.

Footnotes

Data sharing statement:

Research data are not available at this time.

Authors responsible for data integrity:

MAF and GDK had full access to all the data in the study and takes responsibility for the integrity of the data.

Conflict of interest:

GDK and JFD are co-founders and have equity in Liquid Biotech USA, Inc., a University of Pennsylvania PCI-developed company through the UPStart program.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Dorsey JF, Kao GD, MacArthur KM, Ju M, Steinmetz D, Wileyto EP, et al. Tracking viable circulating tumor cells (CTCs) in the peripheral blood of non-small cell lung cancer (NSCLC) patients undergoing definitive radiation therapy: Pilot study results. Cancer. 2015. January 1;121(1):139–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Tanaka F, Yoneda K, Kondo N, Hashimoto M, Takuwa T, Matsumoto S, et al. Circulating tumor cell as a diagnostic marker in primary lung cancer. Clin Cancer Res. 2009. November 15;15(22):6980–6. [DOI] [PubMed] [Google Scholar]
  • 3.Fiorelli A, Accardo M, Carelli E, Angioletti D, Santini M, Di Domenico M. Circulating tumor cells in diagnosing lung cancer: Clinical and morphologic analysis. Ann Thorac Surg. 2015. June;99(6):1899–905. [DOI] [PubMed] [Google Scholar]
  • 4.Shah JL, Loo BW Jr., Stereotactic ablative radiotherapy for early-stage lung cancer. Semin Radiat Oncol. 2017. July;27(3):218–28. [DOI] [PubMed] [Google Scholar]
  • 5.Bradley JD, El Naqa I, Drzymala RE, Trovo M, Jones G, Denning MD. Stereotactic body radiation therapy for early-stage non-small-cell lung cancer: The pattern of failure is distant. Int J Radiat Oncol Biol Phys. 2010. July 15;77(4):1146–50. [DOI] [PubMed] [Google Scholar]
  • 6.Stephans KL, Woody NM, Reddy CA, Varley M, Magnelli A, Zhuang T, et al. Tumor control and toxicity for common stereotactic body radiation therapy dose-fractionation regimens in stage I non-small cell lung cancer. Int J Radiat Oncol Biol Phys. 2018. February 1;100(2):462–9. [DOI] [PubMed] [Google Scholar]
  • 7.Frick MA, Feigenberg SJ, Jean-Baptiste SR, et al. Circulating tumor cells are associated with recurrent disease in patients with early-stage nonesmall cell lung cancer treated with stereotactic body radio therapy. Clin Cancer Res 2020;26:2372–2380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Aerts HJ, Velazquez ER, Leijenaar RT, Parmar C, Grossmann P, Carvalho S, et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun. 2014. June 3;5:4006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Frick MA, Kao GD, Aguarin L, Chinniah C, Swisher-McClure S, Berman AT, et al. Circulating tumor cell assessment in presumed early stage non-small cell lung cancer patients treated with stereotactic body radiation therapy: A prospective pilot study. Int J Radiat Oncol Biol Phys. 2018. November 1;102(3):536–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Chinniah C, Aguarin L, Cheng P, Decesaris C, Cutillo A, Berman AT, et al. Early detection of recurrence in patients with locally advanced non-small-cell lung cancer via circulating tumor cell analysis. Clin Lung Cancer. 2019. September;20(5):384,390.e2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wei L, Osman S, Hatt M, El Naqa I. Machine learning for radiomics-based multimodality and multiparametric modeling. Q J Nucl Med Mol Imaging. 2019. December;63(4):323–38. [DOI] [PubMed] [Google Scholar]
  • 12.Liu Z, Wang S, Dong D, Wei J, Fang C, Zhou X, et al. The applications of radiomics in precision diagnosis and treatment of oncology: Opportunities and challenges. Theranostics. 2019. February 12;9(5):1303–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Li H, Galperin-Aizenberg M, Pryma D, Simone CB 2nd, Fan Y. Unsupervised machine learning of radiomic features for predicting treatment response and overall survival of early stage non-small cell lung cancer patients treated with stereotactic body radiation therapy. Radiother Oncol. 2018. November;129(2):218–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Oikonomou A, Khalvati F, Tyrrell PN, Haider MA, Tarique U, Jimenez-Juan L, et al. Radiomics analysis at PET/CT contributes to prognosis of recurrence and survival in lung cancer treated with stereotactic body radiotherapy. Sci Rep. 2018. March 5;8(1):4003,018–22357-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Lv W, Yuan Q, Wang Q, Ma J, Feng Q, Chen W, et al. Radiomics analysis of PET and CT components of PET/CT imaging integrated with clinical parameters: Application to prognosis for nasopharyngeal carcinoma. Mol Imaging Biol. 2019. Oct;21(5):954–64. [DOI] [PubMed] [Google Scholar]
  • 16.Zanfardino M, Franzese M, Pane K, Cavaliere C, Monti S, Esposito G, et al. Bringing radiomics into a multi-omics framework for a comprehensive genotype-phenotype characterization of oncological diseases. J Transl Med. 2019. Oct 7;17(1):337,019–2073-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Berman AT, Jabbour SK, Vachani A, et al. Empiric radiotherapy for lung cancer collaborative group multi-institutional evidence-based guidelines for the use of empiric stereotactic body radiation therapy for non-small cell lung cancer without pathologic confirmation. Transl Lung Cancer Res 2019;8:5–14 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lee SH, Kim JH, Cho N, Park JS, Yang Z, Jung YS, et al. Multilevel analysis of spatiotemporal association features for differentiation of tumor enhancement patterns in breast DCE-MRI. Med Phys. 2010. August;37(8):3940–56. [DOI] [PubMed] [Google Scholar]
  • 19.van Griethuysen JJM, Fedorov A, Parmar C, Hosny A, Aucoin N, Narayan V, et al. Computational radiomics system to decode the radiographic phenotype. Cancer Res. 2017. November 1;77(21):e104–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Leijenaar RT, Nalbantov G, Carvalho S, van Elmpt WJ, Troost EG, Boellaard R, et al. The effect of SUV discretization in quantitative FDG-PET radiomics: The need for standardized methodology in tumor texture analysis. Sci Rep. 2015. Aug 5;5:11075. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Singh A, Shannon CP, Gautier B, Rohart F, Vacher M, Tebbutt SJ, et al. DIABLO: An integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics. 2019. September 1;35(17):3055–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Lê Cao KA, Boitard S, Besse P. Sparse PLS discriminant analysis: Biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinformatics. 2011. June 22;12:253,2105–12-253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Tenenhaus A, Philippe C, Guillemot V, Le Cao KA, Grill J, Frouin V. Variable selection for generalized canonical correlation analysis. Biostatistics. 2014. Jul;15(3):569–83. [DOI] [PubMed] [Google Scholar]
  • 24.Lee SH, Han P, Hales R, Voong KR, Noro K, Sugiyama S, et al. Multi-view radiomics and dosiomics analysis with machine learning for predicting acute-phase weight loss in lung cancer patients treated with radiotherapy. Phys Med Biol. 2020. Apr 1. [DOI] [PubMed] [Google Scholar]
  • 25.Rohart F, Gautier B, Singh A, Lê Cao K. mixOmics: An R package for ‘omics feature selection and multiple data integration. bioRxiv. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Grossmann P, Stringfield O, El-Hachem N, Bui MM, Rios Velazquez E, Parmar C, et al. Defining the biological basis of radiomic phenotypes in lung cancer. Elife. 2017. July 21;6:10.7554/eLife.23421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Sanduleanu S, Woodruff HC, de Jong EEC, van Timmeren JE, Jochems A, Dubois L, et al. Tracking tumor biology with radiomics: A systematic review utilizing a radiomics quality score. Radiother Oncol. 2018. June;127(3):349–60. [DOI] [PubMed] [Google Scholar]
  • 28.Yoon HJ, Sohn I, Cho JH, Lee HY, Kim JH, Choi YL, et al. Decoding tumor phenotypes for ALK, ROS1, and RET fusions in lung adenocarcinoma using a radiomics approach. Medicine (Baltimore). 2015. October;94(41):e1753. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Huynh E, Coroller TP, Narayan V, Agrawal V, Hou Y, Romano J, et al. CT-based radiomic analysis of stereotactic body radiation therapy patients with lung cancer. Radiother Oncol. 2016. August;120(2):258–66. [DOI] [PubMed] [Google Scholar]
  • 30.Lafata KJ, Hong JC, Geng R, Ackerson BG, Liu JG, Zhou Z, et al. Association of pre-treatment radiomic features with lung cancer recurrence following stereotactic body radiation therapy. Phys Med Biol. 2019. January 8;64(2):025007,6560/aaf5a5. [DOI] [PubMed] [Google Scholar]
  • 31.Li Q, Kim J, Balagurunathan Y, Liu Y, Latifi K, Stringfield O, et al. Imaging features from pretreatment CT scans are associated with clinical outcomes in nonsmall-cell lung cancer patients treated with stereotactic body radiotherapy. Med Phys. 2017. August;44(8):4341–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Li Q, Kim J, Balagurunathan Y, Qi J, Liu Y, Latifi K, et al. CT imaging features associated with recurrence in non-small cell lung cancer patients after stereotactic body radiotherapy. Radiat Oncol. 2017. September 25;12(1):158,017–0892-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Yu W, Tang C, Hobbs BP, Li X, Koay EJ, Wistuba II, et al. Development and validation of a predictive radiomics model for clinical outcomes in stage I non-small cell lung cancer. Int J Radiat Oncol Biol Phys. 2018. November 15;102(4):1090–7. [DOI] [PubMed] [Google Scholar]
  • 34.Heppner GH. Tumor heterogeneity. Cancer Res. 1984. June;44(6):2259–65. [PubMed] [Google Scholar]
  • 35.Junttila MR, de Sauvage FJ. Influence of tumour micro-environment heterogeneity on therapeutic response. Nature. 2013. September 19;501(7467):346–54. [DOI] [PubMed] [Google Scholar]
  • 36.O’Connor JP, Rose CJ, Waterton JC, Carano RA, Parker GJ, Jackson A. Imaging intratumor heterogeneity: Role in therapy response, resistance, and clinical outcome. Clin Cancer Res. 2015. January 15;21(2):249–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Lindsay CR, Faugeroux V, Michiels S, Pailler E, Facchinetti F, Ou D, et al. A prospective examination of circulating tumor cell profiles in non-small-cell lung cancer molecular subgroups. Ann Oncol. 2017. July 1;28(7):1523–31. [DOI] [PubMed] [Google Scholar]
  • 38.Lambin P, Leijenaar RTH, Deist TM, Peerlings J, de Jong EEC, van Timmeren J, et al. Radiomics: The bridge between medical imaging and personalized medicine. Nat Rev Clin Oncol. 2017. Dec;14(12):749–62. [DOI] [PubMed] [Google Scholar]
  • 39.Dissaux G, Visvikis D, Da-Ano R, Pradier O, Chajon E, Barillot I, et al. Pretreatment (18)F-FDG PET/CT radiomics predict local recurrence in patients treated with stereotactic body radiotherapy for early-stage non-small cell lung cancer: A multicentric study. J Nucl Med. 2020. Jun;61(6):814–20. [DOI] [PubMed] [Google Scholar]
  • 40.Ou X, Wang J, Zhou R, Zhu S, Pang F, Zhou Y, et al. Ability of (18)F-FDG PET/CT radiomic features to distinguish breast carcinoma from breast lymphoma. Contrast Media Mol Imaging. 2019. Feb 25;2019:4507694. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Schneider BJ, Ismaila N, Aerts J, Chiles C, Daly ME, Detterbeck FC, et al. Lung cancer surveillance after definitive curative-intent therapy: ASCO guideline. JCO. 2020. 03/01; 2020/12;38(7):753–66. [DOI] [PubMed] [Google Scholar]
  • 42.Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008. 11/01; 2020/07;70(5):849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Wang Y, Che X, Ma S. Nonlinear filtering based on 3D wavelet transform for MRI denoising. EURASIP Journal on Advances in Signal Processing. 2012. 02/21;2012(1):40. [Google Scholar]
  • 44.Coroller TP, Grossmann P, Hou Y, Rios Velazquez E, Leijenaar RT, Hermann G, et al. CT-based radiomic signature predicts distant metastasis in lung adenocarcinoma. Radiother Oncol. 2015. March;114(3):345–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Pavic M, Bogowicz M, Kraft J, Vuong D, Mayinger M, Kroeze SGC, et al. FDG PET versus CT radiomics to predict outcome in malignant pleural mesothelioma patients. EJNMMI Res. 2020. Jul 13;10(1):81,020–00669-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Ganeshan B, Miles KA. Quantifying tumour heterogeneity with CT. Cancer Imaging. 2013. March 26;13:140–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Chicklore S, Goh V, Siddique M, Roy A, Marsden PK, Cook GJ. Quantifying tumour heterogeneity in 18F-FDG PET/CT imaging by texture analysis. Eur J Nucl Med Mol Imaging. 2013. January;40(1):133–40. [DOI] [PubMed] [Google Scholar]
  • 48.Amadasun M, King R. Textural features corresponding to textural properties. IEEE Transactions on Systems, Man, and Cybernetics. 1989;19(5):1264–74. [Google Scholar]
  • 49.Thibault G, Angulo J, Meyer F. Advanced statistical matrices for texture characterization: Application to cell classification. IEEE Transactions on Biomedical Engineering. 2014;61(3):630–7. [DOI] [PubMed] [Google Scholar]
  • 50.Galloway MM. Texture analysis using gray level run lengths. Computer Graphics and Image Processing. 1975. June 1975;4(2):172–9. [Google Scholar]
  • 51.Welch ML, McIntosh C, Haibe-Kains B, Milosevic MF, Wee L, Dekker A, et al. Vulnerabilities of radiomic signature development: The need for safeguards. Radiother Oncol. 2019. January;130:2–9. [DOI] [PubMed] [Google Scholar]
  • 52.Norgeot B, Quer G, Beaulieu-Jones B, Torkamani A, Dias R, Gianfrancesco M, et al. Minimum information about clinical artificial intelligence modeling: The MI-CLAIM checklist. Nat Med. 2020. 09/01;26(9):1320–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.Series B (Methodological). 1996. 2020/11;58(1):267–88. [Google Scholar]
  • 54.Fan J, Shao Q, Zhou W. Are discoveries spurious? distributions of maximum spurious correlations and their applications. Ann Statist. 2018. 06;46(3):989–1017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Kim-Anh Lê Cao, Rossouw D, Christèle Robert-Granié, Besse P. A sparse PLS for variable selection when integrating omics data. Statistical Applications in Genetics and Molecular Biology. 2008;7(1). [DOI] [PubMed] [Google Scholar]
  • 56.Kononenko I, Kukar M. Machine learning and data mining: Introduction to principles and algorithms. Horwood Publishing Limited; 2007. [Google Scholar]
  • 57.Bengio Y, Grandvalet Y. No unbiased estimator of the variance of K-fold cross-validation. Journal of machine learning research. 2004;5:1089–105. [Google Scholar]
  • 58.Molinaro AM, Simon R, Pfeiffer RM. Prediction error estimation: A comparison of resampling methods. Bioinformatics. 2005. August 1;21(15):3301–7. [DOI] [PubMed] [Google Scholar]
  • 59.Kim J. Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Comput Stat Data Anal. 2009. 1 September 2009;53(11):3735–45. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1a

Figure S1a: Overview of workflow for multi-block feature integration with DIABLO.

Figure S1b

Figure S1b: Examples of ROC curves averaged across two different datasets and 20×5-fold CV on each of 3 components in DIABLO.

Figure S2a

Figure S2a: Average ROC curves for 20×5-fold CV comparing full, full weighted and null designs in DIABLO for the best input feature blocks for predicting pre-SBRT CTCs.

Figure S2b

Figure S2b: Average ROC curves for 20×5-fold CV comparing full, full weighted and null designs in DIABLO for the best input feature blocks for predicting post-SBRT CTCs.

Supplementary Document 1

RESOURCES