Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Mar 1.
Published in final edited form as: Comput Methods Programs Biomed. 2021 Jan 15;200:105937. doi: 10.1016/j.cmpb.2021.105937

Applying a random projection algorithm to optimize machine learning model for predicting peritoneal metastasis in gastric cancer patients using CT images

Seyedehnafiseh Mirniaharikandehei 1, Morteza Heidari 1, Gopichandh Danala 1, Sivaramakrishnan Lakshmivarahan 2, Bin Zheng 1
PMCID: PMC7920928  NIHMSID: NIHMS1665003  PMID: 33486339

Abstract

Background and Objective:

Non-invasively predicting the risk of cancer metastasis before surgery can play an essential role in determining which patients can benefit from neoadjuvant chemotherapy. This study aims to investigate and test the advantages of applying a random projection algorithm to develop and optimize a radiomics-based machine learning model to predict peritoneal metastasis in gastric cancer patients using a small and imbalanced computed tomography (CT) image dataset.

Methods:

A retrospective dataset involving CT images acquired from 159 patients is assembled, including 121 and 38 cases with and without peritoneal metastasis, respectively. A computer-aided detection scheme is first applied to segment primary gastric tumor volumes and initially compute 315 image features. Then, five gradients boosting machine (GBM) models embedded with five feature selection methods (including random projection algorithm, principal component analysis, least absolute shrinkage, and selection operator, maximum relevance and minimum redundancy, and recursive feature elimination) along with a synthetic minority oversampling technique, are built to predict the risk of peritoneal metastasis. All GBM models are trained and tested using a leave-one-case-out cross-validation method.

Results:

Results show that the GBM model embedded with a random projection algorithm yields a significantly higher prediction accuracy (71.2%) than the other four GBM models (p<0.05). The precision, sensitivity, and specificity of this optimal GBM model are 65.78%, 43.10%, and 87.12%, respectively.

Conclusions:

This study demonstrates that CT images of the primary gastric tumors contain discriminatory information to predict the risk of peritoneal metastasis, and a random projection algorithm is a promising method to generate optimal feature vector, improving the performance of machine learning based prediction models.

Keywords: Gastric Cancer, Quantitative features, Computed tomography, Random projection, Feature dimensionality reduction

1. Introduction

Although the occurrence of gastric cancer has declined recently, it remains the third leading cause of cancer-related death worldwide [1]. While surgery remains the only curative treatment option, preoperative neoadjuvant chemotherapy (NAC) has demonstrated favorable results with increased therapeutic resection rates and improved survival [2]. Preventing the adverse effect of NAC, patients with different disease stages must be distinguished from each other [3] because, for each step of the disease, the treatment would be different [4]. Recent studies demonstrated that applying preoperative NAC for advanced gastric cancer patients with peritoneal metastasis (PM) yielded a much better clinical outcome and enhanced the overall survival rate [5, 6]. Thus, an accurate assessment of the presence of the PM is essential for the selection of appropriate patients for NAC. Since the overall accuracies of subjectively reading endoscopic ultrasound and computed tomography (CT) images are not completely reliable [3, 4], an alternative technique is needed to facilitate the assessment of tumor stages and the risk of PM.

Recently, the novel radiomics technique has been applied to extract quantitative information from medical images with a large pool of image features, and the data mining of image feature pool offers an exciting approach to build machine learning (ML) models and predict clinical outcomes [7, 8]. Although several radiomics based ML models have been reported to differentiate and stage gastric cancer patients [9, 10], these studies computed radiomics features from the tumor region that is manually segmented from one CT slice selected by the radiologist. Meanwhile, the correlation analysis based method was used to determine a small set of image features, which cannot eliminate the redundancy of the selected features. Thus, discriminatory power and prediction accuracy of these ML models were limited. To overcome such limitations, we in this study propose to develop and evaluate a new computer-aided detection (CAD) scheme aiming to predict the risk of PM among gastric cancer patients. First, our scheme segments primary gastric tumor volume in 3D CT image data, which can better compute image features related to the heterogeneity of the tumors. Second, to reduce the dimensionality of feature space and better identify orthogonal or non-redundant image features from a large pool of initially computed radiomics features, we investigate and apply a random projection algorithm (RPA). Third, to avoid bias in generating feature vector, RPA is embedded in a multi-feature fusion-based machine learning (ML) model to predict the risk of PM, which is trained and tested using (1) a synthetic minority oversampling technique (SMOTE) to balance numbers of cases in two classes and (2) a leave-one-case-out (LOCO) cross-validation method. The details of the study design, experimental procedures, data analysis results, and discussions are presented in the following sections of this article.

2. Materials and Methods

2.1. Image Dataset

In this study, we use a retrospective dataset of abdominal computed tomography (CT) images. To avoid potential case selection bias, the dataset initially contains 219 consecutive patients who were diagnosed and treated with gastric cancer. Then, by excluding the cases that were unresectable or undetectable based on CT examinations and poor image quality as determined by the radiologists in the retrospective review, 159 cases are included in this study dataset. Among these patients, 121 cases have PM, and 38 cases do not have PM. Table 1 summarizes the distribution of general demographic information and several related clinical results of these 159 patients in this dataset.

Table 1.

Distribution of demographic information and several related clinical results of study cases in the dataset

Category Cases with PM Cases without PM

Total Cases 121 38

Age (years old) < 45 11 (6.9%) 5 (3.1%)
45 – 65 72 (45.2%) 23 (14.4%)
> 65 38 (23.8%) 10 (6.2%)
Mean ± SD 59.49±11.97 59.11±8.75
Median 61 60

Gender Men 94 (59.1%) 30 (18.8%)
Women 27 (16.9%) 8 (5.0%)

Tumor Location Upper 37 (23.2%) 19 (11.9%)
Medium 20 (12.6%) 7 (4.4%)
Lower 50 (31.4%) 12 (7.5%)
Diffuse 14 (8.8%) 0

Pathological Staging after Surgery I 0 38 (23.9%)
II 26 (16.4%) 0
III 32 (20.1%) 0
IV 63 (39.6%) 0

Bormann Type 1 1 (0.6%) 0
2 21 (13.2%) 11 (6.9%)
3 94 (59.1%) 25 (15.7%)
4 5 (3.1%) 2 (1.3%)

Each patient had an abdominal CT imaging examination during the original cancer diagnosis before surgery. All CT examinations were performed using a multidetector CT machine (GE Discovery CT750 HD, GE Healthcare). Each patient is requested to fast from food overnight and drank 600–1000ml water orally to distend the stomach prior to the CT examination. The contrast-enhanced CT images are obtained with a delay of 28 seconds (arterial phase), 55 seconds (portal phase), and 120 seconds (venous phase) after administration of infused 1.5 ml/kg body weight iodinated contrast agent (Optiray 320 mg I/mL, Bayer Schering Pharma) intravenously at a flow rate of 2.5ml/s. The CT scanning parameters include (1) tube voltage switching between 120 kVp and 140KVp in spectral imaging mode, (2) tube current automatically optimizing with the maximum limit of 200mA, (3) tube rotation time of 0.76 – 0.80 seconds, (4) detector collimation of 64×0.625mm, (5) field of view with 350 – 500mm, and (6) the image matrix with 512×512 pixels and reconstruction thickness of 2.5mm. The venous phase CT images were selected and used to segment tumors, compute image features, and build the machine learning prediction model in this study.

2.2. Tumor Segmentation

By recognizing the heterogeneity of tumors in the clinical images, we modified and implemented a hybrid tumor segmentation scheme that used a dynamic programming method [11, 12] to adaptively identify growing thresholds of a multi-layer topographic region growing algorithm and initial contour in active contour algorithm. Specifically, the tumor segmentation scheme involves the following steps. First, a Weiner filter is applied to reduce image noise. Second, an initial seed is placed at the center of the tumor region of one CT slice in which the tumor has its most significant area. To reduce inter-operator variability in choosing the initial seed and increase the robustness of segmentation results demonstrated in the previous study [13], a predefined window with the size of (5,5) around the initial seed is automatically created. A pixel with the minimum value inside the window is detected and selected as the first seed point. Third, to automatically determine the first threshold value for the region growing algorithm, a new predefined window with size of (5,5), which ensures to fully locate inside all tumor regions of our dataset and avoid potential risk of growing leakage at the first growth layer, is created around the new seed point. Then, the scheme computes the pixel value differences between the center pixel and boundary pixels and identifies the maximum difference. Subsequently, the region growing threshold is determined as T1 = Vc + 0.25 × Dmax, where Vc is the pixel value of the center pixel and Dmax is the computed maximum pixel value difference inside the bounding window. This threshold value is applied to define the first layer of region growing to segment tumor region depicting on one CT image slice.

Fourth, after determining the first layer of tumor region growth, the growing threshold of the second layer is T2 = T1 + βC1 where C1 is the computed contrast of the first layer, and β is a coefficient (i.e., 0.5). This multi-layer region growing continues until the growth ratio between two adjacent layers is two times bigger than the size of the last growing layer. Last, after the region growing algorithm stops, the scheme selects the boundary contour of the last region growing layer as the initial region contour. The active contour algorithm is followed to expand or shrink the contour curve for the best fitting tumor boundary. Figure 1 and Figure 2 illustrate the block diagram of this tumor segmentation scheme and an image example of applying the above steps to segment a tumor region depicting on one CT slice, respectively.

Figure 1.

Figure 1.

The block diagram of the 2D tumor region segmentation.

Figure 2.

Figure 2.

The process of 2D tumor region segmentation.

Subsequently, after segmenting the tumor region on one CT slice, the CAD scheme continues to perform tumor region segmentation by scanning in both up and down directions until no tumor region is detected in the next adjacent CT slice. In this process, the central point of the tumor region detected in the adjacent CT slice is mapped into the new CT slice as the initial region growing seed. Then, the tumor region segmentation in this targeted slice is automatically performed from the mapped growing seed. Additionally, a tumor growing boundary condition is limited by the adjacent slice to facilitate the multi-layer region-growing process and avoid growth leakage. Figure 3 shows an example of the segmentation of tumor regions depicting several CT image slices of one case. In this way, 3D tumor volume can be segmented and computed.

Figure 3.

Figure 3.

An Example of 3D segmentation of a lesion in 3 different slices.

2.3. Feature Extraction

Once 3D tumor volume is segmented, the CAD scheme is applied to compute a large set of radiomics-based image features, which include 315 features extracted and computed from each segmented 2D tumor region (ROI) depicting on one CT image slice. These features were categorized into four main groups, including (a) the grayscale-run length (GLRLM) features in which 44 2-dimensional features are extracted. (b) The Gray Level Difference Methods (GLDM) probability density function features in which from each probability density function representing statistical texture features of ROI, four features of mean, median, standard deviation, and variance are computed. (c) Wavelet domain features in which the image is first decomposed into four components comprising low and high scale decomposition in either X or Y direction by wavelet transform [14]. Then, the GLCM features [15], as well as 21 tumor density [16] and GLDM features [17], are extracted from those components. (d) Laplacian of Gaussian (LoG) features in which a Gaussian smoothing filter is first applied to reduce the sensitivity to the noise, and then the Laplacian filter sharpens the image’s edge and highlights rapid intensity changes inside the region [18]. Next, from the extracted points after applying the LoG filters, the mean, median, and standard deviation are computed. Figure 4 shows the flow diagram of the feature extraction process.

Figure 4.

Figure 4.

Diagram of Feature extraction Process.

After computing 2D features of all segmented tumor regions in N involved CT image slice, CAD scheme computes each 3D feature (F3Dk) as

F3Dk=i=1Nwi×F2Dk (1)

where Wi is the ratio of the segmented tumor volume on a ith slice to the whole tumor volume segmented on all N involved CT slices. The segmented tumor volume on a i th slice is computed by multiplying the segmented region size (2D) to the CT slice thickness. Finally, all 315 computed 3D feature values are normalized between 0 to 1 to reduce case-based reliance and weight all features evenly.

2.4. Feature Dimensionality Reduction Using Random Projection Algorithm

Since the initial feature pool contains 315 image features, many of them can be redundant (highly correlated) or irrelevant (with lower performance). Hence, selecting a small set of optimal features to reduce the feature dimension and enhance learning accuracy is vital. In this study, we investigate and apply a novel image feature regeneration method of the Random Projection Algorithm (RPA). Theoretic analysis has indicated that the RPA has advantages for its simplicity, high performance, and robustness compared to other feature reduction methods; however, empirical results are sparse [19]. Meanwhile, RPA has been investigated and tested in many engineering applications such as text [20] and face and object recognition [21] and yielded comparable results to conventional feature regeneration methods like principal component analysis (PCA) [22]. Nevertheless, the advantage of employing RP methods over their alternative is that they generate more robust results and computationally inexpensive [19, 23].

In this study, we will apply RPA to generate optimal features from the original large pool of radiomics features. Following is a brief introduction of the RPA method. By considering each case as a point in a k dimensional space, where k represents the number of features, the Euclidian distance between two points can be expressed as follows:

|MN|=i=1k(mini)2 (2)

Regarding Formula (2), M = (m1, …, mk), and N = (n1, …, nk) are two points in the k dimensional space. Likewise, the volume of a sphere with radius r and volume of V in k dimensional space is defined as follows in Formula 3 [24]:

V(k)=rkπk2k2Γ(k2) (3)

The normalization of the feature matrix between [0, 1] suggests that all data can be included in a sphere with a radius of 1. The important fact about a sphere with unit radius is that the more increase in dimension, the more reduction in the volume (Formula 4). Simultaneously, the possible distance between the two points remains at 2 [24].

limk(πk2k2Γ(k2))0 (4)

Additionally, according to the theory of the heavy-tailed distribution, for a case like M = (m1, …, mk) in the space of features, considering features independent with an acceptable approximation, or almost perpendicular variables mapping to different axes, with E(mi)=pi, i=1kpi=μ and E|(mipi)d|pi for d=2,3,,t2/6μ, then, a probability can be computed using Formula 5 [24]:

prob(|i=1kmiμ|t)Max(3et212μ,4×2te) (5)

The more the value of t increases, the less chance of a point be out of that distance. Thus, M should be focused around the mean value. In particular, according to Formula 4 and 5, with a satisfactory estimation, all data are contained in a sphere of unit size, and they are focused around their mean value. As a result, if the dimension increases, the volume of the sphere would close to zero. Therefore, the difference between the cases is not enough for accurate classification.

According to the above analysis, the larger the initial feature vector size, the bigger the space dimension is. Hence, most of the data is focused around the center, which leads to less difference between the features. Consequently, to reduce the feature dimension, a powerful technique is the one that reduces the dimensionality of features while preserves the distance between the points, indicating rough preservation of the vast amount of information. If we implement a conventional feature selection method and choose a d-dimensional sup-space of the initial feature vector randomly, it is expected that all the projected distances in the new space are within a determined scale-factor of the initial k-dimensional space [25]. Thus, it is probable that after removing the redundant features, the accuracy would not increase due to the fact that the divergence between the points is not significant enough to consider as a robust model.

To address the concern discussed above and to optimize the feature space, Johnson-Lindenstrauss Lemma’s theory can be applied in RPA [26]. This theory states that for any 0 < ϵ < 1, and for any number of cases as t, which are like the points in k-dimensional space (Rk), if assuming d as a positive integer, Formula 6 can be used to compute this integer number [26]:

d4lnt(ϵ22ϵ33) (6)

Afterward, for any set W of t points in Rk, for all z, wW, it is revealed that there is a map, or random projection function like f: RkRd, which keeps the distance determined by Formula 7 [26]:

(1ϵ)|zw|2|f(z)f(w)|2(1+ϵ)|zw|2 (7)

The above approximation also can be achieved from Formula 8 as follows [26]:

|f(z)f(w)|2(1+ϵ)|zw|2|f(z)f(w)|2(1ϵ) (8)

As demonstrated in Formula 8, the distance between the set of points in the lower-dimension space is roughly close to the distance in high-dimensional space. The Lemma theory declares that it is feasible to project a set of points from a high-dimensional space into a lower-dimensional space, as the distances between the points are approximately preserved.

As a result, the above analysis suggests that if the initial set of features are projected into space with a lower-dimensional subspace using the random projection method, the distances between points are preserved under better contrast. Hence, it may improve the classification accuracy between the features of two classes representing cases either with or without PM under low risk of overfitting ML models.

In this study, we also investigate whether using RPA can yield a better result in comparison to several commonly used feature dimensionality reduction methods used in the medical imaging informatics field, including principal component analysis (PCA) [27], least absolute shrinkage, and selection operator (LASSO) [28], maximum relevance and minimum redundancy (MRMR) [29], and recursive feature elimination (RFE) [30]. All extracted features in the above section are fed into the methods of RPA, PCA, LASSO, MRMR, and RFE. Each method generates 20 optimal features out of the initially large pool of 315 features.

2.5. Machine Learning Model

To classify between the study cases with or without PM, we build a multi-feature fusion-based machine learning model. However, due to the unbalance of our dataset, which includes 121 PM cases and 38 non-PM cases, we apply a synthetic minority oversampling technique (SMOTE) algorithm [31] to rebalance the original image dataset. The advantages of using SMOTE to develop machine learning models in medical images have been well investigated and demonstrated in many previous studies (including those conducted by researchers in our lab) [3234]. In this study, we apply the SMOTE method to generate 83 synthetic non-PM cases. Thus, the dataset is expanded to 242 cases, including 121 PM cases and 121 non-PM cases.

After addressing the imbalance dataset, we select and implement the Gradient Boosting Machine (GBM) to train an optimal machine learning model to predict the risk of advanced gastric cancer patients having PM. The GBM model is a popular machine learning algorithm that has proven effective at classifying complex datasets and often first in class with predictive accuracy [35]. Under a hyperparameter tuning, the GBM model is implemented to achieve a low computational cost and high robustness in detection results as well. Additionally, to decrease the case partition and feature selection (or generation) bias, we use a leave-one-case-out (LOCO) based cross-validation method to train and test the GBM model. In each LOCO cycle, PRA and SMOTE are embedded in the training process. Then, one case not involved in the training cycle is tested by the GBM model trained using all other cases in the dataset. The model produces a prediction score for each testing case ranging from 0 to 1. A higher score indicates a higher risk of PM. The prediction performance is evaluated using a receiver operating characteristic (ROC) method after discarding all SMOTE generated non-PM training samples. The areas under ROC curves (AUC) and overall prediction accuracy after applying an operating threshold (T = 0.5) on the GBM model generated prediction scores are used as two performance evaluation indices. Additionally, Cohen’s Kappa coefficient value is also computed for evaluating the performance of the CAD scheme. High Cohen’s Kappa coefficient value (ranging from zero to one) illustrates high robustness and less randomness in the predicted results [36, 37].

In summary, Figure 5 shows a complete flow chat of using our CAD scheme to process images, compute optimal features, and train the GBM model in which the RPA and SMOTE are embedded inside the LOCO process. In this study, the segmentation and feature extraction steps were performed using MATLAB R2019a package, and the feature reduction and classifications were done using Python 3.7.

Figure 5.

Figure 5.

The flowchart of the proposed CAD scheme.

3. Results

Figure 6 presents five ROC curves generated by the GBM models embedded with five different feature reduction methods (LASSO, PCA, RFE, RPA, MRMR). Table 2 shows the performance comparison between using RPA and the other four feature selection methods. The AUC value and the overall prediction accuracy of the GBM model trained using RPA with 3D image features as input are 0.69±0.019 and 71.2%, respectively. Moreover, the precision, sensitivity, and specificity of the proposed method are 65.78%, 43.101%, and 87.12%, respectively. The results indicate that using RPA leads to generate an optimal image feature vector that can build a GBM model with significantly higher prediction accuracy (p < 0.05) than using the GBM models optimized using the other four feature optimization methods.

Figure 6.

Figure 6.

Comparison of five ROC plots generated using GBM models optimized using five different feature selection or reduction methods.

Table 2.

The performance comparison of five GBM models optimized using five different feature selection and reduction methods.

Precision Sensitivity Specificity Accuracy AUC

LASSO 38.9% 31.1% 80.0% 65.8% 0.59±0.013
PCA 38.5% 64.1% 65.5% 65.2% 0.58±0.021
RFE 56.5% 62.5% 51.2% 56.9% 0.60±0.020
MRMR 50.0% 32.7% 82.0% 64.5% 0.60±0.017
RPA 65.8% 43.1% 87.1% 71.2% 0.69±0.019

Figure 7 shows two ROC curves, and Table 3 reports the prediction performance values to compare two GBM models trained using 2D features computed from the largest tumor region segmented from one CT image slice and the 3D features computed from the segmented tumor volumes. In these two GBM models, the RPA method is used to select and generate optimal features. The results demonstrate that using 3D image features yields significantly higher performance than using 2D features (p < 0.05) in predicting the risk of gastric cancer cases with PM.

Figure 7.

Figure 7.

Comparison of two ROC plots generated by two GBM models optimized using 2D and 3D features generated using the RPA method, respectively

Table 3.

The comparison of two GBM model performance between using 2D and 3D image features generated using the RPA method.

AUC Accuracy

2D features 0.66±0.017 68.4%
3D features 0.69±0.019 71.2%

In addition, we also build and compare several other types of ML models, including logistic regression, support vector machine (SVM), random forest, and decision tree. All models are trained and tested using the same LOCO cross-validation method embedded with RPA and SMOTE schemes. Table 4 and Figure 8 present the results to compare the prediction performance of five ML models, which shows that GBM yields the highest accuracy than the other four ML models. However, AUC values between GBM, SVM, and logistic regression-based ML models are not statistically significantly different (p > 0.05).

Table 4.

Comparison of prediction performance of five ML models.

AUC value Accuracy

SVM 0.66 64.55%
Logistic Regression 0.68 61.93%
Random Forest 0.63 69.03%
Decision Tree 0.56 65.16%
GBM 0.69 71.15%

Figure 8.

Figure 8.

Comparison of ROC plots of five ML models.

4. Discussion

CT is the most popular imaging modality to detect and diagnose gastric cancer, and it may also provide a non-invasive alternative method to predict the risk of PM in advanced gastric cancer patients. Despite the potential advantages of using CT to detect or predict the risk of PM, the efficacy of radiologists in reading and interpreting CT images for PM detection is insufficient [38]. Although studies have suggested that developing and applying CAD schemes integrated with the radiomics concept and ML model is beneficial and may provide radiologists a second opinion to more accurately detect and diagnose different abnormalities [39], developing ML models using a large number of radiomics features and small training dataset remains a difficult task. In this study, we explore a new approach to develop a new CAD scheme or ML model with several unique characteristics and novel ideas in feature extraction and ML model optimization to improve accuracy in detecting advanced gastric patients with PM.

First, in a previous study conducted in this area, the authors performed manual segmentation of gastric cancer tumor regions from the single CT image slices [40]. However, manual segmentation of tumor regions is often inconsistent with large inter-observer variability due to the fuzzy boundary of the tumor regions, which makes the computed image features also inconsistent or not reproducible. Thus, the prediction accuracy may be affected or not robust. To solve this issue, we in our study developed an interactive CAD scheme with a graphical user interface (GUI) to initiate the segmentation of tumor regions from CT images. A user only needs to place an initial seed around the center of the tumor region that has the largest size in one CT slice. CAD scheme then segments tumor regions on all involved CT image slices automatically. The segmentation results can also be visualized by the human eyes on the GUI window. Although we have designed and installed a correction function icon in the GUI and the user can activate this function to order CAD scheme correcting the segmentation errors (if any), the results in this study show that the CAD scheme can achieve satisfactory results on automatically segmenting all 3,305 tumor regions from all 159 cases in our dataset.

Second, although the previous study [41] has reported developing a radiomics based ML model to detect and diagnose gastric cancer using CT images, in that study, the Authors used image features computed just from one manually selected CT image slice, which may not accurately represent image features of the entire tumor. To address this issue, we conduct the first study that develops and tests a new ML model using 3D image features. Our study results support our hypothesis that using 2D image features extracted from only one CT slice is not sufficient enough to represent the heterogonous characteristics of the tumors, while using 3D image features can yield significantly higher prediction performance. Specifically, in this study, we have performed 3D tumor segmentation and extracted 3D image features to detect or predict the risk of advanced gastric patients having PM. As shown in Table 3, the prediction performance of the GBM model trained using 3D features yields AUC=0.69±0.019 and an accuracy of 71.2%, which are significantly higher than the GBM model trained using 2D features with AUC=0.66±0.017 and the accuracy of 68.4% (p < 0.05), respectively.

Third, in developing CAD schemes to train ML models, identifying a small and efficient set of image features plays a critical role [42, 43]; therefore, in previous studies, different feature dimensionality reduction methods have been investigated [44, 45]. Although these studies made many improvements in optimizing the feature vectors, there is a significant challenge of achieving small feature vectors representing the complex and non-linear image feature space. In this study, we investigate the feasibility of applying the RPA to the medical imaging informatics field in optimizing the CAD scheme or ML model. Our study results show that RPA is a promising technique to reduce the dimensionality of a set of points lying in Euclidian space for very heterogeneous feature data, which commonly occurs in medical images and has advantages to achieve high robustness in classification and low risk of overfitting. Figure 6 illustrates that the prediction performance of the GBM model embedded with RPA yields significantly higher performance than other GBM models embedded with the other four popular feature reduction methods (PCA, LASSO, MRMR, and RFE). As presented in Table 2, the AUC value after applying the RPA reached the highest prediction accuracy of 71.2% than the other four feature reduction methods. Moreover, the computed Cohen’s Kappa coefficient value is 0.68, which indicates the reliability or robustness of the GBM model optimized using the RPA method.

Fourth, since many ML models have been developed and used in medical imaging informatics or CAD fields, selecting which ML model can also be a challenging issue. In this study, we also compare the prediction performance of five popular ML models. The results show that many different ML models can yield very comparable performance, as shown in Table 4 and Figure 8. However, comparing with the data presented in Table 2, we can find that selecting or generating optimal features plays a more critical role or contribution than choosing a different ML model. Thus, combing the above new observations of this study, we demonstrate that due to the very complicated distribution of radiomics features computed from medical images, RPA is a promising and more powerful technique applicable to generate optimal feature vectors for better training ML models used in CAD schemes of medical images.

Despite the encouraging results, we also notice some limitations in this study. First, the dataset used in this study is relatively small; hence to validate the results of this study, larger datasets are required before being tested in future prospective clinical studies. Second, although in this study, we have used synthetic data to balance the dataset and reduce the impact of an imbalanced dataset, using the SMOTE technique is just efficient for the low dimensional data, and it may not be appropriate or optimal for high dimensional data [46]. Third, in the initial pool of features, we only extracted a limited number of 315 statistics and textural features, which are much less than the number of features computed based on recently developed radiomics concepts and technology in other studies [47]. Thus, more texture features can be explored in future studies to increase the diversity of the initial feature pool, which may also increase the chance of selecting or generating more optimal features to significantly improve the accuracy of the ML model to predict the risk of PM. To overcome the above limitations, more studies and progress are needed in this field.

In summary, regardless of the above limitations, this is a valid proof-of-concept study that reveals a new and promising approach to identify and generate optimal feature vectors for training ML models implemented in CAD schemes of medical images. Since optimizing the feature vector is one of the critical steps of building an optimal ML model using the radiomics concept, the presented method in this study is not only limited to the detection of advanced gastric patients with PM, and it can also be beneficial for other medical imaging studies of developing ML models to detect different types of cancers or abnormalities in the future.

Highlights.

  • Gastric tumor volume automatically segmented from CT images.

  • Introducing a new image feature regeneration method in the field of medical imaging informatics to optimize extracted features with higher performance compared to one of the popular feature reduction methods in this field.

  • Achieving small feature vectors representing the complex and non-linear image feature space.

Acknowledgment

This study is supported in part by research grant R01 CA197150 from the National Cancer Institute. The authors also thank the support from the Stephenson Cancer Center, University of Oklahoma.

Footnotes

5. Conflict of interest statement

The authors declare that they have no competing interests.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Bray F, et al. , Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a Cancer Journal for Clinicians, 2018. 68(6): p. 394–424. [DOI] [PubMed] [Google Scholar]
  • 2.Biondi A, et al. , Neo-adjuvant chemo (radio) therapy in gastric cancer: current status and future perspectives. World Journal of Gastrointestinal Oncology, 2015. 7(12): p. 389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Fukagawa T, et al. , A prospective multi-institutional validity study to evaluate the accuracy of clinical diagnosis of pathological stage III gastric cancer (JCOG1302A). Gastric Cancer, 2018. 21(1): p. 68–73. [DOI] [PubMed] [Google Scholar]
  • 4.Wang F-H, et al. , The Chinese Society of Clinical Oncology (CSCO): clinical guidelines for the diagnosis and treatment of gastric cancer. Cancer Communications, 2019. 39(1): p. 1–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Coccolini F, et al. , Intraperitoneal chemotherapy in advanced gastric cancer. Meta-analysis of randomized trials. European Journal of Surgical Oncology (EJSO), 2014. 40(1): p. 12–26. [DOI] [PubMed] [Google Scholar]
  • 6.Ishigami H, et al. , Phase III trial comparing intraperitoneal and intravenous paclitaxel plus S-1 versus cisplatin plus S-1 in patients with gastric cancer with peritoneal metastasis: PHOENIX-GC trial. Journal of Clinical Oncology, 2018. 36(19): p. 1922–1929. [DOI] [PubMed] [Google Scholar]
  • 7.Lambin P, et al. , Radiomics: extracting more information from medical images using advanced feature analysis. European journal of cancer, 2012. 48(4): p. 441–446. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Aerts HJ, et al. , Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nature Communications, 2014. 5(1): p. 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Sun Z-Q, et al. , Radiomics study for differentiating gastric cancer from gastric stromal tumor based on contrast-enhanced CT images. Journal of X-ray Science and Technology, 2019. 27(6): p. 1021–1031. [DOI] [PubMed] [Google Scholar]
  • 10.Wang L, et al. CT-based radiomics nomogram for preoperative prediction of No.10 lymph nodes metastasis in advanced proximal gastric cacner. European Journal of Surgical Obcology, 2020. DOI: 10.1016/j.ejso.2020.11.132. [DOI] [PubMed] [Google Scholar]
  • 11.Zheng B, et al. , Interactive computer-aided diagnosis of breast masses: computerized selection of visually similar image sets from a reference library. Academic Radiology, 2007. 14(8): p. 917–927. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Danala G, et al. , Classification of breast masses using a computer-aided diagnosis scheme of contrast enhanced digital mammograms. Annals of Biomedical Engineering, 2018. 46(9): p. 1419–1431. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Gundreddy RR, et al. , Assessment of performance and reproducibility of applying a content- based image retrieval scheme for classification of breast lesions. Medical Physics, 2015. 42(7): p. 4241–4249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Rajaei A and Rangarajan L, Wavelet features extraction for medical image classification. International Journal of Engineering Sciences, 2011. 4: p. 131–141. [Google Scholar]
  • 15.Hazra D, Texture recognition with combined GLCM, wavelet and rotated wavelet features. International Journal of Computer and Electrical Engineering, 2011. 3(1): p. 146. [Google Scholar]
  • 16.Mirniaharikandehei S, et al. , Developing a quantitative ultrasound image feature analysis scheme to assess tumor treatment efficacy using a mouse model. Scientific Reports, 2019. 9(1): p. 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Ahmadi N and Akbarizadeh G, Iris tissue recognition based on GLDM feature extraction and hybrid MLPNN-ICA classifier. Neural Computing and Applications, 2020. 32(7): p. 2267–2281. [Google Scholar]
  • 18.Zhao F and Desilva CJ. Use of the Laplacian of Gaussian operator in prostate ultrasound image processing. in Proceedings of the 20th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Vol. 20 Biomedical Engineering Towards the Year 2000 and Beyond (Cat. No. 98CH36286). 1998. IEEE. [Google Scholar]
  • 19.Bingham E and Mannila H. Random projection in dimensionality reduction: applications to image and text data. in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. 2001. [Google Scholar]
  • 20.Wang Q, et al. , Hierarchical feature selection for random projection. IEEE Transactions on Neural Networks and Learning Systems, 2018. 30(5): p. 1581–1586. [DOI] [PubMed] [Google Scholar]
  • 21.Mekhalfi ML, et al. , Fast indoor scene description for blind people with multiresolution random projections. Journal of Visual Communication and Image Representation, 2017. 44: p. 95–105. [Google Scholar]
  • 22.Suhaimi NFM and Htike ZZ. Comparison of Machine Learning Classifiers for dimensionally reduced fMRI data using Random Projection and Principal Component Analysis. in 2019 7th International Conference on Mechatronics Engineering (ICOM). 2019. IEEE. [Google Scholar]
  • 23.Xie H, Li J, and Xue H, A survey of dimensionality reduction techniques based on random projection. arXiv preprint arXiv:1706.04371, 2017. [Google Scholar]
  • 24.Aggarwal CC, Hinneburg A, and Keim DA. On the surprising behavior of distance metrics in high dimensional space. in International Conference on Database Theory. 2001. Springer. [Google Scholar]
  • 25.Saunders C, et al. , Subspace, Latent Structure and Feature Selection. Statistical and Optimization Perspectives Workshop, SLSFS 2005 Bohinj, Slovenia, February 23–25, 2005, Revised Selected Papers. Vol. 3940. 2006: Springer. [Google Scholar]
  • 26.Dasgupta S and Gupta A, An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms, 2003. 22(1): p. 60–65. [Google Scholar]
  • 27.Pechenizkiy M, Tsymbal A, and Puuronen S. PCA-based feature transformation for classification: issues in medical diagnostics. in Proceedings. 17th IEEE Symposium on Computer-Based Medical Systems. 2004. IEEE. [Google Scholar]
  • 28.Tibshirani R, Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 1996. 58(1): p. 267–288. [Google Scholar]
  • 29.Peng H, Long F, and Ding C, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005. 27(8): p. 1226–1238. [DOI] [PubMed] [Google Scholar]
  • 30.Zeng X, et al. Feature selection using recursive feature elimination for handwritten digit recognition. in 2009 Fifth International Conference on Intelligent Information Hiding and Multimedia Signal Processing. 2009. IEEE. [Google Scholar]
  • 31.Fernández A, et al. , SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research, 2018. 61: p. 863–905. [Google Scholar]
  • 32.Wang KJ, et al. , A hybrid classifier combining Borderline-SMOTE with AIRS algorithm for estimating brain metastasis from lung cancer: A case study in Taiwan. Computer Methods and Programs in Biomedicine, 2015. 119(2): p. 63–76. [DOI] [PubMed] [Google Scholar]
  • 33.Yan S, et al. , Improving lung cancer prognosis assessment by incorporating synthetic minority oversampling technique and score fusion method. Medical Physics, 2016. 43(6Part1): p. 2694–2703. [DOI] [PubMed] [Google Scholar]
  • 34.Aghaei F, et al. , Applying a new quantitative global breast MRI feature analysis scheme to assess tumor response to chemotherapy. Journal of Magnetic Resonance Imaging, 2016. 44(5): p. 1099–1106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Hu R, Li X, and Zhao Y. Gradient boosting learning of Hidden Markov models. in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. 2006. IEEE. [Google Scholar]
  • 36.McHugh ML, Interrater reliability: the kappa statistic. Biochemia medica: Biochemia Medica, 2012. 22(3): p. 276–282. [PMC free article] [PubMed] [Google Scholar]
  • 37.Heidari M, et al. , Improving the performance of CNN to predict the likelihood of COVID-19 using chest X-ray images with preprocessing algorithms. International Journal of Medical Informatics, 2020. 144: p. 104284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Seevaratnam R, et al. , How useful is preoperative imaging for tumor, node, metastasis (TNM) staging of gastric cancer? A meta-analysis. Gastric Cancer, 2012. 15(1): p. 3–18. [DOI] [PubMed] [Google Scholar]
  • 39.Gonçalves VM, Delamaro ME, and Nunes F.d.L.d.S., A systematic review on the evaluation and characteristics of computer-aided diagnosis systems. Revista Brasileira de Engenharia Biomédica, 2014. 30(4): p. 355–383. [Google Scholar]
  • 40.Liu S, et al. , CT textural analysis of gastric cancer: correlations with immunohistochemical biomarkers. Scientific Reports, 2018. 8(1): p. 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Li R, et al. , Detection of gastric cancer and its histological type based on iodine concentration in spectral CT. Cancer Imaging, 2018. 18(1): p. 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Kuhn M and Johnson K, An introduction to feature selection, in Applied Predictive Modeling. 2013, Springer. p. 487–519. [Google Scholar]
  • 43.Tan M, Pu J, and Zheng B, Optimization of breast mass classification using sequential forward floating selection (SFFS) and a support vector machine (SVM) model. International Journal of Computer Assisted Radiology and Surgery, 2014. 9(6): p. 1005–1020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Khalid S, Khalil T, and Nasreen S. A survey of feature selection and feature extraction techniques in machine learning. in 2014 Science and Information Conference. 2014. IEEE. [Google Scholar]
  • 45.Chandrashekar G and Sahin F, A survey on feature selection methods. Computers & Electrical Engineering, 2014. 40(1): p. 16–28. [Google Scholar]
  • 46.Blagus R and Lusa L, SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics, 2013. 14: p. 106–106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Wang T, et al. , Correlation between CT based radiomics features and gene expression data in non-small cell lung cancer. Journal of X-ray Science and Technology, 2019. 27(5): p. 773–803. [DOI] [PubMed] [Google Scholar]

RESOURCES