Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2024 Oct 25;35(12):2801–2814. doi: 10.1021/jasms.4c00324

The Application of a Random Forest Classifier to ToF-SIMS Imaging Data

Mariya A Shamraeva , Theodoros Visvikis , Stefanos Zoidis , Ian G M Anthony , Sebastiaan Van Nuffel †,‡,*
PMCID: PMC11622239  PMID: 39455427

Abstract

graphic file with name js4c00324_0004.jpg

Time-of-flight secondary ion mass spectrometry (ToF-SIMS) imaging is a potent analytical tool that provides spatially resolved chemical information on surfaces at the microscale. However, the hyperspectral nature of ToF-SIMS datasets can be challenging to analyze and interpret. Both supervised and unsupervised machine learning (ML) approaches are increasingly useful to help analyze ToF-SIMS data. Random Forest (RF) has emerged as a robust and powerful algorithm for processing mass spectrometry data. This machine learning approach offers several advantages, including accommodating nonlinear relationships, robustness to outliers in the data, managing the high-dimensional feature space, and mitigating the risk of overfitting. The application of RF to ToF-SIMS imaging facilitates the classification of complex chemical compositions and the identification of features contributing to these classifications. This tutorial aims to assist nonexperts in either machine learning or ToF-SIMS to apply Random Forest to complex ToF-SIMS datasets.

Keywords: ToF-SIMS, Random Forest, Machine Learning

1. Introduction

Time-of-flight secondary ion mass spectrometry (ToF-SIMS) has been used extensively for several decades for the surface analysis of a wide range of inorganic and organic material systems due to its chemical specificity and sensitivity.16 Mass spectrometry images are generated by raster scanning a sample in two dimensions using a focused primary ion beam and acquiring mass spectra for each pixel.7 Additionally, it is possible to include a third dimension of depth by removing material by sputtering and performing scans at different depths within the sample, thereby creating a 3D image stack. The ToF-SIMS spectra of most organic materials are complex because primary ion beams often cause significant fragmentation.8 In numerous cases, molecular ions have low intensities or are not observed at all in the ToF-SIMS spectra. Thus, much of the information about the surface chemistry is obtained via fragment ions, which complicates the data interpretation. The superposition of various fragmentation patterns, especially when analytes share similar structural features, leads to convoluted datasets. Furthermore, secondary ions yields do not always correspond to their abundance in the sample.9 As a result, ToF-SIMS is nonquantitative, especially in the case of molecular imaging below the static limit.8,9 Nonetheless, ToF-SIMS imaging has become a potent tool that enables untargeted analysis of molecular species, for example in the contexts of material sciences and biomedical research.1012

The analysis of ToF-SIMS images poses several challenges. First, electric field effects introduced by surface topography and charging can introduce mass shifts and signal loss that may further complicate data analysis.1317 These field effects can be reduced via instrument designs with a pulsed analyzer or via the use of an extraction delay in instruments with pulsed primary ions.18 Surface charging can also be compensated using dedicated charge neutralization hardware.19,20 Second, the nature of the noise should also be considered; there is a consensus that data should be scaled in a manner that is consistent with its noise.2124 When techniques rely on ion counting as is the case for some time-of-flight mass analyzers,2224 the noise follows a Poisson distribution, making it proportional to the square root of the average number of counts and, with enough counts, should approach normality.24 Unfortunately, ToF-SIMS data, especially individual pixels in ToF-SIMS images, frequently deviates from normality because of low ion counts.25,26 Third, nonlinearity can be introduced by detector saturation and matrix effects. Care should be taken to avoid detector saturation during data collection. Counting statistics of a Poisson process with dead time refers to the statistical treatment of event counting where the events follow a Poisson distribution but are subject to a dead time during which the detection system is unable to register subsequent events. Dead time corrections (using Poisson statistics) can also be used to reduce such nonlinearity issues. This correction is essential for maintaining the accuracy and reliability of quantitative measurements in systems where event rates can be high, such as in mass spectrometry imaging.23 Matrix effects are, however, still poorly understood and can be difficult to avoid.27

Conventionally, ToF-SIMS imaging requires user interpretation, which benefits from a priori knowledge of the sample. As user interpretation is often slow and can require extensive training, machine learning (ML) can be used to help analyze ToF-SIMS data.28 These approaches are increasing in use as mass spectrometry imaging (MSI) datasets, and ToF-SIMS MSI data in particular, are highly multidimensional, typically with 104–106 bins per pixel, which can lead to individual images to be many GBs or even TBs.29 MSI datasets, ToF-SIMS datasets in particular, and ML are a good combination as these are large sample procedures.

Unsupervised ML methods have several potential advantages in exploratory research, namely, visualization, dimensionality reduction, image segmentation, unmixing, pattern extraction, and denoising.29 Unsupervised ML methods can be divided into several specific subbranches: factorization methods, partitioning and clustering methods, and manifold learning methods. Examples of factorization methods are principal component analysis (PCA),3032 weighted PCA (w-PCA),23,33 independent component analysis (ICA),34 maximum autocorrelation factor (MAF),25,35,36 non-negative matrix factorization (NMF),3741 multivariate curve resolution (MCR)42,43 and multivariate curve resolution-alternating least-squares (MCR-ALS),44,45 probabilistic latent semantic analysis (pLSA),41,46 CX/CUR matrix decomposition,47 dictionary learning or molecular dictionary learning (MOLDL)48 as well as others.49,50 PCA is among the most used multivariate analysis techniques for SIMS-based MSI. Using PCA, high-dimensional data can be decomposed into a lower-dimensional space.26 PCA can be effective for the efficient retrieval of sources of variation in the data, and the investigation of the linear correlations and trends between mass peaks within an MSI dataset.31,51,52 However, PCA provides limited spectral information compared to other matrix decomposition methods and can be difficult to interpret.29 Partitioning & clustering methods are a second widely used class of algorithms for exploratory MSI analysis such as k-means,32,5355 hierarchical clustering (HC),29,56,57 bisecting k-means,58 high dimensional data clustering (HDDC)59,60 and soft segmentation techniques such as fuzzy c-means clustering (FCM),61 AMASS,62 latent Dirichlet allocation,63 and spatial shrunken centroids.64 The linear nature of matrix factorization methods makes them less appropriate for nonlinear data. Nonlinear data can be analyzed by manifold learning methods. Manifold learning methods include t-distributed stochastic neighbor embedding (tSNE),65 uniform manifold approximation and projection (UMAP),66 self-organizing maps (SOMs), autoencoder (AE),67 and Kohonen (neural) networks.68 It has been demonstrated that AE classifies MALDI imaging data from biological samples more effectively than PCA and NMF. Additionally, Matsuda showed that AE segmented ToF-SIMS data of human skin with greater detail than PCA or MCR.69 Furthermore, Aoyagi et al. reported that AE offered more accurate results for quantitatively analyzing SIMS data from organic mixtures with matrix effects.70

Supervised ML methods are widely used to identify molecular signatures and disease diagnosis71,72 and include partial least-squares (PLS) regression,73,74 Random Forest (RF),46,74,75 Markov random field (MRF),74 logistic regression (LR),71 gradient boosting,76 support vector machine (SVM),77 and others.76,78 Supervised machine learning models the relationship between input data and output data and requires human annotation in comparison to unsupervised ML where no human annotation is needed. Another subset of ML is deep learning and neural network methods with artificial neural networks (ANN).7982 For instance, an ANN-based supervised learning method is useful for both quantitative and qualitative analysis of SIMS data from organic mixture samples affected by matrix effects.83 All the above-mentioned ML methods have been discussed in reviews written by Mehta et al.,21 Graham and Castner,26 Verbeeck et al.,29 Jetybayeva et al.,78 and Gardner et al.81

Even though the RF algorithm has a relatively low feature and sample robustness, it has many advantages such as medium robustness to overfitting, outliers, and mislabeled data. An example of the application of RF is an extensive interlaboratory study, collecting over 1,000 spectra from six peptide models using instruments from 27 different ToF-SIMS setups across 25 institutes worldwide. The RF algorithm was trained with 20 amino acid labels to classify and identify the peptides from the spectral data. The method demonstrated its effectiveness in determining the amino acid composition of unknown peptides and its potential to uncover novel chemical features within ToF-SIMS.75

In summary, RF provides high classification accuracy and information about feature importance. Furthermore, it can handle the nonlinearity introduced by for example matrix effects.84,85 This tutorial seeks to aid nonexperts in either RF or ToF-SIMS. We summarize decision trees and Random Forest theory and provide guidelines on their application and use for ToF-SIMS imaging data. We also present a practical example of the application of RF to a ToF-SIMS dataset.

2. Theoretical Background

2.1. Decision Trees

Decision trees were first published by Morgan and Sonquist in 1963,86 who published an automatic interaction detector (AID) tree-based technique for managing multivariate nonadditive effects in survey data. This publication was followed by several more advancements.87,88 Throughout the 1970s, Breiman,89 Friedman,90 and Quinlan91 independently proposed similar algorithms for the induction of tree-based models. A decision tree is a type of supervised learning algorithm with advantages such as handling heterogeneous data, robustness to outliers and to noise due to feature selection, and applicability both for classification and regression tasks.9296Classification and Regression Trees (CART), Iterative Dichotomiser 3 (ID3),97 and C4(98) are examples of decision tree algorithms. In this tutorial, we focus on CART with special attention to classification trees.92 The idea behind the decision-tree model involves approximating the Bayes model partition by recursively splitting input space X into subspaces and then assigning constant prediction values. Let us define a rooted tree as a graph G = (V, E), where the set of edges (E) is directed away from the root (Figure 1). Any two vertices (V) (or nodes) in a graph have only one path connecting them. A tree’s root node splits into two or more sets (or subpopulations) with increased homogeneity if there exists an edge from t1 to t2 (i.e., (t1,t2) ∈ E), then node t1 is the parent of node t2 while node t2 is the child of node t1. These new subpopulations or internal nodes, which have one or more children, continue splitting or branching until a homogeneous terminal node with no children, also referred to as a leaf node, is reached. The objective is to train a decision tree to generate rules to predict the target variable.

Figure 1.

Figure 1

Schematic overview of decision trees. (a) The binary tree (Xt0) splits into two internal nodes, with the node Xt1 being the parent of Xt2. This process continues, with each node branching further, until a terminal node without children (leaf node) is reached. (b) The entropy function as an impurity measure: (1) highest when the distribution of node proportions is uniform (pi values are equal), (2) lowest when the probability of a specific class pi is 1, and (3) symmetric under permutation of pi values. (c) Pruning tackles overfitting by reducing the size of decision trees, recombining a large tree upward (i, ii, iii), thus decreasing the sequence of subtrees. By eliminating unnecessary branches and nodes, classification accuracy can improve.

In general terms, an impurity measure i(t), using the framework of Breiman in 1984,92 can be defined as a function that assesses the goodness of any node t.

The most common impurity criteria used for classification trees are the Gini impurity based on the Gini index99 and the Shannon entropy.100 However, there are other metrics like cross-entropy,101 and mutual information102 that are also crucial. Impurity criteria are used to measure how well a split separates the classes in a dataset. The Gini impurity measures how often a randomly selected element of a set would be mislabeled if it were labeled randomly and independently based on the label distribution within the set. It is calculated by summing the pairwise products of the probability of choosing an item of a particular class and the probability of miscategorizing that item for each class label. Shannon entropy is another such measure, quantifying the uncertainty or impurity in the data. The Shannon-entropy-based criteria measure the uncertainty of the target variable within a given node. As the impurity-based on entropy decreases, the information gained about the target variable by splitting the node increases, which is commonly known as information gain.103,104 Cross-entropy measures the difference between two probability distributions: the true distribution of the data and the predictions made by the model. In the context of decision trees, cross-entropy can be used to evaluate how well the tree’s splits are classifying the data. Lower cross-entropy indicates a better-performing split. Mutual Information quantifies the amount of information gained about one variable through another variable. In decision trees, it helps determine which feature to split on by measuring how much knowing the value of a feature reduces the uncertainty of the target variable.

The Classification and Regression Trees (CART) algorithm uses Gini impurities to decide where to split the data. However, other algorithms apply different methods to evaluate the purity of a node. These methods include Oblique Classifier 1 (OC1),105 Chi-square automatic interaction detection (CHAID),93,106 multivariate adaptive regression splines (MARS),107 and Conditional Inference Tree.108 Each of these functions uses its impurity criteria that exhibit the necessary properties to be an impurity function to determine the best way to split the data in a decision tree.109

In decision tree classification, the likelihood of incorrect predictions can be measured by calculating the misclassification rate. As more splits are added to a tree, this rate typically decreases, suggesting that the tree becomes more accurate, which might lead to overfitting. Conversely, if we incorporate too few terminal nodes, it may lead to underfitting. Finding the ideal trade-off between a tree that is neither too deep nor too shallow is therefore essential.

Both overfitting and underfitting result in errors that hinder supervised learning algorithms from generalizing beyond their training set. Overfitting leads to high variance, which is an error caused by the algorithm’s sensitivity to small fluctuations in the training set. High variance can occur when the algorithm models the random noise present in the training data. Underfitting leads to high bias, which causes errors from incorrect assumptions made by the learning algorithm. High bias can cause the algorithm to overlook the relevant relationships between features and the target outputs. This is commonly known as the bias-variance trade-off.110 User-defined “hyperparameters” (parameters that specify details of the learning process) can be tuned to find the right trade-off.

Pruning can be used to tackle the issue of overfitting in decision trees. This technique involves reducing the size of decision trees by recombining a large tree upward, thereby decreasing the sequence of subtrees.111 The objective of pruning is to determine the optimal model complexity that minimizes both error sources simultaneously. There are two techniques for pruning a decision tree, namely, postpruning (backward pruning) and prepruning.112114 However, as we will see in the next section, pre- or postpruning is no longer necessary to achieve good generalization performance in the context of an ensemble of decision trees. The CART system92 employs a tree pruning method that is based on trading off predictive accuracy versus tree complexity; this trade-off is governed by a parameter that is optimized using cross-validation (CV) (Figure S1), which can partition the dataset in different ways.115,116

CV schemes are exploited to increase the variation in the training and testing data and to reduce the influence of the data split on the output testing statistics.117 There are two main cross-validation methods, namely, nonexhaustive CV and exhaustive CV.118,119 Nested CV has become a popular method for performing external CVs and improving the estimation of unbiased performance.116,120 An example of an exhaustive CV is a leave-one-out CV (LOOCV) (Figure S1).120 Since there is a good chance of finding similarities between the training set and the testing sets, LOOCV is known to generate overoptimistic estimates.115

2.2. From Trees to Random Forest

An ensemble of decision trees is referred to as a Random Forest.121125 The main differences between ensemble methods and Random Forest methods are based on how baseline models are trained and combined. The most widely used ensemble methods are bagging, stacking, and boosting. There are many reviews in the literature about ensemble techniques.127130 However, in this tutorial, we will focus on specific ensemble method described by Breiman in 2001, which is referred to as Random Forests.131

The bias-variance decomposition for the squared error loss of the generalization error was first proposed by Genman132 in the context of neural networks. Similar decompositions for the expected generalization error based on the zero-one loss have been proposed in the literature, drawing a direct analogy with the bias-variance decomposition for the squared error loss by redefining the concepts of bias and variance in the case of classification.122,126,133136 The bias-variance decomposition is a useful tool for diagnosing underfitting and overfitting (Figure 2).

Figure 2.

Figure 2

Bias-variance decomposition: This figure illustrates the bias-variance decomposition as a function of the model complexity. The blue line represents the squared bias, which is high for simple models (underfitting) and decreases as the model complexity increases. The red line represents the variance, which is low for simple models but increases with model complexity, indicating overfitting. The green line represents the total error, which is the sum of the bias squared and variance. Vertical dashed lines mark the regions of under- and overfitting. Underfitting occurs at low model complexity with high bias and low variance, while overfitting occurs at high model complexity with low bias and high variance. The optimal model complexity balances bias and variance, resulting in the lowest total error.

Reducing the prediction variance by implementing ensemble methods, Random Forest is a reasonable strategy for decreasing generalization error, assuming that the corresponding bias can be maintained at the same level or not increased excessively. Breiman showed in 1996 that the average model has a lower expected generalization error.126 Intrinsically, bagging is classified by averaging combined models built from a bootstrap sample Lm for m = 1, ..., M of the training set. Lm builds replicates of L, given a training dataset with N observations and a binary target variable (x, y), drawn at random but with replacement from L.137 When numerous bootstrap samples are generated from a large sample size after N draws with replacement, the probability of never having been selected is Inline graphic; therefore, each bootstrap sample contains Inline graphic of the original observations.137,138 Because bagging produces unique models that vary from one bootstrap sample to another, they are more likely to benefit from the process of averaging. Therefore, this approach enhances model performance by reducing the variance without increasing bias. Breiman demonstrated that classification accuracy and stability were significantly enhanced when averaging classification outcomes obtained from multiple bootstrap samples of the original training set. However, if the learning set L is small, subsampling 67% of the objects may result in an increase in bias due to a decrease in model complexity. This bias may be too large to compensate for the decrease in variance, leading to a worse performance overall. Despite this flaw, bagging is an effective method. Subsequently, the use of bagging was expanded to any kind of model–i.e. not only decision trees, and to generalized statistical analysis, which showed sampling with and without replacement yielded equivalent improvements.

In 2001, Breiman131 combined bagging with feature bagging,123 which resulted in a method known as Random Forests. Feature bagging is an altered tree learning algorithm that chooses a random subset of features at each potential split. Consequently, each tree is generated from a bootstrap sample of training data and uses a random sample of features for each split. This approach effectively reduces both variance and bias. The improved reduction in variance is attributed to the increased independence of the trees, which is achieved through a combination of bootstrap samples and a random selection of features. Similarly, the bias is decreased because there are often a few dominant features that consistently outperform their counterparts during the decision tree fitting process. By employing feature bagging, a vast number of predictors can be considered, allowing local feature predictors to contribute to the construction of the tree. The construction of a classification tree is like the construction of Breiman’s original decision tree.131 It begins with the root that contains all of the training samples. A subset of features is chosen randomly at each node, and the subset’s characteristic that permits the greatest class separation in the sample set at that node is identified. Next, the node is split into two child nodes. The process is then repeated, until the tree is fully grown; that is, all leaf nodes contain samples from one class only. The trees grown are not pruned.131 This entire process is repeated many times. After many trees are generated, each observation is assigned a final class by a plurality vote.131 Breiman empirically demonstrates that Random Forest outperform boosting125 and arcing algorithms,123 both of which are geared toward reducing bias, whereas forests concentrate on reducing variance.

As we sample with replacement, there will be Inline graphic observations that are not part of the bootstrap sample, known as out-of-bag (OOB) observations. These observations can be considered as a test dataset and dropped down the tree, enabling us to estimate the prediction error. The out-of-bag error refers to the average prediction error on each training sample using only the trees that did not include that specific training sample in their bootstrap sample. The proximity measure139 between two sample points is another useful feature of tree-based ensemble methods.

The importance of the different features for the classification can be estimated with the permutation accuracy criterion.131 However, little is known about the variable importance calculated by Random Forest, and to the best of our knowledge, Ishwaran’s work in 2007 is the most fundamental work devoted to the theoretical analysis of tree-based variable importance measures.140 The commonly used methods to estimate the prediction accuracy or variable importance include permutation importance,131 Impurity importance141 if the Gini index is being used as an impurity function, Actual impurity reduction importance,142 and others.141

It is worth mentioning that Random Forest, which is a bagging extension of ensemble methods, is one but not the only example of the application of the ensembles of Decision Trees. There are also Extremely Randomized Trees. Similarly to Random Forest, a random subset of possible features is used, but instead of finding the most optimal thresholds, threshold values are randomly selected for each possible feature, and the best of these randomly generated thresholds is chosen as the best rule for splitting a node. This usually reduces the model variance slightly at the expense of a slightly larger increase in the bias. Boosting is another ensemble strategy for generating a set of predictors, but the accuracy of RF is comparable to that of boosting.143

There are also several enhancements to the boosting based on different boosting algorithms such as AdaBoost, Gradient Boosting, XGB, Light GBM, and CatBoost. Stacking or stack generalization is also an ensemble technique. While specific details regarding the algorithms are out of the scope of this review, the application of Random Forest is discussed in the following section.

2.3. Application of Random Forest to ToF-SIMS Data

Random Forest has been applied to mass spectrometry imaging datasets obtained by ToF-SIMS71,75 and other MS methods such as MALDI46,144,145 or some others.78,146,147

As mentioned in the Introduction, acquiring good quality data using an appropriate experimental setup can already avoid issues such as field effects introduced by topography and surface charging and nonlinearities caused by detector saturation. Crucial preprocessing steps can include but are not limited to mass calibration, normalization, baseline correction, denoising, peak picking, and dimensionality reduction. Considering the nature of ToF-SIMS data is necessary while doing all the preprocessing steps. Incorrect mass calibration would not affect the Random Forest algorithm directly, but a correct calibration of the mass spectra is, of course, important for the interpretation of the output by the analyst. Green et al. provide a useful guide for ToF-SIMS mass calibration.148 Regarding mass calibration, peak picking, and mass bin width, Madiona et al.149 evaluated bin size using Shannon entropy, while Lang et al.76 proposed a conversion method for mass spectra. Normalization by the total ion count (TIC) per pixel effectively mitigates variations in the secondary ion signal due to differences in topography, sample charging, or instrumental conditions such as variations in primary ion current or detector efficiency, facilitating comparison across measurements taken on different days. Nevertheless, caution is advised when applying TIC normalization, because it may produce misleading results. Baseline correction is usually unnecessary in ToF-SIMS, unlike other MS techniques due to the low levels of chemical noise.150 Denoising also might be considered as one of the preprocessing steps, because the performance of ML models is driven by the bias-variance trade-off. In particular, Haar wavelet denoising, or the down-binning of specific m/z images, is a commonly used technique for ToF-SIMS imaging.151,152 While performing peak picking it is also important to take into account the large disparity of signal-to-noise ratio (SNR), and noise, which is still a problem when it comes to automated peak identification in SIMS.150 The use of derivative spectrometry based on the continuous wavelet transform (CWT) usually facilitates peak detection in ToF-SIMS.151 Collinear features such as m/z channels of a single peak or fragments of the same molecule could potentially split the importance among themselves, reducing the clarity of results. If a SIMS imaging dataset is acquired on a system with a detector that follows Poisson statistics, then before any dimensionality reduction, a valid preprocessing step should include Poisson scaling,23 because principal component analysis (PCA)3032 works optimally on data with a Gaussian distribution and weighted PCA (w-PCA)23,33 alleviates the effects of the Poisson distributed noise. However, it is important to consider one potential drawback of scaling, namely its tendency to reduce sparsity in MSI datasets, preventing the use of efficient sparse representations of the datasets which would significantly reduce computational demands.153

For MSI datasets, the mass channels of a mass spectrum can be considered the features, and the individual pixels are considered separate samples. Class labels need to be assigned to each pixel in a training dataset. One of the important considerations is the correct selection of data-partitioning schemes. The output from the RF model may serve as an input for subsequent analyses, and a bad data-partitioning scheme can adversely affect these downstream processes, leading to potential misinterpretation. As a rule of thumb, a training set, which should be independent of the testing set, is usually taken as 70% of the samples and testing as the remaining 30%.154 However, some data-partitioning schemes are only effective if the size of training and testing sets are big enough and representative of the parameter space, namely, the single train–test split.115,116 When a dataset is small, the k-fold cross-validation estimate is usually preferred over the test sample estimate. Moreover, another important consideration concerning the application of RF to ToF-SIMS data is feature extraction and selection. One strategy is to use the entire mass spectrum either by using the individual mass channels or by down-binning them to reduce their number. In this case, noise is included in the feature space. The alternative strategy is to perform a peak search, enabling the creation of a better-performing model, which might introduce some user bias if done incorrectly. Although RF classifiers exhibit notable robustness to overfitting, outliers, and mislabeling, enough extreme outliers and mislabeled samples in the training dataset may still impact the performance of the RF model. Moreover, they can lead to overfitting or certain (sub)trees within the RF to focus on features that are not representative of the classification problem. Outliers can compel trees within the Random Forest to grow deeper to isolate these points, thus increasing the computational complexity. The removal of outliers promotes more balanced trees, streamlining the model training process and reducing computational demands. Moreover, initially during training, the optimal hyperparameters of RF should be determined. The hyperparameters that should be considered when training a model: n_estimators, the number of trees in the forest; criterion (“gini”, “log_loss” or “entropy”), max_features, the number of features to consider when looking for the best split (for classification tasks, the default value is √n where n is the number of features); min_samples_leaf, the minimum number of samples required to be at a leaf node; max_depth, the maximum depth of the tree. As an example, various numbers of trees (in the range of 10–1000) have been reported for robust RF.155157 The latter can be tuned, for instance, through consecutive repetitions, calculating some quality metrics such as mean value and standard deviation of the out-of-bag error,158 or through a CV.159 Other hyper-parameters such as the number of variables used in each node can be optimized similarly.156,160,161 Linear combinations of variables (for example, using dimensionality reduction algorithms) can be also used to reduce the number of features to improve results and computation times at the cost of interpretability.147,158 It is sometimes crucial to decrease the number of variables because of the high collinearity of ToF-SIMS data to get a robust and reproducible model.162 An optimal number of variables can be determined via a nested cross-validation process.163

3. Practice Example Using a ToF-SIMS Dataset

In this section, we will provide a simple demonstration of Random Forest applied to ToF-SIMS image data. In a previous publication, Random Forest was successfully used to identify potential marker ions for pulmonary arterial hypertension (PAH) in human lung arteries for the MALDI dataset.144 In that study by Van Nuffel et al., a ToF-SIMS imaging dataset of control and PAH-related arteries was collected but had not been analyzed using Random Forest.

3.1. Experimental Section

3.1.1. Biological Sample Preparation and ToF-SIMS Analyses

For the sample preparation of the human lung tissue sections and instrumental set up including experimental parameters, we refer the reader to the previous study,144 where it has been described in detail. Table S1 provides an overview of the samples used in the example. For this demonstration, 8 negative polarity ToF-SIMS images of 4 control arteries and 4 occluded PAH arteries are selected.144

3.1.2. Data Preprocessing

The ToF-SIMS datasets are first internally calibrated with the following ions: CN, CNO, C14H27O2, C16H31O2, and C18H35O2 using SurfaceLab software (IONTOF GmbH, Germany) (Figure S2). Then, the calibrated ion images were converted to ImzML and loaded in Python (3.12.3), where the processing was performed by using custom-made scripts. The script iterates through the files, loading the data matrices and spatial coordinates, while also assigning labels based on the filenames—labeling “Control” data as 0 and “PAH” data as 1. After collecting all data matrices, it combines them into a single sparse matrix (combined_matrix). The data analysis pipeline included peak picking with further extraction of peaks. The peak list includes 608 peaks in the m/z range of 0–1850 Da. Additionally, it compiles the mass axis, spatial coordinates, and labels and returns them for subsequent analysis steps. This consolidated dataset, along with the corresponding metadata, forms the foundation for later stages of the analysis pipeline.

Finally, the Compressed Sparse Row (CSR) matrix was normalized by scaling its values to fall within the range of 0 to 1. It does this by first identifying the maximum and minimum values within the matrix, which define the data range. Then, each element in the matrix is scaled by subtracting the minimum value and dividing by the difference between the maximum and minimum values. The ion image of palmitic acid at m/z 255.2 is used as an indicator of biological tissue to remove background pixels from the image data. The signal of palmitic acid is extracted from a data matrix by selecting the m/z values within a defined tolerance around a target m/z value (±0.5). This ion image is then thresholded based on a specified intensity threshold (0.05), creating a binary mask that identifies the pixels with ion intensity above the threshold. The thresholded image is then returned, and the indices of the “active pixels” that meet the threshold criteria. The function takes these results and creates visual representations of the ion image and the thresholded image. It reshapes the data into a 512 × 512 pixel grid, corresponding to the spatial coordinates of the sample, and uses the HoloViews library to create two images: one showing the original ion intensities and another highlighting the thresholded regions in a binary format (Figure S3).

3.1.3. Random Forest Implementation

The Random Forest is implemented using the scikit-learn class library via the RandomForestClassifier. It begins by splitting the input dataset (active pixels) and labels (active labels) into training and testing sets. The RF training procedure included 10% active pixels. The function then initializes the RandomForestClassifier with predetermined hyperparameters and fits the model to the training data.

To optimize different hyperparameters, the RandomForestClassifier function uses stratified k-fold cross-validation to evaluate the model’s performance with varying values of a specific hyperparameter, namely the number of trees (n_estimators), maximum depth of trees (max_depth), and the minimum number of samples required to be at a leaf node (min_samples_leaf) are used. For that it iterates over a grid of values for the number of trees and trains the RandomForestClassifier using each value, tracking the model’s accuracy on both training and cross-validation sets. It identifies the optimal number of trees that result in the highest cross-validation accuracy. Similarly, max_depth optimization and min_samples_leaf optimization iterate over different values for the maximum tree depth and the minimum samples per leaf, respectively, each time identifying the best-performing model based on cross-validation accuracy. For each hyperparameter setting, the functions collect and calculate metrics, including training accuracy, cross-validation accuracy, and the standard deviation of the accuracies. These metrics are then used to generate plots, which visually represent the model’s performance across the different hyperparameter values (Figures S4–S9). The plots include training and cross-validation accuracy as well as error bands to show variability. Each function returns the best classifier trained during the process, providing an optimized model configuration for further use.

In order to address multicollinearity, the visualization function is designed to analyze its effects by iteratively removing highly correlated features and observing changes in feature importance and model accuracy. For each threshold value that defines the level of acceptable multicollinearity, the function reduces the feature set by eliminating highly correlated features. It identifies and removes features from the training dataset that exhibit high collinearity based on a chosen arbitrary correlation threshold. Here, 0.30 and 0.90 were used. It starts by computing the correlation matrix of the input feature matrix, which measures the pairwise correlations among all features. Then, it extracts the upper triangle of this matrix, excluding the diagonal, to focus only on the unique feature pairs. Using the specified threshold, the function identifies pairs of features with a correlation coefficient greater than the threshold, indicating strong collinearity. It then determines which features to remove to reduce redundancy in the dataset. The function returns a new feature matrix containing only the remaining, less correlated features along with the indices of the removed features. The reduced datasets are then used to train and evaluate the model multiple times, capturing performance metrics such as accuracy, training time, and feature importance. These metrics are tracked across different thresholds to understand the impact of multicollinearity on the model’s performance. The function then compiles the feature importance scores into a data frame and assigns a consistent color map for visualization. It plots the feature importance to show how the significance of individual features changes with varying thresholds for multicollinearity. Additionally, it tracks the top features across thresholds, highlighting how the importance of these key features evolves as more correlated features are removed. The function plots performance metrics to provide insights into how multicollinearity affects the model’s accuracy and training efficiency (Figure S9).

After all the optimizations, the function predicts the labels for the test set and evaluates the model’s performance by calculating the accuracy and generating a detailed classification report. The top important features are visualized by extracting the feature importance from the classifier, which indicates how much each feature contributes to the model’s decision-making process (Figures S10 and S11). For further visualization, the function takes the trained optimized Random Forest classifier and uses it to predict the classes for all active pixels in the dataset. It starts by extracting the relevant pixel data from the combined data matrix and applying the classifier to obtain predicted labels. The function then iterates over each image, using the spatial coordinates to place the classified pixels in their correct positions on a blank image grid. Pixels predicted as “healthy” (label 0) are colored green, while “unhealthy” pixels are colored red (Figure 3).

Figure 3.

Figure 3

RF classification results of human lung tissue (green pixels classified as normal; red pixels classified as PAH-related). Top row: normal control arteries; bottom row: occluded PAH arteries.

3.2. Results and Discussion

The accuracy of CV ideally should reach an asymptote with a certain number of trees. However, it was possible to achieve 100% accuracy with a peak list consisting of 608 peaks using the training set, which illustrates the problem of overfitting (Figure S4). To avoid overfitting, the regularization parameters must be added to the model, starting with the maximum depth (max-depth) parameter (Figure S5). The max_depth hyperparameter helps to regularize the model, and consequently, the model overfits less. Another important parameter is min_samples_leaf; it also serves as a regularization parameter (Figure S6). Overall, it was concluded that the optimal number of trees was n_estimators = 100, max_depth = 17, min_samples_leaf = 1, max_features = 20. In this case, the validation does not result in an accuracy gain, but we can significantly reduce overfitting while maintaining an accuracy above 94%.

The trends for validation curves have been illustrated above, but the optimal hyperparameters can also be found by cross-validated grid search over a parameter grid, cross-validated search over parameter settings (randomized search, Bayesian optimization, etc.), k-fold cross-validator, etc. Using the OOB estimation, which allows you to obtain an unbiased error estimate, it is theoretically possible to avoid using a separate test set or cross-validation to evaluate the model. However, OOB error estimation, while useful, is not always a substitute for full cross-validation, especially in the following cases: (1) the dataset has a complex structure or distribution that may not be fully accounted for by bootstrapping, (2) the evaluation of the model on a small dataset, where the OOB estimation may be unstable, and (3) it is necessary to carefully tune the hyperparameters of the model.

Due to the nature of the ToF-SIMS dataset, namely, the highly collinear data points (m/z), performing variable selection is beneficial before pattern recognition to obtain more robust and generalizable models. As mentioned above collinearity could potentially “dilute” the importance of features and complicate the interpretation of the results. When the feature number decreases, the feature importance, which was previously “diluted”, gets higher for the remaining ones (Figure S9). It can also be noticed that what can seem like a high-importance feature might be misleading. The collinearity between the fragments can also be decreased via a nested cross-validation process. Namely, an optimal number of variables can be determined using the CV to perform a cross-validated prediction performance of a model. This method assesses the prediction performance of models by sequentially reducing the number of predictors (ranked by variable importance) through a nested cross-validation process. In that case, the process is repeated N times using k-fold cross-validation (usually 5-fold), and in each step, the data was reduced, which resulted in an optimal number of variables. When collinear features are removed, the model accuracy should not be majorly affected. This also can be visualized as a boxplot graph of OOB and predictability accuracy of the RF dataset with all variables and with decreased numbers of variables.

It is possible to evaluate the importance of variables if the mean decrease in accuracy (MDA) was calculated in a permutation manner to determine each variable’s significance in the classification process. MDA considers the difference between the out-of-bag error resulting from randomly permuting variable values and the OOB from the original dataset. To obtain a more accurate understanding of the MDA for each variable, the variable selection for splitting decisions is usually repeated several times, and the average MDA was calculated. The variables with the highest MDA values can be retained in the final models, resulting in a reduced number of variables. Similarly, the reduced number of variables after the collinearity threshold was applied can be retained in the final model. The ion at m/z 885.62 has one of the highest predictor importance estimates of all the ions in the high mass range (Figure S11). Based on the previous publication, this is known to be a PI C18:0/C20:4 species that was identified as a potential marker for PAH.144 The classification accuracy for the entire dataset was assessed by constructing images where green pixels indicate healthy tissue and red pixels indicate unhealthy tissue (Figure 3). In the images, the control tissue primarily appears green, with a minority of pixels incorrectly classified as unhealthy. Conversely, in the PAH-affected arteries, red dominates, indicating the presence of the disease. The fact that some pixels are still misclassified highlights the importance of obtaining ground truth labels, which could enhance accuracy by refining the classification process and reducing misclassification.

4. Conclusion

The Random Forest algorithm proves to be a powerful tool for ToF-SIMS data, but there are some points of attention. Even though RF can handle outliers and noise in the data efficiently due to random sampling, the choice of descriptors for training is paramount in the analysis of ToF-SIMS data, highlighting the criticality of preprocessing. The algorithm is also prone to overfitting, especially on noisy data, necessitating validation. Moreover, as demonstrated, multicollinearity among features is an aspect to be addressed when working with ToF-SIMS data. If the datasets contain groups of correlated features that have similar significance for labels, then preference is given to small groups over large ones. Reducing multicollinear features can significantly improve the model’s performance, reliability, and interpretability by emphasizing the truly impactful variables and minimizing misleading feature importance. Identifying the most relevant features through nested cross-validation can help in selecting a subset of variables that enhances the model’s robustness. Finally, achieving relevant classification accuracy in ToF-SIMS analysis requires accurate ground truth labels. These labels can potentially enable a more precise evaluation and adjustment of the model.

On the other hand, it is resilient to common data challenges such as feature scaling and other monotonic transformations of feature values, due to the choice of random subspaces, which enhances its utility for ToF-SIMS datasets. RF is also worthwhile for its high parallelizability and scalability, making it suitable for the increasingly larger ToF-SIMS imaging datasets. The application of RF to ToF-SIMS imaging facilitates the classification of complex chemical compositions and the identification of significant features that contribute to these classifications. With consideration of the nature of the ToF-SIMS dataset, RF enhances the understanding of complex systems, paving the way for new applications in materials science, biology, and surface chemistry where detailed surface chemical mapping is essential.

Acknowledgments

The authors thank the Chan Zuckerberg Initiative (DAF2023-32124) and the Atoms2Anatomy Fund, part of the University Fund Limburg/SWOL, for funding. The authors would like to thank Dr. Alain Brunelle, Dr. Sylvia Cohen-Kaminsky, and Prof. Marc Humbert for agreeing to the reuse of data from a previously published study (ref (144)) for the example presented. The authors thank Sven Kayser and Matthias Kleine-Boymann (IONTOF) for providing access to the software SurfaceLab 7.3 which allowed us to read the original datafiles and export the datasets as imzML files. The authors would also like to thank Dr. Caroline Bouvier, Dr. Edith Sandström, and Kimberly G. Garcia for the insightful discussions.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/jasms.4c00324.

  • Visualization of cross-validation technique and visualization of 5-fold cross-validation; visualization of leave-one-out cross-validation; examples of ToF-SIMS ion images; thresholded ion images; learning curves and model performance of Random Forest; figures of feature importance; table of samples used in the example (PDF)

Author Contributions

Conceptualization: M.S., S.V.N., Data Acquisition: S.V.N., Data Curation: M.S., T.V., S. Z., Formal Analysis: M.S., T.V., Visualization: M.S., T.V., S. Z., Supervision: S.V.N., I.G.M.A., Writing–original draft: M.S., Writing–review and editing: T.V., S. Z., S.V.N., I.G.M.A., Secured funding: S.V.N., I.G.M.A. The manuscript was written through the contributions of all authors. All authors have given approval to the final version of the manuscript.

The example presented in this tutorial uses ToF-SIMS images of human lung tissue. Every patient signed an informed consent for participation in the study (Protocol N8CO-08–003, ID RCB: 2008-A00485-50), as indicated in ref (144).

The authors declare no competing financial interest.

Special Issue

Published as part of Journal of the American Society for Mass Spectrometryspecial issue “Advanced Data Analysis in Secondary Ion Mass Spectrometry (SIMS)”.

Supplementary Material

References

  1. Andersen C. A.; Hinthorne J. R. Ion microprobe mass analyzer. Science 1972, 175 (4024), 853–860. 10.1126/science.175.4024.853. [DOI] [PubMed] [Google Scholar]
  2. Drowart J.; Honig R. E. Mass Spectrometric Study of Copper, Silver, and Gold. J. Chem. Phys. 1956, 25 (3), 581–582. 10.1063/1.1742974. [DOI] [Google Scholar]
  3. Van Vaeck L.; Adriaens A.; Gijbels R. Static secondary ion mass spectrometry (S-SIMS) Part 1: methodology and structural interpretation. Mass Spectrom. Rev. 1999, 18 (1), 1–47. . [DOI] [Google Scholar]
  4. Chabala J. M.; Soni K. K.; Li J.; Gavrilov K. L.; Levi-Setti R. High-Resolution Chemical Imaging with Scanning Ion Probe SIMS. International Journal of Mass Spectrometry and Ion Processes 1995, 143, 191–212. 10.1016/0168-1176(94)04119-R. [DOI] [Google Scholar]
  5. Todd P. J.; Schaaff T. G.; Chaurand P.; Caprioli R. M. Organic Ion Imaging of Biological Tissue with Secondary Ion Mass Spectrometry and Matrix-Assisted Laser Desorption/Ionization. J. Mass Spectrom 2001, 36 (4), 355–369. 10.1002/jms.153. [DOI] [PubMed] [Google Scholar]
  6. Belu A. M.; Graham D. J.; Castner D. G. Time-of-Flight Secondary Ion Mass Spectrometry: Techniques and Applications for the Characterization of Biomaterial Surfaces. Biomaterials 2003, 24 (21), 3635–3653. 10.1016/S0142-9612(03)00159-5. [DOI] [PubMed] [Google Scholar]
  7. Buchberger A. R.; DeLaney K.; Johnson J.; Li L. Mass Spectrometry Imaging: A Review of Emerging Advancements and Future Insights. Anal. Chem. 2018, 90 (1), 240–265. 10.1021/acs.analchem.7b04733. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Vickerman J. C.; Winograd N. SIMS—A Precursor and Partner to Contemporary Mass Spectrometry. Int. J. Mass Spectrom. 2015, 377, 568–579. 10.1016/j.ijms.2014.06.021. [DOI] [Google Scholar]
  9. Seah M. P.; Shard A. G. The Matrix Effect in Secondary Ion Mass Spectrometry. Appl. Surf. Sci. 2018, 439, 605–611. 10.1016/j.apsusc.2018.01.065. [DOI] [Google Scholar]
  10. Miyamoto S.; Hsu C.-C.; Hamm G.; Darshi M.; Diamond-Stanic M.; Declèves A.-E.; Slater L.; Pennathur S.; Stauber J.; Dorrestein P. C.; Sharma K. Mass Spectrometry Imaging Reveals Elevated Glomerular ATP/AMP in Diabetes/Obesity and Identifies Sphingomyelin as a Possible Mediator. EBioMedicine 2016, 7, 121–134. 10.1016/j.ebiom.2016.03.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Tuck M.; Blanc L.; Touti R.; Patterson N. H.; Van Nuffel S.; Villette S.; Taveau J.-C.; Römpp A.; Brunelle A.; Lecomte S.; Desbenoit N. Multimodal Imaging Based on Vibrational Spectroscopies and Mass Spectrometry Imaging Applied to Biological Tissue: A Multiscale and Multiomics Review. Anal. Chem. 2021, 93 (1), 445–477. 10.1021/acs.analchem.0c04595. [DOI] [PubMed] [Google Scholar]
  12. Paudel B.; Dhas J. A.; Zhou Y.; Choi M.-J.; Senor D. J.; Chang C.-H.; Du Y.; Zhu Z. ToF-SIMS in Material Research: A View from Nanoscale Hydrogen Detection. Mater. Today 2024, 75, 149. 10.1016/j.mattod.2024.03.003. [DOI] [Google Scholar]
  13. Tyler B. J.ToF-SIMS: Surface Analysis by Mass Spectrometry; Surface Spectra/IM Publications: Chichester/Manchester, 2001; 475–493. [Google Scholar]
  14. Wagner M. S.; Graham D. J.; Ratner B. D.; Castner D. G. Maximizing Information Obtained from Secondary Ion Mass Spectra of Organic Thin Films Using Multivariate Analysis. Surf. Sci. 2004, 570 (1), 78–97. 10.1016/j.susc.2004.06.184. [DOI] [Google Scholar]
  15. Pachuta S. J.; Vlasak P. R. Postacquisition Mass Resolution Improvement in Time-of-Flight Secondary Ion Mass Spectrometry. Anal. Chem. 2012, 84 (3), 1744–1753. 10.1021/ac203229m. [DOI] [PubMed] [Google Scholar]
  16. McDonnell L. A.; Mize T. H.; Luxembourg S. L.; Koster S.; Eijkel G. B.; Verpoorte E.; de Rooij N. F.; Heeren R. M. A. Using Matrix Peaks to Map Topography: Increased Mass Resolution and Enhanced Sensitivity in Chemical Imaging. Anal. Chem. 2003, 75 (17), 4373–4381. 10.1021/ac034401j. [DOI] [PubMed] [Google Scholar]
  17. Ziegler G.; Hutter H. Correction of Topographic Artefacts of ToF-SIMS Element Distributions. Surf. Interface Anal. 2013, 45 (1), 457–460. 10.1002/sia.5127. [DOI] [Google Scholar]
  18. Vanbellingen Q. P.; Elie N.; Eller M. J.; Della-Negra S.; Touboul D.; Brunelle A. Time-of-Flight Secondary Ion Mass Spectrometry Imaging of Biological Samples with Delayed Extraction for High Mass and High Spatial Resolutions. Rapid Commun. Mass Spectrom. 2015, 29 (13), 1187–1195. 10.1002/rcm.7210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Edwards L.; Mack P.; Morgan D. J. Recent Advances in Dual Mode Charge Compensation for XPS Analysis. Surf. Interface Anal. 2019, 51 (9), 925–933. 10.1002/sia.6680. [DOI] [Google Scholar]
  20. Baer D. R.; Artyushkova K.; Cohen H.; Easton C. D.; Engelhard M.; Gengenbach T. R.; Greczynski G.; Mack P.; Morgan D. J.; Roberts A. XPS Guide: Charge Neutralization and Binding Energy Referencing for Insulating Samples. Journal of Vacuum Science & Technology A 2020, 38 (3), 031204 10.1116/6.0000057. [DOI] [Google Scholar]
  21. Mehta P.; Bukov M.; Wang C.-H.; Day A. G. R.; Richardson C.; Fisher C. K.; Schwab D. J. A High-Bias, Low-Variance Introduction to Machine Learning for Physicists. Phys. Rep. 2019, 810, 1–124. 10.1016/j.physrep.2019.03.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Foi A.; Trimeche M.; Katkovnik V.; Egiazarian K. Practical Poissonian-Gaussian noise modeling and fitting for single-image raw-data. IEEE transactions on image processing 2008, 17 (10), 1737–1754. 10.1109/TIP.2008.2001399. [DOI] [PubMed] [Google Scholar]
  23. Keenan M. R.; Kotula P. G. Accounting for Poisson Noise in the Multivariate Analysis of ToF-SIMS Spectrum Images. Surf. Interface Anal. 2004, 36 (3), 203–212. 10.1002/sia.1657. [DOI] [Google Scholar]
  24. Jiang B.; Meng K.; Youcef-Toumi K. Quantification and Reduction of Poisson-Gaussian Mixed Noise Induced Errors in Ellipsometry. Opt. Express, OE 2021, 29 (17), 27057–27070. 10.1364/OE.432793. [DOI] [PubMed] [Google Scholar]
  25. Tyler B. J.; Rayal G.; Castner D. G. Multivariate Analysis Strategies for Processing ToF-SIMS Images of Biomaterials. Biomaterials 2007, 28 (15), 2412–2423. 10.1016/j.biomaterials.2007.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Graham D. J.; Castner D. G. Multivariate Analysis of ToF-SIMS Data from Multicomponent Systems: The Why, When, and How. Biointerphases 2012, 7 (1), 1–12. 10.1007/s13758-012-0049-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Shard A. G.; Rafati A.; Ogaki R.; Lee J. L. S.; Hutton S.; Mishra G.; Davies M. C.; Alexander M. R. Organic Depth Profiling of a Binary System: The Compositional Effect on Secondary Ion Yield and a Model for Charge Transfer during Secondary Ion Emission. J. Phys. Chem. B 2009, 113 (34), 11574–11582. 10.1021/jp904911n. [DOI] [PubMed] [Google Scholar]
  28. Ievlev A. V.; Belianinov A.; Jesse S.; Allison D. P.; Doktycz M. J.; Retterer S. T.; Kalinin S. V.; Ovchinnikova O. S. Automated Interpretation and Extraction of Topographic Information from Time of Flight Secondary Ion Mass Spectrometry Data. Sci. Rep 2017, 7 (1), 17099. 10.1038/s41598-017-17049-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Verbeeck N.; Caprioli R. M.; Van de Plas R. Unsupervised Machine Learning for Exploratory Data Analysis in Imaging Mass Spectrometry. Mass Spectrom Rev. 2020, 39 (3), 245–291. 10.1002/mas.21602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Graham D. J.; Wagner M. S.; Castner D. G. Information from Complexity: Challenges of TOF-SIMS Data Interpretation. Appl. Surf. Sci. 2006, 252 (19), 6860–6868. 10.1016/j.apsusc.2006.02.149. [DOI] [Google Scholar]
  31. Van Nuffel S.; Ang K. C.; Lin A. Y.; Cheng K. C. Chemical Imaging of Retinal Pigment Epithelium in Frozen Sections of Zebrafish Larvae Using ToF-SIMS. J. Am. Soc. Mass Spectrom. 2021, 32 (1), 255–261. 10.1021/jasms.0c00300. [DOI] [PubMed] [Google Scholar]
  32. Brulet M.; Seyer A.; Edelman A.; Brunelle A.; Fritsch J.; Ollero M.; Laprévote O. Lipid Mapping of Colonic Mucosa by Cluster TOF-SIMS Imaging and Multivariate Analysis in Cftr Knockout Mice. J. Lipid Res. 2010, 51 (10), 3034–3045. 10.1194/jlr.M008870. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Wang M. Depth Profile Analysis of Molybdenum Disulfide Film by Glow Discharge Mass Spectrometry. At. Spectrosc. 2021, 42 (3), 183. 10.46770/AS.2021.070. [DOI] [Google Scholar]
  34. Siy P. W.; Moffitt R. A.; Parry R. M.; Chen Y.; Liu Y.; Sullards M. C.; Merrill A. H.; Wang M. D. Matrix Factorization Techniques for Analysis of Imaging Mass Spectrometry Data. Proc. IEEE Int. Symp. BioInformatics Bioeng. 2008, 1–6. 10.1109/BIBE.2008.4696797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Larsen R. Decomposition Using Maximum Autocorrelation Factors. Journal of Chemometrics 2002, 16 (8–10), 427–435. 10.1002/cem.743. [DOI] [Google Scholar]
  36. Henderson A.; Fletcher J. S.; Vickerman J. C. A Comparison of PCA and MAF for ToF-SIMS Image Interpretation. Surf. Interface Anal. 2009, 41 (8), 666–674. 10.1002/sia.3084. [DOI] [Google Scholar]
  37. Trindade G. F.; Williams D. F.; Abel M.-L.; Watts J. F. Analysis of Atmospheric Plasma-Treated Polypropylene by Large Area ToF-SIMS Imaging and NMF. Surf. Interface Anal. 2018, 50 (11), 1180–1186. 10.1002/sia.6378. [DOI] [Google Scholar]
  38. Trindade G. F.; Abel M.-L.; Lowe C.; Tshulu R.; Watts J. F. A Time-of-Flight Secondary Ion Mass Spectrometry/Multivariate Analysis (ToF-SIMS/MVA) Approach To Identify Phase Segregation in Blends of Incompatible but Extremely Similar Resins. Anal. Chem. 2018, 90 (6), 3936–3941. 10.1021/acs.analchem.7b04877. [DOI] [PubMed] [Google Scholar]
  39. Race A. M.; Palmer A. D.; Dexter A.; Steven R. T.; Styles I. B.; Bunch J. SpectralAnalysis: Software for the Masses. Anal. Chem. 2016, 88 (19), 9451–9458. 10.1021/acs.analchem.6b01643. [DOI] [PubMed] [Google Scholar]
  40. Borodinov N.; Lorenz M.; King S. T.; Ievlev A. V.; Ovchinnikova O. S. Toward Nanoscale Molecular Mass Spectrometry Imaging via Physically Constrained Machine Learning on Co-Registered Multimodal Data. npj Comput. Mater. 2020, 6 (1), 1–8. 10.1038/s41524-020-00357-9. [DOI] [Google Scholar]
  41. Ding C.; Li T.; Peng W. On the Equivalence between Non-Negative Matrix Factorization and Probabilistic Latent Semantic Indexing. Computational Statistics & Data Analysis 2008, 52 (8), 3913–3927. 10.1016/j.csda.2008.01.011. [DOI] [Google Scholar]
  42. Jaumot J.; Tauler R. Potential Use of Multivariate Curve Resolution for the Analysis of Mass Spectrometry Images. Analyst 2015, 140 (3), 837–846. 10.1039/C4AN00801D. [DOI] [PubMed] [Google Scholar]
  43. Lee J. L. S.; Gilmore I. S.; Fletcher I. W.; Seah M. P. Multivariate Image Analysis Strategies for ToF-SIMS Images with Topography. Surf. Interface Anal. 2009, 41 (8), 653–665. 10.1002/sia.3070. [DOI] [Google Scholar]
  44. Yuan S.; Wang R.; Zhang H.; Li Y.; Liu L.; Fu Y. Investigation of Mineral Phase Transformation Technology Followed by Magnetic Separation for Recovery of Iron Values from Red Mud. Sustainability 2022, 14 (21), 13787. 10.3390/su142113787. [DOI] [Google Scholar]
  45. Keenan M. R.; Windig W.; Arlinghaus H. Framework for Alternating-Least-Squares-Based Multivariate Curve Resolution with Application to Time-of-Flight Secondary Ion Mass Spectrometry Imaging. Journal of Vacuum Science & Technology A 2015, 33 (5), 05E123 10.1116/1.4927528. [DOI] [Google Scholar]
  46. Hanselmann M.; Kirchner M.; Renard B. Y.; Amstalden E. R.; Glunde K.; Heeren R. M. A.; Hamprecht F. A. Concise Representation of Mass Spectrometry Images by Probabilistic Latent Semantic Analysis. Anal. Chem. 2008, 80 (24), 9649–9658. 10.1021/ac801303x. [DOI] [PubMed] [Google Scholar]
  47. Yang J.; Rübel O.; Prabhat; Mahoney M. W.; Bowen B. P. Identifying Important Ions and Positions in Mass Spectrometry Imaging Data Using CUR Matrix Decompositions. Anal. Chem. 2015, 87 (9), 4658–4666. 10.1021/ac5040264. [DOI] [PubMed] [Google Scholar]
  48. Harn Y.-C.; Powers M. J.; Shank E. A.; Jojic V. Deconvolving Molecular Signatures of Interactions between Microbial Colonies. Bioinformatics 2015, 31 (12), i142–i150. 10.1093/bioinformatics/btv251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Gelb L. D.; Milillo T. M.; Walker A. V. On Including Nonlinearity in Multivariate Analysis of Imaging SIMS Data. Surf. Interface Anal. 2014, 46 (S1), 221–224. 10.1002/sia.5653. [DOI] [Google Scholar]
  50. Chen Y.; Tang F.; Li T.; He J.; Abliz Z.; Liu L.; Wang X. Application of Factor Analysis in Imaging Mass Spectrometric Data Analysis. Chinese Journal of Analytical Chemistry 2014, 42, 1099–1103. 10.1016/S1872-2040(14)60757-X. [DOI] [Google Scholar]
  51. Van Nuffel S.; Parmenter C.; Scurr D. J.; Russell N. A.; Zelzer M. Multivariate Analysis of 3D ToF-SIMS Images: Method Validation and Application to Cultured Neuronal Networks. Analyst 2016, 141 (1), 90–95. 10.1039/C5AN01743B. [DOI] [PubMed] [Google Scholar]
  52. Van Nuffel S.; Elie N.; Yang E.; Nouet J.; Touboul D.; Chaurand P.; Brunelle A. Insights into the MALDI Process after Matrix Deposition by Sublimation Using 3D ToF-SIMS Imaging. Anal. Chem. 2018, 90 (3), 1907–1914. 10.1021/acs.analchem.7b03993. [DOI] [PubMed] [Google Scholar]
  53. Race A. M.; Steven R. T.; Palmer A. D.; Styles I. B.; Bunch J. Memory Efficient Principal Component Analysis for the Dimensionality Reduction of Large Mass Spectrometry Imaging Data Sets. Anal. Chem. 2013, 85 (6), 3071–3078. 10.1021/ac302528v. [DOI] [PubMed] [Google Scholar]
  54. Konicek A. R.; Lefman J.; Szakal C. Automated Correlation and Classification of Secondary Ion Mass Spectrometry Images Using a k -Means Cluster Method. Analyst 2012, 137 (15), 3479–3487. 10.1039/c2an16122b. [DOI] [PubMed] [Google Scholar]
  55. Palmer A. D.; Bunch J.; Styles I. B. The Use of Random Projections for the Analysis of Mass Spectrometry Imaging Data. J. Am. Soc. Mass Spectrom. 2015, 26 (2), 315–322. 10.1007/s13361-014-1024-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Urbini M.; Petito V.; de Notaristefani F.; Scaldaferri F.; Gasbarrini A.; Tortora L. ToF-SIMS and Principal Component Analysis of Lipids and Amino Acids from Inflamed and Dysplastic Human Colonic Mucosa. Anal Bioanal Chem. 2017, 409 (26), 6097–6111. 10.1007/s00216-017-0546-9. [DOI] [PubMed] [Google Scholar]
  57. Deininger S.-O.; Becker M.; Suckau D.. Tutorial: Multivariate Statistical Treatment of Imaging Data for Clinical Biomarker Discovery. In Mass Spectrometry Imaging: Principles and Protocols; Rubakhin S. S., Sweedler J. V., Eds.; Humana Press: Totowa, NJ, 2010; pp 385–403. 10.1007/978-1-60761-746-4_22. [DOI] [PubMed] [Google Scholar]
  58. Trede D.; Kobarg J. H.; Oetjen J.; Thiele H.; Maass P.; Alexandrov T. On the Importance of Mathematical Methods for Analysis of MALDI-Imaging Mass Spectrometry Data. Journal of Integrative Bioinformatics (JIB) 2012, 9 (1), 1–11. 10.1515/jib-2012-189. [DOI] [PubMed] [Google Scholar]
  59. Alexandrov T.; Becker M.; Deininger S.-O.; Ernst G.; Wehder L.; Grasmair M.; von Eggeling F.; Thiele H.; Maass P. Spatial Segmentation of Imaging Mass Spectrometry Data with Edge-Preserving Image Denoising and Clustering. J. Proteome Res. 2010, 9 (12), 6535–6546. 10.1021/pr100734z. [DOI] [PubMed] [Google Scholar]
  60. Alexandrov T.; Becker M.; Guntinas-Lichius O.; Ernst G.; von Eggeling F. MALDI-Imaging Segmentation Is a Powerful Tool for Spatial Functional Proteomic Analysis of Human Larynx Carcinoma. J. Cancer Res. Clin Oncol 2013, 139 (1), 85–95. 10.1007/s00432-012-1303-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Jones E. A.; van Remoortere A.; van Zeijl R. J. M.; Hogendoorn P. C. W.; Bovee J. V. M. G.; Deelder A. M.; McDonnell L. A. Multiple Statistical Analysis Techniques Corroborate Intratumor Heterogeneity in Imaging Mass Spectrometry Datasets of Myxofibrosarcoma. PLoS One 2011, 6 (9), e24913 10.1371/journal.pone.0024913. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Bruand J.; Alexandrov T.; Sistla S.; Wisztorski M.; Meriaux C.; Becker M.; Salzet M.; Fournier I.; Macagno E.; Bafna V. AMASS: Algorithm for MSI Analysis by Semi-Supervised Segmentation. J. Proteome Res. 2011, 10 (10), 4734–4743. 10.1021/pr2005378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Chernyavsky I.; Alexandrov T.; Maass P.; Nikolenko S. I.. A Two-Step Soft Segmentation Procedure for MALDI Imaging Mass Spectrometry Data. In DROPS-IDN/v2/document/10.4230/OASIcs.GCB.2012.39; Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2012. 10.4230/OASIcs.GCB.2012.39. [DOI] [Google Scholar]
  64. Bemis K. D.; Harry A.; Eberlin L. S.; Ferreira C. R.; van de Ven S. M.; Mallick P.; Stolowitz M.; Vitek O. Probabilistic Segmentation of Mass Spectrometry (MS) Images Helps Select Important Ions and Characterize Confidence in the Resulting Segments *. Molecular & Cellular Proteomics 2016, 15 (5), 1761–1772. 10.1074/mcp.O115.053918. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Abdelmoula W. M.; Škrášková K.; Balluff B.; Carreira R. J.; Tolner E. A.; Lelieveldt B. P. F.; van der Maaten L.; Morreau H.; van den Maagdenberg A. M. J. M.; Heeren R. M. A.; McDonnell L. A.; Dijkstra J. Automatic Generic Registration of Mass Spectrometry Imaging Data to Histology Using Nonlinear Stochastic Embedding. Anal. Chem. 2014, 86 (18), 9204–9211. 10.1021/ac502170f. [DOI] [PubMed] [Google Scholar]
  66. Smets T.; Verbeeck N.; Claesen M.; Asperger A.; Griffioen G.; Tousseyn T.; Waelput W.; Waelkens E.; De Moor B. Evaluation of Distance Metrics and Spatial Autocorrelation in Uniform Manifold Approximation and Projection Applied to Mass Spectrometry Imaging Data. Anal. Chem. 2019, 91 (9), 5706–5714. 10.1021/acs.analchem.8b05827. [DOI] [PubMed] [Google Scholar]
  67. Thomas S. A.; Race A. M.; Steven R. T.; Gilmore I. S.; Bunch J.. Dimensionality Reduction of Mass Spectrometry Imaging Data Using Autoencoders. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI); 2016; 1–7. 10.1109/SSCI.2016.7849863. [DOI]
  68. Franceschi P.; Wehrens R. Self-organizing maps: A versatile tool for the automatic analysis of untargeted imaging datasets. Proteomics 2014, 14 (7–8), 853–861. 10.1002/pmic.201300308. [DOI] [PubMed] [Google Scholar]
  69. Matsuda K.; Aoyagi S. Sparse Autoencoder-Based Feature Extraction from TOF-SIMS Image Data of Human Skin Structures. Anal Bioanal Chem. 2022, 414 (2), 1177–1186. 10.1007/s00216-021-03744-3. [DOI] [PubMed] [Google Scholar]
  70. Aoyagi S.; Matsuda K. Quantitative analysis of ToF-SIMS data of a two organic compound mixture using an autoencoder and simple artificial neural networks. Rapid Commun. Mass Spectrom. 2023, 37 (4), e9445 10.1002/rcm.9445. [DOI] [PubMed] [Google Scholar]
  71. Zhao Y.; Otto S.-K.; Lombardo T.; Henss A.; Koeppe A.; Selzer M.; Janek J.; Nestler B. Identification of Lithium Compounds on Surfaces of Lithium Metal Anode with Machine-Learning-Assisted Analysis of ToF-SIMS Spectra. ACS Appl. Mater. Interfaces 2023, 15 (43), 50469–50478. 10.1021/acsami.3c09643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Koshute P.; Hagan N.; Jameson N. J. Machine Learning Model for Detecting Fentanyl Analogs from Mass Spectra. Forensic Chemistry 2022, 27, 100379 10.1016/j.forc.2021.100379. [DOI] [Google Scholar]
  73. Hook A. L.; Williams P. M.; Alexander M. R.; Scurr D. J. Multivariate ToF-SIMS Image Analysis of Polymer Microarrays and Protein Adsorption. Biointerphases 2015, 10 (1), 019005 10.1116/1.4906484. [DOI] [PubMed] [Google Scholar]
  74. Sun R.; Gardner W.; Winkler D. A.; Muir B. W.; Pigram P. J. Exploring the Performance of Linear and Nonlinear Models of Time-of-Flight Secondary Ion Mass Spectrometry Spectra. Anal. Chem. 2024, 96 (19), 7594–7601. 10.1021/acs.analchem.4c00456. [DOI] [PubMed] [Google Scholar]
  75. Aoyagi S.; Fujiwara Y.; Takano A.; Vorng J.-L.; Gilmore I. S.; Wang Y.-C.; Tallarek E.; Hagenhoff B.; Iida S.; Luch A.; Jungnickel H.; Lang Y.; Shon H. K.; Lee T. G.; Li Z.; Matsuda K.; Mihara I.; Miisho A.; Murayama Y.; Nagatomi T.; Ikeda R.; Okamoto M.; Saiga K.; Tsuchiya T.; Uemura S. Evaluation of Time-of-Flight Secondary Ion Mass Spectrometry Spectra of Peptides by Random Forest with Amino Acid Labels: Results from a Versailles Project on Advanced Materials and Standards Interlaboratory Study. Anal. Chem. 2021, 93 (9), 4191–4197. 10.1021/acs.analchem.0c04577. [DOI] [PubMed] [Google Scholar]
  76. Lang Y.; Zhou L.; Imamura Y. Development of Machine-Learning Techniques for Time-of-Flight Secondary Ion Mass Spectrometry Spectral Analysis: Application for the Identification of Silane Coupling Agents in Multicomponent Films. Anal. Chem. 2022, 94 (5), 2546–2553. 10.1021/acs.analchem.1c04436. [DOI] [PubMed] [Google Scholar]
  77. Luts J.; Ojeda F.; Van de Plas R.; De Moor B.; Van Huffel S.; Suykens J. A. K. A Tutorial on Support Vector Machine-Based Methods for Classification Problems in Chemometrics. Anal. Chim. Acta 2010, 665 (2), 129–145. 10.1016/j.aca.2010.03.030. [DOI] [PubMed] [Google Scholar]
  78. Jetybayeva A.; Borodinov N.; Ievlev A. V.; Haque M. I. U.; Hinkle J.; Lamberti W. A.; Meredith J. C.; Abmayr D.; Ovchinnikova O. S. A Review on Recent Machine Learning Applications for Imaging Mass Spectrometry Studies. J. Appl. Phys. 2023, 133 (2), 020702 10.1063/5.0100948. [DOI] [Google Scholar]
  79. Madiona R. M. T.; Welch N. G.; Russell S. B.; Winkler D. A.; Scoble J. A.; Muir B. W.; Pigram P. J. Multivariate Analysis of ToF-SIMS Data Using Mass Segmented Peak Lists. Surf. Interface Anal. 2018, 50 (7), 713–728. 10.1002/sia.6462. [DOI] [Google Scholar]
  80. Matsuda K.; Aoyagi S. Time-of-Flight Secondary Ion Mass Spectrometry Analysis of Hair Samples Using Unsupervised Artificial Neural Network. Biointerphases 2020, 15 (2), 021013 10.1116/6.0000044. [DOI] [PubMed] [Google Scholar]
  81. Gardner W.; Hook A. L.; Alexander M. R.; Ballabio D.; Cutts S. M.; Muir B. W.; Pigram P. J. ToF-SIMS and Machine Learning for Single-Pixel Molecular Discrimination of an Acrylate Polymer Microarray. Anal. Chem. 2020, 92 (9), 6587–6597. 10.1021/acs.analchem.0c00349. [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Dexter A.; Thomas S. A.; Steven R. T.; Robinson K. N.; Taylor A. J.; Elia E.; Nikula C.; Campbell A. D.; Panina Y.; Najumudeen A. K.; Murta T.; Yan B.; Grabowski P.; Hamm G.; Swales J.; Gilmore I. S.; Yuneva M. O.; Goodwin R. J. A.; Barry S.; Sansom O. J.; Takats Z.; Bunch J.. Training a Neural Network to Learn Other Dimensionality Reduction Removes Data Size Restrictions in Bioinformatics and Provides a New Route to Exploring Data Representations. bioRxiv, 2020. 10.1101/2020.09.03.269555. [DOI]
  83. Aoyagi S.; Cant D. J. H.; Dürr M.; Eyres A.; Fearn S.; Gilmore I. S.; Iida S.; Ikeda R.; Ishikawa K.; Lagator M.; Lockyer N.; Keller P.; Matsuda K.; Murayama Y.; Okamoto M.; Reed B. P.; Shard A. G.; Takano A.; Trindade G. F.; Vorng J.-L. Quantitative and Qualitative Analyses of Mass Spectra of OEL Materials by Artificial Neural Network and Interface Evaluation: Results from a VAMAS Interlaboratory Study. Anal. Chem. 2023, 95 (40), 15078–15085. 10.1021/acs.analchem.3c03173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  84. Auret L.; Aldrich C. Interpretation of Nonlinear Relationships between Process Variables by Use of Random Forests. Minerals Engineering 2012, 35, 27–42. 10.1016/j.mineng.2012.05.008. [DOI] [Google Scholar]
  85. Steyrl D.; Scherer R.; Faller J.; Müller-Putz G. R. Random Forests in Non-Invasive Sensorimotor Rhythm Brain-Computer Interfaces: A Practical and Convenient Non-Linear Classifier. Biomedical Engineering/Biomedizinische Technik 2016, 61 (1), 77–86. 10.1515/bmt-2014-0117. [DOI] [PubMed] [Google Scholar]
  86. Morgan J. N.; Sonquist J. A. Problems in the Analysis of Survey Data, and a Proposal. Journal of the American Statistical Association 1963, 58 (302), 415–434. 10.1080/01621459.1963.10500855. [DOI] [Google Scholar]
  87. Sonquist J. A.Multivariate Model Building: The Validation of a Search Strategy; Survey Research Center, University of Michigan, 1970. [Google Scholar]
  88. Messenger R.; Mandell L. A Modal Search Technique for Predictive Nominal Scale Multivariate Analysis. Journal of the American Statistical Association 1972, 67 (340), 768–772. 10.1080/01621459.1972.10481290. [DOI] [Google Scholar]
  89. Breiman L., Stone C. J.. Parsimonious binary classification trees. Technology Service Corporation Santa Monica, Calif. Tech. Rep. TSCCSD-TN, 1978, 4.
  90. Friedman A Recursive Partitioning Decision Rule for Nonparametric Classification. IEEE Trans. Comput. 1977, C–26 (4), 404–408. 10.1109/TC.1977.1674849. [DOI] [Google Scholar]
  91. Quinlan J. R.Discovering Rules by Induction from Large Collections of Examples. Expert systems in the micro electronics age, 1979. [Google Scholar]
  92. Breiman L.; Friedman J.; Stone C. J.; Olshen R. A.. Classification and Regression Trees; CRC Press, 1984. [Google Scholar]
  93. Kass G. V. An Exploratory Technique for Investigating Large Quantities of Categorical Data. Journal of the Royal Statistical Society. Series C (Applied Statistics) 1980, 29 (2), 119–127. 10.2307/2986296. [DOI] [Google Scholar]
  94. Hunt E. B.; Marin J.; Stone P. J.. Experiments in Induction; Academic Press, 1966. [Google Scholar]
  95. Quinlan J. R.Learning efficient classification procedures and their application to chess end games. In Machine learning; Morgan Kaufmann, 1983. 463–482. [Google Scholar]
  96. Fürnkranz J.Decision Tree. In Encyclopedia of Machine Learning and Data Mining; Sammut C., Webb G. I., Eds.; Springer US: Boston, MA, 2017; pp 330–335. 10.1007/978-1-4899-7687-1_66. [DOI] [Google Scholar]
  97. Chanmee S.; Kesorn K. Semantic Decision Trees: A New Learning System for the ID3-Based Algorithm Using a Knowledge Base. Advanced Engineering Informatics 2023, 58, 102156 10.1016/j.aei.2023.102156. [DOI] [Google Scholar]
  98. Hssina B.; Merbouha A.; Ezzikouri H.; Erritali M.. A Comparative Study of Decision Tree ID3 and C4.5. SpecialIssue 2014, 4 ( (2), ). 10.14569/SpecialIssue.2014.040203. [DOI] [Google Scholar]
  99. Gini C.Variabilità e Mutabilità, Pizetti E, Salvemini T.; Libreria Eredi Virgilio Veschi: Rome, 1912. [Google Scholar]
  100. Shannon C.; Weaver W.. The Mathematical Theory of Communication. [PubMed]
  101. Mao A.; Mohri M.; Zhong Y.. Cross-Entropy Loss Functions: Theoretical Analysis and Applications. arXiv, June 19, 2023. 10.48550/arXiv.2304.07288. [DOI]
  102. Koeman M.; Heskes T.. Mutual Information Estimation with Random Forests. In Neural Information Processing; Loo C. K., Yap K. S., Wong K. W., Teoh A., Huang K., Eds.; Springer International Publishing: Cham, 2014; pp 524–531. 10.1007/978-3-319-12640-1_63. [DOI] [Google Scholar]
  103. Quinlan J. R. Induction of Decision Trees. Mach Learn 1986, 1 (1), 81–106. 10.1007/BF00116251. [DOI] [Google Scholar]
  104. Quinlan J. R.C4.5: Programs for Machine Learning; Morgan Kaufmann, 1993. [Google Scholar]
  105. Murthy S. K.; Kasif S.; Salzberg S. A System for Induction of Oblique Decision Trees. Journal of Artificial Intelligence Research 1994, 2, 1–32. 10.1613/jair.63. [DOI] [Google Scholar]
  106. McArdle J. J., Ritschard G., Eds. Contemporary issues in exploratory data mining in the behavioral sciences. Routledge, 2013. [Google Scholar]
  107. Friedman J. H.Fast MARS, Technical Report No. 110; Department of Statistics, Stanford University: Stanford, CA, 1993. https://purl.stanford.edu/vr602hr6778
  108. Hothorn T.; Hornik K.; Zeileis A. Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics 2006, 15, 651. 10.1198/106186006X133933. [DOI] [Google Scholar]
  109. Larose D. T.; Larose C. D.. Discovering Knowledge in Data: An Introduction to Data Mining; John Wiley & Sons, 2014. [Google Scholar]
  110. Belkin M.; Hsu D.; Ma S.; Mandal S. Reconciling Modern Machine-Learning Practice and the Classical Bias–Variance Trade-Off. Proc. Natl. Acad. Sci. U. S. A. 2019, 116 (32), 15849–15854. 10.1073/pnas.1903070116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  111. Ross Quinlan J.; Rivest R. L. Inferring Decision Trees Using the Minimum Description Lenght Principle. Information and Computation 1989, 80 (3), 227–248. 10.1016/0890-5401(89)90010-2. [DOI] [Google Scholar]
  112. Mansour Y.Pessimistic Decision Tree Pruning Based on Tree Size. Machine Learning-Internationa; Morgan Kaufann Publishers, Inc; 1997, 195–201. [Google Scholar]
  113. Frank E.Pruning decision trees and lists. Doctoral dissertation, The University of Waikato, 2000. [Google Scholar]
  114. Bellman R. Dynamic Programming. Science 1966, 153 (3731), 34–37. 10.1126/science.153.3731.34. [DOI] [PubMed] [Google Scholar]
  115. Baumann D.; Baumann K. Reliable Estimation of Prediction Errors for QSAR Models under Model Uncertainty Using Double Cross-Validation. Journal of Cheminformatics 2014, 6 (1), 47. 10.1186/s13321-014-0047-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  116. Krstajic D.; Buturovic L. J.; Leahy D. E.; Thomas S. Cross-Validation Pitfalls When Selecting and Assessing Regression and Classification Models. Journal of Cheminformatics 2014, 6 (1), 10. 10.1186/1758-2946-6-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  117. Blockeel H.; Struyf J.. Efficient Algorithms for Decision Tree Cross-Validation. arXiv, October 17, 2001. 10.48550/arXiv.cs/0110036. [DOI]
  118. Varma S.; Simon R. Bias in Error Estimation When Using Cross-Validation for Model Selection. BMC Bioinformatics 2006, 7 (1), 91. 10.1186/1471-2105-7-91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  119. Arlot S.; Celisse A. A Survey of Cross-Validation Procedures for Model Selection. Statist. Surv. 2010, 4, 40. 10.1214/09-SS054. [DOI] [Google Scholar]
  120. Mathai N.; Chen Y.; Kirchmair J. Validation Strategies for Target Prediction Methods. Briefings in Bioinformatics 2020, 21 (3), 791–802. 10.1093/bib/bbz026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  121. Kwok S. W.; Carter C.. Multiple Decision Trees. arXiv, March 27, 2013. 10.48550/arXiv.1304.2363. [DOI]
  122. Dietterich T. G.; Kong E. B.. Machine Learning Bias, Statistical Bias, and Statistical Variance of Decision Tree Algorithms.
  123. Amit Y.; Geman D.; Wilder K. Joint Induction of Shape Features and Tree Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 1997, 19 (11), 1300–1305. 10.1109/34.632990. [DOI] [Google Scholar]
  124. Ho T. K. The Random Subspace Method for Constructing Decision Forests. IEEE Trans. Pattern Anal. Machine Intell. 1998, 20 (8), 832–844. 10.1109/34.709601. [DOI] [Google Scholar]
  125. Freund Y.; Schapire R. E.. A Desicion-Theoretic Generalization of on-Line Learning and an Application to Boosting. In Computational Learning Theory; Vitányi P., Ed.; Springer: Berlin, Heidelberg, 1995; pp 23–37. 10.1007/3-540-59119-2_166. [DOI] [Google Scholar]
  126. Breiman L. Bagging Predictors. Mach Learn 1996, 24 (2), 123–140. 10.1007/BF00058655. [DOI] [Google Scholar]
  127. Panov P.; Džeroski S.. Combining Bagging and Random Subspaces to Create Better Ensembles. In Advances in Intelligent Data Analysis VII; R. Berthold M., Shawe-Taylor J., Lavrač N., Eds.; Springer: Berlin, Heidelberg, 2007; pp 118–129. 10.1007/978-3-540-74825-0_11. [DOI] [Google Scholar]
  128. Krawczyk B.; Minku L. L.; Gama J.; Stefanowski J.; Woźniak M. Ensemble Learning for Data Stream Analysis: A Survey. Information Fusion 2017, 37, 132–156. 10.1016/j.inffus.2017.02.004. [DOI] [Google Scholar]
  129. Sagi O.; Rokach L. Ensemble Learning: A Survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2018, 8 (4), e1249. 10.1002/widm.1249. [DOI] [Google Scholar]
  130. Dong X.; Yu Z.; Cao W.; Shi Y.; Ma Q. A Survey on Ensemble Learning. Front. Comput. Sci. 2020, 14 (2), 241–258. 10.1007/s11704-019-8208-z. [DOI] [Google Scholar]
  131. Breiman L. Random Forests. Machine Learning 2001, 45 (1), 5–32. 10.1023/A:1010933404324. [DOI] [Google Scholar]
  132. Geman S.; Bienenstock E.; Doursat R. Neural Networks and the Bias/Variance Dilemma. Neural Computation 1992, 4 (1), 1–58. 10.1162/neco.1992.4.1.1. [DOI] [Google Scholar]
  133. Kohavi R.; Wolpert D. H. Bias plus variance decomposition for zero-one loss functions. ICML 1996, 96, 275–283. [Google Scholar]
  134. Kohavi R.A study of cross-validation and bootstrap for accuracy estimation and model selection; Morgan Kaufman Publishing, 1995. [Google Scholar]
  135. Tibshirani R.Bias, variance and prediction error for classification rules; University of Toronto, Department of Statistics; 1996. [Google Scholar]
  136. Domingos P.A unified bias-variance decomposition. Proceedings of 17th international conference on machine learning; Morgan Kaufmann Stanford, 2000. [Google Scholar]
  137. Efron B.Bootstrap methods: another look at the jackknife. Breakthroughs in statistics: Methodology and distribution. Springer: New York, 1992; pp569–593. [Google Scholar]
  138. Chernick M. R., LaBudde R. A.. An introduction to bootstrap methods with applications to R; John Wiley & Sons, 2014. [Google Scholar]
  139. Breiman L.Manual on Setting up, Using, and Understanding Random Forests, v3. 1; Statistics Department University of California Berkeley, Berkeley, CA, USA, 2002; 1 ( (58), ), 3–42. [Google Scholar]
  140. Ishwaran H. Variable Importance in Binary Regression Trees and Forests. Electron. J. Statist. 2007, 1, 519. 10.1214/07-EJS039. [DOI] [Google Scholar]
  141. Nembrini S.; König I. R.; Wright M. N. The Revival of the Gini Importance?. Bioinformatics 2018, 34 (21), 3711–3718. 10.1093/bioinformatics/bty373. [DOI] [PMC free article] [PubMed] [Google Scholar]
  142. Sandri M.; Zuccolotto P. A Bias Correction Algorithm for the Gini Variable Importance Measure in Classification Trees. Journal of Computational and Graphical Statistics 2008, 17 (3), 611–628. 10.1198/106186008X344522. [DOI] [Google Scholar]
  143. Roßbach P.Neural networks vs. random forests–does it always have to be deep learning? Frankfurt School of Finance and Management, Germany, 2018. [Google Scholar]
  144. Van Nuffel S.; Quatredeniers M.; Pirkl A.; Zakel J.; Le Caer J.-P.; Elie N.; Vanbellingen Q. P.; Dumas S. J.; Nakhleh M. K.; Ghigna M.-R.; Fadel E.; Humbert M.; Chaurand P.; Touboul D.; Cohen-Kaminsky S.; Brunelle A. Multimodal Imaging Mass Spectrometry to Identify Markers of Pulmonary Arterial Hypertension in Human Lung Tissue Using MALDI-ToF, ToF-SIMS, and Hybrid SIMS. Anal. Chem. 2020, 92 (17), 12079–12087. 10.1021/acs.analchem.0c02815. [DOI] [PubMed] [Google Scholar]
  145. Fujimura Y.; Miura D. MALDI Mass Spectrometry Imaging for Visualizing In Situ Metabolism of Endogenous Metabolites and Dietary Phytochemicals. Metabolites 2014, 4 (2), 319–346. 10.3390/metabo4020319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  146. Tian X.; Zhang G.; Shao Y.; Yang Z. Towards Enhanced Metabolomic Data Analysis of Mass Spectrometry Image: Multivariate Curve Resolution and Machine Learning. Anal. Chim. Acta 2018, 1037, 211–219. 10.1016/j.aca.2018.02.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  147. Appley M. G.; Beyramysoltan S.; Musah R. A. Random Forest Processing of Direct Analysis in Real-Time Mass Spectrometric Data Enables Species Identification of Psychoactive Plants from Their Headspace Chemical Signatures. ACS Omega 2019, 4 (13), 15636–15644. 10.1021/acsomega.9b02145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  148. Green F. M.; Gilmore I. S.; Seah M. P. TOF-SIMS: Accurate Mass Scale Calibration. J. Am. Soc. Mass Spectrom. 2006, 17 (4), 514–523. 10.1016/j.jasms.2005.12.005. [DOI] [PubMed] [Google Scholar]
  149. Madiona R. M. T.; Alexander D. L. J.; Winkler D. A.; Muir B. W.; Pigram P. J. Information Content of ToF-SIMS Data: Effect of Spectral Binning. Appl. Surf. Sci. 2019, 493, 1067–1074. 10.1016/j.apsusc.2019.07.044. [DOI] [Google Scholar]
  150. Moore J.Computational approaches for the interpretation of ToF-SIMS data; The University of Manchester, United Kingdom, 2014. [Google Scholar]
  151. Zheng Y.; Tian D.; Liu K.; Bao Z.; Wang P.; Qiu C.; Liu D.; Fan R. Peak Detection of TOF-SIMS Using Continuous Wavelet Transform and Curve Fitting. Int. J. Mass Spectrom. 2018, 428, 43–48. 10.1016/j.ijms.2018.03.001. [DOI] [Google Scholar]
  152. Tyler B. J.; Kassenböhmer R.; Peterson R. E.; Nguyen D. T.; Freitag M.; Glorius F.; Ravoo B. J.; Arlinghaus H. F. Denoising of Mass Spectrometry Images via Inverse Maximum Signal Factors Analysis. Anal. Chem. 2022, 94 (6), 2835–2843. 10.1021/acs.analchem.1c04564. [DOI] [PubMed] [Google Scholar]
  153. Graham D. J.; Gamble L. J. Back to the Basics of Time-of-Flight Secondary Ion Mass Spectrometry of Bio-Related Samples. I. Instrumentation and Data Collection. Biointerphases 2023, 18 (2), 021201 10.1116/6.0002477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  154. Guyon I.A scaling law for the validation-set training-set size ratio. AT&T Bell Laboratories; 1.11, 1997. [Google Scholar]
  155. Díaz-Uriarte R.; Alvarez de Andrés S. Gene Selection and Classification of Microarray Data Using Random Forest. BMC Bioinformatics 2006, 7 (1), 3. 10.1186/1471-2105-7-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  156. Colditz R. R. An Evaluation of Different Training Sample Allocation Schemes for Discrete and Continuous Land Cover Classification Using Decision Tree-Based Algorithms. Remote Sensing 2015, 7 (8), 9655–9681. 10.3390/rs70809655. [DOI] [Google Scholar]
  157. Guan H.; Li J.; Chapman M.; Deng F.; Ji Z.; Yang X. Integration of Orthoimagery and Lidar Data for Object-Based Urban Thematic Mapping Using Random Forests. International Journal of Remote Sensing 2013, 34 (14), 5166–5186. 10.1080/01431161.2013.788261. [DOI] [Google Scholar]
  158. Lebanov L.; Tedone L.; Ghiasvand A.; Paull B. Random Forests Machine Learning Applied to Gas Chromatography - Mass Spectrometry Derived Average Mass Spectrum Data Sets for Classification and Characterisation of Essential Oils. Talanta 2020, 208, 120471 10.1016/j.talanta.2019.120471. [DOI] [PubMed] [Google Scholar]
  159. Zhu N.; Zhu C.; Zhou L.; Zhu Y.; Zhang X. Optimization of the Random Forest Hyperparameters for Power Industrial Control Systems Intrusion Detection Using an Improved Grid Search Algorithm. Applied Sciences 2022, 12 (20), 10456. 10.3390/app122010456. [DOI] [Google Scholar]
  160. Belgiu M.; Drăguţ L. Random Forest in Remote Sensing: A Review of Applications and Future Directions. ISPRS Journal of Photogrammetry and Remote Sensing 2016, 114, 24–31. 10.1016/j.isprsjprs.2016.01.011. [DOI] [Google Scholar]
  161. Gislason P. O.; Benediktsson J. A.; Sveinsson J. R. Random Forests for Land Cover Classification. Pattern Recognition Letters 2006, 27 (4), 294–300. 10.1016/j.patrec.2005.08.011. [DOI] [Google Scholar]
  162. Gromski P. S.; Xu Y.; Correa E.; Ellis D. I.; Turner M. L.; Goodacre R. A Comparative Investigation of Modern Feature Selection and Classification Approaches for the Analysis of Mass Spectrometry Data. Anal. Chim. Acta 2014, 829, 1–8. 10.1016/j.aca.2014.03.039. [DOI] [PubMed] [Google Scholar]
  163. Ma L.; Ma F.; Ji Z.; Gu Q.; Wu D.; Deng J.; Ding J. Urban Land Use Classification Using LiDAR Geometric, Spatial Autocorrelation and Lacunarity Features Combined with Postclassification Processing Method. Canadian Journal of Remote Sensing 2015, 41 (4), 334–345. 10.1080/07038992.2015.1102630. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials


Articles from Journal of the American Society for Mass Spectrometry are provided here courtesy of American Chemical Society

RESOURCES