Cross‐validated permutation feature importance considering correlation between features

Hiromasa Kaneko

doi:10.1002/ansa.202200018

. 2022 Sep 7;3(9-10):278–287. doi: 10.1002/ansa.202200018

Cross‐validated permutation feature importance considering correlation between features

Hiromasa Kaneko ^1,^✉

PMCID: PMC10989554 PMID: 38716264

Abstract

In molecular design, material design, process design, and process control, it is important not only to construct a model with high predictive ability between explanatory features x and objective features y using a dataset but also to interpret the constructed model. An index of feature importance in x is permutation feature importance (PFI), which can be combined with any regressors and classifiers. However, the PFI becomes unstable when the number of samples is low because it is necessary to divide a dataset into training and validation data when calculating it. Additionally, when there are strongly correlated features in x, the PFI of these features is estimated to be low. Hence, a cross‐validated PFI (CVPFI) method is proposed. CVPFI can be calculated stably, even with a small number of samples, because model construction and feature evaluation are repeated based on cross‐validation. Furthermore, by considering the absolute correlation coefficients between the features, the feature importance can be evaluated appropriately even when there are strongly correlated features in x. Case studies using numerical simulation data and actual compound data showed that the feature importance can be evaluated appropriately using CVPFI compared to PFI. This is possible when the number of samples is low, when linear and nonlinear relationships are mixed between x and y when there are strong correlations between features in x, and when quantised and biased features exist in x. Python codes for CVPFI are available at https://github.com/hkaneko1985/dcekit.

Keywords: correlation, cross‐validation, feature importance, model interpretation, permutation importance

Abbreviations

CV: cross‐validation
CVPFI: cross‐validated permutation feature importance
DT: decision tree
GMR: Gaussian mixture regression
GPR: Gaussian process regression
LIME: local interpretable model‐agnostic explanations
PFI: permutation feature importance
PLS: partial least squares
RF: random forests
SHAP: Shapley additive explanations
SVR: support vector regression
VD: validation data

1. INTRODUCTION

In molecular design, material design, process design, process control, and process management, it is common to utilise mathematical models y = f(x) constructed between objective features y and explanatory features x using a dataset. An important objective is to construct models with high prediction accuracy. Classification and regression methods include linear discriminant analysis, logistic regression, ¹ partial least squares (PLS) regression, ² ridge regression, least absolute shrinkage and selection operator, elastic net, ³ support vector regression (SVR), ¹ decision tree, ⁴ random forests (RF), ⁵ Gaussian process regression (GPR), ¹ gradient boosting decision tree, ⁶ extreme gradient boosting, ⁷ light gradient boosting machine, ⁸ , ⁹ , ¹⁰ CatBoost, ¹¹ , ¹² deep neural network, ¹³ and Gaussian mixture regression (GMR). ¹⁴ , ¹⁵ Since there exists no optimal classification method or optimal regression method, a method that is appropriate for each dataset should be used.

It is also important to interpret the constructed models and analyse the relationship between x and y to elucidate the mechanism by which the physical properties and activities are expressed. In local interpretable model‐agnostic explanations (LIME) ¹⁶ and Shapley additive explanations (SHAP), ¹⁷ which can be combined with any regression method, the slope of x with respect to y around a sample point is determined by obtaining an approximate expression for the shape of the model at the sample point. LIME and SHAP can be used to discuss the local contribution or direction of x to y. For example, for a sample with a maximum value of y, we can discuss the direction of x to further improve the y value. However, feature importance, which is the degree of influence of each x on y, in the entire dataset is the focus of this study.

Several methods exist for establishing the feature importance of RF models ¹⁸ , ¹⁹ , ²⁰ , ²¹ , ²² , ²³ , ²⁴ , ²⁵ , ²⁶ such as mean decrease impurity and permutation feature importance (PFI). The feature importance of x is calculated considering the entire value of y but it can be different when the y value is high, medium, or low. Shimizu and Kaneko (2021) proposed a decision tree (DT) and RF hybrid model where the importance of RF was calculated for each leaf node in the DT model, and thus, DT provided a global interpretation of the entire dataset, and RF provided local interpretations for each cluster. ²⁷

PFI can be used universally with various classification and regression methods such as scikit‐learn ²⁸ and also, can be used conveniently. However, validation data (VD) and training data are required to calculate the PFI, and the feature importance calculation is unstable when the number of samples is low. Additionally, the PFI of strongly correlated features is estimated to be lower than that of other independent features.

Therefore, in this study, cross‐validated permutation feature importance (CVPFI) is proposed to solve the above problems and calculate the feature importance appropriately. Because CVPFI is calculated in an iterative manner based on cross‐validation (CV), the feature importance can be calculated stably by increasing the number of divisions for CV when the number of samples is low. Additionally, when randomly shuffling a feature, it is possible to estimate the importance of strongly correlated features appropriately by shuffling other correlated features with probability based on the absolute correlation coefficients between the features.

The performance of CVPFI is verified using numerical simulation data generated in cases where x and y are linearly related to small samples, where x and y are nonlinearly related, where highly correlated features exist in x, and where quantised features whose values are unbalanced exist in x, compared with PFI. The CVPFI of each descriptor was then discussed using a dataset of actual compounds.

The paper is organised as follows. Section 2 highlights the description of the methods used in the paper. Section 3 presents the results of the empirical analysis and discusses the results. Lastly, Section 4 concludes the study and provides implications of the work.

2. METHOD

2.1. Permutation feature importance

When the number of iterations is J, the algorithm for calculating the PFI is as follows:

Construct a model using training data.
Calculate the reference score rs of the model on VD. The score is the accuracy of the classifier and the determination coefficient r² for a regressor.
For each feature i, which means the ith column of VD, and for each repetition j in 1, 2,…, and J, randomly shuffle column i or ith feature of dataset VD to generate a corrupted version of VD CVDi,j and calculate the score s_i,j of the model on CVDi,j.
Calculate the importance of PFI_i for the ith feature as follows:
$P F I_{i} = r s - \frac{1}{J} \sum_{j = 1}^{J} s_{i, j}$ (1)

2.2. Cross‐validated permutation feature importance

Figure 1 shows the basic concept of the CVPFI. In PFI, the score s_i _, _j was calculated using a single set of VD; however, in CVPFI, the score was calculated using CV. For CV, because all samples become VD after the iterative calculation, the feature importance will be calculated stably even when the number of samples is low. However, when predicting VD in CV, if the ith feature of VD is randomly shuffled, as is the case with PFI, the feature will not be shuffled effectively when the number of divisions or folds of the CV is large. For example, in leave‐one‐out CV, the ith feature cannot be shuffled because there exists only one sample in VD per fold. Therefore, in CVPFI, instead of shuffling the ith feature of VD only, the ith feature of VD is randomly sampled without duplication from the original dataset. The proposed method increases the number of samples for model construction and model evaluation compared to the conventional method; thus, feature importance can be calculated stably using the proposed method.

Basic concept of cross‐validated permutation feature importance

Additionally, only the ith feature is shuffled when calculating s_i _, _j in PFI; however, in CVPFI, not only the ith feature but also other features correlated with the ith feature can be randomly sampled without duplication from the original dataset. In general, when there exists a correlation between two features and one of them changes, the other one also changes according to its correlation, which is replicated in CVPFI. The higher the correlation with the ith feature, the higher the probability of being randomly sampled. Thus, the probability was set as the absolute value of the correlation coefficient with the ith feature in the CVPFI. However, it is necessary to consider chance correlation, particularly when the number of samples is low. When the absolute value of the correlation coefficient is higher than zero and there is essentially no correlation between the features, it becomes noise, leading to false feature importance. Therefore, interval estimation of the population correlation coefficient ²⁹ is performed before the probability is set as the absolute correlation coefficient. When the estimated interval includes zero, that is, when the product of the lower and upper limits of the correlation coefficient is less than zero, the correlation coefficient is set to zero, that is, there is no correlation.

In the interval estimation of the population correlation coefficient, the absolute correlation coefficient between pth and qth features, r_p _, _q is converted to z_p _, _q by Fisher z‐transformation ²⁹ using the following equation:

z_{p, q} = \frac{1}{2} \log \frac{1 + r_{p, q}}{1 - r_{p, q}}

(2)

Because it can be assumed that z follows a normal distribution with a mean of z_p _, _q and a variance of 1/(m−3), the range of z whose probability is α (the significance level) can be estimated. The lower and upper limits of this range are denoted by Lz_p _, _q and Uz_p _, _q , respectively. In this study, scipy.norm.interval ³⁰ was used to calculate them.

Lz_p _, _q and Uz_p _, _q of z are converted to Lr_p _, _q and Ur_p _, _q of r, respectively, as follows:

L r_{p, q} = \frac{\exp (2 L z_{p, q}) - 1}{\exp (2 L z_{p, q}) + 1}

(3)

U r_{p, q} = \frac{\exp (2 U z_{p, q}) - 1}{\exp (2 U z_{p, q}) + 1}

(4)

When Lr_p _, _q < 0 < Ur_p _, _q , the correlation has the possibility of chance correlation; thus, r_p _, _q = 0. When feature A is important and feature B is correlated with A, then B is also important, at least by the amount of its correlation coefficient with A. Of course, only correlations between features cannot extract all relationships among features in a multivariate dataset; however, the correlation coefficients can represent some necessary relationships between features, and the CVPFI is calculated by considering the relationships.

When the number of iterations is J, the algorithm for calculating CVPFI is given as follows:

Calculate the correlation coefficients between all the features.
Calculate Lr_p _, _q and Ur_p _, _q between all features and r_p _, _q = 0 when Lr_p _, _q < 0 < Ur_p _, _q .
In CV, for n = 1, 2, …, N, where N is the number of folds of CV, the following procedures are conducted using training data and VD at each fold.

3‐1 Construct a model using the nth training data.

3‐2 Estimate y values on nth VD VD _n using the model

3‐3 For each feature i, which is the ith column of VD _n , and for each repetition j in 1, 2,…, J, randomly sample column i of the original dataset without duplication, and for each feature m (the mth column of VD _n ) for which r_i _, _m is higher than 0, randomly sample column m of the original dataset without duplication with a probability of r_i _, _m to generate a corrupted version of the dataset CVD _n,i _, _j , and estimate the y values of the model.
Integrate the y‐values estimated in CV for VD₁, VD₂ ,…, and VD _N , and calculate the reference score rscv with the integrated y‐values. The score is the accuracy of the classifier and the determination coefficient r ² for a regressor.
Integrate the y‐values estimated in CV for CVD₁ _,i _, _j , CVD₂ _,i _, _j,,…, and CVD _N,i _, _j , and calculate the score scv_i,j with the integrated y‐values.
Calculate the importance of CVPFI_i for the ith feature as follows:
$C V P F I_{i} = r s c v - \frac{1}{J} \sum_{j = 1}^{J} s c v_{i, j}$ (5)

Because CVPFI integrates estimated y‐values and the number of samples to calculate the accuracy for a classifier, and r² for a regressor is larger than that of PFI, the importance can be calculated stably in CVPFI.

Python code for CVPFI is available at https://github.com/hkaneko1985/dcekit. In this code, the maximal information coefficient ³¹ can be used instead of the correlation coefficient r.

3. RESULTS AND DISCUSSION

To validate the proposed CVPFI, it was compared with the conventional PFI using numerical simulation data and actual compound data. J was set equal to 5 for both the PFI and CVPFI. For PFI, the data were randomly split so that the training data and validation data contained 75% and 25% of the samples, respectively. For CVPFI, α was set to 0.999. A leave‐one‐out CV was used when the number of samples was less than 30, a 10‐fold CV when the number of samples was higher than 30 but less than 100, and a 5‐fold CV when the number of samples was higher than 100. In this paper, PLS, which is a linear regression method, SVR with Gaussian kernel, and GPR, which are nonlinear regression methods, and GMR, which is a nonlinear regression method enabling direct inverse analysis, were used. The kernel function used in the GPR is given as follows:

K (x^{(i)}, x^{(j)}) = θ_{0} \exp \{- \frac{θ_{1}}{2} {‖x^{(i)} - x^{(j)}‖}^{2}\} + θ_{2}

(6)

where x ⁽ ⁱ ⁾ is the ith sample of x and θ ₀, θ ₁, and θ ₂ are the hyperparameters.

In numerical simulation data, the following four case studies were conducted:

Case study 1: There are linear relationships between 10 features in x and y, five features unrelated to y are included in x, and the number of samples is low.
Case study 2: There are linear and non‐linear relationships between 10 features in x and y, five features unrelated to y are included in x, and the number of samples is low.
Case study 3: There is a linear relationship between x and y, and x contains features that are strongly correlated.
Case study 4: There is a linear relationship between x and y, and x contains quantised features with biased values.

In case study 1, x was set to 15, and samples of x were generated as uniform random numbers between 0 and 1. The first 10 features in x had weights to y, and all the weights were one. The remaining five features in x have no contribution to y, as do all the weights of zero. Normal random numbers with a standard deviation of 10% were added to y. The sample size was set at 20.

Figure 2 shows the feature importance of each regression method when the PFI and CVPFI are used in case study 1. The results of x of the first to 10th features related to y are shown as blue bars, and the results of x of the 11th to 10th features not related to y are shown as black bars. The PFI results in Figure 2A,C,E,G shows that in the first to 10th features, some features have high importance, but the importance of several features is lower than the maximum importance of the features not related to y. However, Figure 2B,D,F,H shows that the importance of the features related to y is sufficiently high compared to the maximum importance of the features not related to y in all results except the SVR result. Because the number of samples was low and PFI was unstable, the importance of features not related to y was higher than that of features related to y, although the importance of the 10th feature was high. However, the proposed CVPFI could evaluate feature importance appropriately, and the importance of features unrelated to y was lower than that of features related to y. The proposed method can properly evaluate the important features. In SVR, because the number of samples was small and there were three hyperparameters, the optimisation of the hyperparameters in SVR would fail, and SVR could not model the relationship between x and y on the present dataset. It was found that the feature importance could be appropriately evaluated even in a small number of samples using the proposed CVPFI.

Feature importance in case study 1. Blue and black bars correspond to significant x and non‐significant x, respectively. (A) PFI in PLS, (B) CVPFI in PLS, (C) PFI in SVR, (D) CVPFI in SVR, (E) PFI in GPR, (F) CVPFI in GPR, (G) PFI in GMR and (H) CVPFI in GMR. Note: PFI, permutation feature importance; PLS, partial least squares; CVPFI, cross‐validated permutation feature importance; SVR, support vector regression; GPR, Gaussian process regression; GMR, Gaussian mixture regression

In case study 2, x was set to 15, and samples of x were generated as uniform random numbers between 0 and 1. The first five features in x contribute to y as follows:

\begin{matrix} y & = & {(3 x_{1} - 1.5)}^{2} + {(2 x_{2} - 1)}^{3} + \exp (x_{3} - 0.5) + 2 \log (x_{4} + 1) \\ + 1.5 \sin (x_{5}) \end{matrix}

(7)

The next five features in x had weights of y, and all the weights were one. The remaining five features in x have no contribution to y, as do all the weights of zero. Normal random numbers with a standard deviation of 10% were added to y. The sample size was set to 60.

Figure 3 shows the feature importance of each regression method while using the PFI and CVPFI in case study 2. The results of x of the first to 10th features related to y are shown as blue bars, and the results of x of the 11th to 10th features not related to y are shown as black bars. The PFI results in Figure 3A,C,E,G shows that the number of important features whose importance exceeds the importance of the 15th feature unrelated to y is low, and many important features in x are considered less important than the unimportant feature. However, Figure 3B,D,F,H shows that the importance of the features related to y is higher than the maximum importance of the features unrelated to y for all the nonlinear regression methods, that is, SVR, GPR, and GMR. In the linear PLS, although the feature importance of x₁ with a nonlinear relationship to y was low, the importance of the other important features in x was appropriately high. It was found that the combination of the proposed CVPFI and nonlinear regression method can properly evaluate the feature importance even when there are linear and nonlinear relationships between x and y with small data.

Feature importance in case study 2. Blue and black bars correspond to significant x and non‐significant x, respectively. (A) PFI in PLS, (B) CVPFI in PLS, (C) PFI in SVR, (D) CVPFI in SVR, (E) PFI in GPR, (F) CVPFI in GPR, (G) PFI in GMR and (H) CVPFI in GMR. Note: PFI, permutation feature importance; PLS, partial least squares; CVPFI, cross‐validated permutation feature importance; SVR, support vector regression; GPR, Gaussian process regression; GMR, Gaussian mixture regression

In case study 3, the number of x was set to 10, and the first five features were generated as uniform random numbers between 0 and 1. and Then, normal random numbers with a standard deviation of 10% were added for each feature. These features are independent of one another. For the five features to be highly correlated with each other, the ith x, x _i , is generated as follows:

x_{i} = u + 0.1 \times N (0, 1) .

(8)

Here, u is a uniform random number between 0 and 1 and N(0, 1) is the standard normal random number. After standardising all 10 features, all the weights to y were set to one, and then the contributions of features for y were equivalent for all the features. Normal random numbers with a standard deviation of 10% were added to y. The sample size used was 100 µm.

Figure 4 shows the feature importance of each regression method when the PFI and CVPFI are used in case study 3. The results of x for the first to fifth features uncorrelated with each other are shown as blue bars, and the results of x for the sixth to 10th features highly correlated with each other are shown as red bars. From the PFI results in Figure 4A,C,E,G, the importance of the features that are highly correlated with each other is much lower than the importance of the features that are uncorrelated with each other in all the regression methods, although the weights to y are the same for all the features in x. This could be because PFI did not consider correlation in x at all, and other correlated x could explain y in permutation. However, Figures 4B and 3D,F,H show that by using CVPFI and considering the correlation between features in permutation, it did not matter whether the features were uncorrelated or strongly correlated with each other, and the importance of features could be evaluated at the same level of importance in all the regression methods. It was confirmed that feature importance can be evaluated in the same way as independent features, even when features that are highly correlated with each other are included in x using the proposed CVPFI.

Feature importance in case study 3. Blue and red bars correspond to independent x and strongly correlated x, respectively. (A) PFI in PLS, (B) CVPFI in PLS, (C) PFI in SVR, (D) CVPFI in SVR, (E) PFI in GPR, (F) CVPFI in GPR, (G) PFI in GMR and (H) CVPFI in GMR. Note: PFI, permutation feature importance; PLS, partial least squares; CVPFI, cross‐validated permutation feature importance; SVR, support vector regression; GPR, Gaussian process regression; GMR, Gaussian mixture regression

In case study 4, x was set to 10, and the first five features were generated as uniform random numbers ranging from 0 to 1. The remaining five features were generated to have only 0 or 1 and to be biased such that the percentage of 1 was 10%. After standardising all features, all weights to y were set to 1, and y was calculated. Normal random numbers with a standard deviation of 10% were added to y. The sample size used was 100 µm.

Figure 5 shows the feature importance of each regression method when using the PFI and CVPFI in case study 4. The results of x for the first to fifth continuous features are shown as blue bars, and the results of x for the sixth to 10th quantised features with biased values are shown as green bars. For both PFI and CVPFI, feature importance can be evaluated as having the same level of importance, whether they are continuous or quantised features. In terms of the variation of importance between features, that of CVPFI was lower than that of PFI, indicating that CVPFI was more stable than PFI because the dataset in case study 4 was generated so that the weights of x to y were the same. It was found that the proposed CVPFI can stably evaluate feature importance even when quantised features with biased values exist.

Feature importance in case study 4. Blue and green bars correspond to continuous x and quantised x, respectively. (A) PFI in PLS, (B) CVPFI in PLS, (C) PFI in SVR, (D) CVPFI in SVR, (E) PFI in GPR, (F) CVPFI in GPR, (G) PFI in GMR and (H) CVPFI in GMR. Note: PFI, permutation feature importance; PLS, partial least squares; CVPFI, cross‐validated permutation feature importance; SVR, support vector regression; GPR, Gaussian process regression; GMR, Gaussian mixture regression

Next, to examine the performance of the proposed method, a dataset of boiling points ³² was used as the actual dataset for the compounds. RDKit ³³ was used to calculate the molecular descriptors, and only interpretable descriptors were selected to test the interpretability of the model. Those features for which the ratio of samples with the same values in the training data accounted for 80% or more were excluded. One of the pairs of features for which the absolute correlation coefficient attained a value of one was subsequently deleted. The descriptors used in this study are listed in Table 1.

TABLE 1.

Molecular descriptors of RDKit used in this study

Name	Description
MolWt	Average molecular weight of the molecule
HeavyAtomMolWt	Average molecular weight of the molecule ignoring hydrogens
ExactMolWt	Exact molecular weight of the molecule
NumValenceElectrons	Number of valence electrons that the molecule has
HeavyAtomCount	Number of heavy atoms a molecule
NHOHCount	Number of NHs or OHs
NOCount	Number of nitrogens and oxygens
NumAromaticRings	Number of aromatic rings
NumHAcceptors	Number of hydrogen bond acceptors
NumHDonors	Number of hydrogen bond donors
NumHeteroatoms	Number of heteroatoms
NumRotatableBonds	Number of rotatable bonds
RingCount	Number of rings
MolLogP	Wildman‐Crippen LogP value
MolMR	Wildman‐Crippen MR value
fr_C_O	Number of carbonyl O
fr_C_O_noCOO	Number of carbonyl O, excluding COOH

Open in a new tab

Figure 6 illustrates the feature importance of each regression method when PFI and CVPFI were used in the boiling point dataset. Although molecular weight is one of the important factors affecting the boiling point, the PFI results (Figure 6A,C,E,G) show that the importance of all features related to molecular weight is low, except for GMR. Additionally, the features related to the number of atoms and substructures are also important in explaining the boiling point; however, they became less important for all regression methods. The features, including molecular weight and the number of atoms and substructures, are considered less important owing to the correlation between the features. However, Figure 6B,D,F,H shows that by using CVPFI, the importance of the features related to molecular weight and the number of atoms and substructure increased, and the results were reasonable for all regression methods. It was found that the proposed CVPFI can be used to properly evaluate the importance of features in a real dataset where correlations exist between the features.

Feature importance for the boiling point dataset. (A) PFI in PLS, (B) CVPFI in PLS, (C) PFI in SVR, (D) CVPFI in SVR, (E) PFI in GPR, (F) CVPFI in GPR, (G) PFI in GMR and (H) CVPFI in GMR. Note: PFI, permutation feature importance; PLS, partial least squares; CVPFI, cross‐validated permutation feature importance; SVR, support vector regression; GPR, Gaussian process regression; GMR, Gaussian mixture regression

4. CONCLUSION

In this study, the CVPFI was proposed to properly evaluate feature importance using the machine learning method. Compared with the conventional PFI, CVPFI can calculate the feature importance stably and appropriately because the model construction and evaluation of features are repeated based on CV. Furthermore, since features other than the target feature of permutation are randomly sampled based on the correlation coefficients of the features, the importance of strongly correlated features can be evaluated appropriately as well as independent features.

Through case studies using numerical simulation data, it was confirmed that CVPFI can be used to evaluate the feature importance appropriately compared to the conventional PFI in all cases where the number of samples is low, where linear and nonlinear relationships are mixed between x and y, where features with strong correlation exist in x, and where features that are quantised and have biased values exist in x. Furthermore, when the actual boiling point dataset was used, the feature importance could be properly evaluated in the presence of correlations between molecular descriptors. Although CVPFI was applied only to regression analysis in this study, it can also be used for classification by changing the evaluation index of the model from r² to an index of classification, such as accuracy and Cohen's kappa.

Although the CVPFI can be combined with any regression method, the number of years that can be considered simultaneously depends on the regression method. For example, when using SVR, only one y can be considered; however, when using GMR, any y can be considered simultaneously. The proposed CVPFI is expected to facilitate the interpretation of data‐driven models, explanation of phenomena, and clarification of mechanisms in datasets.

Python codes for CVPFI are available. ³⁴

CONFLICT OF INTEREST

The authors declare there is no conflict of interest.

ACKNOWLEDGEMENTS

This work was supported by a Grant‐in‐Aid for Scientific Research (KAKENHI) [grant numbers 19K15352, 20H02553 and 20H04538] from the Japan Society for the Promotion of Science.

Kaneko H. Cross‐validated permutation feature importance considering correlation between features. Anal Sci Adv. 2022;3:278–287. 10.1002/ansa.202200018

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available in reference number [32].

REFERENCES

1. Bishop CM, Nasarbadi NM. Pattern Recognition and Machine Learning. Springer; 2006. [Google Scholar]
2. Wold S, Sjöström M, Eriksson L. PLS‐regression: a basic tool of chemometrics. Chemom Intel Lab Syst. 2001;58(2):109‐130. 10.1016/S0169-7439(01)00155-1 [DOI] [Google Scholar]
3. Li ZT, Sillanpaa MJ. Overview of LASSO‐related penalized regression methods for quantitative trait mapping and genomic selection. Theor Appl Genet. 2012;125(3):419‐435. 10.1007/s00122-012-1892-9 [DOI] [PubMed] [Google Scholar]
4. Bruce CL, Melville JL, Pickett SD, Hirst JD. Contemporary QSAR classifiers compared. J Chem Inf Model. 2007;47(1):219‐227. 10.1021/ci600332j [DOI] [PubMed] [Google Scholar]
5. Palmer DS, O'Boyle NM, Glen RC, Mitchell JBO. Random forest models to predict aqueous solubility. J Chem Inf Model. 2007;47(1):150‐158. 10.1021/ci060164k [DOI] [PubMed] [Google Scholar]
6. Natekin AK. Gradient boosting machines, a tutorial. Front Neurobot. 2013;. 7:1‐21. 10.3389/fnbot.2013.00021 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Chen T, Guestrin C, XGBoost: a scalable tree boosting system. 2016. 10.1145/2939672.2939785 [DOI]
8. Ke G, Meng Q, Finley T, et al. LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Info Proc Syst. 2017;30:3149‐3157. [Google Scholar]
9. Meng Q, Ke G, Wang T, et al. A communication‐efficient parallel algorithm for decision tree. Adv Neural Info Proc Syst. 2016;29:1279‐1287. [Google Scholar]
10. Zhang H, Si S, Hsieh CJ. GPU acceleration for large‐scale tree boosting. SysML Conference. 2018. arXiv:1706.08359.
11. Dorogush AV, Gulin A, Gusev G, Kazeev N, Prokhorenkova LO, Vorobev A. Fighting biases with dynamic boosting. 2017. arXiv:1706.09516.
12. Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support. Workshop on ML Systems at NIPS. 2017.
13. Goh GB, Hodas NO, Vishnu A. Deep learning for computational chemistry. J Comput Chem. 2017;38(16):1291‐1307. 10.1002/jcc.24764 [DOI] [PubMed] [Google Scholar]
14. Kaneko H. Adaptive design of experiments based on Gaussian mixture regression. Chemom. Intell Lab Syst. 2021;208:104226. 10.1016/j.chemolab.2020.104226 [DOI] [Google Scholar]
15. Kaneko H. Lifting the limitations of gaussian mixture regression through coupling with principal component analysis and deep autoencoding. Chemom Intell Lab Syst. 2021; 218: 104437. 10.1016/j.chemolab.2021.104437 [DOI] [Google Scholar]
16. Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?”: explaining the predictions of any classifier. 2016. arXiv:1602.04938v3.
17. Lundberg S, Lee SI. A unified approach to interpreting model predictions. 2017. arXiv:1705.07874v2.
18. Breiman L. Random forests. Mach Learn. 2001;45(1):5‐32. 10.1023/A:1010933404324 [DOI] [Google Scholar]
19. Strobl C, Boulesteix AL, Zeileis A, Hothorn T. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics. 2007;8(1):1‐21. 10.1186/1471-2105-8-257 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A. Conditional variable importance for random forests. BMC Bioinformatics. 2008;9(1):1‐11. 10.1186/1471-2105-9-307 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Ishwaran H. Variable importance in binary regression trees and forests. Electron J Statist. 2007;1:519‐537. 10.1214/07-EJS039 [DOI] [Google Scholar]
22. Archer KJ, Kimes RV. Empirical characterization of random forest variable importance measures. Comput Stat Data An. 2008;52(4):2249‐2260. 10.1016/j.csda.2007.08.015 [DOI] [Google Scholar]
23. Genuer R, Poggi JM, Malot CT. Variable selection using random forests. Pattern Recogn Lett. 2010;31(14):2225‐2236. 10.1016/j.patrec.2010.03.014 [DOI] [Google Scholar]
24. Gregorutti B, Michel B, Pierre PS. Correlation and variable importance in random forests. Stat Comput. 2017;27(3):659‐678. [Google Scholar]
25. Gregorutti B, Michel B, Pierre PS. Grouped variable importance with random forests and application to multiple functional data analysis. 2015. https://arxiv.org/abs/1411.4170
26. Louppe G. Understanding random forests: from theory to practice, 2015. Understanding random forests: from theory to practice, 2013. https://arxiv.org/abs/1407.7502
27. Shimizu N, Kaneko H. Constructing regression models with high prediction accuracy and interpretability based on decision tree and random forests. J Comput Chem Jpn. 2021;20(2):71‐87. [Google Scholar]
28.Accessed February 8, 2022. https://scikit‐learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html
29. Wikipedia . Fisher transformation. Accessed February 8, 2022. https://en.wikipedia.org/wiki/Fisher_transformation
30. Scipy.stats.norm . The SciPy community. Accessed February 8, 2022. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html
31. Reshef DN, Reshef YA, Finucane HK, et al. Detecting novel associations in large data sets. Science. 2011;334(6062):1518‐1524. 10.1126/science.1205438 [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Hall LH, Story CT. Boiling point and critical temperature of a heterogeneous data set: qSAR with atom type electrotopological state indices using artificial neural networks. J Chem Inf Comput Sci. 1996;36(5):1004‐1014. 10.1021/ci960375x [DOI] [Google Scholar]
33. RDKit: open‐source cheminformatics software . Accessed February 8, 2022. https://www.rdkit.org/
34. DCEKit (Data chemical engineering toolkit) . Accessed February 8, 2022. https://github.com/hkaneko1985/dcekit

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are available in reference number [32].

[ansa202200018-bib-0001] 1. Bishop CM, Nasarbadi NM. Pattern Recognition and Machine Learning. Springer; 2006. [Google Scholar]

[ansa202200018-bib-0002] 2. Wold S, Sjöström M, Eriksson L. PLS‐regression: a basic tool of chemometrics. Chemom Intel Lab Syst. 2001;58(2):109‐130. 10.1016/S0169-7439(01)00155-1 [DOI] [Google Scholar]

[ansa202200018-bib-0003] 3. Li ZT, Sillanpaa MJ. Overview of LASSO‐related penalized regression methods for quantitative trait mapping and genomic selection. Theor Appl Genet. 2012;125(3):419‐435. 10.1007/s00122-012-1892-9 [DOI] [PubMed] [Google Scholar]

[ansa202200018-bib-0004] 4. Bruce CL, Melville JL, Pickett SD, Hirst JD. Contemporary QSAR classifiers compared. J Chem Inf Model. 2007;47(1):219‐227. 10.1021/ci600332j [DOI] [PubMed] [Google Scholar]

[ansa202200018-bib-0005] 5. Palmer DS, O'Boyle NM, Glen RC, Mitchell JBO. Random forest models to predict aqueous solubility. J Chem Inf Model. 2007;47(1):150‐158. 10.1021/ci060164k [DOI] [PubMed] [Google Scholar]

[ansa202200018-bib-0006] 6. Natekin AK. Gradient boosting machines, a tutorial. Front Neurobot. 2013;. 7:1‐21. 10.3389/fnbot.2013.00021 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ansa202200018-bib-0007] 7. Chen T, Guestrin C, XGBoost: a scalable tree boosting system. 2016. 10.1145/2939672.2939785 [DOI]

[ansa202200018-bib-0008] 8. Ke G, Meng Q, Finley T, et al. LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Info Proc Syst. 2017;30:3149‐3157. [Google Scholar]

[ansa202200018-bib-0009] 9. Meng Q, Ke G, Wang T, et al. A communication‐efficient parallel algorithm for decision tree. Adv Neural Info Proc Syst. 2016;29:1279‐1287. [Google Scholar]

[ansa202200018-bib-0010] 10. Zhang H, Si S, Hsieh CJ. GPU acceleration for large‐scale tree boosting. SysML Conference. 2018. arXiv:1706.08359.

[ansa202200018-bib-0011] 11. Dorogush AV, Gulin A, Gusev G, Kazeev N, Prokhorenkova LO, Vorobev A. Fighting biases with dynamic boosting. 2017. arXiv:1706.09516.

[ansa202200018-bib-0012] 12. Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support. Workshop on ML Systems at NIPS. 2017.

[ansa202200018-bib-0013] 13. Goh GB, Hodas NO, Vishnu A. Deep learning for computational chemistry. J Comput Chem. 2017;38(16):1291‐1307. 10.1002/jcc.24764 [DOI] [PubMed] [Google Scholar]

[ansa202200018-bib-0014] 14. Kaneko H. Adaptive design of experiments based on Gaussian mixture regression. Chemom. Intell Lab Syst. 2021;208:104226. 10.1016/j.chemolab.2020.104226 [DOI] [Google Scholar]

[ansa202200018-bib-0015] 15. Kaneko H. Lifting the limitations of gaussian mixture regression through coupling with principal component analysis and deep autoencoding. Chemom Intell Lab Syst. 2021; 218: 104437. 10.1016/j.chemolab.2021.104437 [DOI] [Google Scholar]

[ansa202200018-bib-0016] 16. Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?”: explaining the predictions of any classifier. 2016. arXiv:1602.04938v3.

[ansa202200018-bib-0017] 17. Lundberg S, Lee SI. A unified approach to interpreting model predictions. 2017. arXiv:1705.07874v2.

[ansa202200018-bib-0018] 18. Breiman L. Random forests. Mach Learn. 2001;45(1):5‐32. 10.1023/A:1010933404324 [DOI] [Google Scholar]

[ansa202200018-bib-0019] 19. Strobl C, Boulesteix AL, Zeileis A, Hothorn T. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics. 2007;8(1):1‐21. 10.1186/1471-2105-8-257 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ansa202200018-bib-0020] 20. Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A. Conditional variable importance for random forests. BMC Bioinformatics. 2008;9(1):1‐11. 10.1186/1471-2105-9-307 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ansa202200018-bib-0021] 21. Ishwaran H. Variable importance in binary regression trees and forests. Electron J Statist. 2007;1:519‐537. 10.1214/07-EJS039 [DOI] [Google Scholar]

[ansa202200018-bib-0022] 22. Archer KJ, Kimes RV. Empirical characterization of random forest variable importance measures. Comput Stat Data An. 2008;52(4):2249‐2260. 10.1016/j.csda.2007.08.015 [DOI] [Google Scholar]

[ansa202200018-bib-0023] 23. Genuer R, Poggi JM, Malot CT. Variable selection using random forests. Pattern Recogn Lett. 2010;31(14):2225‐2236. 10.1016/j.patrec.2010.03.014 [DOI] [Google Scholar]

[ansa202200018-bib-0024] 24. Gregorutti B, Michel B, Pierre PS. Correlation and variable importance in random forests. Stat Comput. 2017;27(3):659‐678. [Google Scholar]

[ansa202200018-bib-0025] 25. Gregorutti B, Michel B, Pierre PS. Grouped variable importance with random forests and application to multiple functional data analysis. 2015. https://arxiv.org/abs/1411.4170

[ansa202200018-bib-0026] 26. Louppe G. Understanding random forests: from theory to practice, 2015. Understanding random forests: from theory to practice, 2013. https://arxiv.org/abs/1407.7502

[ansa202200018-bib-0027] 27. Shimizu N, Kaneko H. Constructing regression models with high prediction accuracy and interpretability based on decision tree and random forests. J Comput Chem Jpn. 2021;20(2):71‐87. [Google Scholar]

[ansa202200018-bib-0028] 28.Accessed February 8, 2022. https://scikit‐learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html

[ansa202200018-bib-0029] 29. Wikipedia . Fisher transformation. Accessed February 8, 2022. https://en.wikipedia.org/wiki/Fisher_transformation

[ansa202200018-bib-0030] 30. Scipy.stats.norm . The SciPy community. Accessed February 8, 2022. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html

[ansa202200018-bib-0031] 31. Reshef DN, Reshef YA, Finucane HK, et al. Detecting novel associations in large data sets. Science. 2011;334(6062):1518‐1524. 10.1126/science.1205438 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ansa202200018-bib-0032] 32. Hall LH, Story CT. Boiling point and critical temperature of a heterogeneous data set: qSAR with atom type electrotopological state indices using artificial neural networks. J Chem Inf Comput Sci. 1996;36(5):1004‐1014. 10.1021/ci960375x [DOI] [Google Scholar]

[ansa202200018-bib-0033] 33. RDKit: open‐source cheminformatics software . Accessed February 8, 2022. https://www.rdkit.org/

[ansa202200018-bib-0034] 34. DCEKit (Data chemical engineering toolkit) . Accessed February 8, 2022. https://github.com/hkaneko1985/dcekit

PERMALINK

Cross‐validated permutation feature importance considering correlation between features

Hiromasa Kaneko