Universal feature selection for simultaneous interpretability of multitask datasets

Matt Raymond; Jacob Charles Saldinger; Paolo Elvati; Angela Violi

doi:10.1186/s13321-025-01096-z

. 2026 Jan 17;18:23. doi: 10.1186/s13321-025-01096-z

Universal feature selection for simultaneous interpretability of multitask datasets

Matt Raymond ^1,^✉, Jacob Charles Saldinger ^2,³, Paolo Elvati ³, Angela Violi ^1,^2,^3,^✉

PMCID: PMC12896148 PMID: 41547940

Abstract

Extracting meaningful features from complex, high-dimensional datasets across scientific domains remains challenging. Current methods often struggle with scalability, limiting their applicability to large datasets, or make restrictive assumptions about feature-property relationships, hindering their ability to capture complex interactions. BoUTS’s general and scalable feature selection algorithm surpasses these limitations by identifying both universal features relevant to all datasets and task-specific features predictive for specific subsets. Evaluated on seven diverse chemical regression datasets, BoUTS achieves state-of-the-art feature sparsity while generally maintaining prediction accuracy comparable to specialized methods. Notably, BoUTS’s universal features enable domain-specific knowledge transfer between datasets, and we expect these results to be broadly useful to manually-guided inverse problems. Beyond its current application, BoUTS holds potential for elucidating data-poor systems by leveraging information from similar data-rich systems.

Scientific Contribution: BoUTS selects nonlinear, universally informative features across multiple datasets. We identify crucial “universal features” across seven real-world chemistry datasets, which enhance cross-dataset interpretability and selection stability. BoUTS is highly scalable and is applicable to tabular data from many domains, and our results identify connections between seemingly unrelated chemical domains.

Graphical Abstract

Keywords: Variable selection, Multi-output, Multi-source, Dimensionality reduction

Introduction

Multitask learning (MTL) is a rich subfield within machine learning that exploits commonalities across tasks (i.e., datasets) to build robust and generalizable models. MTL has been applied to various domains such as natural language processing and computer vision. MTL trains models on multiple related learning tasks while applying a common regularization to all models (e.g., weight sharing). The underlying idea is that models can share information and representations, leading to better performance on all tasks compared to training the models independently.

MTL plays an important role across various research domains due to the prevalence of datasets that are either large and generic or small and specific. In fields such as biomedicine, drug discovery, and personalized medicine, predicting individual responses is challenging, as large datasets for common diseases poorly capture specific mutations or rare conditions [1–3].

Within this context, multitask feature selection focuses on choosing the most relevant features for these multiple tasks. Primarily, this approach enhances interpretability, which can help us gain valuable insights into the underlying relationships between different phenomena. Indeed, feature selection and its cousin feature engineering have directly led to novel scientific discoveries [4, 5]. Additionally, selecting features in a structured way improves the generalizability and efficiency of the multitask learning model. At the same time, it can also improve model performance and reduce computational costs during training and inference. However, it is important to emphasize that interpretability by domain experts, not performance, is often the primary goal of feature selection.

In this study, we investigate the existence and selection of universal features that are predictive across all datasets under consideration. The goal is to enable knowledge transfer from well-established research areas to those that are less investigated. While numerous methods have been devised to select features in multitask problems, they all have limitations that make them poorly suited for this task.

As discussed later, current methods such as Group LASSO “relax” universal features to “common” features—those of importance to a subset of tasks—thereby leading to larger, less interpretable feature sets. Moreover, tree-based methods [6, 7] require a positive correlation among model outputs across tasks, an assumption that may not always hold. Additionally, kernel methods [1, 8–11] face challenges with scalability. Collectively, such limitations make these approaches ill-suited for application to many real-world datasets.

Therefore, we introduce Boosted Universal and Task-specific Selection (BoUTS), a scalable algorithm designed to perform non-linear universal feature selection that sidesteps existing restrictions on task structure. BoUTS identifies universal (common to all tasks) and task-specific features in task subsets, providing insights into unique mechanisms relevant to specific outcomes.

BoUTS has two stages. First, it performs boosting using “multitask trees,” which select universal features based on the minimum feature importance (impurity decrease) across all outputs. This approach ensures that universal features are predictive for all outputs. Second, task-specific features are selected by independently correcting each output of the multitask tree using regular boosted trees [12]. BoUTS penalizes adding new features during tree growth to ensure small feature sets [13]. Final predictions are made by summing the individual predictions from the multitask and single-task boosting stages.

We evaluated the performance of BoUTS using seven chemistry datasets that span three distinct molecular classes and employ six different molecular properties for outputs. Furthermore, Appendix B contains several ancillary experiments on several nonchemical datasets. In all evaluated cases, BoUTS outperforms existing multitask selection methods in model flexibility, feature sparsity, stability [14], and an enhanced capacity for selecting universal features. BoUTS ’ universal features, even when generalizing across different properties and molecule types, remain competitive with deep learning models [15] and remarkably sometimes surpass dataset-tailored methodologies [16]. Moreover, the identified universal and task-specific features are consistent with established chemical knowledge, highlighting the potential of BoUTS to enhance the analysis of complex datasets and to promote the transfer of domain knowledge.

Methodology

Here, we begin by introducing the algorithm for BoUTS, and describing our data selection and preprocessing methodology. Next, we discuss how we performed statistical analysis of our results, implemented competing methods, and performed training.

BoUTS algorithm

The BoUTS algorithm combines new and existing methods to select concise sets of universal and task-specific features without universal feature relaxation or positive task correlation. Our two-part strategy is first to select universal features and then select task-specific features (Fig. 1a). This approach is a greedy approximation of a globally optimal feature set, as shown in Appendix C. In principle, universal features may be selected using standard Gradient Boosted Trees (GBTs) by independently fitting GBTs on each task and comparing the feature-wise information gained between tasks. However, feature correlation may cause the independent GBTs to (1) miss universal features, and (2) pick excessive, redundant features. To address the first issue, we select universal features using multitask trees. Our multitask trees greedily select features that maximize the minimum impurity decrease across all tasks, ensuring that all trees agree on which (possibly correlated) features to select (Fig. 1b). The second problem is addressed using penalized impurity functions as defined by Xu et al. [13]. We penalize the use of new features at each split when selecting universal or task-specific features. Notably, this approach makes no assumptions about the correlation between task outputs, meaning that BoUTS applies to a wider range of multitask problems than competing methods. Alg. 3 contains the algorithmic details of our approach. BoUTS’s greedy application of multi- and single-task trees and a penalized impurity function underlies its ability to select universal features from nonlinear and uncorrelated tasks.

Fig. 1 — Overview of BoUTS algorithm and datasets: a illustrates the BoUTS algorithm for the case where $T = 3$ . The boosted multitask trees are trained on all multitask datasets (circles) to estimate (upper diamonds) each task output. Single-task boosting estimates the residuals of the multitask trees (middle diamonds). We sum over multi- and single-task outputs for the final estimate (lower diamonds). In b, we show the splitting process for multitask trees. The improvement (impurity decrease) is computed for each task/feature combination (square), and $f_{*}$ is selected as the feature with the maximin improvement. $f_{*}$ splits each dataset (partial circles), and we repeat until a stopping condition is reached. c illustrates the assignment of datasets to categories, with square size indicating the logarithm of the size. For each dataset (starting at the top row), we have $n = [11, 079 777 1, 185 2, 143 147 206 3, 071]$ . d shows the correlation between dataset outputs. Proteins are not included because we use only one protein dataset. n values for the lower triangles, grouped by column, are logP: $n = [777 1, 185 2, 143]$ , logH_s: $n = [479 614]$ , T_b: $n = [822]$ , zeta potential: $n = [119]$ . e) t-SNE plot of each data point (using the complete feature set), colored by molecule type. For small molecule, NP, and protein, $n = [11, 079 3, 071 234]$ , respectively

Splitting single-task trees

For selecting task-specific features, we use an adaptation of the classic classification and regression tree (CART) splitting criteria [12] from Xu et al. [13]. Because the universal feature selection extends this algorithm, we reproduce the details here as a precursor to the "Splitting multitask trees" section.

The CART algorithm defines the notion of impurity, which is used to grow either classification or regression trees. Only regression trees are necessary for gradient boosting as we learn the real-valued loss gradients. Intra-node variance is the canonical example of regression impurity [12], but one can use any convex function that achieves its minimum when the predicted and ground-truth values match exactly. From notational simplicity, we use $i (η)$ to denote the impurity evaluated on node $η$ from this point forward. This impurity function can then be used to perform greedy tree growth.

The CART algorithm grows trees by recursively performing splits that greedily reduce the impurity of each node. Let $η_{f > v}, η_{f \leq v}$ represent the right and left child nodes split on feature $f \in F$ at location $v \in R$ for a d-dimensional feature set F and define $D_{t}$ as the set of samples for task t. Further, let $| \cdot |$ indicate the number of samples in a node and $2^{S}$ indicate the power set of S. Then, the impurity decrease [12] of that split is defined as

\begin{matrix} Δ i (η, f, v) ≐ i (η) - \frac{| η_{f \leq v} |}{| η |} i (η_{f \leq v}) - \frac{| η_{f > v} |}{| η |} i (η_{f > v}), \end{matrix}

where $Δ i : 2^{D_{t}} \times F \times R \to [0, \infty)$ and the optimal split is the tuple (f, v) that provides the largest impurity decrease. We start from the root node, which contains all samples and has maximum impurity. Then, the tree is grown by iteratively applying the optimal split until a stopping criterion is reached. In practice, splitting is usually stopped once $Δ i (η, f, v)$ or $| η |$ fall below a predefined threshold or once the tree reaches a predefined depth or number of leaf nodes.

We add a greedy approximation of the $ℓ_{1}$ penalty to the impurity term to minimize the number of features utilized. CART trees may be used to select features, but they select many redundant features since there is no feature selection penalty. Thus, [13] modifies the CART algorithm to use a penalized impurity function. Let $1_{f}^{t}$ indicate whether feature f is not used by a tree in task t. Then, the penalized impurity is

\begin{matrix} i_{b}^{t} (η) ≐ i (η) + λ^{t} \sum_{f \in F} 1_{f}^{t} \end{matrix}

for boosting round b of task t on node $η$ . Then, the penalized impurity decrease at a split is defined as

\begin{matrix} Δ i_{b}^{t} (η, f, v) & ≐ Δ i (η, f, v) - λ^{t} \sum_{f \in F} 1_{f}^{t} \end{matrix}

with $Δ i_{b}^{t} : 2^{D_{t}} \times F \times R \to R$ . For a given node $η$ , we choose the split feature f and location v that result in the maximum information gain (Fig. 1b, Alg. C.2). Note that a penalized impurity decrease is no longer restricted to $[0, \infty)$ . This is not an issue in practice because of the threshold for minimum impurity decrease. In practice, we use MSE with an improvement score ([17], Equation 35) for selecting splits. The Friedman MSE attempts to maximize the difference between node predictions while maintaining an equal number of samples per node, which is known to improve performance in some settings [18]. This penalized impurity decrease enables sparse task-specific feature selection using GBTs.

Splitting multitask trees

Universal feature selection requires additional modifications to Breiman et al.’s CART trees [12]. For a feature to be universal, it must simultaneously be selected by all T tasks during greedy tree growth. Hence, we assume all T trees from a single boosting round are nearly identical. The only difference is that each tree may choose its splitting location for the feature used in each non-terminal node. Thus, we must derive a splitting criterion that (1) chooses a universal feature to split all task trees on, (2) allows each task tree to choose its splitting threshold for the universal features.

We ensure universal features are important to all tasks via maximin optimization. Let $η^{t} \subseteq D^{t}$ indicate the data points in a tree node currently under consideration for task t. At boosting round b we find the feature $f_{*}$ and split locations $v_{*}^{1}, . . ., v_{*}^{T}$ that solve

\begin{matrix} max_{f \in F} min_{t \in \{1, . . ., T\}} max_{v^{t} \in R} Δ i_{b}^{t} (η^{t}, f, v^{t}), \end{matrix}

where $v^{t}$ is a split location on feature f for task t. Note that the split location is found independently for each task. This method selects the feature that maximizes the minimum impurity to decrease across all tasks while allowing unique split locations, covering requirements (1) and (2) from above. This approach ensures that all universal features are important for all tasks while keeping the selected universal feature set concise.

Datasets

To quantify BoUTS’s performance on real-world data, we evaluate it on datasets covering various chemical scales and properties used to screen molecules for industrial or medical applications. Our data include seven datasets (Table 1): four related to small molecules, two to nanoparticles (NP), and one to proteins. Overall, we consider six different properties: the octanol/water partition coefficient (logP), Henry’s law constant (logH_s), the melting (T_m) and boiling (T_b) temperature, solubility in water, and the zeta potential. Further experiments on non-chemical datasets are available in Appendix B.

Table 1.

Specialized, application-specific methods for each property and system combination. Note that these differ from the competing feature selection methods

System	Property	Method
Small Molecule	logP	ChemProp [15]
	logH_s	ChemProp [15]
	T_m	ChemProp [15]
	T_b	Chemprop [15]
Nanoparticle	logP	PubViNaS [16]
Nanoparticle	Zeta Potential	PubViNaS [16]
Protein	Solubility	GraphSol [29]

Open in a new tab

Construction

We chose datasets to ensure wide coverage of multiple properties and molecular scales: small molecules, proteins, and larger metal NPs. Our small molecule logP dataset was taken from Popova et al. [19], which provided SMILES and associated experimental logP values. For small molecule logH_s, T_b, and T_m, we used the 2017 EPISuite [20] to extract experimentally-measured properties using the SMILES from the logP dataset. Not every SMILES was associated with a value for all properties, so SMILES from the latter three datasets form overlapping subsets of the logP SMILES. The cardinalities of these sets are detailed in Fig. 1c. For proteins, we use solubility values from Han et al. [21] and PDB structures from the Protein Data Bank [22] and AlphaFold Protein Structure Database [23]. To create the NP dataset, we use the PubViNaS [16] database to obtain logP and zeta potential properties and structural information as PDB files. Overall, this resulted in seven different datasets across three different molecular scales.

Similarly, our full feature set was designed to capture as much nanoscale chemistry as possible. Such features were based on the radius, solvent accessible surface area, van der Waals surface area [36], atomic property distributions [24], depth from the convex hull of the molecule, WHIM descriptors [25], and tessellation descriptors [16]. Atomic weightings for these descriptors were similar to those used in previous works [16, 24]. Notably, not all features are computable for all molecular scales. For example, a feature describing volume will return NaN when run on a flat molecule. However, we don’t want to universally exclude such features, as they may be useful if defined for multiple datasets. Instead of dropping such features from all datasets, we individually drop features that are NaN or constant for each dataset in a group of datasets we call a category.

Construction of categories

We perform ablation tests on BoUTS by evaluating its performance on three categories of datasets, as shown in Fig. 1c: property, scale, and all. property contains three datasets with similar solubility-related target properties but span different chemical spaces. scale contains three datasets that span similar chemical spaces of small molecules but have different target properties. all contains all seven datasets covering all chemical spaces and target properties used in this work. We use well-defined and non-constant features for all datasets as the candidate feature set for each category. These categories constitute an ablation test for BoUTS’s generalization capabilities.

Performing splits with overlapping datasets

In some (but not all) cases, the same molecule occurs in multiple datasets, so we perform a modified stratified split. We perform stratified splits on the overlaps of each dataset, which allows us to use all available data while preserving the correct train/validation/test split ratios for all datasets and preventing data leakage. Because the overlap of dataset samples depends on the datasets under consideration, the same dataset may be split differently when included in different categories. For example, small molecule logP is partitioned differently in each category since it overlaps with different datasets (see n in Fig. 2a–c). To address this issue, we re-evaluate competing methods for each category using the same splits described here.

Fig. 2 — Ablation tests and analysis of BoUTS selected features: Feature ablation tests for BoUTS are shown in a, b, and c, and compare our selected features to specialized prediction methods. Violin plots show the performance distribution; the inner bars indicate the 25th and 75th percentiles, and the outer bars indicate the 5th and 95th percentiles. The white dot indicates the median performance. The top of d shows the dataset size (top axis) and the selection stability of single-task gradient-boosted feature selection (bottom axis). The bars indicate the 95% confidence interval. In the bottom section, the upper bar shows the mean stability across all tasks. The lower bar shows the stability of BoUTS s universal features, with the 95% confidence interval as a black bar. e shows the absolute Spearman correlation between the universal features as a graph, with clusters indicated by gray circles and node colors indicating the categories that selected that feature. An alternative visualization is provided in Fig. 9

Statistical comparison of stabilities

We use Nogueira et al. [14]’s measure of stability and a statistical test to quantify the selection stability of BoUTS. For BoUTS’s universal feature selection and single-task gradient boosted feature selection, we perform feature selection on $M = 100$ randomized train/validation/test splits. We then encode the binary masks for universal features and each of T independent tasks in matrices $Z^{U}, Z^{1}, \dots, Z^{T} \in {0, 1}^{M \times d}$ , where d is the number of features. In this encoding, $Z_{m, f}$ indicates whether feature f was selected during the mth trial for matrix $Z$ . Define ${\hat{p}}_{f} (Z) ≐ M^{- 1} \sum_{i = 1}^{M} Z_{i, f}$ as the empirical probability of selecting feature f, and $s_{f}^{2} (Z) ≐ M / (M - 1) {\hat{p}}_{f} (Z) (1 - {\hat{p}}_{f} (Z))$ as “the unbiased sample variance of the selection of the fth feature.” Finally, define ${\bar{k}}_{t}$ as the average number of features selected for task t. Then, Nogueira et al. [14] defines an estimate of feature selection stability as

\begin{matrix} \hat{Φ} (Z) = 1 - \frac{\frac{1}{d} \sum_{f = 1}^{d} s_{f}^{2} (Z)}{\frac{\bar{k}}{d} (1 - \frac{\bar{k}}{d})} . . \end{matrix}

Since the statistic $\hat{Φ}$ weakly converges to a normal distribution [14], the $(1 - α)$ % confidence interval is computed as

\begin{matrix} [\hat{Φ} (Z) - F_{N}^{- 1} (1 - \frac{α}{2}) \sqrt{v (\hat{Φ} (Z))}, \hat{Φ} (Z) + F_{N}^{- 1} (1 - \frac{α}{2}) \sqrt{v (\hat{Φ} (Z))}] \end{matrix}

for the standard normal cumulative distribution function $F_{N}$ and the variance of $\hat{Φ} (Z)$ , denoted as $v (\hat{Φ} (Z)$ . To compare the stability of the two models, we compute $Z^{U}$ and $Z^{t}$ as above, then calculate the test statistic for two-sample, two-sided equality of means [14]. The test statistic is defined as

\begin{matrix} T_{M} = \frac{\hat{Φ} (Z^{U}) - \hat{Φ} (Z^{t})}{\sqrt{v (\hat{Φ} (Z^{U})) + v (\hat{Φ} (Z^{t}))}}, \end{matrix}

where $T_{M}$ asymptotically follows a normal distribution. Thus, the p-value easily computed as $p ≐ 2 (1 - F_{N} (| T |))$ [14].

To quantify the effect size between BoUTS and independent gradient boosted feature selection, we compute Cohen’s d [26] for task t as

\begin{matrix} \frac{\sqrt{2} (\hat{Φ} (Z^{U}) - \hat{Φ} (Z^{t}))}{\sqrt{v (\hat{Φ} (Z^{U})) + v (\hat{Φ} (Z^{t}))}} \end{matrix}

since we take an equal number of samples for each model.

Implementation of all methods

We utilized community implementations of each method when possible and reimplemented or modified algorithms when necessary. For Dirty LASSO [27], we used the community implementation by Janati [28]. Since the original Dirty LASSO implementation [28] requires every task to have the same number of samples, we modified the code to mask the loss for outputs whose true label/output is undefined (e.g., protein T_m), preventing them from affecting the parameter updates. This masking is unoptimized and incurs a runtime penalty that is proportional to the number of datasets being analyzed. For MultiBoost [6], we used SciKit-Learn trees to implement it from scratch. We use the official GraphSol [29] implementation; however, we change the output activation from sigmoidal to exponential since our protein solubility dataset is defined over $(0, \infty)$ rather than (0, 1]. We implement the models for PubViNaS [16] using SciKit-Learn [18]. ChemProp is a graph neural network for small molecules, and we use the implementation provided by Heid et al. [15]. BoUTS was implemented by modifying SciKit-Learn ’s gradient boosted trees.

Training and evaluation

All feature selection methods were trained across different selection penalties to create regularization paths. For each category, we perform a randomized 0.7/0.2/0.1 train/validation/test split using the method from the "Performing splits with overlapping datasets" section and rescale the feature vectors and labels such that each training dataset has a mean of 0 and a standard deviation of 1. Then, we perform a sweep over the penalty parameters on the training set for each selection method. For MultiBoost, we evaluate $2^{n}$ models for $n \in {0, . . ., 8}$ to limit the number of features its trees select. For BoUTS, we set all universal and task-specific feature penalties equal to $λ_{n}$ and tested penalties that were equally spaced in log-space, with regularization $log λ_{n} ≐ 8 (n - 1) / 19 - 4$ for $n \in {1, . . ., 20}$ . Dirty LASSO required a more fine-tuned balance between common and task-specific penalties, so we similarly took equally-log-spaced penalties such that $log λ_{n}^{s} \in [- 4, - 1]$ and $log λ_{n}^{b} \in [- 3.85, - 1]$ , where $n = 20$ and $λ_{n}^{s}, λ_{n}^{b}$ indicate the sparse and group-sparse penalties, respectively. This approach creates a regularization path for each method, which we later use to select a feature selection penalty.

We use a unique approach to extract the selected features from each model in each regularization path. For MultiBoost, we consider a feature to be universal if it is used by at least one tree in each task or the correlated task structure. BoUTS is similar, but we consider the multitask boosted trees instead of the correlated task structure. For both methods, a feature is a task-specific feature if used on at least one but not all tasks. For Dirty LASSO, we consider a feature common if the group-sparse weight matrix uses it and task-specific if it is used only in the sparse weight matrix. We use a weight threshold of $ε = 10^{- 4}$ to determine membership. This procedure provides the features selected by each model in the regularization path.

We then choose a cutoff point to provide similar performance across all feature selection methods. For every subset of features selected for each model (9 + 20 + 20 subsets total), its performance is measured using LightGBM to learn and predict the training set using only the selected features. LightGBM is used for both linearly and non-linearly selected features; more details can be found in Appendices D and E. For each model, we chose the feature penalty that directly precedes a 10% decrease in the explained variance on any task/training dataset. We then find the optimal inference model by performing a grid search with LightGBM on the training and validation sets, and we report performance based on the test set. Because the original labels for each dataset exist on different scales, we report the absolute error of the rescaled labels (0 mean and unit variance), which we call the normalized absolute error. This choice results in similar performance for each selection method and allows us to concentrate only on feature sparsity.

Finally, we evaluate the specialized, application-specific methods on the same datasets to ensure a fair comparison. To train PubViNaS, we use a cross-validated grid search on the union of training and validation sets. Model selection for GraphSol was not computationally feasible. We find that GraphSol is missing one protein from our training dataset (cfa), so this protein is skipped during GraphSol training. We use the default parameters for ChemProp [15] to train a multi-output GNN for 100 epochs to predict all small-molecule tasks in a given category. Prediction is performed on the same hold-out tests as the feature selection methods.

Results

Feature selection using BoUTS

To verify that BoUTS’s results are not an artifact of highly correlated tasks, we computed the Spearman correlation of the target properties, as shown in Fig. 1d. The mean absolute correlation of our datasets is low (0.37) [30], and only two pairs of datasets have an absolute correlation greater than 0.50 [(T_m, logH_s) and (T_m, T_b)]. This low positive and negative correlation mixture ensures sufficient differences to render MultiBoost ineffective and demonstrate the generality of BoUTS.

Moreover, we visualize the feature-space diversity of our datasets to show that BoUTS’s apparent generality is not an artifact of datasets with significant feature-space overlap. Fig. 1e shows a t-SNE plot [31] of the union of all datasets computed using the candidate features. Different types of molecules span distinct regions of the feature space, with small molecules separated from NPs and proteins. While NPs and proteins partially overlap, the NPs form distinct clumps along the space that proteins cover, since our larger protein dataset allows for a more thorough covering of the feature space. The disjointness of our datasets in feature space suggests that our later results represent the true generalization capabilities of our model.

While BoUTS selects universal features by construction (as they are used only if they improve the performance of every task), there is no guarantee about the number or predictive performance of the selected features. Therefore, we tested BoUTS’s universal features selection on three different “categories” of data, property, scale, and all (see "Construction of categories" section and Fig. 1c): the first two categories contain datasets with similar properties or scales, while the third category contains all datasets.

Comparing BoUTS’s feature selection across categories, we see that it selects 8 out of 1,437 for property, 6 out of 1,651 for scale, and 9 out of 1,205 for all (see "Construction of categories" section and Table 2). This small set of universal features (less than 1% for any category) is nearly as predictive as the original feature set, as seen in Fig. 2a–c. The specialized techniques (Table 1) each use a different, optimized feature space (which may not overlap with ours) and show the performance achievable without performing universal feature selection. Additionally, we compare the performance of our original feature set, universal and task-specific features, and only universal features on all three categories. When using universal and task-specific features, the median error is within the interquartile range of all specialized methods except for small molecule logP, even outperforming them on datasets such as NP logP (Fig. 2a–c). ChemProp achieves an RMSE of 0.42 on logP prediction versus an RMSE of 0.88 for LGBM using all features. Compared to specialized methods, BoUTS ’s feature selection improves cross-dataset interpretability while frequently maintaining a comparable level of performance.

Table 2.

Universal features selected for each category: They are presented in alphabetical order and grouped across categories

Property	Scale	All
atomic_volume_cccc	atomic_volume_cccc	atomic_volume_cccc
atomic_volume_cccn
		atomic_volume_ccco
atomic_volume_ccno
atomic_volume_oooo
	covalent_radius_2
	covalent_radius_3
		electron_affinity_mean
electron_affinity_variance
		electron_affinity_whim_axis2
		electron_affinity_whim_d
electronegativity_mean
		electronegativity_variance
		evaporation_heat_convhull_median
	group_5
	ionization_energy1_mean
		ionization_energy1_variance
	mass_ccno
polarizability_whim_ax_density3
vdw_volume_variance		vdw_volume_variance

Open in a new tab

BoUTS’s multitask trees also significantly improve selection stability [14] over simpler greedy approximations of the $ℓ_{1}$ penalty (Fig. 2d). For example, we find no universal features if we run single-task GBT feature selection on each task independently. Additionally, if we run 100 randomized BoUTS replicates on all seven datasets, the universal features have a stability of 0.36 (higher is better, bounded by [0, 1]). By contrast, using GBT feature selection for nanoparticle zeta potential alone achieves a stability of only 0.14. We find Cohen’s d of 18 and a p-value of 0.0 (to numerical precision) when using a two-sided Z-test. Indeed, BoUTS’s universal feature selection stability is comparable to that of the Protein dataset, which has 10 times as many samples. Full tables are available in Supplemental Tables 3, 4, 5. These results demonstrate that BoUTS’s multitask trees improve the worst-case feature selection stability compared to similar greedy optimization methods.

Table 3.

Summary of universal features selected by BoUTS: Stability is bounded by [0, 1], and higher is better

Type	Stability	Variance	95% confidence interval	# Features	Feature penalty
Universal Features	0.3577	0.0002	[0.3287,0.3867]	6.020	2.5

Open in a new tab

$n = 100$ random train/validation/test splits

Table 4.

Stability of single-task GBT feature selection: computed for all molecule types and properties in the all category

Type	Property	Stability	Variance	95% CI	# Features	# Samples	Feature penalty
SM	$log p$	0.6403	0.0001	[0.6202, 0.6603]	7.7200	11,079	50
Prot	Solubility	0.3691	0.0002	[0.4312, 0.3970]	6.290	2,149	10
SM	Boiling	0.3003	0.0002	[0.2742, 0.3264]	7.5000	1,185	10
SM	$log h$	0.2455	0.0001	[0.2274, 0.2637]	6.4200	777	7.5
SM	Melting	0.2148	0.0001	[0.1916, 0.2381]	6.5300	2,143	20
NP	$log p$	0.1754	0.0002	[0.1477, 0.2031]	4.800	147	2.5
NP	Zeta Pot.	0.1429	0.0001	[0.1257, 0.1601]	7.0300	206	2.5
Mean		0.2983			6.6129
Std		0.1688			0.967

Open in a new tab

SM, Prot, and NP stand for “Small Molecule,” “Protein,” and “Nanoparticle,” respectively. Stability is bounded by [0, 1], and higher is better. $n = 100$ random train/validation/test splits. “# Samples" indicates the number of samples available per dataset

Table 5.

Stability of single-task GBT feature selection: Computed for all molecule types and properties in the all category

Type	Property	Cohen’s d	p-value
Small molecule	$log p$	−22.21	0.000
Protein	Solubility	−0.788	0.5775
Small molecule	Boiling point	4.073	$3.977 \cdot 10^{- 03}$
Small molecule	$log h$	9.084	$1.332 \cdot 10^{- 10}$
Small Molecule	Melting point	10.65	$5.085 \cdot 10^{- 14}$
Nanoparticle	$log p$	12.60	0.000
Nanoparticle	Zeta potential	17.66	0.000

Open in a new tab

Stability is bounded by [0, 1], and higher is better. $n = 100$ random train/validation/test splits. Cohen’s d calculated using stability and variance from Supplemental Tables 3 and 4. p-value computed as described in the methodology. Note that some values are zero within numerical precision (i.e., Python returned 0)

Notably, BoUTS’s universal features form correlated clusters, suggesting further stability in the mechanisms selected. We create a correlation matrix for BoUTS’s universal features, where each entry indicates the average Spearman correlation between two selected features across datasets. We ignore datasets where at least one feature in a pair is undefined. Spectral clustering is performed on this matrix to find five groups of features (Fig. 2e). Three-fifths of the clusters contain at least one feature from each category. The all category is represented in all clusters, and the property and scale categories are each missing from one cluster. This overlap indicates that similar information is captured for all categories, even when the selected features don’t match exactly. Thus, we anticipate BoUTS to be even more stable when considering the mechanisms governing the selected features rather than the features themselves.

Comparing BoUTS to other selection methods

BoUTS is not the first method to model both shared and independent task structures, but current approaches are unsuitable for selecting universal features from real-world scientific datasets. We compare against MultiBoost and Dirty LASSO as they embody the major shortcomings in the literature.

MultiBoost [6] adapts gradient-boosting to model correlated and independent task structures. The features used to model the correlated task structure may be considered universal. However, MultiBoost cannot model tasks with uncorrelated outputs, so it cannot select universal features for common scientific datasets like ours (Fig. 1d). Indeed, MultiBoost models no correlated structure for our datasets and selects only two or three universal features via coincidental overlap. Additionally, the lack of a sparsity penalty results in the selection of multiple redundant features. Smaller sets are easier to hold in memory [32], so large, redundant feature sets are less interpretable.

Dirty LASSO [27] employs a “group-sparse plus sparse” [33] penalty to linearly model a common (not universal) and task-specific structure. This approach results in larger feature sets because common features may not describe all tasks and must be supplemented with additional features. Furthermore, Dirty LASSO’s linear models may select multiple features that are related by a nonlinear function, which increases the feature set even more.

By comparing the feature sets selected by each method, we find that BoUTS selects fewer features than either MultiBoost or Dirty LASSO while achieving comparable performance. We perform the same evaluations as in Fig. 2a–c, this time comparing BoUTS and the competing methods in Fig. 3a–c. To establish an objective comparison, we chose feature penalties that lead to similar performance metrics for each method (see the "Training and evaluation" section for details). However, Fig. 3a–c show that BoUTS’s feature set is often much smaller than that of either competing method and frequently contains fewer total features than the common features selected by Dirty LASSO. Overall, BoUTS selects fewer total features than its competitors in 10 out of 13 cases, sometimes by a large margin. For example, despite comparable performance, for NP zeta potential (Fig. 3a) BoUTS selects only nine universal features to Dirty LASSO ’s 182 and MultiBoost ’s 72.

Fig. 3 — Comparing the performance of BoUTS and competing selection methods: The top half of a shows the performance of all evaluated feature selection algorithms compared to specialized methods. The violin plots are defined in Fig. 2a. The bottom half shows the number of features selected by each method. No features are indicated for the specialized methods, as they are not selection methods. The hatched section indicates the universal or common features that are selected, and the remaining features are task-specific. Plots b and c are defined similarly for the property and scale categories

Figure 3a–c show that competing methods only outperform BoUTS when they select significantly more features, like for NP logP, where the number of features differs by nearly two orders of magnitude. We presume that this difference is caused by MultiBoost ’s lack of an explicit sparsity term and Dirty LASSO ’s linearity assumption. Nevertheless, this combination of similar performance and smaller feature sets suggests that BoUTS is more capable of providing insight into the physical mechanisms controlling the feature-property relationships of interest. This comparison shows that BoUTS is significantly more interpretable than competing methods.

Additionally, we observe that the number of task-specific features selected by BoUTS closely aligns with theoretical assumptions regarding universal feature behavior. BoUTS generally selects fewer task-specific features when a task has fewer samples. As described in "Splitting multitask trees" section, BoUTS adds features to the universal set by repeatedly splitting on the feature that maximizes the minimum information gain (prediction improvement) across all tasks. Adding new features is penalized, so BoUTS selects universal features only if they are predictive for all tasks. Because weak feature-property relationships may only be detectable in larger datasets, the smallest datasets will limit the universal feature set. However, tasks with larger datasets will compensate by selecting additional task-specific features, which is the desired behavior of universal and task-specific feature selection.

Analysis of selected features

Analysis of BoUTS’s task-specific features showcases its ability to recover well-known and quantitative feature-property relationships. For example, selecting the “total molecular mass” feature for small molecule T_b reflects an established correlation with intermolecular dispersion forces [34]. Similarly, BoUTS selects solvent-accessible surface area for small molecule logP, commonly used to compute non-polar contributions to the free energy of solvation [35]. Such findings illustrate that BoUTS’s sparse feature selection can recover scientifically meaningful feature-property relationships.

Further, we find that BoUTS’s features are highly specialized to the types of molecules and properties under consideration. We note that half of the universal features for the scale category are VSA style descriptors [36], which are not universal for either the property or all categories. This observation corroborates previous findings on VSA descriptors’ effectiveness for solubility and T_b in small molecules [36]. Additionally, their absence from the property and all categories is expected since VSA descriptors are not anticipated to apply to other length scales or molecular motifs.

Finally, BoUTS’s universal features provide insight into factors controlling chemical properties across multiple scales. When analyzing the property category, we note the lack of task-specific features for NP, which suggests that the universal features are sufficiently descriptive for logP predictions of metal-cored NPs. Although an NP ’s solubility may partially depend on its metal core, metal-related features are weakly predictive for small molecules and proteins. Similarly, tessellation descriptors [16], linked to carbon, nitrogen, and oxygen atom subgroups, are the only extrinsic descriptors chosen for the property category. If we compute the information gain provided by each feature the ensemble uses, we find that tesselation features contribute 46% of the total information gain. Because the model’s competitive NP logP predictions heavily utilize tessellation, we suspect the model focuses on the NPs ’ organic ligand surface, which is in contact with the solvent. A researcher may correctly conclude from these results that ligating known soluble molecules to a nanoparticle will improve the nanoparticle’s solubility.

Discussion

BoUTS overcomes the limitations of existing feature selection methods, which often struggle with the complexities of real-world data, particularly nonlinearities and uncorrelated tasks. Through our case study, we find that BoUTS selects far fewer features than existing methods (typically less than $1 %$ of features) while generally maintaining good performance. We find that approximately ten universal features are predictive for all tasks, and that task-specific features further improve performance (Fig. 2a–c). BoUTS exhibits desirable sparsity patterns, such as selecting more task-specific features for tasks with larger datasets. This remarkable sparsity enhances interpretability, especially in smaller datasets where understanding the underlying drivers is crucial. By focusing on the essential, universal features, BoUTS offers researchers a clearer path to uncovering the unifying principles underlying their datasets.

We attribute BoUTS s performance to its universal, nonlinear, and more general approach to feature selection. LASSO-like methods, such as Dirty LASSO, relax universal features to “common” features, so more features must be selected to compensate for underrepresented tasks. Additionally, linearity likely leads to the redundant selection of non-linearly related features. Similarly, current tree ensemble methods like MultiBoost require datasets to follow restrictive assumptions, such as having correlated outputs. Since our dataset outputs are not correlated, MultiBoost cannot be applied. BoUTS performs nonlinear feature selection with neither relaxation nor such assumptions, leading to sparser feature sets without performance degradation.

As demonstrated by our results, BoUTS s unique, domain-agnostic ability to select universal features from uncorrelated tasks may help researchers identify unifying principles across multiple datasets. For example, BoUTS correctly identifies the connection between surface chemistry and solubility in proteins, nanoparticles, and small molecules without any a priori knowledge of chemistry. Beyond solubility, our universal features are predictive across all seven datasets. These results raise the broader question of whether BoUTS can systematically uncover analogous structures across diverse real-world domains.

Although many works discuss chemistry datasets and how they relate to the underlying “chemical space,” our universal feature selection is most comparable to the “consensus chemical space” introduced by Medina-Franco et al. [37]. Given multiple types of molecules with different chemical features, this space is composed of pooled versions of the original features that have been distilled into a shared feature space. Indeed, BoUTS may be interpreted as an approach to identifying a small and performant consensus chemical space that we call “universal features.” Future research may investigate whether universal features have general predictive performance outside of their original predictive tasks.

BoUTS s universal features allow domain-specific knowledge transfer between datasets. We note that applying universal feature selection to small datasets improves the stability of the selection ( $p = 0$ to numerical precision) while providing an explicit connection between datasets. This stability increases the likelihood that the selected features are optimal for the true data distribution [38], and cross-dataset connections enable researchers to apply domain knowledge from well-researched areas (e.g., proteins) to poorly-understood areas (e.g., nanoparticles) [39]. Additionally, by finding universal features for multiple properties of a class of chemical structures (e.g., small molecules), it may be possible to anticipate how structural modifications that optimize one property may unintentionally alter another.

While BoUTS offers significant advantages, it is crucial to acknowledge its limitations and ongoing research efforts. BoUTS’s primary limitation is its reliance on greedy optimization. Although it outperforms existing methods, BoUTS’s greedy feature selection, greedy tree growth, and boosting may still yield a feature set that is either larger or less predictive than the optimal feature set. Such a mis-specified set may exclude features that are only important once conditioned on another feature (Appendix G) or include multiple partially-related features. Additionally, the greedy tree growth algorithm may select features that are important for a subset of the feature space rather than the whole space. However, in Appendix G, we investigate BoUTS’s sensitivity to conditionally-dependent features using a toy dataset and find that BoUTS outperforms Dirty LASSO by a factor of two. Additionally, our multitask trees require that each task be split using the same features in the same nodes, which is a seemingly restrictive assumption. However, there are two reasons that we expect BoUTS to work well in practice. The primary reason is that BoUTS empirically outperforms the non-greedy Dirty LASSO. The second reason is that limiting BoUTS to trees of depth one (called decision stumps) would alleviate the previously-mentioned restriction. Decision stumps can be boosted to high accuracy [40], so we would expect deeper trees (though biased) to help BoUTS. Indeed, this is what we found in preliminary experiments. Thus, further evaluation is required to determine whether the same-branch-split restriction harms performance. Additionally, BoUTS is more computationally expensive than linear methods like Dirty LASSO, which may be significant for large datasets (see Appendix A). However, BoUTS is still significantly more scalable than kernel methods. Finally, BoUTS is currently limited to using single-output regression trees. This could potentially be addressed by adapting the multi-output regression trees from Segal and Xiao [41] to the BoUTS problem setting.

Graph neural networks, like ChemProp, may achieve higher predictive accuracy than feature-based approaches, like BoUTS, but are not universally applicable. Specifically, we note two potential reasons that ChemProp predicts logP so well: the number of training points and the fact that there is precedent for modeling logP as a sum of atomic contributions. Our logP dataset has approximately seven times as many data points as the other datasets, meaning that the GNN is less likely to overfit this dataset than the others. Additionally, logP has previously been estimated using a sum over atomic contributions [42, 43], which is similar to the global node aggregation performed by ChemProp. However, we note that ChemProp cannot be applied to structures that lack covalent bonds (such as our metal-cored nanoparticles), so it cannot be used to derive universal features analogously to BoUTS. It remains to be seen whether less restrictive GNN architectures, such as those proposed by Satorras et al. [44], can be applied in an interpretable manner across diverse chemical datasets.

While demonstrating superior selection capabilities in our chemical datasets, the true potential of BoUTS lies in its scalability and broader applicability. Future research efforts may optimize BoUTS using histogram trees, GPU acceleration, multiprocessing [7, 45], and parallel feature processing, paving the way for massive datasets with millions of features and samples, a hallmark of modern scientific research. Additionally, BoUTS is not an application-specific approach, so by leveraging BoUTS’s scalability, sparsity, and unique ability to perform universal feature selection, researchers across fields may gain deeper insights into their data.

In summary, BoUTS offers unique sparsity, interpretability, and universal feature selection advantages. Its universal features facilitate the transfer of domain knowledge between datasets and improve feature selection stability on small datasets. Moreover, BoUTS s universal features perform competitively despite being predictive for all datasets. We anticipate these capabilities will empower researchers to analyze high-dimensional or poorly understood datasets, accelerating progress across various scientific fields.

Language model

During the preparation of this work, the authors used GitHub Copilot as a programming assistant and ChatGPT4 to generate simple functions. GitHub Copilot was also used to assist in LaTeX typesetting. ChatGPT4 and Claude2 were used to review the manuscript’s early drafts and brainstorm ideas, including the name “BoUTS.” After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Acknowledgements

The authors thank Dr. C. Scott from the University of Michigan for insightful discussions. We acknowledge Advanced Research Computing, a division of Information and Technology Services at the University of Michigan, for computational resources and services provided for the research.

Abbreviations

MTL: Multitask learning
BoUTS: Boosted universal and task-specific selection
GBT: Gradient boosted tree
NP: Nanoparticle
CART: Classification and regression tree

Appendix

Appendix A: Runtime comparisons between BoUTS and competing methods

Because it uses a tree-based algorithm, BoUTS may be slower than linear methods such as Dirty LASSO; however, their relative speeds are largely a consequence of the chosen algorithms. Indeed, two points should be noted:

Our modified implementation [28] of Dirty LASSO is slow, as optimizing it was not our primary goal. We expect a better implementation to be $N \times$ faster, where N is the number of datasets being used. Additionally, Nesterov acceleration may improve the convergence rate from $O (1 / k)$ to $O (1 / k^{2})$ where k is the number of iterations [46].
Our implementations of BoUTS and MultiBoost [18] do not use any parallelization or acceleration methods that are typical of tree-based boosting methods [7, 45], which are shown to achieve $> 100 \times$ speed-up.

With these caveats in mind, we show a comparison between methods in Fig. 4.

Appendix B Evaluating BoUTS on non-chemical datasets

All of our evaluations in the main text are for chemistry datasets. However, it is important to confirm that BoUTS’s performance generalizes to other, nonchemical datasets. We demonstrate tha BoUTS outperforms Dirty LASSO on (1) wine quality prediction [47] (Fig. 5a), (2) inverse dynamics of a robotic arm [48] (Fig. 5b), and (3) predicting the price of rental housing [49] (Fig. 5c).

Fig. 5 — Examples of BoUTS and Dirty LASSO on non-chemical datasets: The performance is quantified using R², and the bars are hatched to indicate that universal features are being use

These datasets are preprocessed by standardizing features and regression targets, and by converting set inputs (e.g., types of animals allowed in the building) to boolean membership flags. We select the same number of universal features using both BoUTS and Dirty LASSO, then train LightGBM on the selected features using default hyperparameters. To make the wine dataset multitask, we split the red and white wines into different tasks. To make the rental dataset multitask, we group entries by the state the rental is in and take the states with the top ten number of units for rent. For the latter, we also dropped incorrectly formatted rows, timestamps, city names, and columns that contain natural language (e.g., rental description). Finally, we log-transform the square footage and rental price of all samples so that they more closely follow a normal distribution.

The code for this experiment can be found in si_notebooks/other_multitask.ipynb in the associated repository.

Appendix C: Derivation of BoUTS boosting procedure

We begin by defining our problem statement and the global objective that we want to optimize. Define $[n] ≐ {1, \dots, n}$ , $⊙$ as the Hadamard product, and |s| as the cardinality of a set. We further let $□^{t}$ indicate the tth task, $a_{i}$ indicate index i for vector $\vec{a}$ , $a ≪ b$ indicate a is much less than b, and $□_{U}$ and $□_{U̸}$ indicate $□$ for universal features and task-specific features, respectively. Let $X \subseteq R^{d}$ be the feature space and $Y$ be an abstract output space for regression or classification. Let each task $t \in [T]$ have a dataset $D^{t} ≐ \{(\vec{x}, y) | \vec{x} \in X, y \in Y\}$ . In our setting, the data is described by a small subset of features, which can be disjointly split into universal and task-specific feature sets. Let ${‖ \cdot ‖}_{0}$ indicate the $ℓ_{0}$ -pseudo-norm, which counts the number of nonzero elements in a vector. Thus, in our setting, $P (y ∣ \vec{x}) \approx P (y ∣ (m^{t} + m^{U}) ⊙ \vec{x})$ for $‖ m^{t} + m^{U} ‖_{0} ≪ d$ , and $m^{U} ⊙ m^{t} = \vec{0}$ , where $m^{t}, m^{U} \in {0, 1}^{d}$ indicate binary masks for task t and universal features, respectively. Then, given a family of parametric functions $ϕ_{θ} \in Φ$ such that $ϕ_{θ} : X \to Y$ for $θ \in Θ$ , loss $ℓ : Y \times R \to [0, \infty)$ , and regularizer $R : Θ \to [0, \infty)$ , we aim to find

\begin{matrix} \underset{\begin{matrix} θ^{1}, \dots, θ^{T} \\ m^{1}, \dots, m^{T}, m^{U} \end{matrix}}{argmin} (\sum_{t \in [T]} (\sum_{(\vec{x}, y) \in D^{t}}, ℓ, (y, ϕ_{θ^{t}} ((m^{t} + m^{U}) ⊙ \vec{x}))) + λ^{t} {‖ m^{t} ‖}_{0} + α R (θ^{t})) + λ^{U} {‖ m^{U} ‖}_{0} . \end{matrix}

Here, $α, λ^{t}, λ^{U} \in [0, \infty)$ weight the regularization for parameters $θ^{t}$ , the task-specific feature sparsity of task t, and the universal feature sparsity of all tasks.

Now, we specify the function class $Φ$ over which we optimize the total. Define $T$ as the class of all regression trees, with tree $τ \in T : X \to R$ . For the purpose of analysis, we make the assumption that this set is finite [6, 50]. This holds in practice, where we learn trees of finite depth using finite datasets. Indeed, given a finite dataset, all depth-limited trees can be grouped into a finite number of sets, where each set contains trees that provide identical outputs when evaluated on that dataset. For the purposes of optimization, we only need one tree from each set. Define $τ (\vec{x}) ≐ {[τ_{1} (\vec{x}) \dots τ_{| T |} (\vec{x})]}^{⊤} \in R^{| T |}$ for $τ_{i} \in T$ as the vector of “tree-transformed features” [6]. Then, we define $Φ$ as the set of all additive tree ensembles, with $ϕ_{θ^{t}} (\vec{x}) ≐ \sum_{i \in [| T |]} θ_{i}^{t} τ_{i} (\vec{x}) = ⟨ θ^{t}, τ (\vec{x}) ⟩$ for $θ^{t} \in Θ ≐ R_{+}^{| T |}$ , as $T$ is closed under negation. Note that the feature transformation $τ (\cdot)$ is the same across all tasks. Finally, we make the assumption that $‖ θ^{t *} ‖_{0} \leq B$ (i.e., the optimal $θ^{t}$ is B-sparse) [13]. In other words, given a feature vector $\vec{x}$ , we non-linearly map it to the space of “tree-transformed features” as $τ (\vec{x})$ , and learn a sparse linear predictor in this “latent space.” Note that this method admits both regression and classification, depending on the loss selected. Thus, we aim to optimize over the class of all additive tree ensembles.

To simplify the selection of universal features, we perform sequential selection of universal and task-specific features. Recall that we aim to select universal features that are important to all tasks, and task-specific features that are important to one or more (but not all) tasks. Optimization of our global objective, Eq. (9), is intractable, so we perform greedy optimization. In this approach, the first stage selects universal features without interference from task-specific features, and the second stage allows each task to independently select task-specific features while utilizing the previously selected universal features. We further optimize each stage greedily using a gradient boosting algorithm, which allows for an efficient boosting-style algorithm for approximating the solution to our global objective.

For the first stage, we show that the trees that select only universal features can be greedily optimized using gradient boosting. Define $T^{U} \subseteq T$ as trees that only use features that are predictive for all tasks (training datasets), where $θ_{U}^{t} \in R_{+}^{| T |}$ is the weights for trees in $T^{U}$ . Then, we first optimize

\begin{matrix} \underset{θ_{U}^{1}, \dots, θ_{U}^{T}}{argmin} \sum_{\begin{matrix} t \in [T] \\ (\vec{x}, y) \in D^{t} \end{matrix}} ℓ (y, ⟨ τ (\vec{x}), θ_{U}^{t} ⟩) + λ^{U} ‖ m^{U} ‖_{0} + α {‖ θ_{U}^{t} ‖}_{1} \end{matrix}

to select universal features. Note that there is no need to optimize $m^{U}$ directly, since it is determined by $θ_{U}^{t}, t \in [T]$ . The following derivation is provided in [13], but is reproduced here for completeness. We write the subgradient of this total loss for a task t with respect to the parameter $θ_{U}^{t}$ as

\begin{matrix} (\sum_{(\vec{x}, y) \in D^{t}}, \frac{\partial}{\partial θ_{U}^{t}}, ℓ, (y, ⟨ τ (\vec{x}), θ_{U}^{t} ⟩)) + \frac{\partial}{\partial θ_{U}^{t}} (λ^{U} ‖ m^{U} ‖_{0} + α {‖ θ_{U}^{t} ‖}_{1}) . \end{matrix}

We first derive the subgradient of the regularizers. Let F indicate a d-dimensional feature space, and $1_{f}^{t}$ that feature f is previously unused but is included in the current tree for task t. Because the selection of a tree $τ \in T^{t}$ adds $β$ to one dimension of the feature vector $θ_{U}^{t}$ at each boosting round, similar to [13], we can differentiate the universal feature regularization as

\begin{matrix} \begin{matrix} \frac{\partial}{\partial θ_{U, j}^{t}} (λ^{t} ‖ m^{U} ‖_{0} + α {‖ θ_{U}^{t} ‖}_{1}) = λ^{t} \sum_{f \in F} 1_{f}^{t} + α β, \end{matrix} \end{matrix}

where $τ_{j}$ is the tree under consideration, indexed by $θ_{U, j}^{t}$ . This penalizes both the addition of new features and new trees to the model. Now, we focus on the gradient of the loss, where we rewrite as

\begin{matrix} \frac{\partial ℓ}{\partial θ_{U}^{t}} & = \frac{\partial ℓ}{\partial ⟨ τ (\vec{x}), θ_{U}^{t} ⟩} \frac{\partial ⟨ τ (\vec{x}), θ_{U}^{t} ⟩}{\partial θ_{U}^{t}}, \end{matrix}

using the chain rule. $\frac{\partial ⟨ τ (\vec{x}), θ_{U}^{t} ⟩}{\partial θ_{U, j}^{t}} = τ_{j} (\vec{x})$ by linearity. Define $g_{U, b}^{t} (\vec{x}) ≐ - \frac{\partial ℓ}{\partial ⟨ τ (\vec{x}), θ_{U}^{t} ⟩}$ . Then, the subgradient of our total loss is

\begin{matrix} \sum_{(x, y) \in D^{t}} - g_{U, b}^{t} (x) τ_{j} (x) + λ^{t} \sum_{f \in F} 1_{f}^{t} + α β . \end{matrix}

Here, we assume that $\sum_{(\vec{x}, y) \in D^{t}} τ_{j} {(\vec{x})}^{2}$ is constant [6]. In theory, this can be achieved by normalizing the output of each tree during optimization. However, because the trees are grown greedily, we do not normalize them in practice. Noting that $g_{U, b}^{t} (\vec{x})$ is a constant with respect to $τ_{j}$ , we rewrite the optimization problem as

\begin{matrix} j_{b + 1}^{t} & = \underset{τ_{j} \in T^{U}}{argmin} \sum_{(x, y) \in D^{t}} - g_{U, b}^{t} (x) τ_{j} (x) + λ^{t} \sum_{f \in F} 1_{f}^{t} + α β \\ = \underset{τ_{j} \in T^{U}}{argmin} \sum_{(x, y) \in D^{t}} - 2 g_{U, b}^{t} (x) τ_{j} (x) + 2 (λ^{t} \sum_{f \in F} 1_{f}^{t} + α β) \\ = \underset{τ_{j} \in T^{U}}{argmin} \sum_{(x, y) \in D^{t}} g_{U, b}^{t} {(x)}^{2} - 2 g_{U, b}^{t} (x) τ_{j} (x) + τ_{j} {(x)}^{2} + 2 (λ^{t} \sum_{f \in F} 1_{f}^{t} + α β) \\ j_{b + 1}^{t} & = \underset{τ_{j} \in T^{U}}{argmin} \sum_{(x, y) \in D^{t}} {(g_{U, b}^{t} (x) - τ_{j} (x))}^{2} + λ^{t} \sum_{f \in F} 1_{f}^{t} + α β, \end{matrix}

where the last line absorbs the constants into the regularization hyperparameters and $j_{b + 1}^{t}$ is the index of the selected tree for task t at the current boosting round. In the setting where our loss is the squared error, the term $g_{U, b}^{t} (\vec{x})$ is the residuals of the previous prediction [17]. As in [13], we note that the first term is also the squared error, an impurity function. Thus, the loss structure lends itself to optimization via a boosting-style algorithm. Using a constant step size, this gives us an additive model defined recursively as $θ_{U, b + 1}^{t} = θ_{U, b}^{t} + β {\vec{e}}_{j_{b + 1}^{t}}$ with $θ_{U, 0} = \vec{0}$ , where $θ_{U, b}^{t}$ is the parameter vector at boosting round b and ${\vec{e}}_{j}$ is the jth canonical basis vector. As in [13], we note that each gradient boosting round increases $‖ θ_{U}^{t} ‖_{1}$ by $β$ . Then, after r boosting rounds, we have $‖ θ_{U}^{t} ‖_{1} = β r$ , meaning that $ℓ_{1}$ regularization is equivalent to early stopping [17]. Thus, we drop the explicit $ℓ_{1}$ -regularization terms and limit the number of boosting rounds instead. With this in mind, we consider $θ_{U}^{t}$ and $θ_{U, r}^{t}$ to be equivalent. This ends our restatement of the derivation from [13].

Next, we show that the trees that use universal and/or task-specific features can be similarly optimized. We define $θ_{U̸}^{t} \in R_{+}^{| T |}$ as the weights for trees in $T$ which may use either universal or task-specific features, where the final weight vector will be $θ^{t} ≐ θ_{U}^{t} + θ_{U̸}^{t}$ . Given $θ_{U}^{t}$ , the optimization of task-specific features is independent for each task $t \in [T]$ and can be written as

\begin{matrix} \begin{matrix} \underset{θ_{U̸}^{t}}{argmin} \sum_{(\vec{x}, y) \in D^{t}} ℓ (y, ⟨ τ (\vec{x}), θ_{U}^{t} + θ_{U̸}^{t} ⟩) + λ^{t} ‖ m^{t} ‖_{0} + α {‖ θ_{U̸}^{t} ‖}_{1} \end{matrix} \end{matrix}

where, as above, $m^{t}$ does not need to be directly optimized. We find the optimal $θ_{U̸}^{t}$ similarly to $θ_{U}^{t}$ . We meet the B-sparse assumption by letting B equal the total number of boosting rounds (both universal and task-specific). More generally, we can use $B_{U}$ and $B_{U̸}$ to indicate the number of universal and task-specific boosting rounds, where $B = B_{U} + B_{U̸}$

We greedily learn $θ_{U}^{t}$ and $θ_{U̸}^{t}$ by using gradient boosted CART trees with a penalized impurity function. $θ_{U̸}^{t}$ is learned using single-task GBT feature selection, which adds a penalty function to the impurity function, as described in "Splitting single-task trees" section, Eq. (2). However, $θ_{U}^{t}$ requires a specialized tree-splitting criterion to ensure that all tasks agree on the universal features. Thus, we use maximin optimization to grow trees for all tasks at once, as described in "Splitting multitask trees" section, Eq. (4). Pseudocode for the universal (SI Alg. 1) and task-specific (SI Alg. 2) splitting conditions and the BoUTS boosting algorithm (SI Alg. 3) are provided below.

Algorithm 1 — Universal splitting condition

Algorithm 2 — Task-specific splitting condition

Appendix D: Mixing linear feature selection and nonlinear regression

In this paper, we evaluate the performance of several feature selection algorithms by training a LightGBM model on the features selected by each of these models. Since Dirty LASSO is a linear model, one might think that the selected features will perform better using a linear model than a nonlinear model. In Fig. 6, we plot the error for predictions made using both ridge regression (linear) and LightGBM (nonlinear) using the features selected by Dirty LASSO. For all tests, we use the default parameters for each method. As seen in the figure, LightGBM outperforms ridge regression, so the experiments shown in Fig. 3 provide the best-case scenario for Dirty LASSO.

Fig. 6 — Demonstrating that LightGBM outperforms ridge regression: a, b, and c show comparisons between LightGBM and ridge regression for features selected by Dirty LASSO. Violin plots show the performance distribution; the inner bars indicate the 25th and 75th percentiles, and the outer bars indicate the 5th and 95th percentiles. The white dot indicates the median performance. Hatched plots indicate that only the universal features were used, and unhatched plots indicate that universal and task-specific features were used

The code for this experiment can be found in si_notebooks/linear_vs_nonlinear.ipynb in the associated repository.

Appendix E: Why use gradient boosted models?

After feature selection, we refit a gradient boosted model (LightGBM) using the features selected by each feature selection method. We chose gradient boosting because this class of models tends to have the highest performance on tabular datasets [51], and LightGBM [7] specifically because of its computational efficiency and scalability. Here, we also demonstrate the superior performance of LightGBM.

In general, Figs. 7 and 8 show that ridge regression performs the worst, and tree-based methods (random forest and LGBM) perform the best. LGBM is the best overall. Simpler nonlinear methods such as kernel ridge regression and KNN are competitive when fewer samples are available (e.g., Nanoparticle logP, Small Molecule T_m and logH_s), and the NN generally performs well but results in some highly inaccurate predictions (e.g., Small Molecule logP). These results are roughly consistent with existing work [52].

Fig. 7 — Demonstrating that LightGBM is the best overall model on the original feature set: a, b, and c show comparisons between multiple machine learning algorithms on the chemistry datasets used in this study. Here, we evaluate the performance of each method without feature selection to highlight their innate performance. Violin plots show the performance distribution; the inner bars indicate the 25th and 75th percentiles, and the outer bars indicate the 5th and 95th percentiles. The white dot indicates the median performance. “KNN” indicates “k-nearest neighbors,” “Ridge ” indicates “ridge regression,” “Kernel Ridge” indicates “kernel ridge regression,” and “NN” indicates a “neural network”

Fig. 8 — Demonstrating that LightGBM is the best overall model using the universal features selected by BoUTS: a, b, and c show comparisons between multiple machine learning algorithms on the chemistry datasets used in this study. Here, we evaluate the performance of each method with feature selection to highlight their innate performance. Violin plots show the performance distribution; the inner bars indicate the 25th and 75th percentiles, and the outer bars indicate the 5th and 95th percentiles. The white dot indicates the median performance. “KNN” indicates “k-nearest neighbors,” “Ridge” indicates “ridge regression,” “Kernel Ridge” indicates “kernel ridge regression,” and “NN” indicates a “neural network”

The code for this experiment can be found in si_notebooks/compare_different_models.ipynb in the associated repository.

Appendix F: BoUTS selected universal features

In Table 2, we list the universal features selected by BoUTS for each category. We do not include the task-specific features, as there are too many, and they are provided with the associated code.

Recall from Fig. 2d that BoUTS’s selected universal features are not entirely stable across different runs or splits, despite being more stable than independently-selected features. This result, combined with the fact that each category contains different datasets, means that the selected universal features are expected to differ between categories. For example, given a set of correlated features, different groups of datasets may lead to different features being selected from this correlated set.

Appendix G: Universal feature stability

Here, we provide stability metrics computed for BoUTS and single-task GBT feature selection. Stability and variance are computed using code from [14]. All hyperparameters are the same for all models used in this table, except for the feature penalty. The penalty was adjusted to provide approximately the same number of features for BoUTS and GBT feature selection. We observe that, in general, stability tends to increase as dataset size increases. Thus, universal features significantly improve stability for datasets with few samples, without reducing predictive performance. Details are provided in SI Tables 3, 5, and visualizations are provided in Figs. 2c and 9.

Fig. 9 — Alternative visualization of Fig. 2e with included feature names. The edges indicate the absolute Spearman correlation between the universal features selected for each category, with clusters indicated by circular brackets on the outside of the graph. The colors in each node indicate the categories that selected that feature

Appendix H: Selecting universal features that are conditionally important

It is possible that a universal feature is conditionally important given a task-specific feature, but is not important on its own. More specifically, suppose we have two tasks that depend on random vectors ${\vec{x}}_{1} \sim {[t_{1} u_{1} v_{1}]}^{⊤}, {\vec{x}}_{2} \sim {[v_{2} u_{2} t_{1}]}^{⊤}$ where $u_{i}, t_{i}, v_{i}$ indicate universal, task-specific, and uninformative features, respectively, for $i \in {1, 2}$ . Furthermore, suppose that each random vector ${\vec{x}}_{i}$ is paired with an output $y_{i} = f_{i} ({\vec{x}}_{i})$ for some function $f_{i}$ . We say that $y_{i}$ is conditionally dependent on, but marginally independent of $u_{i}$ , when $y_{i} \sim y_{i} | u_{i}$ and $y_{i} ≁ y_{i} | u_{i}, t_{i}$ where a|b indicates that a is conditioned on b. Fig. 10 shows a uniformly-weighted mixture of Gaussian distributions to demonstrate this situation, with two 3d Gaussian distributions providing the output $- 1$ and two providing the output 1. In this case, we see that $E [y_{i}] = 0$ , $E [y_{i} | u_{i}] = E [y_{i} | t_{i}] = 0$ , and $E [y_{i} | u_{i}, t_{i}] \in [- 1, 1]$ , depending on the values of $u_{i}$ and $t_{i}$ . Additionally, we note that $y_{i}$ is independent of $v_{i}$ . Because we select universal features before task-specific features, it seems possible that BoUTS will not be able to identify such conditionally dependent universal features. We evaluate this claim on the dataset below, and find that BoUTS outperforms Dirty LASSO in this setting.

Fig. 10 — An example of conditionally-important universal features: Here, we plot a synthetic 3d dataset with 1000 samples in each cluster. Both tasks share one universally important feature, and each task has one feature that is important for that task but has zero predictive power for the other task. Notably, the universal and task-specific features are only predictive when both features are selected. Red indicates a y value of 1, and blue a y value of 1. The right plot has been rotated around the vertical axis to improve visualization

It should be noted that our example is a nonlinear regression problem, which violates the assumptions of linear models. This is because such an (in)dependence condition for universal features is not possible for linear functions. If we define $y_{i} = a u_{i} + b t_{i} + c v_{i} + d$ for constants a, b, c, d and variables $u_{i}, t_{i}, v_{i}$ , we can see that, in general, $y_{i} \sim y_{i} | u_{i}$ is only true when $a = 0$ , in which case $u_{i}$ is not identified as a universal feature. Thus, in principle, this setting cannot be solved using linear methods like Dirty LASSO.

We find that, in the extremely challenging (and quite unrealistic) situation where there are exactly the same number of samples in each cluster, our algorithm does indeed fail to reliably select the correct universal features. Specifically, after running 1000 random seeds, BoUTS selects the correct number of features 30% of the time when choosing n samples in the range [500, 1000] to assign to each cluster.

However, independently selecting the number of samples within each cluster from the range [500, 1000] results in a 68% chance of selecting the correct universal feature with BoUTS. Although these numbers are not impressive on their own, it should be noted that BoUTS outperforms the current state-of-the-art (Dirty LASSO) by $> 2 \times$ in both cases. Indeed, Dirty LASSO selects the universal feature 0% and 24% of the time for the same experiments.

The code for this experiment can be found in si_notebooks/cond_univ_sel.ipynb in the associated repository.

Author contributions

Matt Raymond: methodology, software, validation, formal analysis, investigation, writing - original draft, writing - review & editing, visualization Jacob Saldinger: methodology, software, formal analysis, investigation, data curation, writing - original draft, Paolo Elvati: conceptualization, methodology, software, writing - original draft, writing - review & editing, Angela Violi: conceptualization, methodology, writing - original draft, writing - review & editing, supervision, funding acquisition.

Funding

This work was supported by the BlueSky Initiative, funded by the University of Michigan College of Engineering (P.I. Violi); ECO-CBET Award Number (FAIN): 2318495; the National Science Foundation Graduate Research Fellowship Program DGE 1256260 (J. Saldinger); and the J. Robert Beyster Computational Innovation Graduate Fellowship (M. Raymond).

Data availability

Our precomputed features for the chemistry datasets and the source data for Figs. 1, 2, and 3 are available at the following URL/DOI: https://doi.org/10.7302/f33q-0741. Publicly hosted implementations of all machine learning code are available under the GPL3 license: https://gitlab.eecs.umich.edu/mattrmd-public/gendesc/gendesc-main-repo.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Matt Raymond, Email: mattrmd@umich.edu.

Angela Violi, Email: avioli@umich.edu.

References

1.Dereli O, Oğuz C, Gönen M (2019) A multitask multiple kernel learning algorithm for survival analysis with application to cancer biology. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol 97. PMLR, Long Beach, CA, USA, pp 1576–1585. https://proceedings.mlr.press/v97/dereli19a.html
2.Valmarska A, Miljkovic D, Konitsiotis S, et al (2017) Combining multitask learning and short time series analysis in parkinson’s disease patients stratification. In: ten Teije A, Popow C, Holmes JH, et al (eds) 16th Conference on Artificial Intelligence in Medicine, vol 10259. Springer International Publishing, Vienna, Austria, pp 116–125. 10.1007/978-3-319-59758-4_13
3.Yuan H, Paskov I, Paskov H et al (2016) Multitask learning improves prediction of cancer drug sensitivity. Sci Rep 6(1):31619. 10.1038/srep31619 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Sun X, Araujo RB, Santos EC et al (2024) Advancing electrocatalytic reactions through mapping key intermediates to active sites via descriptors. Chem Soc Rev 53:7392–7425. 10.1039/D3CS01130E [DOI] [PubMed] [Google Scholar]
5.Weng B, Song Z, Zhu R et al (2020) Simple descriptor derived from symbolic regression accelerating the discovery of new perovskite catalysts. Nat Commun 11(1):3513. 10.1038/s41467-020-17263-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Chapelle O, Shivaswamy P, Vadrevu S et al (2010) Boosted multi-task learning. Mach Learn 85(1):149–173. 10.1007/s10994-010-5231-6 [Google Scholar]
7.Ke G, Meng Q, Finley T, et al (2017) LightGBM: A highly efficient gradient boosting decision tree. In: Guyon I, Von Luxburg U, Bengio S, et al (eds) Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, pp 3149–3157, 10.5555/3294996.3295074
8.Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272. 10.1007/s10994-007-5040-8 [Google Scholar]
9.Jebara T (2004) Multi-task feature and kernel selection for svms. In: Brodley C (ed) Proceedings of the 21st International Conference on Machine Learning. Association for Computing Machinery, New York, NY, USA, p 55, 10.1145/1015330.1015426
10.Li C, Georgiopoulos M, Anagnostopoulos GC (2014) A unifying framework for typical multitask multiple kernel learning problems. IEEE Trans Neural Netw Learn Syst 25(7):1287–1297. 10.1109/TNNLS.2013.2291772 [Google Scholar]
11.Peng J, An L, Zhu X, et al (2016) Structured sparse kernel learning for imaging genetics based alzheimer’s disease diagnosis. In: MICCAI 2016, vol 9901. Springer, Athens, Greece, pp 70–78, 10.1007/978-3-319-46723-8_9 [DOI] [PMC free article] [PubMed]
12.Breiman L, Friedman J, Stone CJ et al (1984) Classification and regression trees, 1st edn. Taylor & Francis Group, Boca Raton. Accessed 7 Dec 2022 [Google Scholar]
13.Xu Z, Huang G, Weinberger KQ, et al (2014) Gradient boosted feature selection. In: Macskassy S, Perlich C, Leskovec J, et al (eds) Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, pp 522–531. 10.1145/2623330.2623635
14.Nogueira S, Sechidis K, Brown G (2018) On the stability of feature selection algorithms. J Mach Learn Res 18(174):1–54 [Google Scholar]
15.Heid E, Greenman KP, Chung Y et al (2024) Chemprop: a machine learning package for chemical property prediction. J Chem Inf Model 64(1):9–17. 10.1021/acs.jcim.3c01250 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Yan X, Sedykh A, Wang W et al (2020) Construction of a web-based nanomaterial database by big data curation and modeling friendly nanostructure annotations. Nat Commun 11(1):2519. 10.1038/s41467-020-16413-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Statist 29(5):1189–1232 [Google Scholar]
18.Pedregosa F, Varoquaux G, Gramfort A, et al (2023) Scikit-learn: Machine learning in Python. https://github.com/scikit-learn/scikit-learn [DOI] [PMC free article] [PubMed]
19.Popova M, Isayev O, Tropsha A (2018) Deep reinforcement learning for de novo drug design. Sci Adv 4(7):eaap7885. 10.1126/sciadv.aap7885. Accessed 19 June 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Estimation Programs Interface Suite (2006) Environmental Protection Agency
21.Han X, Wang X, Zhou K (2019) Develop machine learning-based regression predictive models for engineering protein solubility. Bioinformatics 35(22):4640–4646. 10.1093/bioinformatics/btz294 [DOI] [PubMed] [Google Scholar]
22.Berman H, Henrick K, Nakamura H (2003) Announcing the worldwide protein data bank. Nat Struct Mol Biol 10(12):980. 10.1038/nsb1203-980 [DOI] [PubMed] [Google Scholar]
23.Jumper J, Evans R, Pritzel A et al (2021) Highly accurate protein structure prediction with alphafold. Nature 596(7873):583–589. 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Isayev O, Oses C, Toher C et al (2017) Universal fragment descriptors for predicting properties of inorganic crystals. Nat Commun 8(1):15679. 10.1038/ncomms15679 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Todeschini R, Gramatica P (1997) The whim theory: new 3d molecular descriptors for qsar in environmental modelling. SAR QSAR Environ Res 7(1–4):89–115. 10.1080/10629369708039126 [Google Scholar]
26.Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn. Lawrence Erlbaum Associates, New York [Google Scholar]
27.Jalali A, Sanghavi S, Ruan C et al (2010) A dirty model for multi-task learning. In: Lafferty J, Williams C, Shawe-Taylor J et al (eds) Advances in neural information processing systems 23. Curran Associates, Inc., Vancouver, pp 1–9 [Google Scholar]
28.Janati H (2021) MuTaR: Multi-task regression in python. https://github.com/hichamjanati/mutar
29.Chen J, Zheng S, Zhao H et al (2021) Structure-aware protein solubility prediction from sequence through graph convolutional network and predicted contact map. J Cheminform 13(1):1–10. 10.1186/s13321-021-00488-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Kaptein M, Heuvel E (2022) Statistics for data scientists: an introduction to probability, statistics, and data analysis, 1st edn. Springer Nature Switzerland, Cham. 10.1007/978-3-030-10531-0 [Google Scholar]
31.Maaten L, Hinton GE (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605 [Google Scholar]
32.Miller GA (1956) The magical number seven, plus or minus two: some limits on our capacity for processing information: psychological Review. Psychol Rev 63(2):81–97. 10.1037/h0043158. Accessed 7 Mar 2024 [PubMed] [Google Scholar]
33.Rao N, Cox C, Nowak R, et al (2013) Sparse overlapping sets lasso for multitask learning and its application to fMRI analysis. In: Burges C, Bottou L, Welling M, et al (eds) Proceedings of Advances in Neural Information Processing Systems 26. Curran Associates, Inc., Lake Tahoe, CA, USA, pp 1–9, https://papers.nips.cc/paper_files/paper/2013/hash/a1519de5b5d44b31a01de013b9b51a80-Abstract.html
34.Wessel MD, Jurs PC (1995) Prediction of normal boiling points of hydrocarbons from molecular structures. J Chem Inf Comput Sci 35(1):68–76. 10.1021/ci00023a010 [Google Scholar]
35.Liu C, Elvati P, Majumder S et al (2019) Predicting the time of entry of nanoparticles in lipid membranes. ACS Nano 13(9):10221–10232. 10.1021/acsnano.9b03434. Accessed 19 June 2024 [DOI] [PubMed] [Google Scholar]
36.Labute P (2000) A widely applicable set of descriptors. J Mol Graph Model 18(4):464–477. 10.1016/S1093-3263(00)00068-1 [DOI] [PubMed] [Google Scholar]
37.Medina-Franco JL, Chávez-Hernández AL, López-López E et al (2022) Chemical multiverse: an expanded view of chemical space. Mol Inf 41(11):e2200116. 10.1002/minf.202200116. Accessed 04 Mar 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Lee HW, Lawton C, Na YJ et al (2013) Robustness of chemometrics-based feature selection methods in early cancer detection and biomarker discovery. Stat Appl Genet Mol 12(2):207–223. 10.1515/sagmb-2012-0067 [DOI] [PubMed] [Google Scholar]
39.Saldinger JC, Raymond M, Elvati P et al (2023) Domain-agnostic predictions of nanoscale interactions in proteins and nanoparticles. Nat Comput Sci 3(5):393–402. 10.1038/s43588-023-00438-x [DOI] [PubMed] [Google Scholar]
40.Freund Y, Schapire RE (1995) A desicion-theoretic generalization of on-line learning and an application to boosting. In: Vitányi P (ed) Comput Learn Theory. Springer, Berlin, pp 23–37 [Google Scholar]
41.Segal M, Xiao Y (2011) Multivariate random forests. WIREs Data Min Knowl Discov 1(1):80–87. 10.1002/widm.12 [Google Scholar]
42.Wildman SA, Crippen GM (1999) Prediction of physicochemical parameters by atomic contributions. J Chem Inf Comput Sci 39(5):868–873. 10.1021/ci990307l. Accessed 19 June 2024 [Google Scholar]
43.Zeng X, Ye X, Liu D et al (2025) A new simple and efficient molecular descriptor for the fast and accurate prediction of log p. J Materi Inform. 10.20517/jmi.2024.61 [Google Scholar]
44.Satorras VG, Hoogeboom E, Welling M (2021) E(n) equivariant graph neural networks. In: Meila M, Zhang T (eds) Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol 139. PMLR, pp 9323–9332. https://proceedings.mlr.press/v139/satorras21a.html
45.Chen T, Guestrin C (2016) XGBoost: A scalable tree boosting system. In: Krishnapuram B, Shah M, Smola A, et al (eds) Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, Kdd ’16, pp 785–794. 10.1145/2939672.2939785.
46.Nesterov YE (1983) A method of solving a convex programming problem with convergence rate . Dokl Akad Nauk SSSR 269(3):543–547 [Google Scholar]
47.Cortez P, Cerdeira A, Almeida F et al (2009) Modeling wine preferences by data mining from physicochemical properties. Decis Support Syst 47(4):547–553. 10.1016/j.dss.2009.05.016 [Google Scholar]
48.Vijayakumar S, Schaal S (2000) Lwpr: An O(n) algorithm for incremental real time learning in high dimensional space. In: Proceedings of the 17th International Conference on Machine Learning (ICML), pp 1079–1086
49.Anonymous (2019) Apartment for Rent Classified. UCI Machine Learning Repository. 10.24432/C5X623
50.Ustimenko A, Beliakov A, Prokhorenkova L (2023) Gradient boosting performs gaussian process inference. In: Liu Y, Kim B, Nickel M, et al (eds) The 11th International Conference on Learning Representations. OpenReview.net, Kigali, Rwanda, pp 1–29. https://openreview.net/forum?id=3VKiaagxw1S
51.Shwartz-Ziv R, Armon A (2022) Tabular data: deep learning is not all you need. Inf Fusion 81:84–90. 10.1016/j.inffus.2021.11.011 [Google Scholar]
52.Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd International Conference on Machine Learning. Association for Computing Machinery, New York, NY, USA, ICML ’06, p 161–168. 10.1145/1143844.1143865,

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR1] 1.Dereli O, Oğuz C, Gönen M (2019) A multitask multiple kernel learning algorithm for survival analysis with application to cancer biology. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol 97. PMLR, Long Beach, CA, USA, pp 1576–1585. https://proceedings.mlr.press/v97/dereli19a.html

[CR2] 2.Valmarska A, Miljkovic D, Konitsiotis S, et al (2017) Combining multitask learning and short time series analysis in parkinson’s disease patients stratification. In: ten Teije A, Popow C, Holmes JH, et al (eds) 16th Conference on Artificial Intelligence in Medicine, vol 10259. Springer International Publishing, Vienna, Austria, pp 116–125. 10.1007/978-3-319-59758-4_13

[CR3] 3.Yuan H, Paskov I, Paskov H et al (2016) Multitask learning improves prediction of cancer drug sensitivity. Sci Rep 6(1):31619. 10.1038/srep31619 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Sun X, Araujo RB, Santos EC et al (2024) Advancing electrocatalytic reactions through mapping key intermediates to active sites via descriptors. Chem Soc Rev 53:7392–7425. 10.1039/D3CS01130E [DOI] [PubMed] [Google Scholar]

[CR5] 5.Weng B, Song Z, Zhu R et al (2020) Simple descriptor derived from symbolic regression accelerating the discovery of new perovskite catalysts. Nat Commun 11(1):3513. 10.1038/s41467-020-17263-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Chapelle O, Shivaswamy P, Vadrevu S et al (2010) Boosted multi-task learning. Mach Learn 85(1):149–173. 10.1007/s10994-010-5231-6 [Google Scholar]

[CR7] 7.Ke G, Meng Q, Finley T, et al (2017) LightGBM: A highly efficient gradient boosting decision tree. In: Guyon I, Von Luxburg U, Bengio S, et al (eds) Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, pp 3149–3157, 10.5555/3294996.3295074

[CR8] 8.Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272. 10.1007/s10994-007-5040-8 [Google Scholar]

[CR9] 9.Jebara T (2004) Multi-task feature and kernel selection for svms. In: Brodley C (ed) Proceedings of the 21st International Conference on Machine Learning. Association for Computing Machinery, New York, NY, USA, p 55, 10.1145/1015330.1015426

[CR10] 10.Li C, Georgiopoulos M, Anagnostopoulos GC (2014) A unifying framework for typical multitask multiple kernel learning problems. IEEE Trans Neural Netw Learn Syst 25(7):1287–1297. 10.1109/TNNLS.2013.2291772 [Google Scholar]

[CR11] 11.Peng J, An L, Zhu X, et al (2016) Structured sparse kernel learning for imaging genetics based alzheimer’s disease diagnosis. In: MICCAI 2016, vol 9901. Springer, Athens, Greece, pp 70–78, 10.1007/978-3-319-46723-8_9 [DOI] [PMC free article] [PubMed]

[CR12] 12.Breiman L, Friedman J, Stone CJ et al (1984) Classification and regression trees, 1st edn. Taylor & Francis Group, Boca Raton. Accessed 7 Dec 2022 [Google Scholar]

[CR13] 13.Xu Z, Huang G, Weinberger KQ, et al (2014) Gradient boosted feature selection. In: Macskassy S, Perlich C, Leskovec J, et al (eds) Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, pp 522–531. 10.1145/2623330.2623635

[CR14] 14.Nogueira S, Sechidis K, Brown G (2018) On the stability of feature selection algorithms. J Mach Learn Res 18(174):1–54 [Google Scholar]

[CR15] 15.Heid E, Greenman KP, Chung Y et al (2024) Chemprop: a machine learning package for chemical property prediction. J Chem Inf Model 64(1):9–17. 10.1021/acs.jcim.3c01250 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Yan X, Sedykh A, Wang W et al (2020) Construction of a web-based nanomaterial database by big data curation and modeling friendly nanostructure annotations. Nat Commun 11(1):2519. 10.1038/s41467-020-16413-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Statist 29(5):1189–1232 [Google Scholar]

[CR18] 18.Pedregosa F, Varoquaux G, Gramfort A, et al (2023) Scikit-learn: Machine learning in Python. https://github.com/scikit-learn/scikit-learn [DOI] [PMC free article] [PubMed]

[CR19] 19.Popova M, Isayev O, Tropsha A (2018) Deep reinforcement learning for de novo drug design. Sci Adv 4(7):eaap7885. 10.1126/sciadv.aap7885. Accessed 19 June 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Estimation Programs Interface Suite (2006) Environmental Protection Agency

[CR21] 21.Han X, Wang X, Zhou K (2019) Develop machine learning-based regression predictive models for engineering protein solubility. Bioinformatics 35(22):4640–4646. 10.1093/bioinformatics/btz294 [DOI] [PubMed] [Google Scholar]

[CR22] 22.Berman H, Henrick K, Nakamura H (2003) Announcing the worldwide protein data bank. Nat Struct Mol Biol 10(12):980. 10.1038/nsb1203-980 [DOI] [PubMed] [Google Scholar]

[CR23] 23.Jumper J, Evans R, Pritzel A et al (2021) Highly accurate protein structure prediction with alphafold. Nature 596(7873):583–589. 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Isayev O, Oses C, Toher C et al (2017) Universal fragment descriptors for predicting properties of inorganic crystals. Nat Commun 8(1):15679. 10.1038/ncomms15679 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Todeschini R, Gramatica P (1997) The whim theory: new 3d molecular descriptors for qsar in environmental modelling. SAR QSAR Environ Res 7(1–4):89–115. 10.1080/10629369708039126 [Google Scholar]

[CR26] 26.Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn. Lawrence Erlbaum Associates, New York [Google Scholar]

[CR27] 27.Jalali A, Sanghavi S, Ruan C et al (2010) A dirty model for multi-task learning. In: Lafferty J, Williams C, Shawe-Taylor J et al (eds) Advances in neural information processing systems 23. Curran Associates, Inc., Vancouver, pp 1–9 [Google Scholar]

[CR28] 28.Janati H (2021) MuTaR: Multi-task regression in python. https://github.com/hichamjanati/mutar

[CR29] 29.Chen J, Zheng S, Zhao H et al (2021) Structure-aware protein solubility prediction from sequence through graph convolutional network and predicted contact map. J Cheminform 13(1):1–10. 10.1186/s13321-021-00488-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Kaptein M, Heuvel E (2022) Statistics for data scientists: an introduction to probability, statistics, and data analysis, 1st edn. Springer Nature Switzerland, Cham. 10.1007/978-3-030-10531-0 [Google Scholar]

[CR31] 31.Maaten L, Hinton GE (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605 [Google Scholar]

[CR32] 32.Miller GA (1956) The magical number seven, plus or minus two: some limits on our capacity for processing information: psychological Review. Psychol Rev 63(2):81–97. 10.1037/h0043158. Accessed 7 Mar 2024 [PubMed] [Google Scholar]

[CR33] 33.Rao N, Cox C, Nowak R, et al (2013) Sparse overlapping sets lasso for multitask learning and its application to fMRI analysis. In: Burges C, Bottou L, Welling M, et al (eds) Proceedings of Advances in Neural Information Processing Systems 26. Curran Associates, Inc., Lake Tahoe, CA, USA, pp 1–9, https://papers.nips.cc/paper_files/paper/2013/hash/a1519de5b5d44b31a01de013b9b51a80-Abstract.html

[CR34] 34.Wessel MD, Jurs PC (1995) Prediction of normal boiling points of hydrocarbons from molecular structures. J Chem Inf Comput Sci 35(1):68–76. 10.1021/ci00023a010 [Google Scholar]

[CR35] 35.Liu C, Elvati P, Majumder S et al (2019) Predicting the time of entry of nanoparticles in lipid membranes. ACS Nano 13(9):10221–10232. 10.1021/acsnano.9b03434. Accessed 19 June 2024 [DOI] [PubMed] [Google Scholar]

[CR36] 36.Labute P (2000) A widely applicable set of descriptors. J Mol Graph Model 18(4):464–477. 10.1016/S1093-3263(00)00068-1 [DOI] [PubMed] [Google Scholar]

[CR37] 37.Medina-Franco JL, Chávez-Hernández AL, López-López E et al (2022) Chemical multiverse: an expanded view of chemical space. Mol Inf 41(11):e2200116. 10.1002/minf.202200116. Accessed 04 Mar 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Lee HW, Lawton C, Na YJ et al (2013) Robustness of chemometrics-based feature selection methods in early cancer detection and biomarker discovery. Stat Appl Genet Mol 12(2):207–223. 10.1515/sagmb-2012-0067 [DOI] [PubMed] [Google Scholar]

[CR39] 39.Saldinger JC, Raymond M, Elvati P et al (2023) Domain-agnostic predictions of nanoscale interactions in proteins and nanoparticles. Nat Comput Sci 3(5):393–402. 10.1038/s43588-023-00438-x [DOI] [PubMed] [Google Scholar]

[CR40] 40.Freund Y, Schapire RE (1995) A desicion-theoretic generalization of on-line learning and an application to boosting. In: Vitányi P (ed) Comput Learn Theory. Springer, Berlin, pp 23–37 [Google Scholar]

[CR41] 41.Segal M, Xiao Y (2011) Multivariate random forests. WIREs Data Min Knowl Discov 1(1):80–87. 10.1002/widm.12 [Google Scholar]

[CR42] 42.Wildman SA, Crippen GM (1999) Prediction of physicochemical parameters by atomic contributions. J Chem Inf Comput Sci 39(5):868–873. 10.1021/ci990307l. Accessed 19 June 2024 [Google Scholar]

[CR43] 43.Zeng X, Ye X, Liu D et al (2025) A new simple and efficient molecular descriptor for the fast and accurate prediction of log p. J Materi Inform. 10.20517/jmi.2024.61 [Google Scholar]

[CR44] 44.Satorras VG, Hoogeboom E, Welling M (2021) E(n) equivariant graph neural networks. In: Meila M, Zhang T (eds) Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol 139. PMLR, pp 9323–9332. https://proceedings.mlr.press/v139/satorras21a.html

[CR45] 45.Chen T, Guestrin C (2016) XGBoost: A scalable tree boosting system. In: Krishnapuram B, Shah M, Smola A, et al (eds) Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, Kdd ’16, pp 785–794. 10.1145/2939672.2939785.

[CR46] 46.Nesterov YE (1983) A method of solving a convex programming problem with convergence rate . Dokl Akad Nauk SSSR 269(3):543–547 [Google Scholar]

[CR47] 47.Cortez P, Cerdeira A, Almeida F et al (2009) Modeling wine preferences by data mining from physicochemical properties. Decis Support Syst 47(4):547–553. 10.1016/j.dss.2009.05.016 [Google Scholar]

[CR48] 48.Vijayakumar S, Schaal S (2000) Lwpr: An O(n) algorithm for incremental real time learning in high dimensional space. In: Proceedings of the 17th International Conference on Machine Learning (ICML), pp 1079–1086

[CR49] 49.Anonymous (2019) Apartment for Rent Classified. UCI Machine Learning Repository. 10.24432/C5X623

[CR50] 50.Ustimenko A, Beliakov A, Prokhorenkova L (2023) Gradient boosting performs gaussian process inference. In: Liu Y, Kim B, Nickel M, et al (eds) The 11th International Conference on Learning Representations. OpenReview.net, Kigali, Rwanda, pp 1–29. https://openreview.net/forum?id=3VKiaagxw1S

[CR51] 51.Shwartz-Ziv R, Armon A (2022) Tabular data: deep learning is not all you need. Inf Fusion 81:84–90. 10.1016/j.inffus.2021.11.011 [Google Scholar]

[CR52] 52.Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd International Conference on Machine Learning. Association for Computing Machinery, New York, NY, USA, ICML ’06, p 161–168. 10.1145/1143844.1143865,

PERMALINK

Universal feature selection for simultaneous interpretability of multitask datasets

Matt Raymond

Jacob Charles Saldinger

Paolo Elvati

Angela Violi

Abstract

Graphical Abstract

Introduction

Methodology

BoUTS algorithm

Fig. 1.

Splitting single-task trees

Splitting multitask trees

Datasets

Table 1.

Construction

Construction of categories

Performing splits with overlapping datasets

Fig. 2.

Statistical comparison of stabilities

Implementation of all methods

Training and evaluation

Results

Feature selection using BoUTS

Table 2.

Table 3.

Table 4.

Table 5.

Comparing BoUTS to other selection methods

Fig. 3.

Analysis of selected features

Discussion

Language model

Acknowledgements

Abbreviations

Appendix

Appendix A: Runtime comparisons between BoUTS and competing methods

Fig. 4.

Appendix B Evaluating BoUTS on non-chemical datasets

Fig. 5.

Appendix C: Derivation of BoUTS boosting procedure

Algorithm 1.

Algorithm 2.

Algorithm 3.

Appendix D: Mixing linear feature selection and nonlinear regression

Fig. 6.

Appendix E: Why use gradient boosted models?

Fig. 7.

Fig. 8.

Appendix F: BoUTS selected universal features

Appendix G: Universal feature stability

Fig. 9.

Appendix H: Selecting universal features that are conditionally important

Fig. 10.

Author contributions

Funding

Data availability

Declarations

Competing interests

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases