Explanation Beyond Individual Features: Instance-wise Feature Grouping for EHR Predictive Analytics

Chin Wang Cheong; Kejing Yin; William K Cheung; Ivor Tsang

doi:10.1007/s41666-025-00222-8

. 2025 Nov 6;10(1):70–94. doi: 10.1007/s41666-025-00222-8

Explanation Beyond Individual Features: Instance-wise Feature Grouping for EHR Predictive Analytics

Chin Wang Cheong ^1,^✉, Kejing Yin ¹, William K Cheung ¹, Ivor Tsang ²

PMCID: PMC12873021 PMID: 41658403

Abstract

Identifying relevant input features which contribute to the output of a clinical prediction model can enhance the model explainability. To allow the explainability to be more personalized, instance-wise feature selection (IWFS) methods can be adopted where features are selected specifically for each input instance. Existing IWFS methods often grapple with feature selection instability, and thus precarious interpretation. As relevant features among the instances in a dataset do overlap, feature grouping tricks have been proposed to regularize the selection, but often at the expense of sacrificing the downstream prediction accuracy. To this end, we propose a novel instance-wise feature grouping method called FlexGPC to achieve robust and stable selection by learning i) flexible representation for feature groups, and ii) flexible combination of feature groups implemented using neural networks. To evaluate the effectiveness of FlexGPC, we explore various feature group combination schemes and conduct extensive experiments for performance comparison using real-world electronic health records (EHR) data. Our experimental results show that FlexGPC outperforms all the SOTA baselines in terms of accuracy and feature selection stability for both downstream mortality and next-admission diagnosis prediction tasks. We also illustrate that computational phenotyping can be achieved at the same time, with the identified feature groups being the potential phenotypes.

Keywords: Explanability, Feature selection, Deep learning, Electronic health records, Predictive analytics

Introduction

Machine learning (ML) methods have been found promising for clinical prediction tasks. Other than achieving high accuracy, their explanability is always emphasized due to the safety-critical nature of healthcare application and the increasing legal and ethical concerns of AI models [1–3]. Identifying relevant input features which contribute to the prediction outcome can allow clinicians to discover the key factors which lead to the outcome based on the ML model. This is essentially a feature selection problem which has been well studied in the literature [4, 5], and widely used for explaining relevant risk factors for different diseases [6–10]. For example, [9] adopted a bagging-based feature selection framework and identified smoking habits, lack of exercise, and unbalanced diet of both mothers and children to be the key risk factors of childhood obesity.

Conventional feature selection methods can only identify a common feature subset for all data instances. Yet there are many cases where the relevant feature subset varies over the instances in the dataset. For example, the relevant subset of clinical features for medical diagnosis should depend on the specific health condition of an individual patient. To support more personalized explanability, instance-wise feature selection (IWFS) methods can be adopted to identify the relevant feature subset per data instance [11, 12]. As the selected feature subset is considered relevant to the specific output of the downstream task, the IWFS model can be seen as an “explainer” that explains the prediction results through the selected features.

Despite the promising results, existing IWFS methods often grapple with feature selection instability, and thus precarious interpretation. It is always desirable for a feature selection method to select a consistent subset of features given reasonable variation of an input [13, 14]. Achieving that for IWFS, however, is challenging since a lot more selection variables are to be estimated. Neural network models (e.g., MLP) have been proposed to represent the selection mapping [11, 15–17]. Yet, learning the selection network often suffers from overfitting.

With the observation that relevant features for individual data instances in fact possess different degrees of overlapping, feature grouping tricks can be explored to regularize the feature selection to improve the selection stability. Instance-wise feature grouping (IWFG) methods have recently been proposed [16, 18] to carry out the instance-wise selection over a set of feature groups instead of individual features. In addition, the feature groups obtained via learning also form interpretable feature grouping patterns (e.g., phenotypes from EHR data), which have also been understood to be salient for the prediction task. Introducing the grouping for the gain in explanability and stability, however, often results in degradation of the downstream prediction accuracy due to the constraints imposed for the regularization.

To this end, we propose Flexible Group-wise Combination (FlexGPC) which aims to explain the prediction output by learning i) a set of more flexibly represented feature groups and ii) a selection network to achieve flexible combination (selection) of the feature groups. Compared with the existing IWFG methods, FlexGPC tries to “soften” the feature group representation can allow the relative importance of each feature within a feature group to be captured, and to “soften” their combination to open up more options for combining the feature groups to form the feature mask. The learning objective of FlexGPC is also designed to further emphasize the model interpretabilty. In this paper, two particular feature group combination schemes, namely convex combination and restricted affine combination are adopted and integrated into the proposed FlexGPC. The former is a natural choice while the latter enables a more flexible combination by allowing some features to be “de-selected”. Figure 1 illustrates the overall architecture of FlexGPC which can be incorporated into different model architectures for various clinical prediction tasks.

Fig. 1 — FlexGPC achieves instance-wise feature selection via adaptive feature group combination for enhancing clinical prediction model explanability. G, are the feature groups, FlexGPC combine them for each data point by the mixture weight s

We conduct extensive experiments to evaluate the effectiveness of the proposed FlexGPC for mortality and next-admission diagnosis prediction tasks using the real-world EHR datasets. By properly relaxing the assumptions commonly used in existing IWFG methods, we demonstrate that FlexGPC can achieve substantial improvement in both prediction accuracy and selection stability over the SOTA baselines we tested. The improvement becomes more obvious when the data missingness rate is high which is common for EHR data. Among the two feature group combination schemes, restricted affine composition gives substantially better performance. We provide interpretation to the feature groups identified from the real-world EHR datasets. To illustrate its applicability to other problem domains, we also apply FlexGPC to image data and gene expression data to demonstrate its effectiveness. The key contributions of this paper can be summarized as follows:

We propose a novel IWFG model called FlexGPC that enables more flexible feature representation and grouping to achieve robust, stable and more explanable instant-wise feature selection.
We design the selection network to implement convex and restricted affine combinations for feature grouping in FlexGPC, and the corresponding learning algorithm to balance the objectives of flexibility of feature grouping and model interpretability.
We show that FlexGPC can be incorporated into different clinical analytics models for mortality and next-admission diagnosis prediction, with significant performance improvement over the SOTA IWFS methods in terms of both prediction accuracy and feature selection stability.

Related Work

This section provides a brief review of methods proposed for instance-wise feature selection and grouping. With the objective of enhancing model explanability, a number of them were evaluated based on the EHR data analytics tasks.

Instance-wise Feature Selection

Conventional feature selection aims at identifying a (global) relevant subset of features for the whole dataset. The problem has been well studied under different settings for the selection task, including supervised, semi-supervised, and unsupervised [19, 20]. IWFS tries to identify a distinct feature subset for each data instance. The selector-predictor approach is commonly adopted, in which a selector network is learned to map each input to a specific feature selection mask and the masked input is fed to a predictor network. For instance, L2X [11] performs instance-wise feature selection for explaining black-box models by maximizing the mutual information between the selected features and the response variable. INVASE [12] extends L2X so that the size of the feature subset selected can also be inferred based on the input instance. As the use of mutual information cannot capture causal influence [15], relative entropy distance together with sparse and class-discriminative features was adopted in [17]. Also, LLSPIN [21] was proposed for selecting features in low-sample-size data.

Instance-wise Feature Grouping

Grouping highly correlated features is effective in reducing the feature selection solution space [22]. Existing methods for global feature grouping [23, 24] identify feature groups using clustering algorithms, and representative features per cluster can then be selected [25, 26]. This feature grouping idea has also been extended to the instance-wise setting. Instance-Wise Feature Grouping (IWFG) aims to identify feature groups from the data that can be specifically combined according to the input instance to form the feature mask. gI is an IWFG method that formulates the problem using two notions of feature redundancies based on information theory [16]. Additionally, GroupFS is a group-wise feature selection method proposed for supervised tasks [24], which first clusters the data instances and then identifies the cluster-specific feature subset.

Feature Acquisition

Feature acquisition involves sequentially selecting a subset of features to achieve optimal prediction performance. [27] and [28] utilize Q-learning to sequentially select features. [29] propose a generative surrogate model that captures dependencies among input features to evaluate the potential information gained from the acquisitions. [30] propose an amortized optimization approach as an alternative to the reinforcement learning method, which is notoriously difficult to train. [31] select features in batches rather than individually to reduce query costs. In contrast to feature acquisition, instance-wise feature selection assumes all features are available upfront, allowing for improved accuracy.

To better situate FlexGPC within the literature, we provide Table 1 that summarize therepresentative instance-wise feature selection (IWFS) and instance-wise feature grouping (IWFG) approaches.

Table 1.

Comparison of representative IWFS/IWFG methods

Method	Selector type	Grouping	Interpretability	Weaknesses
L2X	MLP	No	Salient features	Unstable, no grouping
INVASE	Actor–critic policy net	No	Adaptive subset size	High variance, unstable
LSPIN	Locally sparse NN	No	Sparse local masks	Misses global structure
gI	Info-theoretic redundancy	Yes	Learns groups	High complexity
GroupFS	MoE with discrete gating	Yes	Cluster-level subsets	Rigid, low flexibility

Open in a new tab

FlexGPC is designed to address their limitations by enhancing expressiveness (via RAC), stability (via regularization and clamping), and robustness under missingness

Problem Formulation

Key Challenges

While instance-wise feature selection (IWFS) enables more personalized and interpretable predictions, several fundamental challenges remain:

Stability under perturbations. IWFS models are highly sensitive to small variations in the input, which can lead to inconsistent feature subsets being selected for nearly identical instances. This instability reduces the reliability of the explanations in safety-critical domains such as healthcare.
Accuracy–interpretability trade-off. Introducing grouping or sparsity constraints improves interpretability but often comes at the cost of predictive accuracy. Balancing these objectives remains a central challenge in designing explainable selection models.
Flexibility limits of convex mixing. Existing IWFG methods that rely on convex combination of groups can only span a limited subset of possible feature masks. This lack of expressiveness prevents them from capturing more nuanced or counter-pattern structures that are needed in clinical settings.
Robustness to missingness. Real-world EHR datasets are plagued by high rates of missing values. A practical IWFS method must generate stable and meaningful masks even when large portions of the feature space are absent.

We denote Inline graphic as the input data (e.g., clinical features) and as the prediction labels (e.g., clinical outcomes), where n is the number of data instances (e.g., patient records) and d is the number of features. IWFS aims at learning a specific feature selection mask for each data instance where Inline graphic . The masked input is obtained by taking the element-wise product . The expected outcome of IWFS is to obtain for each so that the masked input will correspond to the relevant features which in principle can give better prediction result than the case when the original input is used.

In this paper, we propose a novel IWFG method called FlexGPC which comprises a feature groups matrix and a selection network for estimating the values of the feature selection variables, as shown in Fig. 1. In the context of clinical prediction, the feature groups can be interpreted as phenotypes of different disorders. The role of the selection network is to select and combine the relevant phenotypes, and the masked input is the resulting set of relevant clinical features for the downstream prediction. Compared to the existing IWFG methods, FlexGPC allows soft feature groups and relax the assumptions typically imposed on selection variables (e.g., m-hot representation) for the feature grouping. In particular, we study two combination schemes, namely convex combination and restricted affine combination. A selection network can be designed to estimate the selection variables accordingly.

The proposed FlexGPC module can be integrated with a prediction model, and the overall model can be learned end-to-end (to be detailed in the sequel). We design our learning objective so that the masked input contains both informative and relevant features. Also, regularization terms are adopted to encourage sparsity for both the feature mask and the feature groups to enhance the interpretability of FlexGPC.

Overall Framework of FlexGPC

We first describe two key components of FlexGPC: i) feature groups and ii) feature group selection network for implementing different feature group combination schemes.

Feature Groups

Let Inline graphic represent a set of feature groups, where k is the number of groups and d is the number of features. The elements of G can take values in the range of . is a vector representing the feature group with its elements indicating the feature importance within the group.

Feature Group Combination

Let Inline graphic be a vector of selection variables indicating the extent that the k different feature groups should be selected for the data instance . The value of is computed by feeding into a selection network which is to be learned. Instead of assuming m-hot representation for as in [16], we allow Inline graphic to take continuous values. The “selection” step essentially becomes an input-specific combination process. In the sequel, we will call the selection variables as selection weights.

In particular, we investigate two combination schemes for representing a feature mask, namely convex combination, and restricted affine combination.

Definition 1

Convex Combination (CC) aggregates a set of feature groups Inline graphic for an input with the corresponding selection weights where and for all .

The value of Inline graphic is computed by feeding to a selection network designed accordingly. Instead of using multi-hot representation, we allow for . Thus, the selection network is essentially performing an input-specific convex combination process. We can also explore other ways of combination.

Definition 2

Restricted Affine Combination (RAC) aggregates a set of feature groups Inline graphic for an input via a weighted sum with the corresponding weights , where and for all .

RAC deliberately allows the selection weights computed by the selection network to take negative values. This means that we allow one feature group to de-select its associated features which are selected due to other feature groups.

Figure 2 illustrates the potential benefit of introducing feature de-selection in the combination process. If only positive values are allowed for the selection weights, we need four feature groups to represent the five hypothetical masked inputs. If negative selection weights are allowed, only three feature groups are needed. In general, given a fixed number of feature groups, RAC can allow more feature masks to be represented compared to CC. It is not difficult to theoretically show that RAC can span a larger subspace of feature masks than CC.

Fig. 2 — Illustration of masked inputs which require 4 feature groups based on CC (left) but only 3 groups using RAC by allowing negative selection weight (right)

Theorem 1

(Expressiveness of Restricted Affine Composition). Suppose Inline graphic 1, , for any space spanned by a matrix with convex composition, that exist span the larger or equal space with restricted affine composition, and , if the following hold true:

Condition: There is a feature group separation of Inline graphic to and such that are pairwise differences of .

Proof

The space spanned by Inline graphic is a subspace of , we will denote it by . By the feature group separation condition, we can write any vector in as a convex combination of the vectors in and the pairwise differences in . We also denote , and we denote the space spanned by by .

We can express any vector Inline graphic in the space spanned by , as a restricted affine combination of the vectors in . This is due to the condition that are pairwise differences of , and we know that restricted affine composition allows for negative weights, which can express differences between vectors.

Furthermore, any vector in Inline graphic can also be expressed as a restricted affine combination of vectors in as they are the same, by choosing the weights such that they sum up to one and no weight is less than zero, effectively creating a convex combination.

So, any vector in the space spanned by Inline graphic can be expressed as a restricted affine combination of the vectors in . Therefore, .

Feature Mask m

For each visit i, we compute the feature mask using restricted affine composition Inline graphic . Where is the ith feature group of G and . And U(w) is a clamping function that map w from 0 to 1:.

We could define the clamped restricted affine composition(CRAC) as U(s) where w is defined in the definition 2. Similarly, we define clamped convex composition as U(s) where s is defined in the definition 1.

Clamped restricted affine composition is more expressive than restricted affine Composition when Inline graphic . As shown below:

Theorem 2

(Expressiveness of Clamped Restricted Affine Composition). we denote the space spanned by G with restricted affine composition as Inline graphic , and the space spanned by G with clamped restricted affine composition as . And . if some feature groups overlap.

Where we define two feature groups Inline graphic and are overlap If and

Proof

For any vector v in Inline graphic , we can construct it using clamped restricted affine composition with exactly the same set of . as for , . So the clamp function U(.) have no effect on it. Therefore, .

Then, we show how to construct v in Inline graphic but not in . Suppose and are two feature groups that are overlapped; we calculate the difference between them as . cannot be constructed by restricted affine composition as , which violate the positivity of G. With U(.), we can map the negative value to 0, so . Thus,

Apart from expressiveness, stability of learned masks is a important criteria that make sure the differences of masks learned during different runs are small. In the following, we will show some definition and properties of the stability and showing stability of the proposed method.

Definition: Uniform Stability

A method is said to be Inline graphic -uniformly stable if, for any two datasets S and that differ by one point, and for any input , the computed feature masks and from S and respectively differ by no more than .

Mathematically, this can be expressed as:

for all Inline graphic in the input space, where , , f is the method being used (in this case, the feature mask computation using composed feature grouping), and is the Euclidean norm. The constant represents the maximum allowed change in the feature mask due to a single change in the dataset.

Assumption 1

Feature Group Learning Stability

Suppose we have two datasets S and Inline graphic that differ by one point. Let G and be the feature groups learned from S and respectively. We assume that the maximum difference between corresponding feature groups in G and is bounded by a constant .

Here, Inline graphic and represent the i-th feature group in G and respectively, is the Euclidean norm, and is a constant representing the maximum change in the feature groups due to a single change in the data. This assumption essentially states that the feature group learning method is stable, in the sense that a small change in the data results in at most a Inline graphic -sized change in the feature groups.

Property: Lipschitz Continuity of the Restricted Affine Composition

The restricted affine composition of a set of vectors Inline graphic with corresponding scalar weights is a Lipschitz continuous function due to its linearity.

This is mathematically expressed as follows:

for all vectors Inline graphic and , where , L is the Lipschitz constant, and is the Euclidean norm.

Given the constraints on the weights ( Inline graphic and ), the Lipschitz constant for this operation is 1, meaning that this operation is 1-Lipschitz continuous. Therefore, the distances between points in the input space are not increased by this operation.

Theorem 2

(Uniform Stability of the Method)

Suppose we have two datasets S and Inline graphic that differ by one point, and corresponding feature groups G and such that . Let and be the feature masks computed for any input under datasets S and respectively, where f represents our method. Let the Lipschitz constant of the restricted affine composition used in the method be L.

Then, the method is Inline graphic -uniformly stable, i.e., for any input , the difference between the computed feature masks is bounded by :

The above inequality means that a small change in the input data (at most one data point) leads to a bounded change in the computed feature masks, thus demonstrating the stability of the method.

Proof

We start with two datasets S and Inline graphic , which differ by at most one point. From assumption 1, we have the stability of feature groups learning, which implies that for corresponding feature groups G and .

Now, let’s consider an input Inline graphic for which we compute the feature masks and using datasets S and , respectively.

Using the Lipschitz continuity of the restricted affine composition (property), the difference in the computed feature masks is bounded by the Lipschitz constant multiplied by the difference in the feature groups, i.e., Inline graphic .

Substituting the upper bound of Inline graphic from assumption 1 into this inequality, we have .

Thus, we have shown that for any input Inline graphic , the difference between the computed feature masks and is bounded by , which completes the proof of -uniform stability of the method.

Selection Network Implementation

We implement the selection network using MLP. For CC, the selection network, denoted as Inline graphic , computes the output using the function:

where Inline graphic and are the parameters of the selection network. The softmax function is used to guarantee for all j and .

For RAC, the selection network, denoted as Inline graphic , computes the output using the function:

where Inline graphic and are the parameters of the selection network , and denotes the L1-norm. The use of the formula can guarantee for all j, and .

Feature Mask via Groups Selection

Finally, the feature mask Inline graphic for can be computed as . To achieve a more robust mask, we adopt the LSPIN mapping [21] during the model training:

where Inline graphic is drawn from during model training, and is a hyperparameter for setting the noise level. The mapping has the benefit of pushing the elements in to take values closer to either 0 or 1. Adding noises helps the model explore more combinations of feature groups during training, which can in turn improve the model robustness. After training, the mask can be computed using (3) by setting Inline graphic to zero.

Prediction Models with FlexGPC Incorporated

FlexGPC can be incorporated into different prediction models and train them end-to-end to achieve robust and stable instance-wise feature grouping and selection, as well as high prediction accuracy. In particular, we apply FlexGPC to two prediction tasks: mortality prediction and next-admission diagnosis prediction. Depending on whether the sequential relationship of the features are exploited or not, the two tasks can readily be formulated as sequence-to-sequence classification or just standard classification.

For standard classification, as shown in Fig. 3, we can first feed the input feature vector to FlexGPC and then the masked input to an MLP to form FlexGPC-MLP. For sequence-to-sequence classification, we can integrate FlexGPC with sequential models like Transformer. Specifically, given an input sequence Inline graphic where and the corresponding output label sequence where , we can feed each element to FlexGPC to obtain its masked version . Then, we feed the masked input sequence to a Transformer, and then to an MLP to form FlexGPC-Trans-MLP.

Model Training

Both FlexGPC-MLP and FlexGPC-Trans-MLP can be trained end-to-end to achieve good feature selection and prediction performance at the same time. For the training, we adopt the following objective function with three loss terms:

a) Prediction Loss is measured by cross-entropy:

to guide the model learning to give high prediction accuracy given the relevant features selected.

b) Reconstruction Loss measures the discrepancy between Inline graphic and its reconstructed version based on the masked input :

where Inline graphic is implemented using a two-layer MLP. It encourages the informative features that could recover the redundant features to be captured and retained.

c) Regularization Term comprises two parts:

where Inline graphic encourages sparse mask, and encourages discriminative and sparse feature groups. Both are introduced for enhancing FlexGPC’s interpretability.

Note that before training directly on Inline graphic , we pre-train the feature masks based on the following pre-training loss:

We find that using this loss for pre-training can force the feature masks Inline graphic to be similar to the input . It allows the parameters of the selection network to be initialized less randomly before the subsequent model training, thereby further improving the selection stability.

Experiment Setup

Datasets

To evaluate the effectiveness of FlexGPC for clinical prediction tasks, two real-world EHR datasets, namely MIMIC-III (Medical Information Mart for Intensive Care) [32] and eICU [33], are used.

MIMIC-III is a public dataset containing data of over 46,000 patients admitted to intensive care units (ICU). eICU is a multi-center database, containing over 200,000 ICU admissions. We filter out patients with less than 2 hospital admissions to allow next-diagnosis prediction to be carried out. As a result, we extract 6, 453 patients with 2.7 admissions per patient on average for MIMIC-III, and 12, 293 patients with 2.2 admissions on average for eICU. The average numbers of diagnoses and medications per admission are 12.0 and 38.9 respectively for MIMIC-III, while 14.6 and 21.4 for eICU.

To further demonstrate that FlexGPC is also applicable to other problem domains, we conduct additional experiments using a gene expression dataset and the MNIST image dataset. For the gene expression dataset, we use one that contains cells collected from the human pancreas, with 1,937 cells, 20,125 genes, and 14 cell types.1 We follow the pre-processing adopted in [34], where genes expressed in less than three cells are excluded from further analysis and the gene expression counts per cell are normalized. MNIST is a database of handwritten digits, which contains 60,000 training images. Each image is a 28x28 pixel grayscale image of a handwritten digit (0 through 9). And there are 10,000 test images.

The statistics of the datasets are summarized in Table 2.

Table 2.

Statistics of datasets

Data set	MIMIC-III	eICU	Gene	MNIST
# of Samples	6,453	12,293	1,937	60,000
# of Features	6,054	3,353	20,125	784
Average visits per patients	2.7	2.2	/	/

Open in a new tab

Parameter Setting

For implementing Inline graphic , and , the size of the hidden layer l is 400. For the prediction MLP in FlexGPC-MLP and FlexGPC-Trans-MLP, the size of the hidden layer is 200. The Transformer used in FlexGPC-Trans-MLP has one layer, a single head and a dimension of 200. We tested different numbers of feature groups k from 50 to 300. For model training, Inline graphic in (3) is set to 1. The batch size is 100. of the data is used for training, for validation, and for testing. We run our experiments on a server with four NVidia Tesla V100-PCIE-32GB GPU, 250 GB memory, and Intel(R) Silver 4114 CPU. Adam optimizer is used for the training with five repetitions.

Performance Evaluation

We test the performance of FlexGPC first on two clinical prediction tasks: i) next-admission diagnosis prediction [35], and ii) mortality prediction [36]. They are formulated as sequence-to-sequence classification problems with FlexGPC-Trans-MLP adopted. We further test the effectiveness of FlexFPC on two other problems iii) cell type identification [34], and iv) handwritten digit recognition. They are formulated as standard classification problems where FlexGPC-MLP is adopted.

Metrics for Prediction Accuracy

For the next-admission diagnosis prediction [37, 38], we first derive the ground-truth labels by grouping the diagnoses in the next admissions into 793 groups based on the first three digits of their ICD-9 codes, and then carry out multi-label classification accordingly. The prediction accuracy is measured by:

For mortality prediction, we use area under the ROC curve (AUC). For cell type identification and handwritten digita recognition, we use Accuracy@1.

Metric for Feature Selection Stability

We evaluate the stability of feature selection by:

It measures the similarity of the feature masks learned in B runs (5 in our experiments), where

gives the ranking similarity of the two masks Inline graphic and obtained in and runs respectively. The higher the value, the better the stability is.

Baselines

We compare the performance of FlexGPC-Trans-MLP and FlexGPC-MLP with a number of baselines.

INVASE [12] consists of a selector network, a predictor network, and a baseline network and uses the actor-critic methodology for training.
LSPIN [21] is a locally sparse neural network where the local sparsity is learned to identify the relevant feature subset for each data instance.
gI [16] learns and combine feature groups to form the feature mask by minimizing the loss of representation and relevant redundancies.
GroupFS [24] is a group-wise feature selection method that uses the Mixture of Experts (MoE) model with discrete gating to select feature masks.
replaces the restricted affine combination (RAC) component of FlexGPC with the convex combination (CC).

For fair comparison, we modify all the baselines so that they adopt the same prediction network as in FlexGPC-Trans-MLP or FlexGPC-MLP, depending on the prediction task.

For the two sequence-to-sequence prediction tasks, we report also the performance of Hi-BEHRT [39] which adopts a more sophisticated hierarchical Transformer to capture information of long sequences in the EHR.

Results

This section presents the results of performance comparison with the baselines based on the four prediction tasks.

Performance on EHR Data

The performance comparison results on next-admission diagnosis prediction and mortality prediction are summarized in Table 3 (MIMIC-III) and Table 4 (eICU). We observe that the proposed FlexGPC generally outperforms baselines under different degrees of data missingness. Using RAC for combining the feature groups in most cases outperforms the use of CC. This shows the benefit of RAC to allow the possibility of learning feature groups for “de-selection”. Compared with Hi-BEHRT which adopts a more sophisticated Transformer, Inline graphic together with only a simple Transformer obtains better performance. Also, Fig. 4 shows that for next-admission diagnosis prediction, consistently outperforms for different number of feature groups being tested. The same results are observed for mortality prediction. In general, FlexGPC which uses the RAC for the feature groups combination requires a smaller number of feature groups to achieve higher accuracy as compared to Inline graphic .

Table 3.

Performance comparison based on MIMIC-III

missing rate
Model	0%	20%	40%
Next-admission diagnosis prediction (Accuracy@20)
INVASE	0.571 ± 0.015	0.609 ± 0.019	0.589 ± 0.017
LSPIN	0.615± 0.022	0.600 ± 0.023	0.598 ± 0.025
gI	0.602 ± 0.014	0.600 ± 0.017	0.591 ± 0.018
GroupFS	0.592 ± 0.012	0.589 ± 0.012	0.586 ± 0.014
	0.610 ± 0.019	0.592 ± 0.019	0.586 ± 0.018
	0.589 ± 0.011	0.576 ± 0.016	0.571 ± 0.012
	0.575 ± 0.09	0.572 ± 0.011	0.561 ± 0.015
	0.646± 0.016	0.621 ± 0.015	0.614 ± 0.017
Hi-BEHRT	0.631 ± 0.013	0.618± 0.012	0.610 ± 0.011
Mortality prediction (AUC)
INVASE	0.872 ± 0.021	0.839 ± 0.019	0.812 ± 0.020
LSPIN	0.944 ± 0.010	0.880 ± 0.015	0.877 ± 0.016
gI	0.917 ± 0.014	0.835 ± 0.014	0.810 ± 0.017
GroupFS	0.845 ± 0.011	0.837 ± 0.013	0.823 ± 0.013
	0.859 ± 0.008	0.848 ± 0.010	0.836 ± 0.012
	0.896 ± 0.015	0.892 ± 0.017	0.881 ± 0.014
	0.862 ± 0.017	0.855 ± 0.022	0.853 ± 0.013
	0.945 ± 0.010	0.879 ± 0.012	0.878 ± 0.011
Hi-BEHRT	0.865 ± 0.021	0.832 ± 0.025	0.824 ± 0.023

Open in a new tab

FlexGPC-Trans-MLP is abbreviated as FlexGPC

Table 4.

Performance comparison based on eICU

missing rate
Model	0%	20%	40%
Next-admission diagnosis prediction (Accuracy@20)
INVASE	0.900± 0.011	0.878 ± 0.012	0.856 ± 0.015
LSPIN	0.885 ± 0.009	0.876 ± 0.010	0.854 ± 0.010
gI	0.882 ± 0.007	0.879 ± 0.07	0.856 ± 0.09
GroupFS	0.870 ± 0.010	0.862 ± 0.012	0.835 ± 0.011
	0.873± 0.007	0.863 ± 0.008	0.845 ± 0.100
	0.869± 0.006	0.872 ± 0.009	0.849 ± 0.8
	0.865± 0.008	0.860 ± 0.007	0.841 ± 0.13
	0.896± 0.009	0.882± 0.010	0.863± 0.012
Hi-BEHRT	0.872± 0.019	0.867 ± 0.017	0.842 ± 0.018
Mortality prediction (AUC)
INVASE	0.731 ± 0.015	0.710 ± 0.014	0.703 ± 0.016
LSPIN	0.719 ± 0.012	0.701 ± 0.013	0.695 ± 0.013
gI	0.712 ± 0.015	0.710 ± 0.017	0.687± 0.018
GroupFS	0.638 ± 0.012	0.617 ± 0.012	0.552 ± 0.015
	0.736± 0.007	0.731 ± 0.008	0.687± 0.011
	0.738± 0.012	0.725 ± 0.009	0.682± 0.013
	0.722± 0.005	0.724 ± 0.007	0.667± 0.09
	0.740± 0.006	0.740± 0.010	0.729± 0.011
Hi-BEHRT	0.716 ± 0.018	0.695 ± 0.021	0.695 ± 0.019

Open in a new tab

Fig. 4 — Effect of the number of feature groups on the accuracy of (red dashed line) and (blue line). outperforms in most cases

To further confirm the effectiveness of CC and RAC for feature grouping, we test two variants of FlexGPC for selecting features and combining feature groups. Inline graphic aggregates the feature groups via simple averaging, while learns the feature mask using a one-layer neural network without feature grouping. As shown in Tables 3 and 4, can outperform both by a large margin. For example, for the next-admission diagnosis prediction, the Accuracy@20 on MIMIC-III drops from 0.646 for Inline graphic to 0.589 for and 0.575 for .

Regarding feature selection stability, Table 5 shows the stability scores of Inline graphic and for next-admission diagnosis prediction. To test also their robustness, we conduct the test with different levels of Gaussian noise added to the input. Both outperform all the baselines. gI is the second best where feature grouping is also adopted. Similar improvement is also observed on eICU data, and for mortality prediction.

Table 5.

Stability comparison for MIMIC-III for next-admission diagnosis prediction

Model/Noise level	0	0.1	0.5
INVASE	0.815	0.801	0.785
LSPIN	0.809	0.792	0.773
gI	0.835	0.817	0.799
GroupFS	0.826	0.811	0.794
	0.878	0.843	0.806
	0.882	0.852	0.795

Open in a new tab

Phenotypes Extracted from MIMIC-III

Figure 5 shows three feature groups (phenotypes) ( Inline graphic , , ) extracted from MIMIC-III using FlexGPC. With reference to a specific patient, and are identified by FlexGPC as positive (selected) feature groups, and as a negative (deselected) group. Figure 5a illustrates how the three feature groups can be combined using RAC estimated by the selection network to give a feature mask shown in Fig. 5b ).

Fig. 5 — Illustration of a feature mask obtained by for a patient record in MIMIC-III

The feature mask suggests that the patient is dealing with a serious, potentially metastatic cancer, accompanied by psychological (depression) and metabolic (hyperlipidemia) comorbidities and a high-risk vascular condition (aneurysm). Inline graphic is a group consisting of mental health issues that are related to the patients (311, V667) and some neoplasms-related diseases. However, not all diseases in are in the patient’s record. E.g., the patient did not develop 1578 (Malignant neoplasm of other specified sites of pancreas) and 1970 (Secondary malignant neoplasm of lung). The model can de-select them in Inline graphic by subtracting from it feature group which contains those diseases. Similarly, is related to bleeding issues which contain diseases related to the patient (2724, 4414, 311). Those unrelated diseases (2851, 5781) again can be de-selected by subtracting from it . The capability of identifying the feature groups which can be combined via group selection and de-selection to represent the whole dataset is a key feature enabled by FlexGPC.

Performance on Gene Expression and MNIST Datasets

To demonstrate the applicability of the proposed FlexGPC to other problem domains, we apply FlexGPC-MLP to the human pancreas gene expression data for cell type identification and to the MNIST dataset for the handwritten digit recognition task. The results are summarized in Table 6.

Table 6.

Accuracy and stability comparison on human pancreas gene expression and MNIST datasets

	Cell Type Identification		Handwritten Digit Recognition
Model	Accuracy@1	Stability	Accuracy@1	Stability
INVASE	0.948 ± 0.012	0.831	0.977 ± 0.090	0.810
LSPIN	0.943 ± 0.010	0.829	0.963 ± 0.011	0.833
gI	0.747 ± 0.013	0.861	0.965 ± 0.012	0.857
GroupFS	0.769 ± 0.013	0.881	0.958 ± 0.015	0.823
	0.943 ± 0.011	0.872	0.963 ± 0.011	0.866
	0.959 ± 0.012	0.899	0.979 ± 0.011	0.852

Open in a new tab

FlexGPC obtains the highest accuracy for the cell type identification task. INVASE and LSPIN achieve slightly worse accuracy of about 0.94, while gI is significantly worse, obtaining an accuracy of 0.74. We also report the two stability scores in the same table. FlexGPC gives the most stable feature selection performance. Similar conclusions can be drawn from the handwritten digit recognition task, where FlexGPC obtains the highest accuracy and stability.

Figure 6a shows the learned feature masks based on different approaches where we observe that RAC can identify more salient regions in the image for the prediction as compared to the others. Figure 6b shows that the feature masks obtained by FlexGPC can effectively select the important pixels corresponding to the same digits written in different styles.

Fig. 6 — Comparison of learned feature masks for digits in MNIST

Visualization of Groupings of Feature Masks

Figure 7 shows the visualization of the feature masks learned by FlexGPC for the MIMIC-III dataset. To facilitate the visualization, we first apply K-means clustering to the dataset based on the hamming distance of the ground truth next-admission diagnoses to obtain the disease group label for each patient visit. The feature masks for the patient visits under the same disease group label are then presented together row-wise. In the Fig. 7, we see that the masks under the same group learned by FlexGPC share similar clinical features (histories of diagnoses and medications). For the cell type identification, we group together the cells with the same ground truth cell type label. Figure 8 shows a snapshot of feature masks learned from the gene expression data. Again, each row shows a feature mask, and the feature masks are grouped according to their ground truth cell type. We can observe that prominent and distinct gene blocks can be discovered for each cell type. This implies that FlexGPC can effectively discover the biomarkers for cells of the same type.

Fig. 7 — Feature masks learned from the MIMIC-III data by FlexGPC

Fig. 8 — Feature masks learned from the human pancreas data by FlexGPC

Conclusion

In this paper, we propose a novel instance-wise feature grouping model called FlexGPC. We incorporated FlexGPC into different prediction models and trained them end-to-end for various analytics tasks based on electronic health records (EHR), gene expression, and image data. Across all tested tasks, FlexGPC enhanced both downstream prediction accuracy and feature selection stability. Additionally, FlexGPC offers fine-grained interpretation through the use of feature groups and increased expressiveness due to the flexible combination of feature groups. We demonstrated how FlexGPC, with a selection network implementing restricted affine combination, supports the selection and de-selection of feature groups, thereby enhancing the robustness and stability of Instance-Wise Feature Grouping (IWFG), as confirmed by extensive experiments. For future work, we plan to consider temporal information from sequential data to infer the feature mask, allowing for the exploration of dynamic interactions among input features. Additionally, developing more explicit methods to handle missing data is another direction towards achieving more robust instance-wise feature selection.

Author Contributions

William K. Cheung and Ivor Tsang supervised the project. Chin Wang Cheong designed the model, implemented the code and conducted the experiments. William K. Cheung, Chin Wang Cheong, Kejing Yin contributed to the drafting of the manuscript. Ivor Tsang comments and reviewed the manuscript. All authors discussed the results and approved the final version before submission.

Funding

Open access funding provided by Hong Kong Baptist University Library. This research is partially supported by the Research Matching Grant Scheme RMGS2021_8_06 from the Hong Kong Government, the National Natural Science Foundation of China (NSFC) under Grant 62302413 and the Health and Medical Research Fund (HMRF) under Grant 23220312.

Data Availability

The MIMIC-III data set and eICU data set can be downloaded from https://physionet.org/content/mimiciii-demo/1.4/ and https://eicu-crd.mit.edu/gettingstarted/access/ respectively. The gene expression data set is available at the following URL: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM223075. And the MNIST data set is available at https://www.kaggle.com/datasets/hojjatk/mnist-dataset.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2230757

References

1.Higgins D, Madai VI (2020) From bit to bedside: A practical framework for artificial intelligence product development in healthcare. Adv Intell Syst 2(10):2000052. 10.1002/aisy.202000052, https://arxiv.org/abs/onlinelibrary.wiley.com/doi/pdf/10.1002/aisy.202000052
2.Amann J, Blasimme A, Vayena E, Frey D, Madai V (2020) Explainability for artificial intelligence in healthcare: a multidisciplinary perspective. BMC Med Inf Decis Making 20. 10.1186/s12911-020-01332-6 [DOI] [PMC free article] [PubMed]
3.Yang CC (2022) Explainable artificial intelligence for predictive modeling in healthcare. J Healthcare Inf Res 6(2):228–239. 10.1007/s41666-022-00114-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Scheurwegs E, Cule B, Luyckx K, Luyten L, Daelemans W (2017) Selecting relevant features from the electronic health record for clinical code prediction. J Biomed Inf 74:92–103. 10.1016/j.jbi.2017.09.004 [DOI] [PubMed] [Google Scholar]
5.Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182 [Google Scholar]
6.Pathan MS, Nag A, Pathan MM, Dev S (2022) Analyzing the impact of feature selection on the accuracy of heart disease prediction. Healthcare Anal 2:100060 [Google Scholar]
7.Ebrahimi A, Wiil UK, Naemi A, Mansourvar M, Andersen K, Nielsen AS (2022) Identification of clinical factors related to prediction of alcohol use disorder from electronic health records using feature selection methods. BMC Med Inf Decis Making 22 [DOI] [PMC free article] [PubMed]
8.Chen Y, Zhang J, Qin X (2022) Interpretable instance disease prediction based on causal feature selection and effect analysis. BMC Med Inform Decis Mak 22(1):51. 10.1186/s12911-022-01788-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Shi X, Nikolic G, Epelde G, Arrúe M, Bidaurrazaga Van-Dierdonck J, Bilbao R, De Moor B (2021) An ensemble-based feature selection framework to select risk factors of childhood obesity for policy decision making. BMC Med Inf Decis Making 21 [DOI] [PMC free article] [PubMed]
10.Bosschieter TM, Xu Z, Lan H, al. (2024) Interpretable predictive models to understand risk factors for maternal and fetal outcomes. J Healthcare Inf Res 8(1):65–87. 10.1007/s41666-023-00151-4 [DOI] [PMC free article] [PubMed]
11.Chen J, Song L, Wainwright M, Jordan M (2018) Learning to Explain: An information-theoretic perspective on model interpretation. In: Proceedings of the 35th international conference on machine Learning, pp 883–892
12.Yoon J, Jordon J, Schaar M (2019) Invase: instance-wise variable selection using neural networks. In: Proceedings of international conference on learning representations
13.Xin B, Hu L, Wang Y, Gao W (2015) Stable feature selection from brain smri. In: Proceedings of the AAAI conference on artificial intelligence, pp 1910–1916
14.He Z, Yu W (2010) Stable feature selection for biomarker discovery. Comput Biol Chem 34(4):215–225. 10.1016/j.compbiolchem.2010.07.002 [DOI] [PubMed] [Google Scholar]
15.Chang S, Zhang Y, Yu M, Jaakkola T (2020) Invariant rationalization. In: Proceedings of the international conference on machine learning, pp 1448–1458
16.Masoomi A, Wu C, Zhao T, Wang Z, Castaldi P, Dy JG (2020) Instance-wise feature grouping. In: Proceedings of neural information processing systems, pp 13374–13386
17.Panda P, Kancheti SS, Balasubramanian V (2021) Instance-wise causal feature selection for model interpretation. In: Proceedings of IEEE conference on computer vision and pattern recognition workshops, pp 1756–1759
18.Chormunge S, Jena S (2018) Correlation based feature selection with clustering for high dimensional data. J Electr Syst Inf Technol 5(3):542–549. 10.1016/j.jesit.2017.06.004 [Google Scholar]
19.Witten DM, Tibshirani R (2010) A framework for feature selection in clustering. J Am Stat Assoc 105(490):713–726 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Akhiat Y, Asnaoui Y, Chahhou M, Zinedine A (2020) A new graph feature selection approach. In: Proceedings of the 6th IEEE congress on information science and technology, pp 156–161. https://api.semanticscholar.org/CorpusID:233136414
21.Yang J, Lindenbaum O, Kluger Y (2021) Locally sparse neural networks for tabular biomedical data. In: Proceedings of international conference on machine learning, pp 25123–25153
22.Kuzudisli C, Bakir-Gungor B, Bulut N, Qaqish B, Yousef M (2023) Review of feature selection approaches based on grouping of features. PeerJ 11 [DOI] [PMC free article] [PubMed]
23.Dai Y, Gao Z, Zhu Y, Zhang W, Li H, Wang Y, Li Z (2022) Feature grouping for no-reference image quality assessment. In: Proceedings of 7th international conference on automation, control and robotics engineering, pp 204–208
24.Xiao Q, Li H, Tian J, Wang Z (2022) Group-wise feature selection for supervised learning. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 3149–3153
25.Sahu B, Dehuri S, Jagadev AK (2017) Feature selection model based on clustering and ranking in pipeline for microarray data. Inf Med Unlocked 9:107–122 [Google Scholar]
26.Alimoussa, M, Porebski, A, Vandenbroucke, N, Thami, ROH, El Fkihi, S (2021) Clustering-based sequential feature selection approach for high dimensional data classification. In: Proceedings of the 16th international joint conference on computer vision, imaging and computer graphics theory and applications, pp 122–132
27.Shim H, Hwang SJ, Yang E (2018) Joint active feature acquisition and classification with variable-size set encoding. In: Proceedings of advances in neural information processing systems, vol. 31
28.Janisch J, Pevný, T, Lisý V (2017) Classification with costly features using deep reinforcement learning. Proceedings of the AAAI conference on artificial intelligence 33. 10.1609/aaai.v33i01.33013959
29.Li Y, Oliva JB (2020) Active feature acquisition with generative surrogate models. In: Proceedings of international conference on machine learning. https://api.semanticscholar.org/CorpusID:222142179
30.Covert I, Qiu W, Lu M, Kim N, White N, Lee S-I (2023) Learning to maximize mutual information for dynamic feature selection. In: Proceedings of the 40th international conference on machine learning
31.Asgaonkar V, Jain A, De A (2024) Generator assisted mixture of experts for feature acquisition in batch. In: Proceedings of the 38th international conference on artificial intelligence
32.Johnson AEW, Pollard TJ, Shen L, Lehman L-w H, Feng M, Ghassemi M, Moody B, Szolovits P, Anthony Celi L, Mark RG (2016) MIMIC-III, a freely accessible critical care database. Scientific Data 3:160035 [DOI] [PMC free article] [PubMed]
33.Pollard T, Johnson A, Raffa J, Celi L, Mark R, Badawi O (2018) The eICU collaborative research database, a freely available multi-center database for critical care research. Scientific Data 5:180178. 10.1038/sdata.2018.178 [DOI] [PMC free article] [PubMed]
34.Xu K, Cheong C, Veldsman W, Lyu A, Cheung W, Zhang L (2023) Accurate and interpretable gene expression imputation on scrna-seq data using IGSimpute. Briefings in Bioinf 24 [DOI] [PubMed]
35.Nguyen P, Tran T, Wickramasinghe N, Venkatesh S (2016) Deepr: a convolutional net for medical records. IEEE J Biomed Health Inf 22–30 [DOI] [PubMed]
36.Sha Y, Wang MD (2017) Interpretable predictions of clinical outcomes with an attention-based recurrent neural network. In: Proceedings of the 8th ACM international conference on bioinformatics, computational biology, and health informatics, pp 233–240 [DOI] [PMC free article] [PubMed]
37.Choi E, Bahadori MT, Song L, Stewart WF, Sun J (2017) GRAM: Graph-based attention model for healthcare representation learning. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 787–795 [DOI] [PMC free article] [PubMed]
38.Song L, Cheong CW, Yin K, Cheung WK, Fung BCM, Poon J (2019) Medical concept embedding with multiple ontological representations. In: Proceedings of the 28th international joint conference on artificial intelligence, pp 4613–4619
39.Li Y, Mamouei M, Salimi-Khorshidi G, Rao S, Hassaine A, Canoy D, Lukasiewicz T, Rahimi K (2023) Hi-BEHRT: Hierarchical transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records. IEEE J Biomed Health Inform 27:1106–1117 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR1] 1.Higgins D, Madai VI (2020) From bit to bedside: A practical framework for artificial intelligence product development in healthcare. Adv Intell Syst 2(10):2000052. 10.1002/aisy.202000052, https://arxiv.org/abs/onlinelibrary.wiley.com/doi/pdf/10.1002/aisy.202000052

[CR2] 2.Amann J, Blasimme A, Vayena E, Frey D, Madai V (2020) Explainability for artificial intelligence in healthcare: a multidisciplinary perspective. BMC Med Inf Decis Making 20. 10.1186/s12911-020-01332-6 [DOI] [PMC free article] [PubMed]

[CR3] 3.Yang CC (2022) Explainable artificial intelligence for predictive modeling in healthcare. J Healthcare Inf Res 6(2):228–239. 10.1007/s41666-022-00114-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Scheurwegs E, Cule B, Luyckx K, Luyten L, Daelemans W (2017) Selecting relevant features from the electronic health record for clinical code prediction. J Biomed Inf 74:92–103. 10.1016/j.jbi.2017.09.004 [DOI] [PubMed] [Google Scholar]

[CR5] 5.Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182 [Google Scholar]

[CR6] 6.Pathan MS, Nag A, Pathan MM, Dev S (2022) Analyzing the impact of feature selection on the accuracy of heart disease prediction. Healthcare Anal 2:100060 [Google Scholar]

[CR7] 7.Ebrahimi A, Wiil UK, Naemi A, Mansourvar M, Andersen K, Nielsen AS (2022) Identification of clinical factors related to prediction of alcohol use disorder from electronic health records using feature selection methods. BMC Med Inf Decis Making 22 [DOI] [PMC free article] [PubMed]

[CR8] 8.Chen Y, Zhang J, Qin X (2022) Interpretable instance disease prediction based on causal feature selection and effect analysis. BMC Med Inform Decis Mak 22(1):51. 10.1186/s12911-022-01788-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Shi X, Nikolic G, Epelde G, Arrúe M, Bidaurrazaga Van-Dierdonck J, Bilbao R, De Moor B (2021) An ensemble-based feature selection framework to select risk factors of childhood obesity for policy decision making. BMC Med Inf Decis Making 21 [DOI] [PMC free article] [PubMed]

[CR10] 10.Bosschieter TM, Xu Z, Lan H, al. (2024) Interpretable predictive models to understand risk factors for maternal and fetal outcomes. J Healthcare Inf Res 8(1):65–87. 10.1007/s41666-023-00151-4 [DOI] [PMC free article] [PubMed]

[CR11] 11.Chen J, Song L, Wainwright M, Jordan M (2018) Learning to Explain: An information-theoretic perspective on model interpretation. In: Proceedings of the 35th international conference on machine Learning, pp 883–892

[CR12] 12.Yoon J, Jordon J, Schaar M (2019) Invase: instance-wise variable selection using neural networks. In: Proceedings of international conference on learning representations

[CR13] 13.Xin B, Hu L, Wang Y, Gao W (2015) Stable feature selection from brain smri. In: Proceedings of the AAAI conference on artificial intelligence, pp 1910–1916

[CR14] 14.He Z, Yu W (2010) Stable feature selection for biomarker discovery. Comput Biol Chem 34(4):215–225. 10.1016/j.compbiolchem.2010.07.002 [DOI] [PubMed] [Google Scholar]

[CR15] 15.Chang S, Zhang Y, Yu M, Jaakkola T (2020) Invariant rationalization. In: Proceedings of the international conference on machine learning, pp 1448–1458

[CR16] 16.Masoomi A, Wu C, Zhao T, Wang Z, Castaldi P, Dy JG (2020) Instance-wise feature grouping. In: Proceedings of neural information processing systems, pp 13374–13386

[CR17] 17.Panda P, Kancheti SS, Balasubramanian V (2021) Instance-wise causal feature selection for model interpretation. In: Proceedings of IEEE conference on computer vision and pattern recognition workshops, pp 1756–1759

[CR18] 18.Chormunge S, Jena S (2018) Correlation based feature selection with clustering for high dimensional data. J Electr Syst Inf Technol 5(3):542–549. 10.1016/j.jesit.2017.06.004 [Google Scholar]

[CR19] 19.Witten DM, Tibshirani R (2010) A framework for feature selection in clustering. J Am Stat Assoc 105(490):713–726 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Akhiat Y, Asnaoui Y, Chahhou M, Zinedine A (2020) A new graph feature selection approach. In: Proceedings of the 6th IEEE congress on information science and technology, pp 156–161. https://api.semanticscholar.org/CorpusID:233136414

[CR21] 21.Yang J, Lindenbaum O, Kluger Y (2021) Locally sparse neural networks for tabular biomedical data. In: Proceedings of international conference on machine learning, pp 25123–25153

[CR22] 22.Kuzudisli C, Bakir-Gungor B, Bulut N, Qaqish B, Yousef M (2023) Review of feature selection approaches based on grouping of features. PeerJ 11 [DOI] [PMC free article] [PubMed]

[CR23] 23.Dai Y, Gao Z, Zhu Y, Zhang W, Li H, Wang Y, Li Z (2022) Feature grouping for no-reference image quality assessment. In: Proceedings of 7th international conference on automation, control and robotics engineering, pp 204–208

[CR24] 24.Xiao Q, Li H, Tian J, Wang Z (2022) Group-wise feature selection for supervised learning. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 3149–3153

[CR25] 25.Sahu B, Dehuri S, Jagadev AK (2017) Feature selection model based on clustering and ranking in pipeline for microarray data. Inf Med Unlocked 9:107–122 [Google Scholar]

[CR26] 26.Alimoussa, M, Porebski, A, Vandenbroucke, N, Thami, ROH, El Fkihi, S (2021) Clustering-based sequential feature selection approach for high dimensional data classification. In: Proceedings of the 16th international joint conference on computer vision, imaging and computer graphics theory and applications, pp 122–132

[CR27] 27.Shim H, Hwang SJ, Yang E (2018) Joint active feature acquisition and classification with variable-size set encoding. In: Proceedings of advances in neural information processing systems, vol. 31

[CR28] 28.Janisch J, Pevný, T, Lisý V (2017) Classification with costly features using deep reinforcement learning. Proceedings of the AAAI conference on artificial intelligence 33. 10.1609/aaai.v33i01.33013959

[CR29] 29.Li Y, Oliva JB (2020) Active feature acquisition with generative surrogate models. In: Proceedings of international conference on machine learning. https://api.semanticscholar.org/CorpusID:222142179

[CR30] 30.Covert I, Qiu W, Lu M, Kim N, White N, Lee S-I (2023) Learning to maximize mutual information for dynamic feature selection. In: Proceedings of the 40th international conference on machine learning

[CR31] 31.Asgaonkar V, Jain A, De A (2024) Generator assisted mixture of experts for feature acquisition in batch. In: Proceedings of the 38th international conference on artificial intelligence

[CR32] 32.Johnson AEW, Pollard TJ, Shen L, Lehman L-w H, Feng M, Ghassemi M, Moody B, Szolovits P, Anthony Celi L, Mark RG (2016) MIMIC-III, a freely accessible critical care database. Scientific Data 3:160035 [DOI] [PMC free article] [PubMed]

[CR33] 33.Pollard T, Johnson A, Raffa J, Celi L, Mark R, Badawi O (2018) The eICU collaborative research database, a freely available multi-center database for critical care research. Scientific Data 5:180178. 10.1038/sdata.2018.178 [DOI] [PMC free article] [PubMed]

[CR34] 34.Xu K, Cheong C, Veldsman W, Lyu A, Cheung W, Zhang L (2023) Accurate and interpretable gene expression imputation on scrna-seq data using IGSimpute. Briefings in Bioinf 24 [DOI] [PubMed]

[CR35] 35.Nguyen P, Tran T, Wickramasinghe N, Venkatesh S (2016) Deepr: a convolutional net for medical records. IEEE J Biomed Health Inf 22–30 [DOI] [PubMed]

[CR36] 36.Sha Y, Wang MD (2017) Interpretable predictions of clinical outcomes with an attention-based recurrent neural network. In: Proceedings of the 8th ACM international conference on bioinformatics, computational biology, and health informatics, pp 233–240 [DOI] [PMC free article] [PubMed]

[CR37] 37.Choi E, Bahadori MT, Song L, Stewart WF, Sun J (2017) GRAM: Graph-based attention model for healthcare representation learning. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 787–795 [DOI] [PMC free article] [PubMed]

[CR38] 38.Song L, Cheong CW, Yin K, Cheung WK, Fung BCM, Poon J (2019) Medical concept embedding with multiple ontological representations. In: Proceedings of the 28th international joint conference on artificial intelligence, pp 4613–4619

[CR39] 39.Li Y, Mamouei M, Salimi-Khorshidi G, Rao S, Hassaine A, Canoy D, Lukasiewicz T, Rahimi K (2023) Hi-BEHRT: Hierarchical transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records. IEEE J Biomed Health Inform 27:1106–1117 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Explanation Beyond Individual Features: Instance-wise Feature Grouping for EHR Predictive Analytics

Chin Wang Cheong

Kejing Yin

William K Cheung

Ivor Tsang

Abstract

Introduction

Fig. 1.

Related Work

Instance-wise Feature Selection

Instance-wise Feature Grouping

Feature Acquisition

Table 1.

Problem Formulation

Key Challenges

Overall Framework of FlexGPC

Feature Groups

Feature Group Combination

Definition 1

Definition 2

Fig. 2.

Theorem 1

Proof

Feature Mask m

Theorem 2

Proof

Definition: Uniform Stability

Assumption 1

Property: Lipschitz Continuity of the Restricted Affine Composition

Theorem 2

Proof

Selection Network Implementation

Feature Mask via Groups Selection

Prediction Models with FlexGPC Incorporated

Fig. 3.

Model Training

Experiment Setup

Datasets

Table 2.

Parameter Setting

Performance Evaluation

Metrics for Prediction Accuracy

Metric for Feature Selection Stability

Baselines

Results

Performance on EHR Data

Table 3.

Table 4.

Fig. 4.

Table 5.

Phenotypes Extracted from MIMIC-III

Fig. 5.

Performance on Gene Expression and MNIST Datasets

Table 6.

Fig. 6.

Visualization of Groupings of Feature Masks

Fig. 7.

Fig. 8.

Conclusion

Author Contributions

Funding

Data Availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases