Skip to main content
Springer Nature - PMC COVID-19 Collection logoLink to Springer Nature - PMC COVID-19 Collection
. 2020 Apr 17;12084:818–829. doi: 10.1007/978-3-030-47426-3_63

Cross-data Automatic Feature Engineering via Meta-learning and Reinforcement Learning

Jianyu Zhang 14,, Jianye Hao 14, Françoise Fogelman-Soulié 15
Editors: Hady W Lauw8, Raymond Chi-Wing Wong9, Alexandros Ntoulas10, Ee-Peng Lim11, See-Kiong Ng12, Sinno Jialin Pan13
PMCID: PMC7206177

Abstract

Feature Engineering (FE) is one of the most beneficial, yet most difficult and time-consuming tasks of machine learning projects, and requires strong expert knowledge. It is thus significant to design generalized ways to perform FE. The primary difficulties arise from the multiform information to consider, the potentially infinite number of possible features and the high computational cost of feature generation and evaluation. We present a framework called Cross-data Automatic Feature Engineering Machine (CAFEM), which formalizes the FE problem as an optimization problem over a Feature Transformation Graph (FTG). CAFEM contains two components: a FE learner (FeL) that learns fine-grained FE strategies on one single dataset by Double Deep Q-learning (DDQN) and a Cross-data Component (CdC) that speeds up FE learning on an unseen dataset by the generalized FE policies learned by Meta-Learning on a collection of datasets. We compare the performance of FeL with several existing state-of-the-art automatic FE techniques on a large collection of datasets. It shows that FeL outperforms existing approaches and is robust on the selection of learning algorithms. Further experiments also show that CdC can not only speed up FE learning but also increase learning performance.

Introduction

As machine learning becomes more and more widespread, it has been recognized that feature engineering (FE) is the most critical factor for models performance [1]. Various researchers have demonstrated the benefit of using additional features [11]. FE aims at reducing the model error and making learning easier by deriving, through mathematical functions (operators), new features from the original ones. Normally a data scientist combines feature generation, selection and model evaluation iteratively, generating a long sequence of decisions before obtaining the “optimal” set of derived features. This process heavily relies on expert domain knowledge, intuition and technical expertise to handle the complex feedbacks and make best decisions. As a result, the process is difficult, time-consuming and hard to automate.

Most of existing methods of automatic FE either generate a large set of possible features by predefined transformation operators followed by feature selection [3, 7, 15] or apply simple supervised learning (simple algorithm and/or simple meta-features derived from FE process) to recommend a potentially useful feature [4, 5, 9]. The former makes the process computationally expensive, which is even worse for complex features, while the latter significantly limits the performance boost.

A recently proposed FE approach [5] is based on Reinforcement Learning (RL). It treats all features in the dataset as a union, then applies traditional Q-learning [14] on FE-augmented examples to learn a strategy for automating FE under a given computing budget. RL is more promising in providing general FE solutions. However, this work uses Q-learning with linear approximation and 12 simple manual features, which limits the ability of automatic FE. Furthermore, it ignores the differences between features and applies a transformation operator on all of them at each step. Because of this nondiscrimination of different features, it is computation expensive, especially for large datasets and complex transformation operators.

To address the above limitations, in this work, we propose FeL (Feature Engineering Learner) and CAFEM (Cross-data Automatic Feature Engineering Machine). The former is a novel approach for automatic FE for one particular dataset based on off-policy Deep Reinforcement Learning (DRL). In order to speed up the FE process and take advantage of the FE knowledge learned from a large set of datasets, the latter extends FeL to cross-data level by Meta-Learning.

We define a Feature Transformation Graph (FTG), a directed graph representing relationships between different transformed versions of features, to organize the FE process. FeL sequentially trains an agent for each feature by DRL algorithms to learn the strategy for feature engineering on one dataset and corresponding FTG representation. We thus view the goal of FE as maximizing model accuracy by searching through a set of features Inline graphic to generate and a set of features Inline graphic to eliminate. CAFEM extends this process to cross-data by training one agent on a large set of datasets to enable the learned policy to perform well on unseen datasets.

Background and Problem Formulation

In this section we review the Reinforcement Learning (RL) [10] background and describe the problem formulation.

Reinforcement Learning

RL is a family of algorithms that formalizes the interaction of an agent Inline graphic with her environment using a Markov Decision Process (MDP) and allows it to devise an optimal sequence of actions. An MDP is defined by a tuple Inline graphic, where Inline graphic is a set of states, Inline graphic a set of actions, Inline graphic a transition function that maps each state-action pair to a probability distribution over the possible successor states, Inline graphic a reward function and Inline graphic a discount factor for controlling the importance of future rewards. A policy Inline graphic is a mapping from states to actions. At every time step t, an agent in state Inline graphic produces an action Inline graphic. Based on transition function Inline graphic the agent gets into next state Inline graphic with probability Inline graphic and obtains immediate reward Inline graphic. The goal of an agent is to find an optimal policy Inline graphic maximizing her expected discounted cumulated reward Inline graphic, where Inline graphic is the discounted sum of future rewards.

Q-learning is a well-known model-free RL algorithm for finding an optimal policy Inline graphic for any finite MDP. In Q-learning we define the Q-function or action-value function as Inline graphic.

Given an optimal policy Inline graphic, we are interested in the optimal function Inline graphic, or Inline graphic for short, where Inline graphic. As a result, Inline graphic satisfies the following equation:

graphic file with name M27.gif 1

Double Deep Q-network (DDQN) [12] is a model-free RL algorithm, which estimates the state-action value approximately through a deep neural network with parameters Inline graphic. It uses an Inline graphic-greedy policy to get the next action.

During training, the tuples Inline graphic generated by the Inline graphic-greedy policy are stored in R, the so-called replay buffer. Then the neural network is trained by sampling from the replay buffer, using mini-batch, and performing gradient descent on loss Inline graphic, where Inline graphic is approximated by the network g with parameter Inline graphic.

Meta-learning

The goal of meta-learning is to quickly train a model for a new task with the help of data from many other similar tasks.

Model-Agnostic Meta-Learning (MAML) [2] is one of the best meta-learning algorithms that were trained by gradient descent. We denote Inline graphic as a set of tasks. MAML performs one step gradient descent for a task Inline graphic on loss Inline graphic with network g and network parameters Inline graphic and gains Inline graphic as Equation (2). Then it performs a second gradient descent Inline graphic step on loss Inline graphic with network parameters Inline graphic as Equation (3). Finally, MAML finds parameters Inline graphic that are close to the optimal parameters of every task.

graphic file with name M44.gif 2

where Inline graphic is the learning rate of each task Inline graphic.

graphic file with name M47.gif 3

where Inline graphic is the meta step size.

Problem Formulation

We consider a collection of typical supervised learning tasks (binary classification or regression) Inline graphic and each task Inline graphic can be represented as Inline graphic, where Inline graphic is a dataset with a set of features Inline graphic and a corresponding target variable y, L is a learning algorithm (e.g. Random Forest, Logistic Regression, Neural Network) to be applied on dataset D and m is an evaluation measure (e.g. log-loss, relevant absolute error, f1-score) to measure the performance.

We use Inline graphic or P(D) to denote the cross-validation performance of learning algorithm L and evaluation measure m on dataset D. The goal of each task is to maximize P(D).

A transformation operator Inline graphic in FE is a function that is applied on a set of features to generate a new feature Inline graphic where the order of the operator follows the number of features in Inline graphic. We denote the set of derived features as Inline graphic. For instance, a product transformation applied on two features (Order-2) generates a new feature Inline graphic. We use Inline graphic to denote the set of all transformation operators.

Feature engineering aims at constructing a subset of features Inline graphic, where Inline graphic is the set of original features in dataset D, Inline graphic the set of derived features and Inline graphic the set of features that we decide to drop out from original features. For a given dataset D, a feature engineering strategy Inline graphic specifies a derived feature set Inline graphic, where Inline graphic. The goal of feature engineering is to find a good policy Inline graphic that maximizes the model performance for a given algorithm L and measure m on a dataset D.

graphic file with name M69.gif 4

Method

In this section, we present a new framework called Cross-data Automatic Feature Engineering Machine (CAFEM). In order to highlight the differences between features and integrate feature generation and feature selection effectively, we propose a Feature Transformation Graph (FTG) to represent the FE process at feature level. Based on FTG, CAFEM can perform feature engineering for each particular feature based on the information related with it. Thus, it avoids the drawback of generating a large set of features at each step in [5], especially for complex features and large number of features. One component of CAFEM called FE Learner (FeL) uses Reinforcement Learning to find the optimal feature set Inline graphic for each feature iteratively, instead of using expensive graph search algorithm [6]. FeL focus on one particular supervised learning task which gives FeL the ability to dig deeply into that task. However, it loses the opportunity to learn and integrate useful experiments from other tasks which can speed up FE process on a similar task. In order to balance performance and speed, another component of CAFEM called Cross-data Component (CdC) applies a Model-Agnostic Meta-Learning (MAML) [2] method, which is originally designed for supervised learning and on-policy reinforcement learning algorithms, on off-policy reinforcement learning algorithms to speed up FE learning on one particular dataset by integrating the FE knowledges from a set of datasets.

Feature Transformation Graph

We propose a structure called Feature Transformation Graph (FTG) G, which is a directed acyclic dynamic graph, to represent the FE process. Each node f in FTG corresponds to either one original feature in Inline graphic or one feature derived from original features. An edge from node Inline graphic to Inline graphic, Inline graphic, with label Inline graphic indicates that feature Inline graphic is transformed from feature Inline graphic by transformation operator Inline graphic, e.g. Inline graphic or transformed partially from Inline graphic by Inline graphic, e.g. Inline graphic. At the start of FE, G contains n nodes which correspond to n original features Inline graphic. As FE process goes, FTG dynamically grows up (adds more nodes and edges). So we denote FTG at time step t as Inline graphic. An illustrating example is given in Fig. 1.

Fig. 1.

Fig. 1.

Example of FTG

MDP Formulation

So far, we have introduced the representation of FE with FTG in our automatic FE framework. After that, what we need to do is to find a suitable strategy to control the growth of FTG. An important property is that FTG is not designed for any particular strategy, but to be a general representation of an FE process. As a result, we can apply many different strategies on the FTG to control it, such as graph search or RL. In this paper, we choose RL to learn a strategy that can make a sequence of decisions on top of FTG, due to its efficiency.

Consider the FE process with FTG on one dataset D as an MDP problem defined as a tuple Inline graphic. At each time step t, a state Inline graphic consists of the Feature Transformation graph Inline graphic and the features Inline graphic we are working on. Due to the complexity of transformation operators, Inline graphic could contain one or more features. For example, Inline graphic contains one feature for Order-1 operators (e.g. logsquare), two features for Order-2 operators (e.g. productsum).

An action Inline graphic comes from the following two groups of actions:

  • Inline graphic is a set of actions for feature generation, which apply a transformation Inline graphic on current features Inline graphic to derive one new feature.

  • Inline graphic contains one action for feature selection by RL, which drops current feature Inline graphic and moves back to the previous feature. One special case is that current feature Inline graphic belongs to original features. In this case, feature selection action drops it and stops current FE process.

The learning objective here is to find a state Inline graphic with feature set Inline graphic in FTG that maximizes the model accuracy Inline graphic. The trajectory from original feature to a new feature Inline graphic indicates the final feature engineering strategy for Inline graphic.

Since the target of FE is to maximize the performance P(D), the reward Inline graphic of this FE problem in FTG at time step t is set as:

graphic file with name M104.gif 5

CAFEM Framework

Until now, we have introduced the organization of FE process and the MDP formulation of FE problem. The most critical part is the algorithm to find a good strategy of FE. We introduce CAFEM framework which mainly contains two parts: 1) an algorithm called FeL that can apply an off-policy DRL algorithm Inline graphic (such as DQN [8], Double DQN [12]) on FTG for one particular dataset to perform automatic FE; 2) an extended version of model-agnostic meta-learning [2] algorithm on off-policy DRL to speed up FE learning by taking advantage of the generalized FE strategies learned from a set of datasets. It is called off-policy, since the policy being learned can be different from the policy being executed.

In the following sub-section, we will introduce the details of these two parts.graphic file with name 492449_1_En_63_Figa_HTML.jpg

Feature Engineering Learner (FeL): Although FeL works as a component of CAFEM in this paper, it is also a complete algorithm that sequentially optimizes FE strategies for each feature on one particular dataset. The details of FeL algorithm are shown in Algorithm 1. Given a supervised learning task T with n features Inline graphic, n off-policy DRL agents Inline graphic, FeL sequentially optimizes a FE policy for each feature (line 2 in Algorithm 1). As traditional training stage of off-policy RL algorithms, FeL starts with performing M episodes of FE process by Inline graphic-greedy and stores corresponding transitions in replay buffer (line 3–10). In this process, FeL either generates a new feature Inline graphic from feature f by action Inline graphic (Inline graphic) or drops current feature f and moves back to previous feature Inline graphic (Inline graphic). Then FeL trains the corresponding agent Inline graphic by performing gradient descent on a mini-batch sampled from replay buffer Inline graphic (line 11–14). During test stage the same FE method as Algorithm 1 with Inline graphic is used to perform FE for each feature sequentially. Note that the operators in transformation operators set Inline graphic are not of the same complexity level. For example, some unary features (e.g. Inline graphic) are less complex than binary features (e.g. Inline graphic).

As in [15], we introduce features along feature complexity, driving simple features first (e.g. unary features) then complex features (e.g. binary features).

Cross-data Component: In order to speed up FE process and take advantage of a large set of datasets, we apply Model-agnostic Meta-Learning [2] on off-policy RL to perform cross-data level automatic feature engineering. The details of the Cross-data Component (CdC) are shown in Algorithm 2. Given a set of datasets Inline graphic and an off-policy RL agent Inline graphic (we use DDQN here as it can gain relevant a good performance in many tasks [12]) represented by Inline graphic, Cross-data Component samples a batch of features Inline graphic and corresponding dataset Inline graphic and constructs a batch of supervised learning tasks Inline graphic (line 2). For each task Inline graphic, CdC uses the RL agent Inline graphic together with Inline graphic-greedy exploration to perform M episodes for Inline graphic and stores the corresponding transitions in replay buffer Inline graphic (line 4–5). Then CdC samples K transitions from Inline graphic and computes one step gradient descent as Algorithm 2 (line 7–8) where the loss Inline graphic is the same as Algorithm 1. Finally, we sample a batch of transitions and perform meta-update (line 9–11).graphic file with name 492449_1_En_63_Figb_HTML.jpg

Network Design: Until now, we have discussed the details of FeL algorithm and cross-data component. One remaining part is the structure of the neural network that can approximate the Q-values of DDQN in FeL algorithm. In this project, instead of building one approximation function with parameter Inline graphic for each action a [5], we use one union function that is approximated by a neural network, for all actions. Thus, we only need to train one DRL model.

As we discussed in Sect. 4.2, the state Inline graphic at time t indicates the FTG Inline graphic and the features Inline graphic it is working on at time t. In order to cover these two parts of information in the representation of each state Inline graphic, we use the following features to represent Inline graphic:

  1. Extended Quantile Sketch Array (ExQSA) representation of features. Quantile Sketch Array (QSA) uses quantile data sketch [13] to represent feature values associated with a class label. For each feature f and binary target y, QSA builds equi-width bins for f with target Inline graphic and Inline graphic separately. For regression problems, we extend QSA (ExQSA) by building equi-width bins for f with numeric target Inline graphic and Inline graphic separately.

  2. Previous N-step FE history on FTG.

  3. The number of each transformation operators used in Inline graphic.

  4. The number of next node visited for each action.

  5. The number of each operator used from Inline graphic to its root.

  6. Node depth of a feature in HTG.

  7. Average performance improvement of each action.

Totally, we use 293 features to represent each state. A neural network with three fully connected hidden layers (128-128-64 neurons) and ReLU activation function is used to approximate Q-values.

Experiments

This section describes our experimental results. First, we introduce our experimental settings as well as our training procedure. Then we use F1-score (for classification) and 1 - Relevant Absolute Error (1-RAE) (for regression) criteria to compare the performance of FeL algorithm with several state-of-the-art automatic FE techniques. After that, we evaluate the robustness of our algorithm with respect to different learning algorithms (Random Forest, Logistic Regression). Finally, we show the efficiency of CAFEM on different supervised learning tasks by comparing it with FeL. To our surprise, CAFEM can help improving the prediction performance. Source codes are posted on Github (https://github.com/TjuJianyu/CAFEM.git).

Experimental Settings

We randomly collect 120 binary classification or regression datasets, which do not contain missing values and too many features and instances, from OpenML. We randomly split them into 100 datasets for training and the other 20 datasets for testing. Following [5, 9], we choose 13 transformation operators (set Inline graphic) including Order-1: Log, Round, Sigmoid, Tanh, Square, Square Root, ZScore, Min-Max-Normalization and Order-2: Sum, Difference, Product, Division. Following [9], we choose Random Forest and Logistic Regression (Lasso for regression) (from Scikit-learn http://scikit-learn.org) as our learning algorithm and use F1-score/1-RAE to measure the performance. A 5-fold cross validation (same seed for all experiments) using random stratified sampling is used to measure the average performance. FeL is performed on 20 testing datasets directly, while CAFEM is trained on the 100 training datasets by meta-learning. For Order-2 operators, as the number of candidate features is very large, FeL randomly sample a small batch (100) at each step.

To showcase the ability of different FE algorithms, we compare the performance of FeL with the following approaches:

  • Baseline: applies learning algorithm on original dataset (features) directly.

  • Random-FeL (RS): is an algorithm where we apply random strategy on FTG rather than the strategy learned by RL like CAFEM to find a set of features that can maximize P(D). This shows the effect of FTG without RL and Meta-learning. This algorithm can be seen as random graph search method on FTG. As some graph search algorithms, such as depth-first search (DFS) or breadth-search algorithm (BFS), are extremely time consuming [5], we do not compare FeL with DFS or BFS in this paper.

  • Brute-force (BF): is inspired by DSM [3], OneBM [7] and [15]. It applies all transformation operators to all original features and performs feature selection on the augmented dataset. (top-down approach).

  • LFE [9]: uses QSA to generate the representation of each feature in classification problems. Following [9], a neural network with one hidden layer, L2 regularization and dropout is used to predict whether a feature with a transformation operator will gain 1% model performance improvement.

  • FERL: organizes the FE problem into a Transformation Graph, where each node is either the original dataset D or a dataset transformed from D. Then it uses Q-learning with linear approximation. We use the same setting as [5]. For Order-2 transformation operators, native FERL is extremely computation expensive since the number of new features is very large. During training stage, we prune the branches in Transformation Graph that would generate more than 10,000 new features next to make it trainable.

As the source codes of all these methods are not publicly available and some experiments details are not provided (such as, the random seed of learning algorithm and train-test dataset splitting), we implemented ourselves all the benchmarks. For all the FE approaches except Baseline, we evaluate the performance for Order-1 and (Order-1 & Order-2) transformation operators to compare the ability of handling simple and complex transformation operators.

Performance Comparison of FeL

Table 1 compares the model performance of our automatic FE approach FeL to other state-of-the-art FE approaches on 20 datasets. The first four columns in this table report the dataset, the number of instances (rows) and original features, the baseline performance (F1-score/1-RAE of 5-fold cross validation) of the datasets. The number of instances ranges from 506 to 14,240 and for features, it ranges from 4 to 971. In the middle five columns, we compare different automatic FE approaches with Order-1 transformation operators, and in the last five columns, the performance with Order-1 & Order-2 transformation operators.

Table 1.

Comparing Performance by F1-score/1-RAE, Random Forest and 5-fold Cross-validation (- indicates cannot finish within 36 h, x indicates the algorithm can not handle corresponding dataset).

Datasets #Row #Feature Baseline Order-1 Order-1 & 2
FeL BF LFE RS FERL FeL BF LFE RS FERL
Balance_scale 625 5 88.2% 88.3% 86.4% 88.2% 88.2% 88.6% 95.0% 97.0% 95.1% 92.7% -
Boston 506 21 88.2% 90.2% 86.7% 89.2% 89.5% 88.7% 89.9% 85.6% 88.2% 89.8% -
ClimateModel 540 21 95.5% 96.0% 95.6% 95.5% 95.7% 95.9% 96.1% 95.5% 95.5% 96.1% -
Cpu_small 8,192 13 86.3% 87.1% 84.5% 85.8% 86.6% 86.8% 87.1% 86.2% 86.3% 87.0% -
Credit card 14,240 31 50.5% 68.7% 64.8% 50.5% 63.8% 64.0% 71.4% 65.1% 65.1% 64.6% -
Disclosure_x 662 4 44.8% 51.7% 46.6% 46.8% 49.7% 49.8% 51.4% 46.4% 46.4% 51.4% 51.8%
Disclosure_z 662 4 53.8% 57.7% 55.6% 53.1% 55.6% 57.0% 57.0% 53.8% 55.0% 56.7% 56.9%
fri_c1_1000_25 1,000 26 84.9% 87.7% 85.8% 85.8% 86.7% 88.0% 87.1% 77.9% 82.1% 87.1% -
Fri_c2_100_10 1,000 11 86.3% 89.7% 85.8% 86.8% 88.6% 89.3% 91.0% 87.2% 86.7% 89.3% -
Fri_c3_100_5 1,000 6 88.2% 89.2% 88.5% 88.2% 88.4% 89.4% 90.7% 87.3% 87.1% 89.3% -
fri_c3_1000_50 1,000 51 79.7% 83.7% 88.5% 80.9% 80.7% 87.8% 83.1% 88.4% 78.3% 80.8% -
Gina_agnostic 3,468 971 92.3% 92.8% 78.9% 92.3% 92.8% 93.5% 92.8% - 92.5% 92.8% -
Hill-valley 1,212 101 57.5% 61.7% 59.2% 57.5% 60.8% 61.1% 100% 100% 57.5% 99.9% -
Ilpd 583 11 41.3% 45.7% 38.7% 38.9% 43.6% 44.9% 45.9% 45.9% 42.4% 44.8% -
Kc1 2,109 22 40.4% 44.5% 35.3% 38.9% 42.0% 42.7% 44.4% 39.9% 38.8% 43.4% -
openml_589 1,000 25 66.9% 67.7% 55.0% X 67.2% 72.6% 75.0% 76.9% X 68.1% -
Pc4 1,458 38 47.7% 57.0% 36.2% 45.3% 53.8% 58.4% 58.1% 50.1% 55.1% 56.5% -
Pc3+C14 1,563 38 25.9% 33.4% 27.9% 23.0% 30.3% 32.0% 33.3% 24.6% 27.4% 31.6% -
Spectrometer 531 103 77.3% 83.9% 80.0% 75.2% 80.4% 83.0% 82.7% 90.8% 73.2% 81.8% -
Strikes 625 7 96.6% 99.5% 98.7% 97.8% 99.1% 98.9% 99.5% 97.8% 93.4% 99.4% 98.9%

In Order-1 transformation operators, FeL outperforms all approaches on most datasets. On average, FeL improves performance by 4.2% on test datasets. In the best two cases, Credit card and Pc4, FeL even improves baseline performance by 18.2% and 9.3%. One interesting phenomenon is that the Random method (random graph search on FTG) can obtain a relevant higher performance on some datasets. This indicates that FTG represents the FE process in an effective way and significantly contributes further strategy learning of FE.

On Order-1 & Order-2 transformation operators, the complexity of FE increases significantly. Thus, it is expected that an inefficient method would easily run out of time, memory space or even would not work. FeL improves performance by 6.9% on average compared with existing approaches. In the best two cases, Pc4 and Ilpd, it improves by 42.5% and 20.9%. As we mentioned above, some FE approaches would be strongly limited as the complexity of transformation operators increasing. Comparing the performance of each approaches on Order-1 & Order-2 with that on Order-1, we found that LFE and Brute-force approaches get a worse performance (-1.54% in average) on half of the datasets, while FeL does not get any performance decrease. FERL approach is really computation expensive here: most of the datasets run out of time (36 h).

Robustness of FeL on Different Learning Algorithms

In order to showcase the robustness of FeL, we evaluate the performance of FeL with two learning algorithms: Random Forest (tree-based ensemble learning algorithm) and Logistic Regression (Lasso for regression) (general linear algorithm) on 20 test datasets. FeL gains 10.8% and 4.2% performance increase on average with Logistic regression and Random Forest, respectively. The performance of FeL with Logistic regression ranges from 0.2% to 25.8%. For Random Forest, the performance of FeL ranges from 0% to 18.2%. It shows that our algorithm is robust with respect to different learning algorithms.

Performance of Cross-data Component

One main aim of Cross-data Component is to speed up FE learning. We evaluate FeL and CAFEM on the test datasets and randomly show the comparison on 4 datasets due to the space limitation. Figure 2 shows that CAFEM can increase model performance more rapidly (gain a high score within the first epoch, outperform the best of FeL within around ten epochs). To our surprise, CAFEM can gain a better final model performance than FeL in most of the cases. We hypothesize the reason of this phenomena as that CAFEM learnt some general FE rules from a large set of datasets to help the agent quickly learn a new dataset and regularize its behavior.

Fig. 2.

Fig. 2.

CAFEM vs FeL over 4 different datasets

Conclusion

In this paper, we present a novel framework called CAFEM to perform automatic feature engineering (FE) and transfer FE experiences from a set of datasets to a particular one. It contains a feature transformation graph (FTG) that organized the process of FE, a Single-data FE learner and a Cross-data component. In most datasets, the framework outperforms state-of-the-art automatic FE approaches for both simple and complex transformation operators. With the help of cross-data component, CAFEM can speed up FE and increase FE performance. Moveover, the framework is robust to the choice of different learning algorithms.

Acknowledgments

The work is supported by the National Natural Science Foundation of China (Grant Nos.: 61702362, U1836214).

Contributor Information

Hady W. Lauw, Email: hadywlauw@smu.edu.sg

Raymond Chi-Wing Wong, Email: raywong@cse.ust.hk.

Alexandros Ntoulas, Email: antoulas@di.uoa.gr.

Ee-Peng Lim, Email: eplim@smu.edu.sg.

See-Kiong Ng, Email: seekiong@nus.edu.sg.

Sinno Jialin Pan, Email: sinnopan@ntu.edu.sg.

Jianyu Zhang, Email: edzhang@tju.edu.cn.

Jianye Hao, Email: jianye.hao@tju.edu.cn.

Françoise Fogelman-Soulié, Email: francoise.soulie@hub-franceia.fr.

References

  • 1.Domingos P. A few useful things to know about machine learning. Commun. ACM. 2012;55(10):78–87. doi: 10.1145/2347736.2347755. [DOI] [Google Scholar]
  • 2.Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1126–1135 (2017). JMLR.org
  • 3.Kanter, J.M., Veeramachaneni, K.: Deep feature synthesis: towards automating data science endeavors. In: IEEE International Conference on Data Science and Advanced Analytics (DSAA), vol. 36678, pp. 1–10. IEEE (2015)
  • 4.Katz, G., Shin, E.C.R., Song, D.: Explorekit: automatic feature generation and selection. In: Proceedings of the IEEE 16th International Conference on Data Mining ICDM 2016, pp. 979–984. IEEE (2016)
  • 5.Khurana, U., Samulowitz, H., Turaga, D.: Feature engineering for predictive modeling using reinforcement learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
  • 6.Khurana, U., Turaga, D., Samulowitz, H., Parthasrathy, S.: Cognito: automated feature engineering for supervised learning. In: Proceedings of the IEEE 16th International Conference on Data Mining Workshops ICDMW 2016, pp. 1304–1307. IEEE (2016)
  • 7.Lam, H.T., Thiebaut, J.-M., Sinn, M., Chen, B., Mai, T., Alkan, O.: One button machine for automating feature engineering in relational databases. arXiv preprint arXiv:1706.00327 (2017)
  • 8.Mnih V, et al. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529. doi: 10.1038/nature14236. [DOI] [PubMed] [Google Scholar]
  • 9.Nargesian, F., Samulowitz, H., Khurana, U., Khalil, E.B., Turaga, D.: Learning feature engineering for classification. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI, vol. 17, pp. 2529–2535 (2017)
  • 10.Sutton RS, Barto AG, et al. Reinforcement Learning: An Introduction. Cambridge: MIT Press; 1998. [Google Scholar]
  • 11.Töscher, A., Jahrer, M., Bell, R.M.: The BigChaos solution to the Netflix grand prize. Netflix prize documentation, pp. 1–52 (2009)
  • 12.Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double Q-learning. In: AAAI, Phoenix, AZ, vol. 2, p. 5 (2016)
  • 13.Wang, L., Luo, G., Yi, K., Cormode, G.: Quantiles over data streams: an experimental study. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 737–748. ACM (2013)
  • 14.Watkins CJ, Dayan P. Q-learning. Mach. Learn. 1992;8(3–4):279–292. [Google Scholar]
  • 15.Zhang J, Fogelman-Soulié F, Largeron C. Towards automatic complex feature engineering. In: Hacid H, Cellary W, Wang H, Paik H-Y, Zhou R, editors. Web Information Systems Engineering – WISE 2018; Cham: Springer; 2018. pp. 312–322. [Google Scholar]

Articles from Advances in Knowledge Discovery and Data Mining are provided here courtesy of Nature Publishing Group

RESOURCES