Optimal Sparse Regression Trees

Rui Zhang; Rui Xin; Margo Seltzer; Cynthia Rudin

doi:10.1609/aaai.v37i9.26334

. Author manuscript; available in PMC: 2024 Apr 22.

Published in final edited form as: Proc AAAI Conf Artif Intell. 2023 Jun;37(9):11270–11279. doi: 10.1609/aaai.v37i9.26334

Optimal Sparse Regression Trees

Rui Zhang ^1,^*, Rui Xin ^1,^*, Margo Seltzer ², Cynthia Rudin ¹

PMCID: PMC11034802 NIHMSID: NIHMS1982232 PMID: 38650922

Abstract

Regression trees are one of the oldest forms of AI models, and their predictions can be made without a calculator, which makes them broadly useful, particularly for high-stakes applications. Within the large literature on regression trees, there has been little effort towards full provable optimization, mainly due to the computational hardness of the problem. This work proposes a dynamic-programming-with-bounds approach to the construction of provably-optimal sparse regression trees. We leverage a novel lower bound based on an optimal solution to the k-Means clustering algorithm on one dimensional data. We are often able to find optimal sparse trees in seconds, even for challenging datasets that involve large numbers of samples and highly-correlated features.

1. Introduction

Regression trees are one of the oldest and most popular forms of machine learning model, dating back to the 1963 AID algorithm of Morgan and Sonquist (1963). Since then, there has been a vast amount of work on regression trees, the overwhelming majority of which involves greedy tree induction and greedy pruning (Breiman et al. 1984; Quinlan 1993; Payne and Meisel 1977; Loh 2002). In these approaches, trees are grown from the top down, with greedy splitting at each branch node, and greedy pruning afterwards. These techniques are easy and fast, but their trees have no notion of global optimality. Greedily-grown trees can be much larger than necessary, sacrificing interpretability, and their performance suffers when compared to other machine learning approaches. Thus, questions remain – is it possible to create optimal regression trees? Would they be competitive with other machine learning algorithms if they were fully optimized? Certainly there would be many uses for sparse interpretable regression trees if we could create them with accuracy comparable to that of other machine learning approaches.

While the quest for fully-optimal decision trees began in the mid-90’s with the work of Bennett and Blue (1996), fully optimal decision tree learning was rarely attempted over the last three decades, owing to the computational hardness of the problem. Works that did attempt it (Dobkin et al. 1997; Farhangfar, Greiner, and Zinkevich 2008; Narodytska et al. 2018; Janota and Morgado 2020; Shati, Cohen, and McIlraith 2021; Hu et al. 2020; Avellaneda 2020) had strong constraints, such as shallow depth or perfect classification accuracy. For classification (rather than regression), scientists have had recent success in producing fully optimal trees (Günlük et al. 2021; Blanquero et al. 2020; Hu, Rudin, and Seltzer 2019; Verwer and Zhang 2019; Angelino et al. 2017; Lin et al. 2020; McTavish et al. 2022; Farhangfar, Greiner, and Zinkevich 2008; Nijssen and Fromont 2007, 2010; Aghaei, Gomez, and Vayanos 2020; Verhaeghe et al. 2019; Nijssen, Schaus et al. 2020; Nijssen and Fromont 2010; Demirović et al. 2022) using mathematical programming or dynamic programming. However, building sparse optimal classification trees is a much easier problem, since the 0-1 loss has natural discrete lower bounds, and binary integer programming can be used; this is not true of regression, which uses (real-valued) mean squared error as its loss function.

Let us discuss the few works that do address challenges resembling optimal regression trees. The works of Blanquero et al. (2022) and Bertsimas, Dunn, and Wang (2021) do not construct traditional sparse trees with constant predictions in the leaves; their leaf nodes contain linear or polynomial classifiers, thus the formula for producing predictions is quite complex. The former (Blanquero et al. 2022) uses $ℓ_{\infty} + ℓ_{1}$ regularization for the linear models within the nodes and the latter (Bertsimas, Dunn, and Wang 2021) uses $ℓ_{2}$ regularization for polynomial models in the leaves. Neither of these regularize the number of leaves. The evtree algorithm (Grubinger, Zeileis, and Pfeiffer 2014) claims to construct globally optimal trees, but since it is purely an evolutionary method (no bounds are used to reduce the search space), there is no guarantee of reaching optimality, and one never knows whether optimality has already been reached. Dunn (2018) and Verwer and Zhang (2017) provide mathematical programming formulas for optimal regression trees, but no open source code is available; regardless, mathematical programming solvers are generally slow. Interpretable AI (2022) provides proprietary software that requires a license, but it is not possible to ascertain whether it uses local search or mathematical programming; we suspect it uses local search heuristics, despite the claim of optimal solutions. In other words, as far as we know, there is no other prior peer-reviewed work that directly produces sparse, provably-optimal regression trees with publicly available code.

Our goal is to design optimal sparse regression trees in the classical sense, with a small number of leaves, a single condition at each split, and a constant prediction in each leaf. This makes the predictions easy to understand and compute, even for people who cannot understand equations. Given a trained tree, one can print it on an index card and compute a prediction without adding or multiplying any numbers, which makes these models easy to troubleshoot and use – even in high-stakes settings. An example tree for the seoul bike dataset (VE and Cho 2020; Sathishkumar, Park, and Cho 2020; Dua and Graff 2017) constructed by our method is shown in Figure 1.

Figure 1: — Optimal regression tree for *seoul bike* dataset with $λ = 0.05$ , *max depth* = 5. This dataset predicts the number of bikes rented in an hour. It is binarized by splitting each feature into four categories.

Our formulation is a dynamic-programming-with-bounds approach, where the search space is either reduced or searched methodically. Such approaches have been highly successful for classification trees (Angelino et al. 2017; Lin et al. 2020; Nijssen, Schaus et al. 2020) but have not been previously used for regression trees. An important novel element of our formulation is a lower bound that we call the “k-Means equivalent points lower bound.” To reduce the search space, we need as tight a bound as possible on the objective. Our bound makes use of the observation that any high-quality decision tree of $C$ leaves will perform as bad or worse than the performance of fully-optimal C-Means clustering on the labels alone (without any features). We discuss this in Section 3.

Our main results are: (1) The first algorithm with publicly available code for optimal sparse regression trees in the classical sense, with a proof of optimality. We call this algorithm Optimal Sparse Regression Trees (OSRT). (2) A substantial speedup over evtree, owing to our analytical bounds that reduce the search space. Evtree globally optimizes models, but does not provide a proof of optimality as OSRT does.

2. Notation and Objective

We denote the training dataset ( $X$ , $y$ ) as ${(x_{i}, y_{i})}_{i = 1}^{N}$ , where $x_{i} \in {0, 1}^{M}$ is a binary feature vector and $y_{i} \in R$ is a target variable. (Real-valued features in the raw dataset can be transformed into binary features in many different ways, e.g., splitting the domain of the feature into equalsized buckets, splitting between every two realized values of the variable in the training set, using splits from a reference model as in McTavish et al. (2022); we use the first technique.)

We denote $ℒ (t, X, y) ≔ \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}$ as the loss of tree $t$ on the training dataset, where ${\hat{y}}_{i}$ is the prediction of $x_{i}$ by tree $t$ , i.e., we use mean squared error (MSE) as the loss function. We define the objective function of tree $t$ , $R (t, X, y)$ as a combination of tree loss and penalty on complexity:

ℒ (t, X, y) + λ \cdot complexity (t)

where the complexity penalty is $H_{t}$ , the number of leaves in tree $t$ :

R (t, X, y) ≔ ℒ (t, X, y) + {λ H}_{t} .

(1)

Computationally, it is easier when a depth constraint is added:

ℒ (t, X, y) + λ \cdot complexity (t), s.t. depth (t) \leq d .

(2)

Adding a depth constraint dramatically reduces the search space, but it can lead to suboptimal values of the objective if the depth constraint is smaller than the depth of the optimal solution. Unlike all previous approaches, our algorithm can find provably-optimal trees that globally minimize Equation (1) without a depth constraint.

3. Bounds

Following Hu, Rudin, and Seltzer (2019); Lin et al. (2020), we represent a tree as a set of leaves. Trees with identical leaves, regardless of different internal branching nodes, are considered equivalent. This representation allows us to save memory and avoid duplicate computation during tree construction.

Our algorithm, like that of Lin et al. (2020) for classification, is a dynamic-programming-with-bounds algorithm. This algorithm searches the whole space of trees systematically from smaller to larger trees. If the algorithm determines (through the use of bounds) that the current partial tree it is constructing can never be extended to form an optimal full tree, it stops exploring that part of the search space. Thus, the tighter the bounds, the more the algorithm reduces the search space and the more quickly it converges to the optimal solution. Thus, we present a series of tight bounds that reduce computation by reducing the search space.

We start with notation. A tree $t$ is represented as a set of $H_{t}$ distinct leaves: $t = {l_{1}, l_{2}, \dots, l_{H_{t}}}$ . It can also be written as:

t = (t_{fix}, δ_{fix}, t_{split}, δ_{split}, K, H_{t})

where $t_{fix} = {l_{1}, l_{2}, \dots, l_{K}}$ are a set of $K$ fixed leaves that are not allowed to be further split in this part of the search space, $δ_{fix} = {{\hat{y}}_{l_{1}}, {\hat{y}}_{l_{2}}, \dots, {\hat{y}}_{l_{K}}} \in R^{K}$ are predicted targets for the fixed leaves, $t_{split} = {l_{K + 1}, l_{K + 2}, \dots, l_{H_{t}}}$ are $H_{t} - K$ splitting leaves that can be further split in this part of the search space, and their predicted targets are $δ_{split} = {{\hat{y}}_{l_{K + 1}}, {\hat{y}}_{l_{K + 2}}, \dots, {\hat{y}}_{H_{t}}} \in R^{H_{t} - K}$ .

We generate new trees by splitting different subsets of splitting leaves in tree $t$ . We define $t^{'} = (t_{fix}^{'}, δ_{fix}^{'}, t_{split}^{'}, δ_{split}^{'}, K^{'}, H_{t^{'}})$ as a child tree of $t$ if and only if $t_{fix}^{'}$ is a superset of $t_{fix}$ , and $t_{split}^{'}$ is generated through splitting a subset of $t_{split}$ . We denote $σ (t)$ as the set of all child trees of $t$ .

The following bounds start out analogous to those of Lin et al. (2020) for classification and diverge entirely when we get to the new k-Means Lower Bound.

3.1. Lower Bounds

The loss of a tree has contributions from its two parts: fixed leaves and splitting leaves. Since the fixed leaves cannot be further split in this part of the search space, their contribution provides a lower bound for tree $t$ and all of its child trees. Define the objective lower bound of tree $t$ as

R (t, X, y) \geq ℒ (t_{fix}, X, y) + {λ H}_{t},

where $ℒ (t_{fix}, X, y)$ is the sum of losses for fixed leaves:

ℒ (t_{fix}, X, y) = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2} \cdot 1_{cap (t_{fix}, x_{i})}

(3)

$1_{cap (t_{fix}, x_{i})}$ is 1 when one of the leaves in $t_{fix}$ captures $x_{i}$ , 0 otherwise. ( $t_{fix}$ captures $x_{i}$ when $x_{i}$ falls into one of the fixed leaves of $t$ .) If splitting leaves have 0 loss, then the tree’s loss is equal to the lower bound.

We denote the current best objective we have seen so far as $R^{c}$ . If the objective lower bound of $t$ is worse than $R^{c}$ , i.e., $ℒ (t_{fix}, X, y) + λ H_{t} > R^{c}$ , then $t$ cannot be an optimal tree, nor can any of its children, and the search space can be pruned. To show this, we need the following bound, stating that the child trees of $t$ all obey the same lower bound from the fixed leaves. Note that all proofs are in the appendix.

Theorem 3.1. (Hierarchical Objective Lower Bound). Any tree $t^{'} = (t_{fix}^{'}, δ_{fix}^{'}, t_{split}^{'}, δ_{split}^{'}, K^{'}, H_{t^{'}}) \in σ (t)$ in the child tree set of tree $t = (t_{fix}, δ_{fix}, t_{split}, δ_{split}, K, H_{t})$ obeys:

R (t^{'}, X, y) \geq ℒ (t_{fix}, X, y) + {λ H}_{t} .

That is, the objective lower bound of the parent tree holds for all its child trees. This bound ensures that we do not further explore child trees if the parent tree can be pruned via the lower bound.

The next bound removes all of a tree’s child trees from the search space, even if the tree itself could not be eliminated by the previous bound.

Theorem 3.2. (Objective Lower Bound with One-step Lookahead). Let $t = (t_{fix}, δ_{fix}, t_{split}, δ_{split}, K, H_{t})$ be a tree with $H_{t}$ leaves. If $ℒ (t_{fix}, X, y) + λ H_{t} + λ > R^{c}$ , even if its objective lower bound obeys $ℒ (t_{fix}, X, y) + λ H_{t} \leq R^{c}$ , then for any child tree $t^{'} \in σ (t)$ , $R (t^{'}, X, y) > R^{c}$ .

That is, even if a parent tree cannot be pruned via its objective lower bound, if $ℒ (t_{fix}, X, y) + λ H_{t} + λ > R^{c}$ , all of its child trees are sub-optimal and can be pruned (and never explored).

3.2. Equivalent Points

Before making the lower bound of the objective tighter, let us introduce equivalent points. We define equivalent points as samples with identical features but possibly different target values. It is impossible to partition these samples into different leaves in any tree; a leaf that captures a set of equivalent points that have different targets can never achieve zero loss. Our bound exploits this fact.

Let $u$ be a set of equivalent points where samples have exactly the same feature vector $x$ , such that $\forall j_{1}, j_{2}, \dots j_{∣ u ∣} \in u$ :

x_{j_{1}} = x_{j_{2}} = \dots = x_{j_{∣ u ∣}} .

We define the equivalence loss $ℰ_{u}$ as the sum of squares error for set $u$ when the estimate of the leaf is the best possible, namely the mean of targets for points in $u$ . Define ${\bar{y}}_{u} = \frac{1}{∣ u ∣} \sum_{(x_{i}, y_{i}) \in u} y_{i}$ :

ℰ_{u} = \frac{1}{N} \sum_{(x_{i}, y_{i}) \in u} {(y_{i} - {\bar{y}}_{u})}^{2} .

(4)

Theorem 3.3. (Equivalent Points Lower Bound). Let $t = (t_{fix}, δ_{fix}, t_{split}, δ_{split}, K, H_{t})$ be a tree with K fixed leaves and $H_{t} - K$ splitting leaves. For any child tree $t^{'} = (t_{fix}^{'}, δ_{fix}^{'}, t_{split}^{'}, δ_{split}^{'}, K^{'}, H_{t^{'}}) \in σ (t)$ :

R (t^{'}, X, y) \geq ℒ (t_{fix}, X, y) + {λ H}_{t} + \sum_{u \in U} ℰ_{u} \cdot 1_{c a p (t_{split}, u)},

(5)

where $U$ is the set of equivalent points sets in training dataset $(X, y)$ and $1_{cap (t_{split}, u)}$ is 1 when $t_{split}$ captures set $u$ , 0 otherwise.

Combining with the idea of Theorem 3.2, we have:

R (t^{'}, X, y) \geq ℒ (t_{fix}, X, y) + {λ H}_{t} + λ + \sum_{u \in U} ℰ_{u} \cdot 1_{cap (t_{split}, u)} .

(6)

The bound we introduce next, one of the main novel elements of the paper, is much tighter than the Equivalent Points Lower Bound.

3.3. k-Means Lower Bound

Let us consider the points within each leaf of a regression tree. The smallest possible losses within a leaf are achieved when the label values within the leaf are all similar to each other. If we know we will construct a tree with $C$ leaves and we could rearrange the points into any of the leaves, how would we arrange them to minimize loss? The best loss we could possibly achieve would come from grouping points with the most similar targets together in the same leaf. This procedure is equivalent to computing an optimal clustering of the targets (in 1 dimension) that minimizes the sum of squared errors between each point and the position of its cluster center (the mean of the cluster). The solution to this clustering problem gives the lowest loss we can possibly achieve for any regression tree with $C$ leaves. We can use this as a lower bound on the loss for $t_{split}$ by setting $C$ equal to the $H_{t} - K$ number of unsplittable leaves. There exists a deterministic algorithm that takes linear time for computing the optimal k-Means loss on one dimensional data, which takes advantage the fact that the number line is totally ordered (Song and Zhong 2020).

Definition 3.1. (k-Means Problem for 1D targets) Given a set of $N^{'}$ 1D points $y^{'}$ and a number of clusters $C$ , the goal is to assign points into $C$ clusters so that the sum of squared Euclidean distances between each point and its cluster mean is minimized. Define k-Means( $C$ , $y^{'}$ ) to be the optimal objective of the k-Means algorithm for clustering 1D points $y^{'}$ of size $N^{'}$ into $C$ clusters $C \geq 1$ :

k - M e a n s (C, y^{'}) ≔ min_{z, A} \sum_{i = 1}^{N^{'}} {(y_{i}^{'} - z_{A (y_{i}^{'})})}^{2} .

(7)

$A (y_{i}^{'})$ is a function that specifies the cluster assignment of $y_{i}^{'}$ among $c_{1}$ , $c_{2}, \dots, c_{C}$ , and $z_{c}$ is the centroid of cluster $c$ , which is the mean of the points assigned to that cluster.

z_{c} ≔ \frac{\sum_{A (y_{i}^{'}) = c} y_{i}^{'}}{\sum_{A (y_{i}^{'}) = c} 1} .

(8)

We note here that for an assignment, $A$ , of points to a tree’s $C$ leaves, choosing the mean $z_{c}$ as the predicted label in each leaf $c$ yields the following for the k-Means objective, which is optimized over $z$ for a fixed $A$ :

k - M e a n s - o b j (C, y^{'}, A) ≔ min_{z} \sum_{i = 1}^{N^{'}} {(y_{i}^{'} - z_{A (y_{i}^{'})})}^{2} .

(9)

That is, minimizing the regression loss (sum of squares to the mean target in each leaf) also yields the k-Means’ choice of cluster center as the mean of the targets for points belonging to a leaf. Clearly k-Means-obj( $C$ , $y^{'}$ , $A$ ) ≥ k-Means( $C$ , $y^{'}$ ) since the latter is minimized over the assignment of points to clusters without regard to the tree structure at all. This logic is used in our lower bound.

Theorem 3.4. (k-Means Lower Bound). Consider tree $t = (t_{fix}, δ_{fix}, t_{split}, δ_{split}, K, H_{t})$ . and any child tree $t^{'} = (t_{fix}^{'}, δ_{fix}^{'}, t_{split}^{'}, δ_{split}^{'}, K^{'}, H_{t^{'}}) \in σ (t)$ . Let $(X_{t_{split}}, y_{t_{split}})$ be samples captured by the splitting leaves $t_{split}$ . Then,

R (t^{'}, X, y) \geq ℒ (t_{fix}, X, y) + λ K + min_{C} (\frac{1}{N} k - M e a n s (C, y_{t_{split}}) + λ C) .

3.4. k-Means Equivalent Points Lower Bound

We can make the bound from the last section even tighter. In fact, in the k-Means lower bound above, we ignored information inherent to the regression tree problem, because we ignored all of the features $X$ . We can achieve a tighter bound if we leverage our knowledge of $X$ to again consider equivalent points. Specifically, all points with the same features must be assigned to the same leaf. We first present the definition of a modified k-Means problem and then state our theorem.

Definition 3.2. (Constrained k-Means Problem for 1D targets) Given a set of $N^{'}$ 1D target points $y^{'}$ with feature vector $X^{'}$ and number of clusters $C$ , the goal is to assign points into $C$ clusters so that the sum of squared Euclidean distances between each point and its cluster mean is minimized, under the constraint that all points with the same feature vector $x^{'}$ must be assigned to one cluster.

C o n s t r a i n e d_k - M e a n s (C, X^{'}, y^{'}) = min_{z, A} \sum_{i = 1}^{N^{'}} {(y_{i}^{'} - z_{A (y_{i}^{'})})}^{2} s . t . i f x_{i} = x_{i^{'}}, t h e n A (y_{i}^{'}) = A (y_{i^{'}}^{'}) .

(10)

Adding this constraint makes the k-Means Lower Bound tighter.

Theorem 3.5. (k-Means Equivalent Points Lower Bound). Consider tree $t = (t_{fix}, δ_{fix}, t_{split}, δ_{split}, K, H_{t})$ . and any child tree $t^{'} = (t_{fix}^{'}, δ_{fix}^{'}, t_{split}^{'}, δ_{split}^{'}, K^{'}, H_{t^{'}}) \in σ (t)$ . Let $(X_{t_{split}}, y_{t_{split}})$ be samples captured by the splitting leaves $t_{split}$ . Then,

R (t^{'}, X, y) \geq ℒ (t_{fix}, X, y) + λ K + {min}_{C} (\frac{1}{N} C o n s t r a i n e d_k - M e a n s (C, X_{split}, y_{t_{split}}) + λ C)

(11)

where Constrained_k-Means is defined in Equation 10.

3.5. Computing k-Means Equivalent Points Bound

We now define a weighted version of the k-Means problem, where each sample point is associated with a weight. We derive these weights later as sizes of the equivalent sets.

Definition 3.3. (Weighted k-Means Problem) Given a set of $N^{'}$ 1D points $y^{'}$ with weights $w \in R^{N^{'}}$ and number of clusters $C$ , the goal is to assign points into $C$ clusters so that the weighted sum of squared Euclidean distances between each point and its cluster centroid is minimized. Define Weighted_k-Means( $C$ , $y^{'}$ , $w$ ) as the optimal objective of the k-Means algorithm clustering 1D points $y^{'}$ of size $N^{'}$ into $C$ clusters ( $C \geq 1$ ):

W e i g h t e d_k - M e a n s (C, y^{'}, w) = min_{z, A} \sum_{i = 1}^{N^{'}} w_{i} \cdot {(y_{i}^{'} - z_{A (y_{i}^{'})})}^{2} .

(12)

z_{c_{j}} = \frac{\sum_{A (y_{i}^{'}) = c_{j}} w_{i} \cdot y_{i}^{'}}{\sum_{A (y_{i}^{'}) = c_{j}} w_{i}}

(13)

which is similar to the one defined by Song and Zhong (2020).

Song and Zhong (2020) present an efficient $O (k N)$ solution to this weighted k-Means problem, where $k$ is the number of clusters and $N$ the number of data samples. We leverage this algorithm for the k-Means Equivalent Points Lower Bound (Theorem 3.5). In the following theorem, we show that solving this weighted k-Means problem is equivalent to solving a constrained k-Means problem on a modified dataset.

Theorem 3.6. (Constrained k-Means with Equivalent Points is equivalent to weighted k-Means) Recall in Definition 3.2, we have $N^{'}$ 1D target points $y^{'}$ with features $X^{'}$ and number of clusters $C$ . We also have a constraint that all points in any equivalent set u must be assigned to the same leaf. Define a modified dataset $(X_{m o d}, y_{m o d}, w_{m o d})$ , where all points of equivalent set $u$ in the original dataset $(X^{'}, y^{'})$ are represented by a single point $(x_{u}, y_{u}, w_{u})$ , where $x_{u}$ is the same as the feature vector of equivalent set $u$ ,

y_{u} = \frac{1}{∣ u ∣} \sum_{(x_{i}^{'}, y_{i}^{'}) \in u} y_{i}^{'},

(14)

and the weight is the size of the equivalent set $u$

w_{u} = ∣ u ∣ .

(15)

An optimal clustering of the modified dataset will directly provide an optimal clustering of the original dataset with the equivalent points constraint from Equation 10. (All points from the original dataset contributing to a weighted point in the modified dataset will be assigned to its cluster.)

That is, solving the Weighted k-Means problem produces the same solution(s) as solving the Constrained k-Means problem. Thus, solving the weighted k-Means problem on the modified dataset provides the same result as solving the constrained k-Means on the original dataset.

In Equation 11, we observe that computing the k-Means Equivalent Points Lower bound requires that we find the minimum of Constrained_k-Means across all possible $C$ . One can easily see that it is sufficient to iterate $C$ from 1 to $∣ y_{t_{split}} ∣$ , where every data point is in its own cluster. However, this would be costly when dealing with large datasets. The following theorem, as proved in Aggarwal, Schieber, and Tokuyama (1994), shows that the loss improved from adding more clusters decreases as the number of clusters increases. It means we do not need to generate k-Means solutions for all $C$ up to the size of the subproblem, we can stop as soon as the objective improvement from adding new clusters becomes less than the regularization $λ$ .

Theorem 3.7. (Convexity of Weighted_k-Means in number of clusters, from Aggarwal, Schieber, and Tokuyama 1994) Recall Weighted_k-Means( $C$ , $y^{'}$ , $w$ ) from Definition 3.3 for number of clusters $C$ , 1D points $y^{'}$ , and weights $w$ . We have

W e i g h t e d_k - M e a n s (C - 1, y^{'}, w) + W e i g h t e d_k - M e a n s (C + 1, y^{'}, w) \geq 2 \times W e i g h t e d_k - M e a n s (C, y^{'}, w) .

(16)

Other bounds that help reduce the search space (e.g. Leaf Bounds, Splitting Bounds, Permutation Bound, Subset Bound) can be found in Appendix B.

4. Algorithm

We implemented OSRT based on the GOSDT (Lin et al. 2020) framework, which uses a dynamic-programming-with-bounds formulation. Each subproblem in this formulation is identified by a support set $s = {s_{1}, s_{2}, \dots, s_{N}}$ , where $s_{i}$ is a boolean value indicating whether point $i$ is in the support set $s$ . Each leaf and branching node corresponds to a subproblem, recording which samples traverse through that node (or leaf). GOSDT records and updates lower and upper bounds of the objective for each subproblem and stores them in a dependency graph. The dependency graph summarizes the relationship among subproblems. In dynamic programming formulations, finding tight bounds is crucial in reducing the runtime of the algorithm, because that is the key to eliminating large portions of search space. Our k-Means-based bounds for regression are tight and substantially reduce time-to-optimality, as we show in Section 6.4 and Appendix J.1. Like GOSDT, our method finds the optimal trees when the lower and upper bounds of the objective converge. Algorithm 1 below is a subroutine of OSRT.

Compute_Lower_Bound (Algorithm 1):

This algorithm implements the k-Means Equivalent Points Lower Bound as defined in Theorem 3.5. We leveraged a k-Means solver from Song and Zhong (2020), which is a dynamic programming formulation that fills in a $C$ by $N$ matrix, where $C$ represents the number of clusters and $N$ corresponds to the number of samples. We do not assume a maximum value for $C$ and instead grow the table one row at a time, using the fill_kmeans_dp function from their implementation. Each point ( $a$ , $b$ ) in the table represents the optimal k-Means loss using $a$ clusters and the first $b$ datapoints.

Line 1-3: Compute equivalent target set by grouping equivalent points together, and gather all of their labels. Lines 4-5: Compute weight $w$ and value $v$ that defines the k-Means problem. Lines 6-8: Initialize current loss, loss, previous loss, $l o s s^{'}$ , number of clusters used, nClusters, and dynamic programming table, dp_table. Lines 9-17: Solve weighted k-Means problem by adding clusters one at a time. Line 11: Retrieve loss using nClusters clusters from the last entry of the last filled row of dynamic programming table. Lines 12-14: Terminate algorithm if we can no longer benefit from adding more clusters as the reduction of loss by adding one cluster is monotonically decreasing. See Theorem 3.7. Line 18: Compute constant correction term, correction, that restores weighted k-Means to constrained k-Means problem (see Theorem 3.6).

5. Comparison of Regression Tree Optimization Methods

Unlike other methods, OSRT can optimize regression trees without a hard depth constraint and support mean absolute error (L1 loss). Table 1 summarizes the comparison of different regression tree optimization methods. Blue cells are comparative advantages, and red cells are comparative disadvantages.

Table 1:

Comparison of OSRT, IAI (Interpretable AI 2022), Evtree (Grubinger, Zeileis, and Pfeiffer 2014), GUIDE (Loh 2002), CART (Breiman et al. 1984), ORT (Dunn 2018) and DTIP (Verwer and Zhang 2017). Executables for IAI and GUIDE are available, but their source code is not. DPB is dynamic programming with bounds, MIO is mixed integer optimization.

	OSRT	IAI	Evtree	GUIDE	CART	ORT	DTIP
Guarantee optimality	Yes	No	No	No	No	Yes	Yes
Optimization strategy	DPB	Local Search	Evolutionary	Greedy search	Greedy Search	MIO	MIO
Can optimize without depth constraint	Yes	No	No	Yes	Yes	No	No
Support (weighted) least absolute deviation	Yes	No	No	No	Yes	Unknown	Unknown
Implementation available	Yes	Yes (Executable Only)	Yes	Yes (Executable Only)	Yes	No	No

Open in a new tab

6. Experiments

We ran experiments on 12 datasets; the details are described in Appendix C.1. Our evaluation answers the following:

Are trees generated by existing regression tree optimization methods truly optimal? How well do optimal sparse regression trees generalize? How far from optimal are greedy-approach models? (§6.1)
Does each method yield consistently high-quality results? (§6.2)
How fast does OSRT converge, given that it guarantees optimality? (§6.3)
How much do our novel bounds contribute to the performance of OSRT? (§6.4)
What do optimal regression trees look like? (§6.5)

Algorithm 1: compute_lower_bound(dataset, sub,

λ

)
→ lower_bound
// For a subproblem sub and regularization

λ

, compute its
Equivalent k-Means Lower Bound

\begin{matrix} 1 : Let U = the set of unique sample x_{i} \in s u b \\ ∕ ∕ F o r e a c h u n i q u e s a m p l e i n U, c r e a t e a s e t o f a l l t a r g e t s y \\ ∕ ∕ c o r r e s p o n d i n g t o c o p i e s o f t h a t s a m p l e i n s u b \\ ∕ ∕ (e q u i v a l e n t p o i n t s e t s) \\ 2 : E = \emptyset \\ 3 : \forall x_{i} \in U, E_{x_{i}} . a p p e n d ({y_{j}} ∣ \forall x_{j} \in s u b, y_{j} i f x_{i} = x_{j}) \\ ∕ ∕ F o r e a c h u n i q u e s a m p l e i n U, c o m p u t e t h e n u m b e r o f \\ ∕ ∕ i d e n t i c a l s a m p l e s t o i t (p r o d u c i n g t h e v e c t o r w) a n d t h e \\ ∕ ∕ a v e r a g e o f a l l t a r g e t s (p r o d u c i n g t h e v e c t o r v) \\ 4 : w \leftarrow {∣ E_{x} ∣ ∣ E_{x} \in E} where E_{x} \subset y is a set of \\ targets for one unique x \in U . \\ 5 : v \leftarrow {\bar{E_{x}} ∣ E_{x} \in E} where \bar{E_{x}} denotes a v e r a g e of E_{x} \\ 6 : l o s s, l o s s^{'} \leftarrow inf \\ 7 : n C l u s t e r s \leftarrow 1 \\ ∕ ∕ W e i n i t i a l i z e t h e d y n a m i c p r o g r a m m i n g t a b l e w i t h \\ ∕ ∕ n o r o w s, b u t o n e c o l u m n f o r e a c h e l e m e n t o f U \\ 8 : d p_t a b l e \leftarrow [] [∣ E ∣] \\ 9 : while t r u e do \\ ∕ ∕ F i l l i n t h e {(n C l u s t e r s)}^{t h} r o w o f d p_t a b l e \\ 10 : d p_t a b l e \leftarrow f i l l_k m e a n s_d p (n C l u s t e r s - 1, w, v, \\ d p_t a b l e) \\ 11 : l o s s \leftarrow d p_t a b l e [(n C l u s t e r - 1, ∣ E ∣ - 1)] \\ 12 : if l o s s^{'} - l o s s \leq λ then \\ ∕ ∕ A d d i n g t h i s c l u s t e r d o e s n o t r e d u n c e l o s s e n o u g h \\ ∕ ∕ t o j u s t i f y a d d i t i o n o f a n o t h e r c l u s t e r \\ 13 : break \\ 14 : end \\ 15 : n C l u s t e r s \leftarrow n C l u s t e r s + 1 \\ 16 : l o s s^{'} \leftarrow l o s s \\ 17 : end \\ ∕ ∕ C o r r e c t f r o m w e i g h t e d k - M e a n s t o c o n s t r a i n e d k - M e a n s \\ 18 : c o r r e c t i o n \leftarrow \sum_{x_{j} \in s u b} y_{j}^{2} - \sum_{x_{i} \in U} w_{i} v_{i}^{2} \\ 19 : return l o s s^{'} + λ \times n C l u s t e r s + c o r r e c t i o n \end{matrix}

Open in a new tab

6.1. Optimality and Generalization

We compare trees produced by CART (Breiman et al. 1984), GUIDE (Loh 2002), IAI (Interpretable AI 2022), Evtree (Grubinger, Zeileis, and Pfeiffer 2014) and OSRT, trained on various datasets. For each method, we swept a range of hyperparameters to illustrate the relationship between loss and sparsity (IAI, Evtree, and OSRT all penalize the number of leaves). Optimization experiments in Appendix D and cross-validation experiments in Appendix H, along with a demonstration of these results in Figure 2 show: (1) trees produced by other methods are usually sub-optimal even if they claim optimality (they do not prove optimality), and only our method can consistently find the optimal trees, which are the most efficient frontiers that optimize the trade-off between loss and sparsity, (2) OSRT has the best generalization performance among methods, and (3) we can now quantify how far from optimal other methods are.

Figure 2: — Training and testing loss achieved by IAI, Evtree, GUIDE, CART, OSRT on dataset *airfoil*, $d = 5$ .

6.2. Controllability

Unlike IAI and Evtree, our method does not rely on random seeds. The results returned by OSRT are consistently high quality, while those of IAI and Evtree are not. Figure 3 shows the stochasticity of various methods. Trees produced by IAI and Evtree have large variance in complexity and accuracy if we do not fix the random seed. High variance of loss and sparsity can result in inaccuracy and overfitting. Details and results of this experiment can be found in Appendix F.

Figure 3: — Variance (horizontal and vertical lines) of trees generated by IAI, Evtree, OSRT using 10 different random seeds on dataset *real-estate.*

6.3. Speed and Scalability

Our method is one of the fastest regression tree optimization methods and the only one that also guarantees optimality. Figure 4 shows that OSRT performs well in run time, and Figure 5 shows its outstanding scalability when tackling a large dataset with over 2 million samples. As the number of sample increases, Evtree slows down more than other methods and cannot converge within a 30-minute time limit when the sample size exceeds 50,000. More results are shown in Appendices G and I.

Figure 5: — Training time of CART, GUIDE, IAI, Evtree and OSRT as a function of sample size on dataset *household*, $d = 5$ , $λ = 0.035$ . (30-minutes time limit; Evtree timed out when sample size is beyond 50,000)

6.4. Value of k-Means Lower Bound

The squared error used in regression tasks tends to make the equivalent points lower bound loose, preventing us from pruning more of the search space. The novel k-Means lower bound allows us to aggressively prune the search space, and Figure 6 shows that for the airfoil data set, the k-Means lower bound converged in less than one-fourth the time it took the equivalent points bound to converge. More results can be found in Appendix J.1.

Figure 6: — The time saved by k-Means lower bound (blue) over equivalent points bound (yellow), using $λ = 0.005$ . The optimal solution is found when the lower bound equals objective. The k-Means bound converges in under a second.

6.5. Optimal Trees

Figure 7 presents two optimal trees generated by OSRT on dataset servo, with and without a depth constraint respectively, using the same regularization parameter. It shows that imposing a depth constraint sacrifices the global optimality of Equation 1. More results regarding the ablation study of depth limit can be found in Appendix J.2, and Appendix L compares optimal trees generated by OSRT and sub-optimal trees generated by other methods.

7. Conclusion

We provide the first method to find provably-optimal regression trees within a reasonable time. Our method quickly and consistently finds an optimal sparse model that tends to generalize well. Our method also scales well even for large datasets. OSRT provides a naturally human-interpretable option for solving regression problems in contrast to other, uninterpretable methods such as ridge regression, support vector regression, ensemble methods and neural networks.

Supplementary Material

NIHMS1982232-supplement-1.pdf^{(4.7MB, pdf)}

Acknowledgments

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the National Institute on Drug Abuse (NIDA) under grant DA054994, and the National Science Foundation (NSF) under grant IIS-2130250.

Footnotes

Code Availability

The implementation of OSRT is available at https://github.com/ruizhang1996/optimal-sparse-regression-tree-public. Our experiment code is available at https://github.com/ruizhang1996/regression-tree-benchmark.

References

Aggarwal A; Schieber B; and Tokuyama T 1994. Finding a minimum-weightk-link path in graphs with the concave monge property and applications. Discrete & Computational Geometry, 12(3): 263–280. [Google Scholar]
Aghaei S; Gomez A; and Vayanos P 2020. Learning Optimal Classification Trees: Strong Max-Flow Formulations. arXiv e-print arXiv:2002.09142. [Google Scholar]
Angelino E; Larus-Stone N; Alabi D; Seltzer M; and Rudin C 2017. Learning certifiably optimal rule lists for categorical data. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). [Google Scholar]
Avellaneda F. 2020. Efficient inference of optimal decision trees. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 3195–3202. [Google Scholar]
Bennett KP; and Blue JA 1996. Optimal decision trees. Rensselaer Polytechnic Institute Math Report, 214: 24. [Google Scholar]
Bertsimas D; Dunn J; and Wang Y 2021. Near-optimal Nonlinear Regression Trees. Operations Research Letters, 49(2): 201–206. [Google Scholar]
Blanquero R; Carrizosa E; Molero-Río C; and Morales DR 2020. Sparsity in optimal randomized classification trees. European Journal of Operational Research, 284(1): 255–272. [Google Scholar]
Blanquero R; Carrizosa E; Molero-Río C; and Morales DR 2022. On sparse optimal regression trees. European Journal of Operational Research, 299(3): 1045–1054. [Google Scholar]
Breiman L; Friedman JH; Olshen RA; and Stone CJ 1984. Classification and Regression Trees. Wadsworth. [Google Scholar]
Demirović E; Lukina A; Hebrard E; Chan J; Bailey J; Leckie C; Ramamohanarao K; and Stuckey PJ 2022. MurTree: Optimal Decision Trees via Dynamic Programming and Search. Journal of Machine Learning Research, 23(26): 1–47. [Google Scholar]
Dobkin D; Fulton T; Gunopulos D; Kasif S; and Salzberg S 1997. Induction of shallow decision trees. IEEE Transactions on Pattern Analysis and Machine Intelligence. [Google Scholar]
Dua D; and Graff C 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Accessed: 2022-04-01. [Google Scholar]
Dunn J. 2018. Optimal Trees for Prediction and Prescription. Ph.D. thesis, Massachusetts Institute of Technology. [Google Scholar]
Farhangfar A; Greiner R; and Zinkevich M 2008. A Fast Way to Produce Optimal Fixed-Depth Decision Trees. In International Symposium on Artificial Intelligence and Mathematics (ISAIM). [Google Scholar]
Grubinger T; Zeileis A; and Pfeiffer K-P 2014. evtree: Evolutionary learning of globally optimal classification and regression trees in R. Journal of Statistical Software, 61: 1–29. [Google Scholar]
Günlük O; Kalagnanam J; Li M; Menickelly M; and Scheinberg K 2021. Optimal decision trees for categorical data via integer programming. Journal of Global Optimization, 1–28. [Google Scholar]
Hu H; Siala M; Hebrard E; and Huguet M-J 2020. Learning optimal decision trees with maxsat and its integration in adaboost. In IJCAI-PRICAI 2020, 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence. [Google Scholar]
Hu X; Rudin C; and Seltzer M 2019. Optimal Sparse Decision Trees. In Proceedings of Conference on Neural Information Processing Systems (NeurIPS). [Google Scholar]
Interpretable AI, L. 2022. Interpretable AI Documentation. https://www.interpretable.ai. Accessed: 2022-04-01.
Janota M; and Morgado A 2020. Sat-based encodings for optimal decision trees with explicit paths. In International Conference on Theory and Applications of Satisfiability Testing, 501–518. Springer. [Google Scholar]
Lin J; Zhong C; Hu D; Rudin C; and Seltzer M 2020. Generalized and scalable optimal sparse decision trees. In Proceedings of International Conference on Machine Learning (ICML), 6150–6160. [Google Scholar]
Loh W-Y 2002. Regression tress with unbiased variable selection and interaction detection. Statistica Sinica, 361–386. [Google Scholar]
McTavish H; Zhong C; Achermann R; Karimalis I; Chen J; Rudin C; and Seltzer M 2022. Fast Sparse Decision Tree Optimization via Reference Ensembles. In Proceedings of AAAI Conference on Artificial Intelligence. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morgan JN; and Sonquist JA 1963. Problems in the analysis of survey data, and a proposal. J. Amer. Statist. Assoc, 58: 415–434. [Google Scholar]
Narodytska N; Ignatiev A; Pereira F; and Marques-Silva J 2018. Learning Optimal Decision Trees with SAT. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI), 1362–1368. [Google Scholar]
Nijssen S; and Fromont E 2007. Mining optimal decision trees from itemset lattices. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 530–539. ACM. [Google Scholar]
Nijssen S; and Fromont E 2010. Optimal constraint-based decision tree induction from itemset lattices. Data Mining and Knowledge Discovery, 21(1): 9–51. [Google Scholar]
Nijssen S; Schaus P; et al. 2020. Learning Optimal Decision Trees Using Caching Branch-and-Bound Search. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI). [Google Scholar]
Payne HJ; and Meisel WS 1977. An algorithm for constructing optimal binary decision trees. IEEE Transactions on Computers, C-26(9): 905–916. [Google Scholar]
Quinlan JR 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann. [Google Scholar]
Sathishkumar V; Park J; and Cho Y 2020. Using data mining techniques for bike sharing demand prediction in metropolitan city. Computer Communications, 153: 353–366. [Google Scholar]
Shati P; Cohen E; and McIlraith S 2021. SAT-based approach for learning optimal decision trees with non-binary features. In 27th International Conference on Principles and Practice of Constraint Programming (CP 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik. [Google Scholar]
Song M; and Zhong H 2020. Efficient weighted univariate clustering maps outstanding dysregulated genomic zones in human cancers. Bioinformatics, 36(20): 5027–5036. [DOI] [PMC free article] [PubMed] [Google Scholar]
VE S; and Cho Y 2020. A rule-based model for Seoul Bike sharing demand prediction using weather data. European Journal of Remote Sensing, 53(sup1): 166–183. [Google Scholar]
Verhaeghe H; Nijssen S; Pesant G; Quimper C-G; and Schaus P 2019. Learning optimal decision trees using constraint programming. In The 25th International Conference on Principles and Practice of Constraint Programming (CP2019). [Google Scholar]
Verwer S; and Zhang Y 2017. Learning decision trees with flexible constraints and objectives using integer optimization. In International Conference on AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, 94–103. Springer. [Google Scholar]
Verwer S; and Zhang Y 2019. Learning optimal classification trees using a binary linear program formulation. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1982232-supplement-1.pdf^{(4.7MB, pdf)}

[R1] Aggarwal A; Schieber B; and Tokuyama T 1994. Finding a minimum-weightk-link path in graphs with the concave monge property and applications. Discrete & Computational Geometry, 12(3): 263–280. [Google Scholar]

[R2] Aghaei S; Gomez A; and Vayanos P 2020. Learning Optimal Classification Trees: Strong Max-Flow Formulations. arXiv e-print arXiv:2002.09142. [Google Scholar]

[R3] Angelino E; Larus-Stone N; Alabi D; Seltzer M; and Rudin C 2017. Learning certifiably optimal rule lists for categorical data. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). [Google Scholar]

[R4] Avellaneda F. 2020. Efficient inference of optimal decision trees. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 3195–3202. [Google Scholar]

[R5] Bennett KP; and Blue JA 1996. Optimal decision trees. Rensselaer Polytechnic Institute Math Report, 214: 24. [Google Scholar]

[R6] Bertsimas D; Dunn J; and Wang Y 2021. Near-optimal Nonlinear Regression Trees. Operations Research Letters, 49(2): 201–206. [Google Scholar]

[R7] Blanquero R; Carrizosa E; Molero-Río C; and Morales DR 2020. Sparsity in optimal randomized classification trees. European Journal of Operational Research, 284(1): 255–272. [Google Scholar]

[R8] Blanquero R; Carrizosa E; Molero-Río C; and Morales DR 2022. On sparse optimal regression trees. European Journal of Operational Research, 299(3): 1045–1054. [Google Scholar]

[R9] Breiman L; Friedman JH; Olshen RA; and Stone CJ 1984. Classification and Regression Trees. Wadsworth. [Google Scholar]

[R10] Demirović E; Lukina A; Hebrard E; Chan J; Bailey J; Leckie C; Ramamohanarao K; and Stuckey PJ 2022. MurTree: Optimal Decision Trees via Dynamic Programming and Search. Journal of Machine Learning Research, 23(26): 1–47. [Google Scholar]

[R11] Dobkin D; Fulton T; Gunopulos D; Kasif S; and Salzberg S 1997. Induction of shallow decision trees. IEEE Transactions on Pattern Analysis and Machine Intelligence. [Google Scholar]

[R12] Dua D; and Graff C 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Accessed: 2022-04-01. [Google Scholar]

[R13] Dunn J. 2018. Optimal Trees for Prediction and Prescription. Ph.D. thesis, Massachusetts Institute of Technology. [Google Scholar]

[R14] Farhangfar A; Greiner R; and Zinkevich M 2008. A Fast Way to Produce Optimal Fixed-Depth Decision Trees. In International Symposium on Artificial Intelligence and Mathematics (ISAIM). [Google Scholar]

[R15] Grubinger T; Zeileis A; and Pfeiffer K-P 2014. evtree: Evolutionary learning of globally optimal classification and regression trees in R. Journal of Statistical Software, 61: 1–29. [Google Scholar]

[R16] Günlük O; Kalagnanam J; Li M; Menickelly M; and Scheinberg K 2021. Optimal decision trees for categorical data via integer programming. Journal of Global Optimization, 1–28. [Google Scholar]

[R17] Hu H; Siala M; Hebrard E; and Huguet M-J 2020. Learning optimal decision trees with maxsat and its integration in adaboost. In IJCAI-PRICAI 2020, 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence. [Google Scholar]

[R18] Hu X; Rudin C; and Seltzer M 2019. Optimal Sparse Decision Trees. In Proceedings of Conference on Neural Information Processing Systems (NeurIPS). [Google Scholar]

[R19] Interpretable AI, L. 2022. Interpretable AI Documentation. https://www.interpretable.ai. Accessed: 2022-04-01.

[R20] Janota M; and Morgado A 2020. Sat-based encodings for optimal decision trees with explicit paths. In International Conference on Theory and Applications of Satisfiability Testing, 501–518. Springer. [Google Scholar]

[R21] Lin J; Zhong C; Hu D; Rudin C; and Seltzer M 2020. Generalized and scalable optimal sparse decision trees. In Proceedings of International Conference on Machine Learning (ICML), 6150–6160. [Google Scholar]

[R22] Loh W-Y 2002. Regression tress with unbiased variable selection and interaction detection. Statistica Sinica, 361–386. [Google Scholar]

[R23] McTavish H; Zhong C; Achermann R; Karimalis I; Chen J; Rudin C; and Seltzer M 2022. Fast Sparse Decision Tree Optimization via Reference Ensembles. In Proceedings of AAAI Conference on Artificial Intelligence. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Morgan JN; and Sonquist JA 1963. Problems in the analysis of survey data, and a proposal. J. Amer. Statist. Assoc, 58: 415–434. [Google Scholar]

[R25] Narodytska N; Ignatiev A; Pereira F; and Marques-Silva J 2018. Learning Optimal Decision Trees with SAT. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI), 1362–1368. [Google Scholar]

[R26] Nijssen S; and Fromont E 2007. Mining optimal decision trees from itemset lattices. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 530–539. ACM. [Google Scholar]

[R27] Nijssen S; and Fromont E 2010. Optimal constraint-based decision tree induction from itemset lattices. Data Mining and Knowledge Discovery, 21(1): 9–51. [Google Scholar]

[R28] Nijssen S; Schaus P; et al. 2020. Learning Optimal Decision Trees Using Caching Branch-and-Bound Search. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI). [Google Scholar]

[R29] Payne HJ; and Meisel WS 1977. An algorithm for constructing optimal binary decision trees. IEEE Transactions on Computers, C-26(9): 905–916. [Google Scholar]

[R30] Quinlan JR 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann. [Google Scholar]

[R31] Sathishkumar V; Park J; and Cho Y 2020. Using data mining techniques for bike sharing demand prediction in metropolitan city. Computer Communications, 153: 353–366. [Google Scholar]

[R32] Shati P; Cohen E; and McIlraith S 2021. SAT-based approach for learning optimal decision trees with non-binary features. In 27th International Conference on Principles and Practice of Constraint Programming (CP 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik. [Google Scholar]

[R33] Song M; and Zhong H 2020. Efficient weighted univariate clustering maps outstanding dysregulated genomic zones in human cancers. Bioinformatics, 36(20): 5027–5036. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] VE S; and Cho Y 2020. A rule-based model for Seoul Bike sharing demand prediction using weather data. European Journal of Remote Sensing, 53(sup1): 166–183. [Google Scholar]

[R35] Verhaeghe H; Nijssen S; Pesant G; Quimper C-G; and Schaus P 2019. Learning optimal decision trees using constraint programming. In The 25th International Conference on Principles and Practice of Constraint Programming (CP2019). [Google Scholar]

[R36] Verwer S; and Zhang Y 2017. Learning decision trees with flexible constraints and objectives using integer optimization. In International Conference on AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, 94–103. Springer. [Google Scholar]

[R37] Verwer S; and Zhang Y 2019. Learning optimal classification trees using a binary linear program formulation. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI). [Google Scholar]

PERMALINK

Optimal Sparse Regression Trees

Rui Zhang

Rui Xin

Margo Seltzer

Cynthia Rudin

Abstract

1. Introduction

Figure 1:

2. Notation and Objective

3. Bounds

3.1. Lower Bounds

3.2. Equivalent Points

3.3. k-Means Lower Bound

3.4. k-Means Equivalent Points Lower Bound

3.5. Computing k-Means Equivalent Points Bound

4. Algorithm

Compute_Lower_Bound (Algorithm 1):

5. Comparison of Regression Tree Optimization Methods

Table 1:

6. Experiments

6.1. Optimality and Generalization

Figure 2:

6.2. Controllability

Figure 3:

6.3. Speed and Scalability

Figure 4:

Figure 5:

6.4. Value of k-Means Lower Bound

Figure 6:

6.5. Optimal Trees

Figure 7:

7. Conclusion

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases