A Random Forest Approach for Bounded Outcome Variables

Leonie Weinhold; Matthias Schmid; Richard Mitchell; Kelly O Maloney; Marvin N Wright; Moritz Berger

doi:10.1080/10618600.2019.1705310

. Author manuscript; available in PMC: 2021 Jun 11.

Published in final edited form as: J Comput Graph Stat. 2020;29(3):639–658. doi: 10.1080/10618600.2019.1705310

A Random Forest Approach for Bounded Outcome Variables

Leonie Weinhold ¹, Matthias Schmid ¹, Richard Mitchell ², Kelly O Maloney ³, Marvin N Wright ⁴, Moritz Berger ¹

PMCID: PMC8193767 NIHMSID: NIHMS1700853 PMID: 34121830

Abstract

Random forests have become an established tool for classification and regression, in particular in high-dimensional settings and in the presence of non-additive predictor-response relationships. For bounded outcome variables restricted to the unit interval, however, classical modeling approaches based on mean squared error loss may severely suffer as they do not account for heteroscedasticity in the data. To address this issue, we propose a random forest approach for relating a beta dis-tributed outcome to a set of explanatory variables. Our approach explicitly makes use of the likelihood function of the beta distribution for the selection of splits dur-ing the tree-building procedure. In each iteration of the tree-building algorithm it chooses one explanatory variable in combination with a split point that maximizes the log-likelihood function of the beta distribution with the parameter estimates de-rived from the nodes of the currently built tree. Results of several simulation studies and an application using data from the U.S.A. National Lakes Assessment Survey demonstrate the properties and usefulness of the method, in particular when compared to random forest approaches based on mean squared error loss and parametric regression models.

Keywords: Beta distribution, Bounded outcome variables, Random forests, Regression modeling

1. Introduction

In observational studies one frequently encounters bounded outcome variables. Important examples are (i) relative frequency measures restricted to the unit interval (0,1), like the household coverage rate (Fang and Ma, 2013), the percentage of body fat (Fang et al., 2018) or DNA methylation levels (Weinhold et al., 2016), and (ii) continuous response scales bounded between 0 and 100, like visual analogue scales (Bilcke et al., 2017) or health-related quality of life (HRQoL) scales (Hunger et al., 2012; Mayr et al., 2018). The objective of statistical analyses is often to build a regression model that relates the bounded outcome to a set of explanatory variables X = (X₁, …, X_p)^⊤. When modeling bounded outcomes a problem arises when the variance is not constant, which implies heteroscedasticity and violates the assumption of Gaussian regression (Carroll and Ruppert, 1988; Lesaffre et al., 2007). Therefore, a popular approach is to apply a transformation, e.g. logistic transformation or arcsine square root transformation, and to fit a Gaussian regression model to the transformed outcome, see, for example, Box and Cox (1964). Given the bounded outcome y ∈ (0, 1) the logistic function, defined by log(y/(1 − y)), maps y to the interval (−∞, ∞), and the arcsine square root function, defined by arcsine $(\sqrt{y})$ , maps y to the interval $(0, \frac{π}{2})$ . Both transformations are variance-stabilizing functions, which means that the variance becomes approximately constant after transformation (Warton and Hui, 2011). A remaining limitation of an analysis based on transformed data is that inference is not possible on the original scale. This complicates the interpretation of the estimated predictor-response relationships and may lead to biased predictions (see below).

A more flexible method that does not require variable transformations is beta regression, see Ferrari and Cribari-Neto (2004), Cribari-Neto and Zeileis (2010) and the extensions developed by Smithson and Verkuilen (2006), Simasand et al. (2010) and Grün et al. (2012). One benefit that makes beta regression a popular tool for modeling bounded outcome variables is the flexibility of the shape of the probability density function (p.d.f.) of the beta distribution, which includes symmetrical shape and left or right skewness. Furthermore, beta regression allows for a simple interpretation of predictor-response relationships in terms of multiplicative increases/decreases of the expected outcome, which is analogous to logistic regression for a binary outcome. An additional benefit of beta regression is illustrated by the results of a small simulation presented in Figure 1. We illustrate that Gaussian models (based on transformed data) can lead to a notable bias in the predicted values on the original scale. For the simulation we first generated right-skewed outcome variables y ∈ (0, 1), which were related to one normally distributed explanatory variable x (1000 observations, 100 replications, see Appendix A for a more detailed description of the simulation design). Next we fitted Gaussian regression models to untransformed data, to logistic transformed data and to arcsine square root transformed data, as well as a beta regression model. The left panel of Figure 1 shows average differences between the true conditional mean and the fitted mean on the original scale obtained from the four model fits. The right panel depicts the data of one randomly selected replication and the corresponding fitted mean functions. It is seen that the fitted values of the three Gaussian models (particularly those obtained from the model fitted to the untransformed data) strongly deviated from the true values, whereas the beta regression model yielded unbiased results.

Figure 1: — Boxplots of the average differences between the true mean and the fitted mean obtained from 100 replications (left panel). Simulated data (*y, x*) of one randomly selected replication and the corresponding fitted mean functions (right panel). The true mean function is marked by the black line.

A large part of the methodology on beta regression refers to parametric regression models using simple linear combinations of the explanatory variables, that is, one assumes that the effects of the explanatory variables on the outcome are linear. In many applications, however, this assumption is too restrictive, for example, when higher-order interactions between explanatory variables are present. Also, the specification of a parametric model may result in a large number of parameters relative to the sample size, which may affect estimation accuracy. When the number of explanatory variables exceeds the number of observations, classical maximum likelihood estimation in a beta regression model is infeasible.

These issues can be addressed by the use of recursive partitioning (also termed tree-based modeling), which has its roots in automatic interaction detection. The most popular version is due to Breiman et al. (1984) and is known by the name classification and regression trees (CART). A disadvantage of tree-based methods, however, is that the resulting trees are often affected by a large variance, i.e. that minor changes in the data may result in completely different trees. Therefore, when the focus is on prediction accuracy, it is often worth stabilizing the results obtained from single trees by applying ensemble methods like random forests (Breiman, 2001; Ishwaran, 2007). In general, the idea of random forests is to reduce the variance of the predictions by averaging over an ensemble of many trees.

In this article, we propose a random forest approach tailored to the modeling of bounded outcome variables. In contrast to classical approaches, which use impurity measures (for classification) or the mean squared error (for regression), we propose to use the likelihood of the beta distribution as splitting criterion for tree building. In each iteration of the tree-building procedure, the algorithm chooses one explanatory variable in combination with a split point that maximizes the log-likelihood function of the beta distribution, with the parameter estimates directly derived from the nodes of the currently built tree. The new approach is very efficient, as the optimization of the log-likelihood does not require an iterative procedure, but is simply based on the estimates of the mean and the variance (i.e. the first and second moments) in the nodes. Related approaches that use the likelihood function as splitting criterion for tree building have been among others proposed by Su et al. (2004) for continuous outcomes and by Bou-Hamad et al. (2009) and Bou-Hamad et al. (2011) for discrete time-to-event data. Our proposed method has been implemented in the R add-on package ranger (Wright and Ziegler, 2017), available at https://github.com/imbs-hl/ranger (version 0.11.3) and will be part of the next ranger version on CRAN. The package is also part of the supplementary materials of this paper.

The remainder of the article is organized as follows: In Section 2 we present the notation and methodology of the proposed random forest approach for modeling bounded outcomes, along with a description of classical parametric beta regression models. Section 3 compares the properties and the performance of the proposed method to classical beta regression models and alternative random forest approaches in a simulation study considering different data-generating models. In Section 4 we tested our method using data collected for the 2007 U.S.A. National Lakes Assessment (NLA) Survey (US Environmental Protection Agency (USEPA), 2009), which surveyed on the condition of lakes, ponds and reservoirs across the U.S. The main findings of the article are summarized in Section 5.

2. Methods

2.1. Notation and Methodology

Let Y be the bounded outcome variable of interest. Without loss of generality, we assume that Y ∈ (0, 1) takes values on the unit interval, and that Y follows a beta distribution with p.d.f.

f_{Y} (y; μ, ϕ) = \frac{Γ (ϕ)}{Γ (μ ϕ) Γ ((1 - μ) ϕ)} y^{μ ϕ - 1} {(1 - y)}^{(1 - μ) ϕ - 1},

(1)

where Γ(·) denotes the gamma function. The p.d.f. in (1) is defined by two parameters, namely the location parameter μ ∈ (0, 1) and the precision parameter ϕ > 0. With this parametrization, the mean and variance of Y are, respectively,

E (Y) = μ and var (Y) = \frac{μ (1 - μ)}{ϕ + 1} .

(2)

Depending on the values of the parameters μ and ϕ, the p.d.f. (1) can take on a number of different shapes, including symmetrical shape and left or right skewness, or even uniform shape (Cribari-Neto and Zeileis, 2010).

To relate Y to the vector of explanatory variables X = (X₁, …, X_p)^T, one usually assumes that the relationship between the conditional mean μ|X := E(Y |X) and X is given by

μ | X = h (X^{T} β) = h (η (X)),

(3)

where h(·) is a monotonic, twice differentiable response function, e.g. the inverse logistic function, η(·) is the predictor function, and β = (β₁, …, β_p)^T is a vector of unknown real-valued coefficients (Ferrari and Cribari-Neto, 2004). When setting h equal to the inverse logistic function, the terms exp(β₁), …, exp(β_p) have a simple interpretation in terms of multiplicative increase or decrease of the conditional mean μ|X. For example, if β_k > 0, k ∈ {1, …, p}, increasing X_k by one unit implies that the odds of the outcome (μ/(1 − μ)) is increased by the factor exp(β_k). The model based on the inverse logistic function will be used for comparison purposes in the simulation study in Section 3.

Estimates of the regression parameters β are usually obtained by maximum likelihood estimation. Let y_i, i = 1, …, n, be a set of independent realizations of Y with means μ_i and precision parameters ϕ_i, and let x_i = (x_i1, …, x_ip)^T, i = 1, …, n, be real-valued observations on p explanatory variables. Then the log-likelihood function of model (3) is defined as

l (y_{1}, \dots, y_{n}, μ_{1}, \dots, μ_{n}, ϕ_{1}, \dots, ϕ_{n}) = \sum_{i = 1}^{n} {\log (Γ (ϕ_{i})) - \log (Γ (μ_{i} ϕ_{i})) - \log (Γ ((1 - μ_{i}) ϕ_{i})) + (μ_{i} ϕ_{i} - 1) \log (y_{i}) + ((1 - μ_{i}) ϕ_{i} - 1) \log (1 - y_{i})} .

(4)

Typically, in beta regression models, the precision parameter is assumed to be constant for all observations (ϕ_i ≡ ϕ for i = 1, …, n). This is analogous to the assumption for the variance σ² in a Gaussian regression model. The more general form of the log-likelihood function specified in Equation (4) is exploited by the proposed approach (see Section 2.2).

The parametric model (3) is linear in β, implying that the explanatory variables have a linear effect on the transformed outcome. However, in practice, this assumption may often be too restrictive, because explanatory variables may show complex interactions. In principle the linear relationship in (3) could be extended by incorporating interaction terms. However, for modeling interactions in classical regression approaches, the specification of the corresponding terms in the predictor function η(X) of the model is required, i.e. they need to be specified beforehand. This is a major challenge in real applications, due to the mostly unknown nature of these terms. Yet another problem when applying classical regression approaches arises when the number of parameters exceeds the number of observations (which is a likely scenario when many interaction terms are present), in particular when p > n. This leads to an overdetermined system for which unregularized maximum likelihood estimation is bound to fail.

2.2. Tree-Based Beta Regression

To address the aforementioned problems arising in classification and Gaussian regression, several statistical methods were developed, including the popular Classification and Regression Trees (CART; Breiman et al., 1984). CART are built by recursively dividing the predictor space (defined by the explanatory variables X) into disjoint regions (i.e. subgroups of the observations) using binary splits. The tree-building procedure starts from the top node (called root node), which contains all observations. In each splitting step, a single explanatory variable X_k, k = 1, …, p, and a corresponding split point is selected, along which the node (denoted by M) is divided into two subsets M₁ and M₂ (called children nodes). The scale of the variable X_k defines the form of the binary split rule. For a metrically scaled or ordinal variable X_k, the partition into two subsets has the form M₁ = M ∩{X_k ≤ c} and M₂ = M ∩{X_k > c}, where c denotes the split point. For multi-categorical variables without ordering X_k ∈ {1, …, r}, the partition has the form M₁ = M ∩ C₁ and M₂ = M ∩ C₂, where C₁ ⊂ {1, …, r} and C₂ ⊂ {1, …, r}\C₁ are disjoint, non-empty subsets. The splitting is repeated in each newly created child node until a specified stopping criterion is met (see Section 2.3). In each resulting terminal node, the conditional mean μ|X of all observations belonging to this node is fitted by a constant, e.g. the mean of the outcome values in regression trees and the most frequent class in classification trees.

For classification trees, one of the most common splitting criteria (to select the explanatory variables and splitting rules in each step), is the Gini impurity, which is minimal if all the observations are perfectly separated by their class. For regression trees, the usual splitting criterion is the mean squared error (MSE). In the presence of a beta distributed outcome variable, however, the MSE causes problems. This is because the variance depends on the mean parameter μ, which violates the homoscedasticity assumption, see Equation (2). Hence, use of the mean squared error as splitting criterion may lead to less accurate predictions, since splits are more strongly affected by data with high variability. More specifically, because regression trees seek to minimize the within-node variance, there will be a tendency to split nodes with high variance, which may result in poor predictions in low-variance nodes (Moisen, 2008).

To address this problem and to account for heteroscedasticity in Y, we introduce an alternative splitting criterion based on the log-likelihood of the beta distribution. This criterion forms the building block of the proposed random forest approach (see Section 2.3). Given the data (y_i, x_i), i = 1, …, n, the log-likelihood (4) can be evaluated by estimating the distribution parameters μ_i and ϕ_i for each observation i. During the tree-building procedure, the estimation of the two parameters is conducted node-wise: In each node M one estimates the covariate-free location parameter μ_M and the covariate-free precision parameter ϕ_M of the beta distribution by

μ_{M} = \frac{1}{| M |} \sum_{i \in A_{M}} y_{i} and ϕ_{M} = \frac{μ_{M} (1 - μ_{M})}{\frac{1}{| M | - 1} \sum_{i \in A_{M}} {(y_{i} - μ_{M})}^{2}} - 1,

(5)

respectively, where A_M = {i ∈ {1, …, n}|x_i ∈ M} denotes the set of observations in node M. Correspondingly, $| M | = \sum_{i = 1}^{n} 1_{x_{i} \in M}$ is the number of observations in M. Given Q nodes of a currently built tree, the individual distribution parameters for each observation i are then obtained by

μ_{i} = μ_{M} \forall i \in A_{M}, M = 1, \dots, Q, i = 1, \dots, n,

(6)

ϕ_{i} = ϕ_{M} \forall i \in A_{M}, M = 1, \dots, Q, i = 1, \dots, n .

(7)

In each iteration of the proposed tree-building procedure, the algorithm chooses one explanatory variable X_k in combination with one split point that maximizes the log-likelihood function (4) when splitting node M into the children nodes M₁ and M₂. After termination of the tree-building algorithm the fitted conditional mean μ|X for each observation is obtained by Equation (6).

Algorithm 1:

Schematic overview of the proposed beta forest. During tree building, the algorithm applies the split rule introduced in Section 2.2.

Initialization: Fix num.trees, mtry and min.node.size.

Bootstrapping: Draw num.trees bootstrap samples from the original data.

Tree Building: For each of the num.trees bootstrap samples fit the tree-based beta regression model (as described in Section 2.2). More specifically, in each node of the trees,

– draw mtry candidate variables out of p variables,

– select the candidate variable and the split point that maximize the log-likelihood of the beta distribution and split the data into two children nodes,

– continue tree growing as long as the number of observations in each node is larger than min.node.size.

Prediction: For a new observation, drop the values of the explanatory variables down to the final nodes of the num.trees trees built in step ‘Tree Building’. Compute the ensemble estimate of the conditional mean by averaging the num.trees estimates of μ|X.

Open in a new tab

2.3. Random Forests for Beta Regression

A major advantage of tree-based methods is the simple and intuitive interpretability of the model. This is particularly important when the aim is to build an easy-to-interpret prediction formula and to quantify the effect of a specific explanatory variable on the outcome. A drawback, however, is that the resulting tree estimators are often very unstable. This means that a small variation in the data can result in very different splits, which negatively affects prediction accuracy. To overcome this problem, we propose the beta forest, which is a random forest algorithm that is based on the splitting criterion introduced in Section 2.2.

The main principle of the beta forest is to generate a fixed number of samples from the original data (denoted by num.trees) using bootstrap procedures, and to apply the tree-building algorithm to each of the samples. To mitigate the similarity of the resulting trees, the set of explanatory variables that are available for splitting in each node is a random subset (of predefined size mtry < p) of all available explanatory variables. During tree building, each node is split until the number of observations falls below a (predefined) minimal node size (denoted by min.node.size). The algorithm terminates when all current nodes are flagged as terminal nodes. As in classical random forest approaches we suggest to grow large trees realized by a small minimal node size (defined by min.node.size). Analogous to the MSE-based random forest approach, the fitted conditional mean μ|X for an observation i is obtained by averaging the estimates of the conditional means over all trees of the random forest. A schematic overview of the beta forest algorithm is provided in Algorithm 1.

The proposed algorithm has been implemented in the R add-on package ranger (Wright and Ziegler, 2017). The default values of the tuning parameters of the beta forest are set to num.trees = 500 and $m t r y = \sqrt{p}$ , as for all the previously implemented regression methods.

3. Simulation Study

In this section we present results of several simulations to evaluate the performance of the beta forest. The aims of the simulations were: (i) to analyze the accuracy of the beta forest in terms of detecting the informative variables in higher dimensional settings with a large number of non-influential variables, (ii) to compare the beta forest to classical MSE-based random forest approaches in the presence of a bounded outcome variable, and (iii) to compare the beta forest to parametric beta regression models in higher dimensional settings and in the presence of interactions between the explanatory variables.

3.1. Simulation Design

Each simulated data set consisted of n_train = 500 observations. The outcome variable of each simulated data set was randomly drawn from a beta distribution with a constant precision parameter ϕ and the conditional mean μ|X = h(η(X₁, …, X₄)). Here, h(·) was defined as the inverse logistic function, and the predictor function

η (X_{1}, \dots, X_{4}) = β_{0} + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{3} + β_{4} X_{4} + β_{5} X_{1} X_{2} + β_{6} X_{2} X_{3} + β_{7} X_{3} X_{4} + β_{8} X_{1} X_{4},

(8)

was composed of four independent binary explanatory variables X_k ∈ {1, 2}, with P (X_k = 1) = P (X_k = 2) = 0.5, k = 1, …, 4. By definition, the true predictor function (8) contained four linear main effects and four two-factor interactions.

We simulated symmetrically distributed data with average location parameter value μ|X = 0.5 (mean over all n_train observations) by setting β = (0.2, 0.3, 0.4, −0.1, −0.3, −0.3,−0.4, 0.1, 0.3)^⊤. For the precision parameter we used the values ϕ = 2 (low), ϕ = 4 (moderate) and ϕ = 8 (high), yielding average variances of 0.08, 0.05 and 0.03, respectively. Furthermore, to assess the robustness of the methods against the degree of noise in the data we added independent, non-informative explanatory variables X_k ∈ {1, 2} to the predictor space. Including the four informative variables, we considered scenarios with p = 4 (informative), p = 10 (low-dimensional), p = 100 (moderate-dimensional) and p = 200 (high-dimensional) explanatory variables. In total this resulted in 3 × 4 = 12 different scenarios. Each simulation scenario was replicated 100 times.

3.2. Accuracy of the Beta Forest

In a first step of the analysis we focused on the three moderate-dimensional scenarios (p = 100) and evaluated whether the beta forest was able to detect the four informative variables. Figure 2 shows the proportion of trees (related to num.trees=500) in which the variables X₁, …, X₁₀₀ were selected for splitting during tree building. It is seen that the algorithm worked quite well. In particular in the high precision scenario (ϕ = 8), variables X₁, …, X₄ were selected in far more than 50% of the fitted trees, whereas variables X₅, …, X₁₀₀ were only chosen randomly. Similar pictures resulted for p=10 and p=200 (not shown).

Figure 2: — Results of the simulations, evaluating the selection accuary of beta forest. The boxplots depict the proportion of trees in which the variables X₁, …, X₁₀₀ were selected for splitting for the three scenarios with 100 variables. Only the first four variables were informative. Note that the precision parameter ϕ varies across the rows.

3.3. Model Comparison

In a second step of the analysis we considered all 12 scenarios and compared the beta forest (i) to alternative random forest approaches differing with regard to the transformation applied to the outcome variable Y, and (ii) to parametric beta regression models differing with regard to the pre-specified predictor function. Specifically, we examined:

the proposed beta forest (beta-rF),
the classical random forest approach with splitting criterion MSE on the original, untransformed data (rF),
the classical random forest approach with splitting criterion MSE on arcsine square root transformed data (asin-rF),
the classical random forest approach with splitting criterion MSE on logistic-transformed data (logit-rF),
a fully specified parametric beta regression model with linear predictor η(X) = β₀ + X^⊤β (linear-bR),
a parametric beta regression model without any explanatory variable, i.e. predictor η(X) = β₀ (int-bR), which served as a null model, and
a parametric beta regression model with predictor function (8) of the true underlying data generating process (true-bR) which served as gold standard.

All random forest models (a) – (d) were fitted with the R package ranger (Wright and Ziegler, 2017) using the default values of num.trees. Since the proposed beta forest relies on numerically stable estimates of the location and precision parameter in each node, we set min.node.size = 10 for all random forest approaches, which is a little higher than the default value min.node.size = 5 for regression trees in ranger. The parametric beta regression models (e) – (g) were fitted using the R package betareg (Cribari-Neto and Zeileis, 2010).

We evaluated the performance of the modeling approaches with respect to predicting the outcomes of future observations by computing the predictive log-likelihood values (based on the beta, Gaussian, logistic-normal or arcsine-normal distribution) and the MSE on an independently drawn test data set of n_test = 500 observations in each replication. The optimal value of the parameter mtry for models (a) – (d) was respectively determined beforehand using an independent sample of 1000 observations. To obtain the precision parameter ϕ of model (a), the log-likelihood function of the beta distribution was optimized on the sample of n_train = 500 observations in each replication, after plugging in the conditional mean estimates of the random forest. The variance parameters σ² of the models (b) – (d) were obtained accordingly.

3.4. Comparison to Parametric Beta Regression

The accuracy with regard to the predictive log-likelihood values obtained from the beta forest and the parametric beta regression models (e) – (g) for the 12 scenarios with varying ϕ (rows) and p (columns) are shown in Figure 3.

Figure 3: — Results of the simulations, comparing the beta forest to parametric beta regression models. Predictive log-likelihood values of the parametric beta regression models and the beta forest for the 12 scenarios with symmetrically distributed outcome (average location parameter μ = 0.5). The precision parameter ϕ varies across the rows, the number of variables p varies across the columns. The gray lines refer to the median values of the true model.

It is seen that the performance of the fully specified beta regression model (linear-bR) was comparable to the beta forest in the scenarios with only informative explanatory variables as well as in the scenarios with moderate-dimensional data (first and second column), with a slight advantage of the beta forest in the informative, high precision scenario (p = 4, ϕ = 8). Further, the performance of the linear-bR substantially deteriorated with an increasing number of non-informative variables (third and fourth column of Figure 3), strongly favoring the beta forest. The difference became even more apparent for smaller values of the precision parameter ϕ. As was to be expected and throughout all settings, the model that served as gold standard (true-bR) outperformed all other models. The relatively good performance of the simple int-bR model in the high-dimensional scenarios revealed that all models had difficulties to retrieve the information contained in these data sets.

The better performance of the beta forest in the moderate- and high-dimensional scenarios is mainly explained by the variable selection mechanism, which is enforced when fitting random forests. Moreover, the result in the informative, high precision scenario (where the performance is almost equal to the true-bR) suggests that the beta forest adequately accounted for the interaction terms contained in the data-generating model (8).

The comparison of the different modeling approaches with regard to the MSE is shown in the upper panel of Figure 10 in Appendix B. The results agree closely with those in Figure 3. The int-bR performed worst and the true-bR performed best in terms of the MSE. In the informative and low-dimensional scenarios the linear-bR and the beta-rF yielded very similar results. In contrast to that, the beta-rF outperformed the linear-bR in the moderate- and high-dimensional scenarios.

3.5. Comparison of Random Forest Approaches

Figure 4 shows the accuracy with regard to the predictive log-likelihood values of the random forest approaches (a) – (d) for the 12 scenarios. Throughout all scenarios, the beta forest outperformed the other random forest approaches. Generally, all approaches gained in prediction accuracy with an increasing value of the precision parameter ϕ and a decreasing number of non-informative variables. The worst performance was obtained for the classical random forest on the untransformed data (rF), even though the difference vanished with increasing precision ϕ. The models based on the arcsine square root transformation (asin-rF) and the logistic transformation (logit-rF) performed roughly equally well in all scenarios. The largest differences to the beta forest were seen in the four low precision scenarios with ϕ = 2 (first row of Figure 4).

When evaluating the performance of the random forest approaches with regard to the MSE, there was no remarkable difference between the four random forest approaches throughout all scenarios (lower panel of Figure 10 in Appendix B). On the other hand, it appears to be questionable whether the MSE is a suitable measure for quantifying model performance as it treats each residual equally and does not account for any heteroscedasticity in the data (Gelfand, 2015). Specifically, Gelfand (2015) reported a degradation of all considered methods when using the MSE for performing model choice in simulations. This result was also confirmed in our simulation study. Therefore, we focus our considerations to the predictive log-likelihood in the following.

3.6. Further Simulation Results

We conducted further simulations to evaluate the beta forest. First, we simulated right-skewed beta distributed data with average location parameter value μ|X = 0.2 by setting β = (0.2, 0.3, 0.4, −0.1, −0.3, −0.3, −0.4, −0.1, −0.3)^⊤, using the predictor function (8) as in the previous simulations. Analogous to Section 3.1, we simulated 12 different scenarios using the same values for the precision parameter ϕ (2, 4 and 8) and the same number of explanatory variables (4, 10, 100 and 200). Again, each simulation scenario was replicated 100 times.

The results for the 12 scenarios with right-skewed data are shown in Figure 11 in Appendix B. Regarding the beta regression models there were only minor differences to the results with symmetrically distributed data throughout all scenarios. One notable difference in these scenarios was the prediction performance of the logit-rF: In the low precision scenarios (first row in the lower panel of Figure 11), the logit-rF yielded comparable results to the beta-rF and even outperformed the beta-rF in the scenarios with only informative explanatory variables and low-dimensional data. Furthermore, the performance of the logit-rF strongly deviated from the asin-rF in all scenarios.

To assess the robustness of the beta forest we additionally simulated data from the logit-normal distribution. We used again predictor function (8) and simulated logit-normal distributed symmetric and right-skewed data with average location parameters μ|X = 0.5 and μ|X = 0.2 by setting β = −(0.04, 0.08, 0.12, 0.16, 0.20, 0.16, 0.12, 0.08, 0.04)^⊤ and β = (0, −1.0, 1.0, −1.0, 1.0, −0.5, 0.5, −0.5, 0.5)^⊤, respectively. According to the different scenarios described in Section 3.1 we simulated data with three different variance parameters (σ² ∈ {0.5, 1.3, 2}) and four different dimensions (p ∈ {4, 10, 100, 200}). Instead of fitting parametric beta regression models, we replaced models (e) - (g) by parametric Gaussian regression models fitted to the logit-transformed data (abbreviated by linear-logit, true-logit and int-logit). All the results are shown in Figures 12 and 13 in Appendix B.

In both simulations, with symmetrically and right-skewed distributed outcome variables, and throughout all scenarios the true-logit model yielded the best results in terms of the predictive log-likelihood. While in the informative and low-dimensional scenarios the beta-rF performed worse than the linear-logit model, beta-rF performed better in the moderate- and high-dimensional scenarios (independent of the variance). A similar tendency was observed when comparing the random forest approaches (lower panels of Figures 12 and 13 in Appendix B). In the simulations with symmetrically logit-normal distributed outcome variables the logit-rF performed best in the informative and low-dimensional scenarios, but the beta-rF showed equal or better performance in the moderate- and high- dimensional scenarios. In all scenarios the classical rF approach performed worse than the alternative methods. In the simulations with right-skewed outcome, the logit-rF resulted in the best prediction performance throughout all scenarios, except of the high-dimensional, low variance scenario, where the beta-rF yielded the highest predictive log-likelihood values.

4. Application

To illustrate the application of the proposed beta forest, we present the results of an analysis of a data set collected for the 2007 U.S.A. National Lakes Assessment (NLA) Survey (US Environmental Protection Agency (USEPA), 2009). The same data set was previously considered in Schmid et al. (2013).

4.1. The NLA Database

In summer 2007 the NLA survey sampled 1,157 lakes across the conterminous U.S. to assess the condition of lakes, ponds and reservoirs. One of the aims of the study was to investigate the associations between the condition of lakes and explanatory variables, such as data on habitat condition (e.g., riparian vegetation condition, shoreline substrate, fish cover), human activities (e.g., docks, roads, buildings), lake watershed conditions (e.g., land use/land cover, precipitation, elevation), biological and chemical measures of a sample taken at the deepest point of the lake (index site) as well as estimates of lake connectivity and colonization sources (number and total surface areal coverage of other lakes within prespecified radii of the sampling location). These associations are particularly important for the prediction of the conditions of unsurveyed lakes that could not be included in the NLA survey due to logistic and cost reasons.

The condition of lakes often is measured by indicators, such as percentages of the sampled biological community, mainly those considered intolerant or tolerant of stressors. One indicator, which is routinely used in the evaluation of stream and lake health and which constitutes the outcome variable in the following analysis, is the percentage of benthic macroinvertebrate taxa in the order Ephemeroptera (mayflies, here denoted as EPHEptax) (Barbour et al., 1999; Maloney and Feminella, 2006). Figure 5 shows the unconditional distribution of EPHEptax. Altogether, 80 explanatory variables were incorporated for statistical analysis. The exclusion of all observations with missing values in any of the explanatory variables resulted in an analysis data set with n = 994 lakes. The explanatory variables with a highly right-skewed distribution were log transformed before fitting models for EPHEptax. The full list of explanatory variables can be found in Appendix C and was also given in Schmid et al. (2013).

Figure 5: — Analysis of the NLA database. Distribution of the bounded outcome variable EPHEptax in the analysis data set (n = 994), which exhibits a right-skewed shape. The median percentage was 0.049.

4.2. Comparison of Methods

In a first analysis step, we considered a parametric beta regression model with linear predictor η(X) = β₀ +X^⊤β. This resulted in a very large number of parameters compared to the number of observations in the data and resulted in numerical problems when fitting the model using betareg. Moreover, the additional incorporation of possibly relevant interactions became numerically infeasible. Therefore, in a second step of the analysis we reduced our considerations to the random forest approaches as described in Section 3.3 (a)-(d) that enforce variable selection and account for interactions between the explanatory variables.

We evaluated the goodness-of-fit of the models with respect to their predictive performance. This was done by computing the predictive log-likelihood values using repeated 3-fold cross validation (100 replications). The results in Figure 6 show that the beta forest performed best among the random forest approaches. As expected, the performance of the classical random forest based on the original, untransformed data (rF) was inferior to the predictive performance of the asin-rF and the beta-rF. Surprisingly, the logit-rF was the worst-performing method. The latter result can be explained by the specific characteristics of the data: Fitting a covariate-free beta regression model yielded the estimate $\hat{ϕ} = 17.4$ . Along the same line, the last row of Figure 11 in Appendix B demonstrates that for right-skewed distributed outcome the performance of the logit-rF strongly suffered from an increasing value of ϕ compared to the alternative random forest approaches. Thus, the performance of the logit-rF applied to the NLA database is in line with the pattern that has already been observed in the simulations.

Figure 6: — Analysis of the NLA database. Predictive log-likelihood values of the different random forest appraoches (a)-(d) based on 3-fold cross validation and 100 replications.

4.3. Results of the Beta Forest

In the final step of the analysis we fitted the beta forest to the entire NLA data set. We computed the variable importance and used partial dependence plots (Cutler et al., 2007; Hastie et al., 2009) to examine the association between EPHEptax and the explanatory variables. The beta forest variable importance was calculated using a permutation based approach (Breiman, 2001) and was derived as follows: For each explanatory variable x_k, k = 1, …, p, and for each of the num.trees trees we computed the difference between the predictive beta log-likelihood in the out-of-bag sample and the predictive beta log-likelihood in the out-of-bag sample with permuted values of x_k. If x_k was strongly associated with the outcome variable, this resulted in a large (mean) difference of the predictive log-likelihood values among the num.trees trees. The latter difference, which naturally reflects the splitting criterion of the beta forest algorithm, was then used to measure the importance of x_k.

In addition, we calculated p-values for the variable importance measures by repeatedly permuting the values of the outcome variable EPHEptax in the original data (1000 replications) and recomputing the variable importance measures as described above (Altmann et al., 2010). For each k, this resulted in an approximation of the distribution under the null hypothesis that there is no association to the outcome and a corresponding p-value.

The variable importance values and the corresponding p-values for all 80 explanatory variables are shown in Figure 7. Using the significance level α = 0.05 resulted in a total of 34 variables with a strong evidence for an effect. This number was comparable to the results described by Schmid et al. (2013), who found 27 variables to be associated with EPHEptax. Four of the top five identified important variables (conductivity, potassium, sulfate and sodium) highlight the importance of salinity on freshwater ecosystems (Entrekin et al., 2018; Kaushal et al., 2018) and Ephemeroptera in particular (Beermann et al., 2018; Kefford, 2018). The partial dependence plot show conductivity exhibiting an inverse U-shaped effect on EPHEptax, with a peak of 5.75% at 600 uS/cm (left panel of Figure 8). As further example we consider the predictor-response relationship for ba_developed (proportion of developed land in catchment), which has previously been shown to be associated with stream condition (Allan, 2004; Maloney and Weller, 2011). The marginal estimate of ba_developed showed a rapid declining effect on EPHEptax up to 8% after which the declining pattern flattened (right panel of Figure 8). Such sensitivity of benthic macroinvertebates to developed land (i.e., urbanization) in upslope catchments has been frequently reported in streams (Schueler et al., 2009) and suggests Ephemeroptera in lakes are similarly sensitive (Schmid et al., 2013).

Figure 8: — Analysis of the NLA database. Marginal estimates of the conditional mean μ|X for conductivity and ba_developed (developed land in catchment). Each tick mark above the x-axis represents a 10^th decile of the variable, respectively. The values of the other explanatory variables where respectively set to the observed mean value or the most frequent class. For the purpose of illustration only the first 9 deciles were shown for each variable.

In Figure 9 the conditional means of EPHEptax for the basin-climate regions are shown. As can be distinctly seen, the Pacific Northwest Region (warm temperature), the south Atlantic-Gulf Region (warm temperature), the Lower Colorado Region (warm temperature) as well as UMissouri (snow) had the lowest percentage of Ephemeroptera. The marginal estimates in these regions ranged between 4.91 % and 5.08 %. The value of EPHEptax was substantially higher in the remaining 29 regions, without any essential differences. In these regions the predicted value of EPHEptax was ranging between 5.50 % and 5.76 %. Similar results on the effects of the basin-climate regions on EPHEptax were reported by Schmid et al. (2013).

5. Concluding Remarks

The beta forest method developed in this paper is a recursive partitioning technique for modeling bounded outcome variables. It provides a convenient solution to the stability problems that arise in regression trees when the classical MSE splitting criterion is applied to heteroscedastic and/or skewed data. At the same time, the beta forest inherit the favorable characteristics of the classical random forest approach, in particular the ability to model non-linear predictor-response relationships, the opportunity to apply the method to high-dimensional data sets, and the automatic incorporation of higher-order interaction terms. Analogous to classical random forests, the beta forest allows for the implementation of variable importance measures, which can be used to rank the explanatory variables in terms of their effects on the outcome variable (as applied in the analysis of the NLA database).

The simulation study in Section 3 showed that the beta forest method (i) yielded comparable or more accurate predictions (depending on the true data-generating model) of bounded outcomes than classical MSE-based random forest approaches in the considered scenarios, and (ii) outperformed classical parametric beta regression models in the presence of higher-dimensional data. Compared to MSE-based random forests with a transformed outcome variable, the beta forest benefit from the fact that it relates the untransformed mean of the bounded outcome to the explanatory variables, thereby avoiding a systematic bias in the predictions caused by back-transformation of the predicted values to the original scale (cf. Figure 1). Clearly, this property is of particular interest in random forest modeling, which is specifically designed to obtain accurate predictions of future and/or unobserved outcome values.

Recently, Schlosser et al. (2018) proposed a related approach that uses the likelihood function as splitting criterion for probabilistic precipitation forecasting in meteorology. In this work, trees and forests were used for prediction modeling in the generalized additive models for location, scale and shape framework (Rigby and Stasinopoulos, 2005). More specifically, Schlosser et al. (2018) considered a zero-censored Gaussian distribution and related all distribution parameters of the respective p.d.f. to explanatory variables using tree-based splits. Their approach is not based on the CART methodology but on conditional inference trees (Hothorn et al., 2006) in which the associations between the outcome variable and the explanatory variables are evaluated using permutation tests. As performance measure, Schlosser et al. (2018) considered the continuous ranked probability score (Gneiting and Raftery, 2007), which constitutes an alternative to the predictive log-likelihood and the MSE used in this paper. Another tree-based approach for beta regression was pro-posed by Grün et al. (2012), who embedded beta regression in the model-based recursive partitioning framework proposed earlier by Zeileis et al. (2008). With this approach, a set of beta regression models (each including one or more explanatory variables) is fitted to subsets of the covariate space that are determined by recursive partitioning techniques.

The random forest algorithm developed in this paper could, in principle, be extended in many ways. For example, one could use subsampling without replacement instead of the bootstrap procedure considered in our simulation study. Differences between these two sampling schemes were, for example, investigated by Strobl et al. (2007) with a focus on the reliability of the resulting variable importance measures. While there is a need to analyze the beta forest with regard to this issue in more detail, we generally expect our method to show a similar behavior as the classical MSE-based methods. Similarly, one could adjust the likelihood-based variable importance measure defined in Section 3 along the lines of Nembrini et al. (2018).

We finally emphasize the high computational efficiency of the beta forest approach, which is attained by the optimization of the likelihood-based splitting criterion using simple moment estimators. The method is implemented in the R add-on package ranger (Wright and Ziegler, 2017), which is based on a fast C++ implementation, and can be readily applied by researchers and practitioners.

Supplementary Material

Sup3

NIHMS1700853-supplement-Sup3.txt^{(3.1KB, txt)}

Sup2

NIHMS1700853-supplement-Sup2.r^{(26.9KB, r)}

Sup1

NIHMS1700853-supplement-Sup1.gz^{(100.4KB, gz)}

Sup6

NIHMS1700853-supplement-Sup6.r^{(11.7KB, r)}

Sup5

NIHMS1700853-supplement-Sup5.r^{(4.2KB, r)}

Sup4

NIHMS1700853-supplement-Sup4.r^{(13KB, r)}

Sup7

NIHMS1700853-supplement-Sup7.r^{(3.4KB, r)}

Sup8

NIHMS1700853-supplement-Sup8.rda^{(187.6KB, rda)}

Sup10

NIHMS1700853-supplement-Sup10.rda^{(3KB, rda)}

Sup9

NIHMS1700853-supplement-Sup9.rda^{(317.2KB, rda)}

Sup13

NIHMS1700853-supplement-Sup13.rda^{(2.5KB, rda)}

Sup11

NIHMS1700853-supplement-Sup11.rda^{(188.1KB, rda)}

Sup12

NIHMS1700853-supplement-Sup12.rda^{(325.1KB, rda)}

Sup16

NIHMS1700853-supplement-Sup16.rda^{(2.2KB, rda)}

Sup14

NIHMS1700853-supplement-Sup14.rda^{(185.9KB, rda)}

Sup15

NIHMS1700853-supplement-Sup15.rda^{(319.7KB, rda)}

Sup19

NIHMS1700853-supplement-Sup19.rda^{(2.4KB, rda)}

Sup17

NIHMS1700853-supplement-Sup17.rda^{(188.6KB, rda)}

Sup18

NIHMS1700853-supplement-Sup18.rda^{(324.9KB, rda)}

Acknowledgements

Any use of trade, product, website, or firm names in the publication is for descriptive purposes only and does not imply endorsement by the U.S. Government.

A. Bias of Back-Transformed Predicted Values

In this small simulation study we generated a right-skewed, beta distributed outcome variable y ~ (0, 1) that was related to one normally distributed explanatory variable x ~ N(0, 1). The predictor function and the conditional mean were defined by

η (x) = β_{0} + β_{1} x and μ | x = \frac{\exp (η (x))}{1 + \exp (η (x))},

respectively, where β₀ = −1 and β₁ = −0.5. The precision parameter was set to ϕ = 8. Each simulated data set consisted of 1000 observations and we performed 100 replications. In each replication we fitted Gaussian regression models to the untransformed data, to the logistic transformed data and to the arcsine square root transformed data, as well as a beta regression model to the untransformed data.

In order to obtain estimates on the original scale of the models based on logistic transformed data and the models based on arcsine square root transformed data, we have applied the following approach: Let g(·) either be the logistic or arcsine square root transformation function, then the model is defined by g(y) = x^T β + ϵ. For each model we randomly drew $\hat{ϵ} = (ϵ_{1}, \dots, ϵ_{K})$ , K = 100, 000 where $\hat{ϵ_{k}} ~ N (0, \hat{σ^{2}})$ with $\hat{σ^{2}} = V a r (ϵ)$ the estimate derived from the respective model. Next, we calculated the expected value of y given x by $E (y | x) = a v e (g^{- 1} (\hat{g (y)}))$ , where $\hat{g (y)} = x^{T} \hat{β} + \hat{ϵ}$ . Finally, we calculated the differences between the true mean and the fitted means on the orignal scale for each model.

B. Further Simulation Results

Figure 10: — simulation study, comparing the beta forest to alternative models. The MSE of the parametric beta regression models and the random forest approaches for the 12 scenarios with symmetrically distributed outcome (average location parameter μ = 0.5). The precision parameter ϕ varies across the rows, the number of variables p varies across the columns. The gray lines refer to the median values of the true model.

Figure 11: — Results of the simulation study, comparing the beta forest to alternative modeling approaches. Predictive log-likelihood values of the parametric beta regression models and the random forest approaches for the 12 scenarios with right-skewed distributed outcome (average location parameter μ = 0.2). The precision parameter ϕ varies across the rows, the number of variables p varies across the columns. The gray lines refer to the median values of the true model.

Figure 12: — Results of the simulation study, comparing the beta forest to alternative modeling approaches. Predictive log-likelihood values of all modeling approaches for the 12 scenarios with symmetrically logit-normal distributed outcome (average location parame-ter μ = 0.5). The variance parameter σ varies across the rows, the number of variables p varies across the columns. The gray lines refer to the median values of the true model.

Figure 13: — Results of the simulation study, comparing the beta forest to alternative modeling approaches. Predictive log-likelihood values of all modeling approaches for the 12 scenarios with right-skewed logit-normal distributed outcome (average location parameter μ = 0.2). The variance parameter σ varies across the rows, the number of variables p varies across the columns. The gray lines refer to the median values of the true model.

C. Predictor Variables of NLA Data

Table 1 contain the full list of explanatory variables used for modeling the percentage of benthic macroinvertebrate taxa (EPHEptax) in Section 4 of the article.

Table 1:

Explanatory variables used for modeling EPHEptax in the application section of the article.

variable name	description	unit / levels

basin_area	basin area	km²
ba_developed	percent of basin area as Developed (NLCD21+ NLCD22+NLCD23+NLCD24)
ba_planted	percent of basin area as Planted/Cultivated (NLCD81+NLCD82)
ba_forest	percent of basin area as Forested Upland (NLCD41+NLCD42+NLCD43)
ba_wetlands	percent of basin area as Wetlands (woody + herbaceous)
longitude	Longitude obtained from NHF (NAD83)	decimal degrees
latitude	Latitude obtained from NHF (NAD83)	decimal degrees
lakearea	lake polygon area from NHD	km²
max_depth	maximum observed lake depth	m
elevation	site elevation from the National Elevation Data Set	m
ph_field	field pH from Profile DO data
conductivity	conductivity	μS/cm @25 °C
anc	Acid Neutralizing Capacity (ANC)	μ eq/L
turbidity	turbidity	Nephelometric Turbidity Units (NTUs)
doc	dissolved organic carbon	mg/L
nitrate	nitrate	mg/L
phosphorus	total phosporus	μg/L
chloride	chloride	mg/L
sulfate	sulfate	mg/L
calcium	calcium	mg/L
potassium	potassium	mg/L
nitrogen	total nitrogen	mg/L
sodium	sodium	mg/L
magnesium	magnesium	mg/L
silica	silica	mg/L SiO_2
ammonium	ammonium	mg/L
chlorophyllA	chlorophyll a concentration	μg/L
sl_index	shoreline development index (lake polygon perimeter/ $2 \sqrt{lake polygon area \cdot π}$ )	m/m2
lake_origin	lake origin	Factor with levels “man-made” and “natural” (including natural lakes augmented by dams)
depth	mean station depth	m
lit_depth	coefficient of variation of littoral depth
bcfc_sandsilt	fractional areal cover of bottom substrate that is silt and sand
bcfc_silt	fractional areal cover of bottom substrate that is silt
bcfc_organic	fractional areal cover of bottom substrate that is organic
bcfc_bedrock	fractional areal cover of bottom substrate that is bedrock
bcfc_boulders	fractional areal cover of bottom substrate that is boulders
bs_dia	mean log10-transformed bottom substrate diameter (mineral)	mm
ami_total	fractional areal cover of littoral floating + emergent macrophytes
amfc_flt_emg	mean log10-transformed shoreline substrate diameter (mineral)	mm
ssx_ldia	mean log10-transformed shoreline substrate diameter (mineral)	mm
hii_nonag	index of nonagricultural human influences (= sum of individual weighted means of nonagricultural influences)
hii_ag	index of agricultural human influences(= sum of individual weighted means of agricultural influences)
rvi_wood	index of total riparian areal cover from woody vegatation
rvn_gnd_innu	count of values of riparian ground areal cover from standing water/innudated vegetation
fci_natural	index of littoral fish cover from natural structures
fci_all	index of total littoral fish cover
fcfc_brush	fractional cover of littoral fish cover that is brush
fcfc_snag	fractional cover of littoral fish cover that is snags
rvn_can_big	count of values of riparian canopy areal cover from large trees (> 30 cm dbh)
ssfc_bedrock	fractional areal cover of shoreline substrate from bedrock
ssfc_boulders	fractional areal cover of shoreline substrate from boulders
hi_weighted	weighted presence of all human influences
horiz_dist	mean horizontal distance to highwater mark	m
vert_dist	mean vertical height to highwater mark	m
lake_perim	lake polygon perimeter from NHD	km
basin_lake_ratio	ratio of drainage basin area to lake surface area
awc_ave	watershed mean of the high values of available water capacity (fraction) of soils from the State Soil Geographic (STATSGO) Database
bd_ave	watershed mean of the high values of soil bulk density of soils from the State Soil Geographic (STATSGO) Database	g/cm³
om_ave	watershed mean of the high value of organic matter content of soils from State Soil Geographic (STATSGO) Database	percent by weight
p_ave	watershed mean of the high values of permeability of soils from the State Soil Geographic (STATSGO) Database	inches / hour
rdh_ave	watershed mean of the high values of depth to bedrock of soils from the State Soil Geographic (STATSGO) Database	inches
sdmntry	percent of the bedrock geology in the watershed classified as sedimentary forms derived from a simplified version of the Generalized Geologic Map of the Conterminous U.S.
dom_geol	geology type with largest percent coverage within the watershed derived from a simplified version of the Generalized Geologic Map of the Conterminous United States	factor with levels “Gneiss”, “Granitic”, “Mafic_UltraMaf”, “Quaternary”, “Sedimentary” and “Volcanic”
kfct_ave	watershed mean of the soil erodibility factor of soils from the State Soil Geographic (STATSGO) Database
sump_pt	sampling point long-term annual precipitation, values based on 30 years (1971–2000) of PRISM climate estimates	mm
tmax	sampling point maximum temperature	°C
tmin	sampling point minimum temperature	°C
tmean	average temperature of the specific summer that field sampling was done at site	°C
psum	total average precipitation for the specific summer that field sampling was done at site	mm
psum_year	total precipitation for previous year at sampling point (estimated total precipitation for the 12 months prior to the field sampling season)	mm
np_ratio	N:P ratio (Total Nitogen/Total Phosphorus)
neardist	distance to the nearest NHDplus waterbody	m
neararea	surface area of nearest NHDplus waterbody	km²
neardist_large	distance to the nearest large (> 1 km² surface area) NHDplus waterbody	m
neararea_large	surface area of nearest large (> 1 km² surface area) NHDplus waterbody	km²
no_lakes1k	number of NHDplus waterbodies within a 1 km radius of sampling site
area_lakes1k	total surface area of NHDplus waterbodies within a 1 km radius of sampling site	km²
no_lakes20k	number of NHDplus waterbodies within a 20 km radius of sampling site
area_lakes20k	total surface area of NHDplus waterbodies within a 20 km radius of sampling site	km²
huc2_climate	NHDplus HUC2 drainage basin intersected with Köoppen-Geiger Climate Classification	factor variable (see Kottek et al. (2006) for definitions of 6 factor levels)

Open in a new tab

References

Allan JD (2004). Landscapes and riverscapes: the influence of land use on stream ecosystems. Annual Review of Ecology, Evolution, and Systematics 35, 257–284. [Google Scholar]
Altmann A, Toloşi L, Sander O, and Lengauer T (2010). Permutation importance: a corrected feature importance measure. Bioinformatics 26, 1340–1347. [DOI] [PubMed] [Google Scholar]
Barbour MT, Gerritsen J, Snyder BD, Stribling JB, et al. (1999). Rapid bioassessment protocols for use in streams and wadeable rivers: periphyton, benthic macroinvertebrates and fish, Volume 339. US Environmental Protection Agency, Office of Water; Washington, DC. [Google Scholar]
Beermann AJ, Elbrecht V, Karnatz S, Ma L, Matthaei CD, Piggott JJ, and Leese F (2018). Multiple-stressor effects on stream macroinvertebrate communities: a mesocosm experiment manipulating salinity, fine sediment and flow velocity. Science of the Total Environment 610, 961–971. [DOI] [PubMed] [Google Scholar]
Bilcke J, Hens N, and Beutels P (2017). Quality-of-life: a many-splendored thing? Belgian population norms and 34 potential determinants explored by beta regression. Quality of Life Research 26, 2011–2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bou-Hamad I, Larocque D, and Ben-Ameur H (2011). Discrete-time survival trees and forests with time-varying covariates: application to bankruptcy data. Statistical Modelling 11, 429–446. [Google Scholar]
Bou-Hamad I, Larocque D, Ben-Ameur H, Mˆasse LC, Vitaro F, and Tremblay RE (2009). Discrete-time survival trees. Canadian Journal of Statistics 37, 17–32. [Google Scholar]
Box GE and Cox DR (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B (Methodological) 26, 211–243. [Google Scholar]
Breiman L (2001). Random forests. Machine Learning 45, 5–32. [Google Scholar]
Breiman L, Friedman J, Olshen R, and Stone C (1984). CART: Classification and Regression Trees. New York: Chapman and Hall, Wadsworth. [Google Scholar]
Carroll RJ and Ruppert D (1988). Transformation and Weighting in Regression. New York: Chapman and Hall. [Google Scholar]
Cribari-Neto F and Zeileis A (2010). Beta regression in R. Journal of Statistical Software 34, 1–24. [Google Scholar]
Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, Gibson J, and Lawler JJ (2007). Random forests for classification in ecology. Ecology 88, 2783–2792. [DOI] [PubMed] [Google Scholar]
Entrekin SA, Clay NA, Mogilevski A, Howard-Parker B, and Evans-White MA (2018). Multiple riparian–stream connections are predicted to change in response to salinization. Philosophical Transactions of the Royal Society B 374:20180042. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fang K, Fan X, Lan W, and Wang B (2018). Nonparametric additive beta regression for fractional response with application to body fat data. Annals of Operations Research. 10.1007/s10479-018-2875-2. [DOI] [Google Scholar]
Fang K and Ma S (2013). Three-part model for fractional response variables with application to chinese household health insurance coverage. Journal of Applied Statistics 40, 925–940. [Google Scholar]
Ferrari S and Cribari-Neto F (2004). Beta regression for modelling rates and proportions. Journal of Applied Statistics 31, 799–815. [Google Scholar]
Gelfand SJ (2015). Understanding the impact of heteroscedasticity on the predictive ability of modern regression methods. Graduating Extended Essay, published online, https://summit.sfu.ca/item/15679. [Google Scholar]
Gneiting T and Raftery AE (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102, 359–378. [Google Scholar]
Grün B, Kosmidis I, and Zeileis A (2012). Extended beta regression in R: shaken, stirred, mixed, and partitioned. Journal of Statistical Software 48, 1–25. [Google Scholar]
Hastie T, Tibshirani R, and Friedman J (2009). The Elements of Statistical Learning (2nd Ed.). New York: Springer. [Google Scholar]
Hothorn T, Hornik K, Van De Wiel MA, and Zeileis A (2006). A lego system for conditional inference. The American Statistician 60, 257–263. [Google Scholar]
Hunger M, Döring A, and Holle R (2012). Longitudinal beta regression models for analyzing health-related quality of life scores over time. BMC Medical Research Methodology 12:144. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ishwaran H (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics 1, 519–537. [Google Scholar]
Kaushal SS, Likens GE, Pace ML, Utz RM, Haq S, Gorman J, and Grese M (2018). Freshwater salinization syndrome on a continental scale. Proceedings of the National Academy of Sciences 115:201711234. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kefford BJ (2018). Why are mayflies (ephemeroptera) lost following small increases in salinity? three conceptual osmophysiological hypotheses. Philosophical Transactions of the Royal Society B 374:20180021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kottek M, Grieser J, Beck C, Rudolf B, and Rubel F (2006). World map of the Köppen-Geiger climate classification updated. Meteorologische Zeitschrift 15, 259–263. [Google Scholar]
Lesaffre E, Rizopoulos D, and Tsonaka R (2007). The logistic transform for bounded outcome scores. Biostatistics 8, 72–85. [DOI] [PubMed] [Google Scholar]
Maloney KO and Feminella JW (2006). Evaluation of single-and multi-metric benthic macroinvertebrate indicators of catchment disturbance over time at the fort benning military installation, georgia, usa. Ecological Indicators 6, 469–484. [Google Scholar]
Maloney KO and Weller DE (2011). Anthropogenic disturbance and streams: land use and land-use change affect stream ecosystems via multiple pathways. Freshwater Biology 56, 611–626. [Google Scholar]
Mayr A, Weinhold L, Hofner B, Titze S, Gefeller O, and Schmid M (2018). The betaboost package – a software tool for modelling bounded outcome variables in potentially high-dimensional epidemiological data. International Journal of Epidemiology 47, 1383–1388. [DOI] [PubMed] [Google Scholar]
Moisen G (2008). Classification and regression trees. In Jørgensen SE; Fath BD (Editor-in-Chief). Encyclopedia of Ecology, Volume 1, pp. 582–588. Oxford, UK: Elsevier. [Google Scholar]
Nembrini S, König IR, and Wright MN (2018). The revival of the Gini importance? Bioinformatics 34, 3711–3718. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rigby RA and Stasinopoulos DM (2005). Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society, Series C (Applied Statistics) 54, 507–554. [Google Scholar]
Schlosser L, Hothorn T, Stauffer R, and Zeileis A (2018). Distributional regression forests for probabilistic precipitation forecasting in complex terrain. arXiv:1804.02921. [Google Scholar]
Schmid M, Wickler F, Maloney KO, Mitchell R, Fenske N, and Mayr A (2013). Boosted beta regression. PLOS ONE 8:e61623. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schueler TR, Fraley-McNeal L, and Cappiella K (2009). Is impervious cover still important? review of recent research. Journal of Hydrologic Engineering 14, 309–315. [Google Scholar]
Simasand A, Barreto-Souza W, and Rocha A (2010). Improved estimators for a general class of beta regression models. Computational Statistics & Data Analysis 54, 348–366. [Google Scholar]
Smithson M and Verkuilen J (2006). A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables. Psychological Methods 11, 54–71. [DOI] [PubMed] [Google Scholar]
Strobl C, Boulesteix A-L, Zeileis A, and Hothorn T (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8:25. [DOI] [PMC free article] [PubMed] [Google Scholar]
Su X, Wang M, and Fan J (2004). Maximum likelihood regression trees. Journal of Computational and Graphical Statistics 13, 586–598. [Google Scholar]
US Environmental Protection Agency (USEPA) (2009). National Lakes Assessment: A Collaborative Survey of the Nation’s Lakes. Washington, D.C.: U.S. Environmental Protection Agency; Office of Water and Office of Research and Development. EPA 841-R-09–001. [Google Scholar]
Warton DI and Hui FKC (2011). The arcsine is asinine: The analysis of proportions in ecology. Ecology 92, 3–10. [DOI] [PubMed] [Google Scholar]
Weinhold L, Wahl S, Pechlivanis S, Hoffmann P, and Schmid M (2016). A statistical model for the analysis of beta values in DNA methylation studies. BMC Bioinformatics 17:480. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wright MN and Ziegler A (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77, 1–17. [Google Scholar]
Zeileis A, Hothorn T, and Hornik K (2008). Model-based recursive partitioning. Journal of Computational and Graphical Statistics 17, 492–514. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Sup3

NIHMS1700853-supplement-Sup3.txt^{(3.1KB, txt)}

Sup2

NIHMS1700853-supplement-Sup2.r^{(26.9KB, r)}

Sup1

NIHMS1700853-supplement-Sup1.gz^{(100.4KB, gz)}

Sup6

NIHMS1700853-supplement-Sup6.r^{(11.7KB, r)}

Sup5

NIHMS1700853-supplement-Sup5.r^{(4.2KB, r)}

Sup4

NIHMS1700853-supplement-Sup4.r^{(13KB, r)}

Sup7

NIHMS1700853-supplement-Sup7.r^{(3.4KB, r)}

Sup8

NIHMS1700853-supplement-Sup8.rda^{(187.6KB, rda)}

Sup10

NIHMS1700853-supplement-Sup10.rda^{(3KB, rda)}

Sup9

NIHMS1700853-supplement-Sup9.rda^{(317.2KB, rda)}

Sup13

NIHMS1700853-supplement-Sup13.rda^{(2.5KB, rda)}

Sup11

NIHMS1700853-supplement-Sup11.rda^{(188.1KB, rda)}

Sup12

NIHMS1700853-supplement-Sup12.rda^{(325.1KB, rda)}

Sup16

NIHMS1700853-supplement-Sup16.rda^{(2.2KB, rda)}

Sup14

NIHMS1700853-supplement-Sup14.rda^{(185.9KB, rda)}

Sup15

NIHMS1700853-supplement-Sup15.rda^{(319.7KB, rda)}

Sup19

NIHMS1700853-supplement-Sup19.rda^{(2.4KB, rda)}

Sup17

NIHMS1700853-supplement-Sup17.rda^{(188.6KB, rda)}

Sup18

NIHMS1700853-supplement-Sup18.rda^{(324.9KB, rda)}

[R1] Allan JD (2004). Landscapes and riverscapes: the influence of land use on stream ecosystems. Annual Review of Ecology, Evolution, and Systematics 35, 257–284. [Google Scholar]

[R2] Altmann A, Toloşi L, Sander O, and Lengauer T (2010). Permutation importance: a corrected feature importance measure. Bioinformatics 26, 1340–1347. [DOI] [PubMed] [Google Scholar]

[R3] Barbour MT, Gerritsen J, Snyder BD, Stribling JB, et al. (1999). Rapid bioassessment protocols for use in streams and wadeable rivers: periphyton, benthic macroinvertebrates and fish, Volume 339. US Environmental Protection Agency, Office of Water; Washington, DC. [Google Scholar]

[R4] Beermann AJ, Elbrecht V, Karnatz S, Ma L, Matthaei CD, Piggott JJ, and Leese F (2018). Multiple-stressor effects on stream macroinvertebrate communities: a mesocosm experiment manipulating salinity, fine sediment and flow velocity. Science of the Total Environment 610, 961–971. [DOI] [PubMed] [Google Scholar]

[R5] Bilcke J, Hens N, and Beutels P (2017). Quality-of-life: a many-splendored thing? Belgian population norms and 34 potential determinants explored by beta regression. Quality of Life Research 26, 2011–2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Bou-Hamad I, Larocque D, and Ben-Ameur H (2011). Discrete-time survival trees and forests with time-varying covariates: application to bankruptcy data. Statistical Modelling 11, 429–446. [Google Scholar]

[R7] Bou-Hamad I, Larocque D, Ben-Ameur H, Mˆasse LC, Vitaro F, and Tremblay RE (2009). Discrete-time survival trees. Canadian Journal of Statistics 37, 17–32. [Google Scholar]

[R8] Box GE and Cox DR (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B (Methodological) 26, 211–243. [Google Scholar]

[R9] Breiman L (2001). Random forests. Machine Learning 45, 5–32. [Google Scholar]

[R10] Breiman L, Friedman J, Olshen R, and Stone C (1984). CART: Classification and Regression Trees. New York: Chapman and Hall, Wadsworth. [Google Scholar]

[R11] Carroll RJ and Ruppert D (1988). Transformation and Weighting in Regression. New York: Chapman and Hall. [Google Scholar]

[R12] Cribari-Neto F and Zeileis A (2010). Beta regression in R. Journal of Statistical Software 34, 1–24. [Google Scholar]

[R13] Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, Gibson J, and Lawler JJ (2007). Random forests for classification in ecology. Ecology 88, 2783–2792. [DOI] [PubMed] [Google Scholar]

[R14] Entrekin SA, Clay NA, Mogilevski A, Howard-Parker B, and Evans-White MA (2018). Multiple riparian–stream connections are predicted to change in response to salinization. Philosophical Transactions of the Royal Society B 374:20180042. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Fang K, Fan X, Lan W, and Wang B (2018). Nonparametric additive beta regression for fractional response with application to body fat data. Annals of Operations Research. 10.1007/s10479-018-2875-2. [DOI] [Google Scholar]

[R16] Fang K and Ma S (2013). Three-part model for fractional response variables with application to chinese household health insurance coverage. Journal of Applied Statistics 40, 925–940. [Google Scholar]

[R17] Ferrari S and Cribari-Neto F (2004). Beta regression for modelling rates and proportions. Journal of Applied Statistics 31, 799–815. [Google Scholar]

[R18] Gelfand SJ (2015). Understanding the impact of heteroscedasticity on the predictive ability of modern regression methods. Graduating Extended Essay, published online, https://summit.sfu.ca/item/15679. [Google Scholar]

[R19] Gneiting T and Raftery AE (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102, 359–378. [Google Scholar]

[R20] Grün B, Kosmidis I, and Zeileis A (2012). Extended beta regression in R: shaken, stirred, mixed, and partitioned. Journal of Statistical Software 48, 1–25. [Google Scholar]

[R21] Hastie T, Tibshirani R, and Friedman J (2009). The Elements of Statistical Learning (2nd Ed.). New York: Springer. [Google Scholar]

[R22] Hothorn T, Hornik K, Van De Wiel MA, and Zeileis A (2006). A lego system for conditional inference. The American Statistician 60, 257–263. [Google Scholar]

[R23] Hunger M, Döring A, and Holle R (2012). Longitudinal beta regression models for analyzing health-related quality of life scores over time. BMC Medical Research Methodology 12:144. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Ishwaran H (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics 1, 519–537. [Google Scholar]

[R25] Kaushal SS, Likens GE, Pace ML, Utz RM, Haq S, Gorman J, and Grese M (2018). Freshwater salinization syndrome on a continental scale. Proceedings of the National Academy of Sciences 115:201711234. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Kefford BJ (2018). Why are mayflies (ephemeroptera) lost following small increases in salinity? three conceptual osmophysiological hypotheses. Philosophical Transactions of the Royal Society B 374:20180021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Kottek M, Grieser J, Beck C, Rudolf B, and Rubel F (2006). World map of the Köppen-Geiger climate classification updated. Meteorologische Zeitschrift 15, 259–263. [Google Scholar]

[R28] Lesaffre E, Rizopoulos D, and Tsonaka R (2007). The logistic transform for bounded outcome scores. Biostatistics 8, 72–85. [DOI] [PubMed] [Google Scholar]

[R29] Maloney KO and Feminella JW (2006). Evaluation of single-and multi-metric benthic macroinvertebrate indicators of catchment disturbance over time at the fort benning military installation, georgia, usa. Ecological Indicators 6, 469–484. [Google Scholar]

[R30] Maloney KO and Weller DE (2011). Anthropogenic disturbance and streams: land use and land-use change affect stream ecosystems via multiple pathways. Freshwater Biology 56, 611–626. [Google Scholar]

[R31] Mayr A, Weinhold L, Hofner B, Titze S, Gefeller O, and Schmid M (2018). The betaboost package – a software tool for modelling bounded outcome variables in potentially high-dimensional epidemiological data. International Journal of Epidemiology 47, 1383–1388. [DOI] [PubMed] [Google Scholar]

[R32] Moisen G (2008). Classification and regression trees. In Jørgensen SE; Fath BD (Editor-in-Chief). Encyclopedia of Ecology, Volume 1, pp. 582–588. Oxford, UK: Elsevier. [Google Scholar]

[R33] Nembrini S, König IR, and Wright MN (2018). The revival of the Gini importance? Bioinformatics 34, 3711–3718. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Rigby RA and Stasinopoulos DM (2005). Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society, Series C (Applied Statistics) 54, 507–554. [Google Scholar]

[R35] Schlosser L, Hothorn T, Stauffer R, and Zeileis A (2018). Distributional regression forests for probabilistic precipitation forecasting in complex terrain. arXiv:1804.02921. [Google Scholar]

[R36] Schmid M, Wickler F, Maloney KO, Mitchell R, Fenske N, and Mayr A (2013). Boosted beta regression. PLOS ONE 8:e61623. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Schueler TR, Fraley-McNeal L, and Cappiella K (2009). Is impervious cover still important? review of recent research. Journal of Hydrologic Engineering 14, 309–315. [Google Scholar]

[R38] Simasand A, Barreto-Souza W, and Rocha A (2010). Improved estimators for a general class of beta regression models. Computational Statistics & Data Analysis 54, 348–366. [Google Scholar]

[R39] Smithson M and Verkuilen J (2006). A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables. Psychological Methods 11, 54–71. [DOI] [PubMed] [Google Scholar]

[R40] Strobl C, Boulesteix A-L, Zeileis A, and Hothorn T (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8:25. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Su X, Wang M, and Fan J (2004). Maximum likelihood regression trees. Journal of Computational and Graphical Statistics 13, 586–598. [Google Scholar]

[R42] US Environmental Protection Agency (USEPA) (2009). National Lakes Assessment: A Collaborative Survey of the Nation’s Lakes. Washington, D.C.: U.S. Environmental Protection Agency; Office of Water and Office of Research and Development. EPA 841-R-09–001. [Google Scholar]

[R43] Warton DI and Hui FKC (2011). The arcsine is asinine: The analysis of proportions in ecology. Ecology 92, 3–10. [DOI] [PubMed] [Google Scholar]

[R44] Weinhold L, Wahl S, Pechlivanis S, Hoffmann P, and Schmid M (2016). A statistical model for the analysis of beta values in DNA methylation studies. BMC Bioinformatics 17:480. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] Wright MN and Ziegler A (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77, 1–17. [Google Scholar]

[R46] Zeileis A, Hothorn T, and Hornik K (2008). Model-based recursive partitioning. Journal of Computational and Graphical Statistics 17, 492–514. [Google Scholar]

PERMALINK

A Random Forest Approach for Bounded Outcome Variables

Leonie Weinhold

Matthias Schmid

Richard Mitchell

Kelly O Maloney

Marvin N Wright

Moritz Berger

Abstract

1. Introduction

Figure 1:

2. Methods

2.1. Notation and Methodology

2.2. Tree-Based Beta Regression

Algorithm 1:

2.3. Random Forests for Beta Regression

3. Simulation Study

3.1. Simulation Design

3.2. Accuracy of the Beta Forest

Figure 2:

3.3. Model Comparison

3.4. Comparison to Parametric Beta Regression

Figure 3:

3.5. Comparison of Random Forest Approaches

Figure 4:

3.6. Further Simulation Results

4. Application

4.1. The NLA Database

Figure 5:

4.2. Comparison of Methods

Figure 6:

4.3. Results of the Beta Forest

Figure 7:

Figure 8:

Figure 9:

5. Concluding Remarks

Supplementary Material

Acknowledgements

A. Bias of Back-Transformed Predicted Values

B. Further Simulation Results

Figure 10:

Figure 11:

Figure 12:

Figure 13:

C. Predictor Variables of NLA Data

Table 1:

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases