Abstract
Background:
The generalized propensity score (GPS) addresses selection bias due to observed confounding variables and provides a means to demonstrate causality of continuous treatment doses with propensity score analyses. Estimating the GPS with parametric models obliges researchers to meet improbable conditions such as correct model specification, normal distribution of variables, and large sample sizes.
Objectives:
The purpose of this Monte Carlo simulation study is to examine the performance of neural networks as compared to full factorial regression models to estimate GPS in the presence of Gaussian and skewed treatment doses and small to moderate sample sizes.
Research design:
A detailed conceptual introduction of neural networks is provided, as well as an illustration of selection of hyperparameters to estimate GPS. An example from public health and nutrition literature uses residential distance as a treatment variable to illustrate how neural networks can be used in a propensity score analysis to estimate a dose–response function of grocery spending behaviors.
Results:
We found substantially higher correlations and lower mean squared error values after comparing true GPS with the scores estimated by neural networks. The implication is that more selection bias was removed using GPS estimated with neural networks than using GPS estimated with classical regression.
Conclusions:
This study proposes a new methodological procedure, neural networks, to estimate GPS. Neural networks are not sensitive to the assumptions of linear regression and other parametric models and have been shown to be a contender against parametric approaches to estimate propensity scores for continuous treatments.
Keywords: data mining, propensity scores, selection bias
In secondary data analyses, researchers often adjust for the observed covariates and assume that unobservables are balanced. While the assumption is usually untestable in practice (Imbens & Rubin, 2015), neural networks (NN) may have advantages compared with conventional approaches such as ordinary least squares regression. This article provides a tutorial on the use of NN to estimate generalized propensity scores (GPS) for continuous treatments. Accurate GPS estimation is essential to subsequent steps in propensity score analyses (e.g., weighting). Rosembaum and Rubin (1983, 1984) presented proof that if the propensity score model is correctly specified, it will balance the distribution of observed covariates and result in an unbiased estimate of the treatment effect.
NN have advantages over conventional methods for estimating GPS being that they can approximate any polynomial function, regardless of the number of interaction terms (Westreich et al., 2010). Many social scientists are unfamiliar with NN and training machine learning models. Those scientists could use this article as an introductory guide for estimating GPS with NN. Given this primary aim, first, we present a Monte Carlo simulation study to compare NN with well-established propensity score estimation methods. Then, we evaluated grocery spending habits in U.S. Department of Agriculture Economic Research Service (USDA, 2015) designated “low access to supermarket” communities. Our evaluation applied the NN from the Monte Carlo simulation.
Our findings suggest that NN may be an alternative method for GPS estimation for continuous treatments, not a replacement for existing methods. Furthermore, it is common practice in propensity score analysis to apply a set of alternative methods for propensity score estimation and compare results (e.g., Collier & Leite, 2020; Setoguchi et al., 2008).
Theoretical Framework
Neyman’s (1923) notation for potential outcomes for randomized experiments was repurposed by Rubin (1973) for observational studies. The potential outcomes for treatments Z with intervals are infinite (Hirano & Imbens, 2004). Making causal inferences in observational studies for continuous and categorical treatments alike is difficult because only is observed and Z and pretreatment covariates X are not typically independent. Such cases are prone to bias inferences (Imai & Van Dyk, 2005).
GPS
Hirano and Imbens (2004) expanded Imbens’s (2000) study on categorical multivalued treatments by demonstrating that GPS shared qualities of Rosenbaum and Rubin’s (1983) propensity score for binary treatments. Such qualities included removing all biases associated with differences in the covariates and balancing properties that can be used to assess the adequacy of particular specifications of the score. The propensity score was generalized to treatments Z with intervals [z0, z1] by way of the conditional density of the normally distributed treatment doses given the covariates x:
(1) |
where the GPS is R = r(z, x) (Hirano & Imbens, 2004).
Hirano and Imbens’s (2004) procedures for Gaussian distributed treatment doses are as follows:
- Model the continuous treatment indicator as a function of covariates:
(2) - Estimate the GPS as the density of the normal distribution given covariates:
(3) - Model the outcome as a function of the Z and the GPS:
where γ are the coefficients for the outcome model,(4) Evaluate the covariate balance, and
- Estimate the average potential outcome at each treatment dose of interest:
(Hirano and Imbens, 2004).(5)
Also in 2004, Imai and van Dyk expanded Imbens’s (2000) and Joffe and Rosenbaum’s (1999) studies to establish causal effects in observational studies where the treatment is categorical, ordinal, continuous, semicontinuous, or multivariate with a large number of covariates. Although Imai and van Dyk’s (2004) methods closely related to the works of the previously mentioned authors, they primarily focused on subclassification rather than matching. They also demonstrated how their methods can improve balance in a randomized design.
Inverse probability weighting can also be used to remove selection bias with continuous treatments (Robins et al., 2000). Inverse probability weights (IPW) for continuous treatment dose have the GPS as the denominator,
(6) |
(Robins et al., 2000). The IPW remove bias by creating a pseudo-population within which a set of individuals assigned the same dose of treatment and another set receiving a different dose have similarly distributed covariates (Robins et al., 2000). Under the assumption that GPS are correctly estimated, the treatment dose will be uncorrelated with the covariates in the pseudo-population generated by weighting observations by the IPW (Leite, 2016). Since these foundational studies on propensity scores for continuous treatments, researchers across an array of disciplines have conducted propensity score methods using the GPS (e.g., Bia & Mattei, 2008; Foster, 2003; Fryges, 2009; Kluve, 2012; Moodie & Stephens, 2012).
GPS Estimation
Parametric estimation.
The most common approach to estimate GPS, demonstrated in Hirano and Imbens (2004), is with ordinary least squares using a parametric model of the relationship between treatment dose and covariates:
(7) |
where ei ~ N(0, σ2). Parametric models require assumptions regarding variable selection, the functional form of relationships between covariates and treatment dose, and specification of interactions (Lee et al., 2010). Potential drawbacks to ordinary least squares and other generalized linear models (GLM) include violations of these assumptions and extreme/outlying values. If any of these assumptions are violated, covariate balance may not be achieved by conditioning on the propensity score, which may result in biased effect estimates (Rubin, 2010). If covariate balance is not achieved, researchers should change the model (e.g., add covariates, include interactions) to achieve balance before using any propensity score method.
Algorithmic estimation.
Data mining encompasses a wide range of research techniques that include more traditional options such as database queries and more recent developments in machine learning and language technology. The drawbacks of ordinary least squares regression to estimate GPS can be overcome by algorithmic data mining procedures, which do not rely on parametric model assumptions. The application of data mining to propensity score methods is not new to educational research (Beane et al., 2017; d’Agostino, 1998; Rubin & Donald, 1997; Stukel et al., 2007), but more literature is needed on the estimation of GPS using data mining techniques.
NN is a machine learning technique that is inspired by biological NN in the brain (Zhu, 2017). Like nerves in the brain, processing elements known as perceptrons connect to other processing elements (Walczak, 2018). Perceptrons are arranged in a layer or vector, with the output of one layer serving as the input to the next layer (see Figure 1). Many algorithms used in machine learning are based on the notion of the perceptron (Rosenblatt, 1962). For example, deep learning is a NN that uses multiple layers between input and output layers. If x1 … xk are independent variables and W is a vector of weights, the perceptron computes the sum of weighted inputs and the output goes through an adjustable threshold (Kotsiantis et al., 2007). Each layer is associated with a loss function, corresponding distributions, and activation functions; all of which make up the model’s architecture. Layers can be used in NN to search through training data to find weights that work for the entire data set. The final layer’s output function is fq(X) = fq(aq(X)) (Witten et al., 2016). This function is applied to an input activation consisting of aq(X) as an element of the input a(X) (Witten et al., 2016). The rectified linear unit function (ReLU) is an example of an output layer’s activation function for estimating GPS. The ReLU is:
(8) |
Figure 1.
Artificial neural network.
It has shown to yield superior results with numerous data distributions (Witten et al., 2016).
The weights in NN are computed by a mathematical function, such as backpropagation, that activates the neurons, which are the circles within layers shown in Figure 1 (Hagan & Menhaj, 1994). Weights are applied to each link between neurons and provides NN with flexibility for detecting interactions. Each arrow has an associated w that is multiplied by inputs (Gupta, 2013). If the weight is high, it will make the associated input strong (Gupta, 2013).
Optimizing a NN for a particular propensity score application and standardizing general procedures whereby NN can be routinely used to estimate propensity scores take substantial investigation and effort (Westreich et al., 2010). NN should be trained while avoiding overfitting. This is when the error on the training set is driven to a very small value, but when new data are presented to the network the error is large (Srivastava et al., 2014). In such a case, the network has memorized the training examples, but it has not learned to generalize to new situations.
Deep learning is derived from the juncture of research on NNs, artificial intelligence, graphical modeling, optimization, pattern recognition, and signal processing (Deng & Yu, 2014). As with most concepts composed of various disciplines, deep learning has a number of definitions relative to its use and the aim of the researcher. Commonalities across definitions of deep learning are (1) models with multiple layers of nonlinear information processing and (2) supervised and unsupervised learning procedures with more, abstract layers (Deng & Yu, 2014). Figure 1 is an example of a deep learning NN because it has multiple hidden layers, but all NN have an input, a hidden, and an output layer.
Few studies have investigated algorithmic data mining procedures to estimate propensity scores (e.g., Keller et al., 2013; McCaffrey et al., 2004; Setoguchi et al., 2008; Westreich et al., 2010). Data mining methods that have been studied include decision trees, generalized boosted regression, NN, and support vector machines. However, most of these applications were specific to estimating propensity scores for categorical treatments. Within this limited body of research, there are even fewer studies on propensity score analysis with normal, lognormal, and gamma distributed treatment doses. Prior to the current study, no research has expanded deep learning NN to continuous treatments nor compared NN with GLM.
Research Questions
Research Question1: What is an architecture for an NN to accurately estimate GPS?
Research Question2: How do NN and GLM compare with respect to correlation between estimated GPS and true GPS, and mean square error, with different sample sizes, number of covariates, and distributions of treatment dose conditions?
Hypotheses
Because the first research question is descriptive, the hypotheses below refer to the second research question:
Hypothesis 1: Correlations will be higher and mean squared error values will be lower for GLM than NN when the treatments are Gaussian distributed and sample sizes are high.
Hypothesis 2: NN will have higher correlations and lower mean squared error than GLM when the treatments are nonnormally distributed.
Method
The research questions are addressed through a Monte Carlo simulation study. We manipulated the following conditions in this Monte Carlo simulation study: (1) two levels of covariates (eight and 16), (2) three treatment distributions (Gaussian, moderately skewed, and highly skewed), (3) three sample sizes (500, 1,000, and 5,000), and (4) two methods to estimate GPS (GLM and NN). Then, we trained and tested six NN to estimate GPS for eight and 16 covariates by three distributions (2 × 3).
In this “Method” section, we begin by discussing the simulation procedures, then explain procedures used to estimate GPS with NN and GLM, and last explain methods to analyze the results.
Monte Carlo Simulation Study
All simulations were performed in R Version 3.2.3, while analyses with the GLM were performed in R, NN were implemented using Python Version 3.6.2, running on a Linux operating system. For each combination of simulated conditions, we generated 1,000 data sets. The steps to build the model for treatment assignment in this simulation study were as follows:
Step 1. Covariates were simulated from a normal distribution with a mean of zero and standard deviation of one. All covariates were uncorrelated. Applications of propensity score estimation in social science data generally range between six and 32 covariates (Ansong et al., 2015; Cho et al., 2012; Monlezun et al., 2015; Morgan et al., 2010; Sullivan & Field, 2013). We simulated eight and 16 covariates to mimic the general range of applied studies.
Step 2. We sampled half of the covariates and split them into categorical variables and kept the other half continuous. Half of the categorical variables were binary, and the remaining half were transformed to equal group sizes by randomly selecting either tertiles, quartiles, or quintiles. We varied the effects and number of categories of the covariates in order to prevent the models from overtraining on a single scale.
- Step 3. The population for the continuous treatment dose was varied randomly with each iteration. Here is a full model for a case of eight main effects:
where X1, X2, …, Xk are covariates, and X5, X6, …, X8 are covariates with c categories.
We created models with main effects, quadratic effects, two-way interactions, and three-way interactions for the relationship between treatment assignment and the covariates in each of the number of covariate conditions. The number of main effects was equal to the number of covariates simulated, but the number of remaining effects was randomly chosen. We randomly selected regression coefficients from a uniform distribution with a range of 0 to .5. The uniform distribution ensured a constant probability for all coefficients within our sampling interval. Thus, the uniform distribution removed potential effects of coefficient size on estimation performance. This study seeks to accurately predict propensity scores rather than estimate the coefficients of the population model. So, there is no advantage of having a fixed population model. Also, varying the population model allows the challenge of data mining methods to detect a different set of interactions and terms.
Gaussian and gamma-distributed treatment doses.
Treatment doses are not always normally distributed in applied research studies. We studied different treatment distributions to make this simulation study more informative for both applied and methodological researchers. The conditional normal density of the Gaussian distributed treatment dose Zi given x1, x2, …, xk is
(9) |
where θ = (β, σ2) (Fong et al., 2016). The probability density of a gamma-distributed treatment dose given parameters αi and βi is
(10) |
For each gamma distributed treatment, the independent variables were exp(X) with α = 1 and α = 5 for strong and moderate skewness.
Sample size.
We kept our sample sizes close to other Monte Carlo simulation studies that evaluated propensity score estimation methods (e.g., Arpino & Mealli, 2011; Drake, 1993; Kim & Seltzer, 2007; Leite et al., 2015; Rosenbum & Rubin, 1983). We simulated sample sizes of 500, 1,000, and 5,000.
Analysis of Simulated Data Sets
Training and testing the neural networks.
Hyperparameters, such as hidden units and layers, are often tuned heuristically by hand (Witten et al., 2016, p. 432). In this study, we hand-tuned each NN’s hyperparameters (Bergstra & Bengio, 2012). We determined the computation for a network to estimate GPS with L hidden layers
(11) |
where the ReLU activation function is applied to the output of preceding hidden layers hl (Witten et al., 2016, p. 423). Each NN had four hidden layers.
We trained the NN to diverse input patterns to increase their generalizability to real-world applications. Eighty percent of each simulated data set was used for training, while the remaining 20% tested the models with mean squared error (MSE) as a performance measure. The optimal splitting of data for training sets typically range from 40% to 80% (Dobbin & Simon, 2011). NN were implemented in Python using the scikit-learn 0.19 library (Pedregosa et al., 2011).
Because our NNs only had a few parameters, it was not hard to select reasonable values by hand (Dahl, Sainath, & Hinton, 2013). Using backpropagation, we trained our NN with a ReLU activation function, an adaptive learning rate beginning at .01, a stochastic gradient-based optimizer, and four hidden layers (Hagan & Menhai, 1994). One hundred neurons and 1,000 iterations and momentum of 0.9 trained the NN to estimate continuous treatments. The solver for weights of the NN was Scikit learn’s default stochastic gradient decent optimizer. We chose “solver” for weight optimization. The specific stochastic gradient-based optimizer solver was proposed by Kingma and Ba (2014).
The MSE, defined as
(12) |
where is a vector of N predicted GPS and R is a vector of true GPS, was implemented as the loss function. In Equation 12, N are the predictions and n are the number of individual cases. MSE is one of the most common performance measure in NN research (Hecht-Nielsen, 1992; Ruck et al., 1990; Specht, 1991; Tenti, 2017; Yadav & Sahu, 2017).
GLM.
The GLM were fitted using a full factorial specified model in order to include all possible interactions of confounding variables in the test data sets. We fitted the GLM using the whole data set. Models to estimate GPS for Gaussian distributed treatments were estimated with linear regression, while moderately and highly skewed treatments were estimated with the gamma regression model
(13) |
(Stacy, 1962) and the probability density presented in Equation 10.
For skewed treatments, we used the gamma.shape function within the MASS package version 7.3–53 in R (R Core Team, 2017). The function estimates the shape parameter and then adjusts coefficient estimations and predictions in the generalized linear model.
Analysis of Monte Carlo Simulation Results
We compared NN and linear regression (LR) with MSE and Pearson’s correlation coefficient. To assist in the interpretation of the results, we ran split plot analysis of variance (ANOVAs) where Pearson’s correlation coefficient and the log of MSE served as the outcome. MSE combines bias and variance (Roberts & Vandenplas, 2017). We took the log of MSE because the raw values had extreme nonnormality because ANOVA is only robust to moderate non-normality. Sample size, number of covariates, and proportion treated were included as between-subject factors. The GPS estimation method was a within-subject factor. We also included all possible interactions of factors. Last, the generalized eta squared effect size measure (Olejnik & Algina, 2003) was calculated and interpreted as substantial if the effect size was equal or greater than .01.
Results
Following the procedures discussed above, we simulated Gaussian, moderately skewed, and highly skewed treatments, eight and 16 confounding variables, and sample sizes equal to 500, 1,000, and 5,000. After estimating GPS with GLM and NN, we compared the performance of each data mining technique with Pearson’s correlation coefficient and MSE.
Pearson’s Correlation Coefficient
In general, the correlation for estimates from GLM and NN was most similar for Gaussian distributed treatments with 16 covariates. For example, when treatments were Gaussian distributed, there were 16 covariates, sample sizes were 1,000, and the average correlations were equal between GLM and NN. The results indicate that with a few exceptions, GPS estimated with NN averaged higher correlation coefficients compared with GLM. Table 1 illustrates each of these differences.
Table 1.
Average Pearson Correlation Coefficients for Estimated Generalized Propensity Scores.
Covariates | Treatments | Sample Size | Generalized Linear Models | Neural Networks |
---|---|---|---|---|
8 | Gaussian | 500 | .78 | .76 |
1,000 | .89 | .77 | ||
5,000 | .98 | .77 | ||
Moderately skew | 500 | .61 | .90 | |
1,000 | .68 | .90 | ||
5,000 | .86 | .89 | ||
Highly skewed | 500 | .66 | .92 | |
1,000 | .76 | .92 | ||
5,000 | .90 | .92 | ||
16 | Gaussian | 500 | .67 | .80 |
1,000 | .81 | .81 | ||
5,000 | .84 | .81 | ||
Moderately skew | 500 | .58 | .95 | |
1,000 | .67 | .94 | ||
- | 5,000 | .85 | .94 | |
Highly skewed | 500 | .58 | .95 | |
1,000 | .70 | .94 | ||
5,000 | .87 | .94 |
In Table 1, the average Pearson correlation coefficient between actual and estimated GPS is reported for conditions in which the sample sizes were 500, 1,000, and 5,000, the distributions of treatments were Gaussian, moderately skewed, and strongly skewed, and the number of covariates were equal to eight and 16.
Some conditions in which Pearson’s correlation coefficient was higher for GLM compared with NN are also displayed. These cases tended to occur with Gaussian distributed treatments, eight covariates, as well as when sample sizes were 5,000.
Manipulated conditions with substantial effects.
This section describes which manipulated conditions impacted the average Pearson correlation coefficients. Table 2 displays the effect sizes for conditions that affect the correlations between estimated and actual GPS. Overall in terms of correlation between actual and estimated GPS, NN performed better than GLM under the conditions manipulated in this study.
Table 2.
Table of Generalized Eta Squared (η2) Effect Sizes for Main Effects and Interactions of Pearson’s Correlation Coefficient.
Generalized Eta Squared for Pearson’s Correlation Coefficient | |
---|---|
Effect | GES |
(Intercept) | .979 |
DMT | .202 |
Treatments: DMT | .187 |
Sample Size: DMT | .146 |
Sample Size | .136 |
Covariates: DMT | .041 |
Treatments | .013 |
Treatments: Sample Size: DMT | .010 |
Treatments: Sample Size | .008 |
Treatments: Covariates | .007 |
Treatments: Covariates: DMT | .006 |
Covariates | .003 |
Treatments: Sample Size: Covariates: DMT | .002 |
Treatments: Sample Size: Covariates | .001 |
Sample Size: Covariates: DMT | .001 |
Sample Size: Covariates | .000 |
Note. The colon in between conditions indicates an interaction term. DMT = data mining technique; GES = generalized eta squared.
Main effects.
The particular data mining technique employed, GLM or NN, had the highest effect on the correlation values. Across conditions, the mean correlation was.82 for GLM and.91 for NN. Pearson correlations ranged between 0 and .997 for GLM and ranged between .538 and .993 for NN. There were also significant main effects ( and 0:01) of sample size and of the number of treatments.
Interaction effects.
Each possible two-way interaction of data mining techniques with other manipulated conditions (distribution of treatments, sample sizes, and number of covariates) notably affected correlations between estimated and actual GPS as well. The highest order ranking two-way interaction was between data mining technique and the distribution of treatments. Average correlations estimated with GLM ranged between .58 and .90 for skewed distributions (see Table 2), while NN yielded values only as low as .90 under the same conditions. Overall in terms of moderately and highly skew treatments, NN had higher correlations than GLM. Furthermore, NN had higher Pearson correlation coefficients with skewed treatments than with Gaussian treatments.
There was a three-way interaction between treatment distribution, sample size, and data mining technique . Table 2 shows that with GLM, the correlations reduce with the decrease of sample size and increase in skewness. However, with NN, the correlations were lowest with a Gaussian distribution, and for each distribution, the correlations did not change as the sample size increased.
Mean Squared Error
A visual examination of the MSE of the manipulated conditions and interactions reveal noticeably different means between GLM and NN when treatments were skewed. In Table 3, MSE ranged between 0 and 206,728,676 for GPS estimated with GLM and between 0 and 195.61 with NN.
Table 3.
Summary Statistics of Mean Squared Errors.
DMT | Minimum | First Quartile | Median | Mean | Third Quartile | Maximum |
---|---|---|---|---|---|---|
GLM | .00 | .00 | 20.00 | 44,199.00 | 127.00 | 206,728,676.00 |
Neural networks | .00 | .01 | 6.49 | 10.25 | 15.59 | 195.61 |
For conditions of Gaussian distributed treatments, the average MSE values were negligible (≤ .01) across methods (see Table 4). MSE was higher for both GLM and NN to estimate GPS of skewed treatments. In general, when treatments were skewed, MSE values were consistent across sample sizes when GPS were estimated with NN. For example, highly skewed treatments with eight covariates had MSE of 8.24 across sample sizes. MSE was higher for GPS estimated with NN and 16 covariates compared with eight covariates. Patterns were difficult to visually inspect when treatments were skewed, and GPS were estimated with GLM. For example, MSE appears to decrease with increases in sample size with 16 covariates. However, the highest MSE values are obtained with medium samples and eight covariates.
Table 4.
Average Mean Squared Errors for Estimated Generalized Propensity Scores.
Covariates | Treatments | Sample Size | Generalized Linear Models | Neural Networks |
---|---|---|---|---|
8 | Gaussian | 500 | 0.01 | 0.01 |
1,000 | 0.00 | 0.01 | ||
5,000 | 0.00 | 0.01 | ||
Highly skewed | 500 | 8,393.53 | 8.24 | |
1,000 | 8,834.67 | 8.24 | ||
5,000 | 461.62 | 8.24 | ||
Moderately skew | 500 | 5,506.39 | 9.15 | |
1,000 | 30,840.69 | 9.53 | ||
5,000 | 27,732.04 | 9.29 | ||
16 | Gaussian | 500 | 0.01 | 0.00 |
1,000 | 0.00 | 0.00 | ||
5,000 | 0.00 | 0.00 | ||
Highly skewed | 500 | 561,100.18 | 20.44 | |
1,000 | 107,845.35 | 23.14 | ||
— | 5,000 | 238.58 | 22.29 | |
Moderately skew | 500 | 39,460.74 | 20.44 | |
1,000 | 4,776.07 | 23.14 | ||
5,000 | 388.80 | 22.29 |
Manipulated conditions with substantial effects.
The manipulated conditions that affected the MSE values are presented in Table 4. All manipulated factors and interactions had a negligible effect on MSE. MSE values reported in Table 3 and Table 4 are raw.
However, the log transformation on the raw values was employed to make the highly skewed distribution of MSE less skewed to conduct an ANOVA and calculate generalized η2 effect sizes. Table 5 is derived from those transformed MSE.
Table 5.
Table of Generalized eta Squared (η2) Effect Sizes for Main Effects and Interactions of Mean Squared Error.
Effect | GES |
---|---|
(Intercept) | .15 |
Treatments | .94 |
Sample Size | .17 |
Covariates | .19 |
DMT | .25 |
Treatments: Sample Size | .02 |
Treatments: Covariates | .05 |
Sample Size: Covariates | .00 |
Treatments: DMT | .23 |
Sample Size: DMT | .19 |
Covariates: DMT | .03 |
Treatments: Sample Size: Covariates | .01 |
Treatments: Sample Size: DMT | .02 |
Treatments: Covariates: DMT | .01 |
Sample Size: Covariates: DMT | .00 |
Treatments: Sample Size: Covariates: DMT | .01 |
Note. The colon in between conditions indicates an interaction term. DMT = data mining technique.
Main effects.
There were substantial main effects of all manipulated conditions in this study: distribution of treatments , sample size , number of covariates , and data mining technique . These effects were expected because their effects were seen and acknowledged in Table 5.
Interaction effects.
There were substantial two-way effects of treatment and sample size , treatment and covariates , treatments and data mining technique , sample size and data mining technique , covariates and data mining technique , and treatments, covariates, and data mining technique . In Table 4, the average MSE values decrease with most cases of increases in sample size, yet MSE remained constant across sample sizes for Gaussian distributed treatments.
The interaction between the distribution of the treatment and number of covariates can also be seen in Table 4. Of note, there was no change between eight and 16 covariates when the treatments were Gaussian distributed. However, MSE averaged higher values for 16 covariates compared with eight covariates when the treatments were highly skewed. Under most simulated conditions when the treatments were Gaussian distributed, both data mining techniques performed nearly equivalently. However, average MSE values were lower for NN compared with GLM when treatments were skewed. MSE values varied across methods and sample size more so for GPS estimated with GLM compared with NN. When treatments were Gaussian distributed, NN estimated GPS with slightly higher MSE when the number of covariates was eight compared with 16. For gamma-distributed treatments, NN estimated GPS with higher MSE when the number of covariates was 16 compared with eight. The MSE were not as clean-cut for GPS estimated with GLM but are explained by the significant three-way effect of treatment, sample size, and data mining technique . In Table 5, GPS estimated with GLM average highest MSE when 16 covariates, sample sizes equaled 500 and the simulated treatments were highly skewed.
Under these circumstances, when sample sizes increased, MSE decreased. However, under similar conditions when the number of covariates were eight, MSE values increased with sample size.
Empirical Study
We sought to quantify the effects of distance (i.e., miles between permanent residence and main grocery store) on consumer shopping behavior. Specifically, we analyzed data from the Food in our Neighborhood Study (FIONS; Karpyn et al., 2020) to estimate GPS and to predict dose–response functions (d.r.f), that is, the response function of grocery store expenses for each level of distance between grocery stores and permanent residences.
Prior to implementing propensity score analyses, we hypothesized that people spend more money at the grocery store when they live further away from it. We estimated GPS with ordinary least squares regression and contrasted GPS with estimates from a newly proposed methodology in the context propensity score analysis, NN. Estimated GPS from both methods were employed in propensity score analyses to measure the impact of the treatment variable, distance, on the outcome variable, monthly grocery store expenses. This demonstration aims to introduce applied researchers and methodologists to the promise of NN in propensity score analysis.
Sample
Door-to-door surveys of residents in food deserts, areas with limited or no access to grocery stores as defined by the USDA (2015), were administered to collect data on primary outcome measures and covariates (e.g., distance, marital status, general health, race/ethnicity, income, type of home, number of residents in the home, body mass index, education, and allocated government assistance). The sample includes 796 men and women aged 18+ who identify themselves as the primary food shopper in their household. The analytic sample (N = 743) was constrained to shoppers with complete case responses to all relevant pretreatment covariates (c = 16). Participants were asked to report the store where they did their primary food shopping, and a Euclidian distance was calculated from their home address to that location, forming a “distance to store” outcome variable. An average per person per household total monthly grocery expenditure was calculated by taking the response value to the question “How much do you spend per month on groceries?” and dividing it by the response to the question “How many people does this feed?”
Analysis
The raw distance variable was skewed and was consequently transformed by taking the common logarithm to approximate the normality assumption of ordinary least squares (OLS) regression. Then, the authors estimated the conditional densities for the distance treatment variable with OLS and NN. The GPS model estimated with OLS included all combinations of two-way and three-way interactions. Table 6 displays the summary statistics of GPS estimated by both methodologies.
Table 6.
Summary of Estimated GPS.
DMT | Minimum | First Quartile | Median | Mean | Third Quartile | Maximum |
---|---|---|---|---|---|---|
OLS | .02366 | .30374 | .40603 | .4425 | .52355 | 1.51347 |
NN | .2371 | .3865 | .4406 | .4443 | .4963 | 0.6739 |
Note. OLS = ordinary least squares; NN = neural networks; GPS = generalized propensity score; DMT = data mining technique.
GPS estimated with OLS had a greater range compared with GPS estimated with NN; medians and means were comparable. Next, we evaluated the covariate balance by stratifying based on the GPS and fitting one regression for each covariate with the treatment dose as the outcome and with GPS strata and the covariates as the predictor. The standardized regression coefficients were used as a measure of the effect size of the covariates on treatment dose. Covariate balance was considered adequate if the coefficient was lower than .25 (Stuart, 2010). The standardized regression coefficients obtained with the code above are shown in Table 7.
Table 7.
Standardized Coefficients of Regressions of Treatment Dose on Covariates.
Variable | OLS Coef. | NN Coef. |
---|---|---|
Married | 0.17133927 | 0.0124522 |
General health | 0.0158966 | 0.06338545 |
Drinker | 0.03353069 | 0.00417273 |
Smoker | 0.21545781 | 0.08434758 |
BMI category | 0.06309498 | 0.05036563 |
Poverty | 0.35228525 | 0.08275315 |
Education | 0.04714866 | 0.00901092 |
Employment | 0.27542781 | 0.0808399 |
SNAP | 0.21252535 | 0.09437242 |
WIC | 0.0902212 | 0.08886782 |
Gender | 0.06125878 | 0.08645785 |
Hispanic | 0.06229547 | 0.02982312 |
Black | 0.01184347 | 0.00666584 |
People to feed | 0.02830187 | 0.03433847 |
Home type | 0.03005369 | 0.01341057 |
Note. OLS = ordinary least squares; NN = neural networks; SNAP = The Supplemental Nutrition Program; WIC = Women, Infants, and Children.
All but two covariates (bold in Table 7) achieved the desired level of balance with OLS, but all covariates achieved balance with NN. Following Hirano and Imben’s (2004) methods, distance, poverty, employment, and GPS obtained with OLS were included in the first outcome model as a function of the treatment. The second outcome model predicted grocery expenses with distance and GPS obtained from NN but without covariates because all of the covariates met balancing criteria.
In the final step, predicted values from the outcome models were plotted to display a d.r.f. Figures 2 and 3 show the treatment dose effects on the monthly grocery expenses with confidence intervals indicated by dashed lines. We calculated the confidence intervals by first estimating individual treatment effects at the 1%–100% percentiles of treatment dose. Then, we calculated upper and lower limits by adding and subtracting 1.96 times the standard errors from the mean grocery expenses. Figure 2 is derived from the first outcome model with GPS estimated with OLS, and Figure 3 derived from GPS estimated with NN.
Figure 2.
Estimated dose–response function post estimation of generalized propensity score with ordinary least squares.
Figure 3.
Estimated dose–response function post estimation of generalized propensity score with neural networks.
There were little to no differences between the d.r.f. in Figures 2 and 3. The confidence intervals are narrower in the left side of the distribution of log distance to the supermarket because more shoppers were available to estimate the treatment dose effects at the left region of the distribution. It is also noticeable that grocery expenses average increases with increases in log distance to the supermarket. This positive effect of distance from the grocery store supports the hypothesis that people spend more money at the grocery store when they live further away from it. Despite differences in the range and distribution of GPS estimated with OLS and NN, following Hirano and Imben’s (2004) methods to determine a d.r.f. yielded nearly equivalent findings.
Discussion
Accurate estimation of GPS is critical to the application of propensity score methods to estimate the effect of continuous treatments. GLM have model assumptions, require complex manual specification of interactions and quadratic terms, and are susceptible to overfitting compared with nonparametric data mining techniques (Setoguchi et al., 2008; Westreich et al., 2010). We investigated the accuracy of GPS estimated with GLM and NN from simulated data with medium to low sample sizes and complex models without changing the specification of interactions. Additionally, we performed propensity score analyses on an empirical data set using traditional and newly developed approaches.
This study is the first to establish literature on fitting NN for GPS estimation for continuous treatment doses. All studied approaches were trained and tested with MSE. Although the loss was low with our NN architectures, there exists other hyperparameters that may yield even better performance. Future research should compare NN trained using quasi-Newton methods with different training algorithms (e.g., gradient descent, conjugate gradient, and Levenburg–Marquardt; Fazayeli & Banerjee, 2016; Kingma, 2014; Le et al., 2015). We fitted GLM to estimate GPS with ordinary least squares for Gaussian distributed treatments and maximum likelihood estimation for gamma-distributed treatments. Other estimators are also available for GLM (Amiguet et al., 2017).
Our study extends the approach to estimate GPS proposed by Hirano and Imbens (2004). Previous research on estimating propensity scores for categorical treatments has favored nonparametric data mining techniques over traditional methods (Collier & Leite, 2020; McCaffrey et al., 2004; Pirracchio et al., 2015; Setoguchi et al., 2008). This study aligns with previous research, having found higher correlations and lower MSE values after comparing the true GPS with the GPS estimated by NN. Kreif et al. (2015) applied and proposed the “Super-Learner” as a new technique to estimate GPS and the d.r.f. for continuous treatments. Similar to this article, they found nearly equivalent GPS and outcomes after comparing the results of the machine learning technique to parametric implementations of the GPS. Kreif et al. (2015) also used MSE as the loss function for the “Super Learner.”
GLM
We demonstrated how GLM can estimate GPS of skewed treatments that are moderately correlated with actual GPS. However, MSE values were large. By looking at the data, we learned that the correlations and MSE were consistent in that when the correlation was very low, the MSE was very high. Furthermore, these cases only occurred when GPS were estimated with GLM. The minimum correlation between actual and estimated GPS for GLM was 0, while the minimum correlation for NN was 0.54. The implication is that if the correlation between true GPS and estimated GPS is low, the selection bias would not be removed.
Distributions in GLM.
Theoretically, the gamma distribution was the right choice in our simulation because the treatments ranged from 0 to infinity, the variance of the treatments was proportional to the square of the mean response, and the treatments were generated at each covariate from a gamma distribution with the mean as the exponential of the linear function of (Davidian, 2017, p. 437; Thom, 1958). However, other models, such as inverse Gaussian and lognormal, have been recommended to deal with the same scaled data. For example, Barber and Thompson (2004) favored gamma over inversed Gaussian and the opposite was concluded in Moran et al. (2007). Moran et al. assessed prediction error with mean absolute error and root mean squared error. Johnson (2014) fitted GLM on gamma distributed data with Gaussian and Gamma specifications. Johnson learned that parameter estimates and predicted variables from the two models were similar. GLM using lognormal and inverse Gaussian response distribution may be of advantage to estimate GPS when treatments are skewed.
Estimating Treatment Effects
Dose–response functions.
The GPS estimated with NN and GLM can be used in plotting the d.r.f., as demonstrated in our empirical example. Graham et al. (2015) extended the binary d.r.f e augmented regression approach of Scharfstein et al. (1999) to derive a Taylor approximation for continuous d.r.f. In this extension, an outcome regression model is augmented with a set of inverse GPS covariates to correct for potential misspecification bias. Users of the R statistical software can estimate the average dose–response functions with the causaldrf package version 0.3 (Schafer & Galagate, 2015). The package offers a range of existing and new estimators such as Schafer and Galagate (2015), Bia et al. (2014), Flores et al. (2012), Imai and Van Dyk (2005), and Hirano and Imbens (2004).
IPW.
We previously stated that IPW can also be used to remove selection bias with continuous treatments (Robins et al., 2000). Weights can be implemented as survey weights when estimating the ATE after calculating GPS with GLM. This study demonstrated how to calculate the denominator of IPW with NN. Had we trained the NN to predict treatment dosage instead, perhaps the same NN could have estimated the numerator with history covariates as a predictor of treatments. Gruber et al. (2015) calculated the numerator of IPW with a NN with one hidden layer containing two nodes. However, the authors did not outline the training techniques and details on their NN hyperparameters. Future works should compare IPW and alternative weighting techniques to reduce the bias and estimate ATE.
Limitations and Future Research
We did not test multiple NN architectures to determine the most optimal approach to estimate propensity scores. Zhu et al. (2015) proposed a boosting algorithm to estimate GPS and provided a method to determine optimization. There are several established methods to optimize NN. It is unclear what performance would have been yielded by other NN architectures and should be noted for future work. Another limitation is that the confounding variables were all uncorrelated. Varying the simulated relationships would have added more complexity to each iteration and ultimately, the study.
As noted above, continuous treatments are common in behavioral and educational research. The conditions of our simulation study suggest that researchers implementing NN can estimate GPS that are approximately unbiased at small and moderate sample sizes for skewed treatments. On average, the MSE values decreased with increases in skewedness. Researchers who are exploring model specification can use NN as an alternative to ordinary least squared models because it is nonparametric and less prone to errors (Drake, 1993). Variable selection methods, such as stepwise deletion, have produced unstable estimates in previous research and yielded highly variable treatment effect estimates (Breiman, 1996; McCaffrey et al., 2004). Despite the limited research on implementing NN in the context of propensity score analysis, we believe the current study provided improvements for estimation with continuous treatments. The difference in GPS estimation may not result in differences in treatment effect estimates for the simulated conditions, and additional research is needed to understand when one of the methods would be preferable. This article serves solely as an introduction. We hope you agree that a gentle introduction is much more engaging, particularly for nonmachine learning expert readers.
With regard to the data used for the empirical study, we acknowledge that the distance to the store variable was limited to a sample of shoppers living in a food desert. This may in turn limit interpretation of findings to similar, food desert circumstances. However, given the importance of food retail to community well-being and health, such information can support future efforts to understand the consequences of food deserts.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Biographies
Zachary K. Collier is an assistant professor in the Educational Statistics and Research Methods program and the Data Science Institute at the University of Delaware. He is also the director of the Methods for Unstructured and Difficult to use Data (MUDD) Lab. Collier’s research expertise generally falls into the advancement and application of two areas: finite mixture modeling and data-mining-type search algorithms in the context of propensity score analysis and structural equation modeling.
Walter L. Leite’s research program at the University of Florida explores how data mining and machine learning methods may assist in statistical modeling for theory development and causal inference. He focuses on data from virtual learning environments, state-level data systems and large national surveys.
Allison Karpyn is co-director at the Center for Research in Education and Social Policy and associate professor in the Department of the Human Development and Family Sciences at the University of Delaware. Dr. Karpyn, in her 20 years of practice, has published widely in journals including Pediatrics, Preventive Medicine, and Health Affairs on program evaluation methods; topics related to hunger, obesity, school food, supermarket access, food insecurity, healthy corner stores; and, strategies to develop and maintain farmer’s markets in low-income areas.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
References
- Ansong D, Wu S, & Chowa G (2015). The role of child and parent savings in promoting expectations for university education among middle school students in Ghana: A propensity score analysis. Children and Youth Services Review, 58, 265–273. 10.1016/j.childyouth.2015.08.009 [DOI] [Google Scholar]
- Amiguet M, Marazzi A, Valdora M, & Yohai V (2017). Robust estimators for generalized linear models with a dispersion parameter. arXiv preprint arXiv: 1703.09626.
- Arpino B, & Mealli F (2011). The specification of the propensity score in multilevel observational studies. Computational Statistics & Data Analysis, 55, 1770–1780. 10.1016/j.csda.2010.11.008 [DOI] [Google Scholar]
- Beane JD, House MG, Pitt SC, Zarzaur B, Kilbane EM, Hall BL, Riall TS, & Pitt HA (2017). Pancreatoduodenectomy with venous or arterial resection: A NSQIP propensity score analysis. HPB, 19(3), 254–263. [DOI] [PubMed] [Google Scholar]
- Bergstra J, & Bengio Y (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb), 281–305. [Google Scholar]
- Bia M, & Mattei A (2008). A Stata package for the estimation of the dose-response function through adjustment for the generalized propensity score. The Stata Journal, 8(3), 354–373. [Google Scholar]
- Breiman L (1996). Bagging predictors. Machine Learning, 24(2), 123–140. [Google Scholar]
- Cho EJ, Park HC, Yoon HB, Ju KD, Kim H, Oh YK, Yang J, Hwang Y-H, & Ahn C (2012). Effect of multidisciplinary pre-dialysis education in advanced chronic kidney disease: Propensity score matched cohort analysis. Nephrology, 17(5), 472–479. [DOI] [PubMed] [Google Scholar]
- Collier ZK, & Leite WL (2020). A tutorial on artificial neural networks in propensity score analysis. The Journal of Experimental Education, 1–18. [Google Scholar]
- d’Agostino R (1998). Tutorial in biostatistics: Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Statistics in Medicine, 17(19), 2265–2281. [DOI] [PubMed] [Google Scholar]
- Dahl GE, Sainath TN, & Hinton GE (2013, May 26–31). Improving deep neural networks for LVCSR using rectified linear units and dropout. In Acoustics, speech and signal processing (ICASSP), 2013 IEEE International Conference (pp. 8609–8613). IEEE. [Google Scholar]
- Davidian M (2017). Nonlinear models for repeated measurement data. Routledge. [Google Scholar]
- Deng L, & Yu D (2014). Deep learning: methods and applications. Foundations and trends in signal processing, 7(3–4), 197–387. [Google Scholar]
- Dobbin KK, & Simon RM (2011). Optimally splitting cases for training and testing high dimensional classifiers. BMC Medical Genomics, 4(1), 31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Drake C (1993). Effects of misspecification of the propensity score on estimators of treatment effect. Biometrics, 1231–1236. [Google Scholar]
- Fazayeli F, & Banerjee A (2016). Generalized direct change estimation in ising model structure. In International Conference on Machine Learning (pp. 2281–2290). PMLR. [Google Scholar]
- Flores CA, Flores-Lagunes A, Gonzalez A, & Neumann TC (2012). Estimating the effects of length of exposure to instruction in a training program: The case of job corps. Review of Economics and Statistics, 94(1), 153–171. [Google Scholar]
- Fong C, Hazlett C, & Imai K (2016). Covariate balancing propensity score for a continuous treatment: Application to the efficacy of political advertisements. Department of Politics, Princeton University. [Google Scholar]
- Foster EM (2003). Propensity score matching: An illustrative analysis of dose response. Medical Care, 41(10), 1183–1192. [DOI] [PubMed] [Google Scholar]
- Fryges H (2009). The export–growth relationship: Estimating a dose-response function. Applied Economics Letters, 16(18), 1855–1859. [Google Scholar]
- Graham DJ, McCoy EJ, & Stephens DA (2015). Doubly robust dose-response estimation for continuous treatments via generalized propensity score augmented outcome regression. arXiv preprint arXiv:1506.04991.
- Gruber S, Logan RW, Jarrín I, Monge S, & Hernán MA (2015). Ensemble learning of inverse probability weights for marginal structural modeling in large observational datasets. Statistics in Medicine, 34(1), 106–117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gupta N (2013). Artificial neural network. Network and Complex Systems, 3(1), 24–28. [Google Scholar]
- Hagan MT, & Menhaj MB (1994). Training feed forward networks with the Marquardt algorithm. IEEE Transactions on Neural Networks, 5(6), 989–993. [DOI] [PubMed] [Google Scholar]
- Hecht-Nielsen R (1992). Theory of the back propagation neural network. In Wechsler H (Ed.), Neural networks for perception (pp. 65–93). Academic Press. [Google Scholar]
- Hirano K, & Imbens GW (2004). The propensity score with continuous treatments. Applied Bayesian modeling and causal inference from incomplete-data perspectives, 226164, 73–84. [Google Scholar]
- Imai K, & Van Dyk DA (2004). Causal inference with general treatment regimes: Generalizing the propensity score. Journal of the American Statistical Association, 99(467), 854–866. [Google Scholar]
- Imai K, & Van Dyk DA (2005). A Bayesian analysis of the multinomial probit model using marginal data augmentation. Journal of Econometrics, 124(2), 311–334. [Google Scholar]
- Imbens GW (2000). The role of the propensity score in estimating dose-response functions. Biometrika, 87(3), 706–710. [Google Scholar]
- Imbens GW, & Rubin DB (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press. [Google Scholar]
- Joffe MM, & Rosenbaum PR (1999). Invited commentary: propensity scores. American journal of epidemiology, 150(4), 327–333. [DOI] [PubMed] [Google Scholar]
- Karpyn A, Young CR, Collier Z, & Glanz K (2020). Correlates of healthy eating in Urban Food desert communities. International Journal of Environmental Research and Public Health, 17(17), 6305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Keller B, Kim J, & Steiner P (2013). Abstract: Data mining alternatives to logistic regression for propensity score estimation: Neural networks and support vector machines. Multivariate Behavioral Research, 48(1), 164–164. 10.1080/00273171.2013.752263 [DOI] [PubMed] [Google Scholar]
- Kim J, & Seltzer M (2007). Causal inference in multilevel settings in which selection processes vary across schools. Center for Study of Evaluation. [Google Scholar]
- Kingma DP, & Ba J (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kluve J, Schneider H, Uhlendorff A, & Zhao Z (2012). Evaluating continuous training programmes by using the generalized propensity score. Journal of the Royal Statistical Society: Series A (Statistics in Society), 175(2), 587–617. [Google Scholar]
- Kotsiantis SB, Zaharakis I, & Pintelas P (2007). Supervised machine learning: A review of classification techniques. Emerging artificial intelligence applications in computer engineering, 160(1), 3–24. [Google Scholar]
- Kreif N, Grieve R, Díaz I, & Harrison D (2015). Evaluation of the effect of a continuous treatment: a machine learning approach with an application to treatment for traumatic brain injury. Health economics, 24(9), 1213–1228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Le QV, Jaitly N, & Hinton GE (2015). A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941.
- Lee BK, Lessler J, & Stuart EA (2010). Improving propensity score weighting using machine learning. Statistics in Medicine, 29(3), 337–346. 10.1002/sim.3782 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leite WL, Jimenez F, Kaya Y, Stapleton LM, MacInnes JW, & Sandbach R (2015). An evaluation of weighting methods based on propensity scores to reduce selection bias in multilevel observational studies. Multivariate Behavioral Research, 50(3), 265–284. [DOI] [PubMed] [Google Scholar]
- Leite W (2016). Practical propensity score methods using R. Sage Publications. [Google Scholar]
- McCaffrey DF, Ridgeway G, & Morral AR (2004). Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological Methods, 9(4), 403–425. [DOI] [PubMed] [Google Scholar]
- Monlezun DJ, Leong B, Joo E, Birkhead AG, Sarris L, & Harlan TS (2015). Novel longitudinal and propensity score matched analysis of hands-on cooking and nutrition education versus traditional clinical education among 627 medical students. Advances in Preventive Medicine, 2015, 656780. 10.1155/2015/656780 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moodie EE, & Stephens DA (2012). Estimation of dose–response functions for longitudinal data using the generalised propensity score. Statistical Methods in Medical Research, 21(2), 149–166. [DOI] [PubMed] [Google Scholar]
- Moran JL, Solomon PJ, Peisach AR, & Martin J (2007). New models for old questions: Generalized linear models for cost prediction. Journal of Evaluation in Clinical Practice, 13(3), 381–389. [DOI] [PubMed] [Google Scholar]
- Morgan PL, Frisco ML, Farkas G, & Hibel J (2010). A propensity score matching analysis of the effects of special education services. The Journal of Special Education, 43(4), 236–254. 10.1177/0022466908323007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neyman JS (1923). On the application of probability theory to agricultural experiments. essay on principles. section 9 (translated and edited by dm dabrowska and tp speed, statistical science [1990], 5, 465–480). Annals of Agricultural Sciences, 10, 1–51. [Google Scholar]
- Noreen EW (1989). Computer-intensive methods for testing hypotheses. Wiley. [Google Scholar]
- Olejnik S, & Algina J (2003). Generalized eta and omega squared statistics: measures of effect size for some common research designs. Psychological Methods, 8(4), 43. [DOI] [PubMed] [Google Scholar]
- Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Müller A, Nothman J, Louppe G, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, & Duchesnay E (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. [Google Scholar]
- Pirracchio R, Petersen ML, Carone M, Rigon MR, Chevret S, & van der Laan MJ (2015). Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study. The Lancet Respiratory Medicine, 3(1), 42–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-progject.org/ [Google Scholar]
- Roberts C, & Vandenplas C (2017). Estimating components of mean squared error to evaluate the benefits of mixing data collection modes. Journal of Official Statistics, 33(2), 303–334. [Google Scholar]
- Robins JM, Hernan MA, & Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiology, 11(5), 550–560. [DOI] [PubMed] [Google Scholar]
- Rosenbaum PR, & Rubin DB (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55. 10.1093/biomet/70.1.41 [DOI] [Google Scholar]
- Rosenbaum PR, & Rubin DB (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79, 516–524. [Google Scholar]
- Rosenblatt F (1962). Principles of neurodynamics. Spartan. [Google Scholar]
- Rubin DB (1973). The use of matching and regression adjustment to remove bias in observational studies. Biometrics, 29, 185–203. [Google Scholar]
- Rubin DB (1997). Estimating causal effects from large data sets using propensity scores. Annals of Internal Medicine, 127(8 Part2), 757–763. [DOI] [PubMed] [Google Scholar]
- Rubin DB (2010). Propensity score methods. American Journal of Ophthalmology, 149(1), 7–9. [DOI] [PubMed] [Google Scholar]
- Ruck DW, Rogers SK, Kabrisky M, Oxley ME, & Suter BW (1990). The multilayer perceptron as an approximation to a Bayes optimal discriminant function. IEEE Transactions on Neural Networks, 1(4), 296–298. [DOI] [PubMed] [Google Scholar]
- Schafer J (2015). causaldrf: tools for estimating causal dose response functions. R package version 0.3. https://CRAN.R-project.org/package=causaldrf
- Schafer J, & Galagate D (2015). Causal inference with a continuous treatment and outcome: Alternative estimators for parametric dose-response models. Advance online publication. http://hdl.handle.net/1903/18170
- Scharfstein DO Rotnitzky A, & Robins JM (1999). Adjusting for nonignorable dropout using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448), 1096–1120 (with rejoinder 1135–1146). [Google Scholar]
- Setoguchi S, Schneeweiss S, Brookhart MA, Glynn RJ, & Cook EF (2008). Evaluating uses of data mining techniques in propensity score estimation: A simulation study. Pharmacoepidemiology and Drug Safety, 17(6), 546–555. 10.1002/pds.1555 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Specht DF (1991). A general regression neural network. IEEE Transactions on Neural Networks, 2(6), 568–576. [DOI] [PubMed] [Google Scholar]
- Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, & Salakhutdinov R (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958. [Google Scholar]
- Stacy EW (1962). A generalization of the gamma distribution. The Annals of Mathematical Statistics, 33(3), 1187–1192. [Google Scholar]
- Stuart EA (2010). Matching methods for causal inference: A review and a look forward. Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 25(1), 1–21. 10.1214/09-STS313 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stukel TA, Fisher ES, Wennberg DE, Alter DA, Gottlieb DJ, & Vermeulen MJ (2007). Analysis of observational studies in the presence of treatment selection bias: Effects of invasive cardiac management on AMI survival using propensity score and instrumental variable methods. JAMA, 297(3), 278–285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sullivan A, & Field S (2013). Do preschool special education services make a difference in kindergarten reading and mathematics skills? A propensity score weighting analysis. Journal of School Psychology, 51(2), 243–260. 10.1016/j.jsp.2012.12.004 [DOI] [PubMed] [Google Scholar]
- Tenti P (2017). Forecasting foreign exchange rates using recurrent neural networks. In Slade S (Ed.), Artificial intelligence applications on wall street (pp. 567–580). Routledge. [Google Scholar]
- Thom HC (1958). A note on the gamma distribution. Monthly Weather Review, 86(4), 117–122. [Google Scholar]
- U.S. Department of Agriculture ERS. USDA Food Access Research Atlas. U.S. Department of Agriculture, Economic Research Service. (2015). Retrieved December 21, 2020, from https://www.ers.usda.gov/data-products/food-access-research-atlas/go-to-the-atlas/ [Google Scholar]
- Walczak S (2018). Artificial neural networks. In Khosrow-Pour M (Ed.), Encyclopedia of information science and technology (4th ed., pp. 120–131). IGI Global. [Google Scholar]
- Westreich D, Lessler J, & Funk MJ (2010). Propensity score estimation: Machine learning and classification methods as alternatives to logistic regression. Journal of Clinical Epidemiology, 63(8), 826–833. 10.1016/j.jclinepi.2009.11.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Witten IH, Frank E, Hall MA, & Pal CJ (2016). Data mining: Practical machine learning tools and techniques. Morgan Kaufmann. [Google Scholar]
- Yadav A, & Sahu K (2017). Wind forecasting using artificial neural networks: A survey and taxonomy. International Journal of Research in Science & Engineering, 3, 148–155. [Google Scholar]
- Zhu A (2017). Artificial neural networks. In Castree N, Goodchild MF, Kobayashi A, Marston RA, Liu W, & Richardson D (Eds.), The international encyclopedia of geography. John Wiley & Sons. [Google Scholar]
- Zhu Y, Coffman DL, & Ghosh D (2015). A boosting algorithm for estimating generalized propensity scores with continuous treatments. Journal of Causal Inference, 3(1), 25–40. [DOI] [PMC free article] [PubMed] [Google Scholar]