Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Mar 21.
Published before final editing as: J Am Stat Assoc. 2026 Mar 17:10.1080/01621459.2026.2637894. doi: 10.1080/01621459.2026.2637894

Inference for Deep Neural Network Estimators in Generalized Nonparametric Models

Xuran Meng *, Yi Li
PMCID: PMC13003751  NIHMSID: NIHMS2150846  PMID: 41869283

Abstract

While deep neural networks (DNNs) are used for prediction, inference on DNN-estimated subject-specific means for categorical or exponential family outcomes remains underexplored. We address this by proposing a DNN estimator under generalized nonparametric regression models (GNRMs) and developing a rigorous inference framework. Unlike existing approaches that assume independence between estimation errors and inputs to establish the error bound, a condition often violated in GNRMs, we allow for dependence and our theoretical analysis demonstrates the feasibility of drawing inference under GNRMs. To implement inference, we consider an Ensemble Subsampling Method (ESM) that leverages U-statistics and the Hoeffding decomposition to construct reliable confidence intervals for DNN estimates. We show that, under GNRM settings, ESM enables model-free variance estimation and accounts for heterogeneity among individuals in the population. Through simulations under nonparametric logistic, Poisson, and binomial regression models, we demonstrate the effectiveness and efficiency of our method. We further apply the method to the electronic Intensive Care Unit (eICU) dataset, a large scale collection of anonymized health records from ICU patients, to predict ICU readmission risk and offer patient-centric insights for clinical decision making.

Keywords: Deep Neural Networks, Inference, Generalized Nonparametric Regression Models, Ensemble Subsampling Method

1. Introduction

Given an outcome variable y and covariates x, estimating the conditional expectation E(yx) is central to supervised learning. Under squared loss, E(yx) is the optimal estimator that minimizes the expected estimation error among all measurable functions, making it a natural and widely adopted target in statistical modeling and machine learning. Even when optimality does not hold under more general losses, E(yx) remains a valuable target, offering insight into the relationship between covariates and outcomes. It serves as a foundational component for constructing more complex quantities across domains such as conditional variances, counterfactual means, and risk-adjusted estimates (Hastie et al. 2009). Methods that estimate E(yx) include linear, parametric and nonparametric regression, kernel regression (Nadaraya 1964), support vector regression (Drucker et al. 1996), random forests (Breiman 2001) and deep neural networks (DNNs) (LeCun et al. 2015).

With the rapid advancement of artificial intelligence, DNN–based estimators have gained widespread popularity. However, the statistical inference surrounding the DNN estimator of E(yx), especially when y is categorical or belongs to the exponential family, remains underdeveloped. We address this gap by establishing a rigorous framework for inference on E(yx) estimated by DNNs under a generalized nonparametric regression model (GNRM) setting [see Eq. (2.3)].

Modern inference approaches include distribution-free conformal estimation inference (Lei et al. 2018, Angelopoulos & Bates 2021, Huang et al. 2024), extensions of jackknife inference within conformal estimation (Alaa & Van Der Schaar 2020, Kim et al. 2020, Barber et al. 2021), debiasing methods (Athey et al. 2018, Guo et al. 2021), residual-based bootstrap techniques (Zhang & Politis 2023), and the use of U statistics in regressions, random forests and neural networks (Mentch & Hooker 2016, Wager & Athey 2018, Schupbach et al. 2020, Wang & Wei 2022, Fei & Li 2024). Several distinctions and limitations remain. For example, conformal prediction constructs distribution-free prediction sets for y, targeting outcome coverage rather than inference on E(yx). Its intervals thus absorb both outcome noise and model uncertainty, making them conservative and inefficient for mean inference. Clinically, inference on risk classifications derived from patients’ risk profiles, which are based on the estimated conditional mean (i.e., risk score), is more actionable, as medical decisions (e.g., triage or treatment intensity) rely on risk stratifications rather than prediction sets (Lundberg et al. 2018, Van Calster et al. 2019). Jackknife and related leave-one-out methods assume stable, nearly independent residuals, conditions violated by DNNs whose estimates depend heavily on individual samples. Debiasing approaches (Athey et al. 2018, Guo et al. 2021) and residual bootstraps (Zhang & Politis 2023) remain limited to linear models, while resampling-based inference for high-dimensional GLMs (Fei & Li 2021) is fully parametric. Recent work on uncertainty quantification for random forests (Mentch & Hooker 2016, Wager & Athey 2018) and neural networks (Schupbach et al. 2020, Fei & Li 2024) offers valuable insights but remains largely confined to continuous outcomes, leaving inference for GNRMs such as logistic and Poisson regressions unresolved.

Theoretically, for valid inference on E(yx), we need to establish the convergence rate of neural networks in a GNRM framework. Since the universal approximation theorem (Hornik et al. 1989), there has been much progress in understanding the convergence rates of neural networks (McCaffrey & Gallant 1994, Kohler & Krzyżak 2005, Yarotsky 2017, Shen et al. 2019, Kohler & Langer 2021, Lu et al. 2021, Shen et al. 2021, Jiao et al. 2023, Fan et al. 2024, Wang et al. 2024, Bhattacharya et al. 2024, Yan et al. 2025). For instance, Shen et al. (2021) established the convergence rate under nonparametric regression assuming the p-th moment of response is bounded for some p>1; Fan et al. (2024) explored nonasymptotic error bounds of the estimators under several loss functions, such as the least-absolute deviation loss and Huber loss; Wang et al. (2024) proposed an efficient and robust nonparametric regression estimator based on an estimated maximum likelihood estimator; Fan & Gu (2024), Bhattacharya et al. (2024) investigated deep neural network estimators for nonparametric interaction models with diverging input dimensions, focusing on theoretical properties such as minimax optimality, approximation error control, and sparsity-driven regularization. However, quantifying uncertainty of DNN-estimated E(yx) within the framework of GNRM introduces additional challenges. The existing approaches typically assume a nonparametric regression framework, where the distribution of y-E(yx) does not depend on x. This assumption simplifies the theoretical analysis and the derivation of convergence results (Takezawa 2005, Györfi et al. 2006, Fan 2018). In generalized nonparametric regression settings, however, the assumption often fails due to heteroskedasticity, as the distribution of y-E(yx) may still depend on x. Consequently, the standard techniques for bounding error terms and controlling variance no longer apply, and the established results on uniform convergence or concentration inequalities (Bartlett et al. 2020) are invalid. As a result, it is unclear whether these methods are applicable for drawing inference on GNRMs estimated by DNNs.

Our methodological and theoretical contributions are as follows. First, we propose a DNN estimator for subject specific means under GNRMs and develop an inference framework that explicitly accounts for covariate-dependent error distributions induced by heteroskedasticity. We provide new proof techniques, including new criterion loss function, covering-number based concentration and a truncation strategy that handles exponential-tailed, heteroskedastic noise in Section D in the Supplementary Material, improving the results of Schmidt-Hieber (2020). For example, with the general loss induced by GNRMs, the quadratic-based arguments of Schmidt-Hieber (2020) no longer apply. This necessitates a complete redevelopment of the error bounds, derived instead from the intrinsic curvature and Lipschitz properties in the new GNRM setting. As a result, all estimation error analyses are novel and move beyond Gaussian approximations. Additionally, the presence of exponential tailed, heteroskedastic noise complicates uniform convergence, which we address through a truncation strategy. The truncation scheme partitions the analysis into bounded and unbounded noise regimes, where standard convergence arguments apply in the former and concentration inequalities control the latter, yielding sharper error bounds. These theoretical tools enable us to extend the continuous outcome setting of Schmidt-Hieber (2020), accommodating both continuous and discrete outcomes, and thereby generalizing their framework to the broader GNRM class. Second, to quantify the uncertainty of the estimates, we propose an ensemble subsampling method (ESM) under GNRMs, developed from U-statistics (Frees 1989, Hoeffding 1992, Lee 2019, Boroskikh 2020) and Hoeffding decomposition (Hoeffding 1992). Our model-free variance estimator leverages the intrinsic properties of Hoeffding decomposition and accommodates varying population levels as well as individual variance structures. Moreover, we employ incomplete U-statistics to reduce computational cost, enabling the resample size to grow proportionally with the sample size while limiting the total number of resamples. To mitigate the resulting Monte Carlo bias, we refine the variance estimator of Wager & Athey (2018) for bias correction.

In Section 2, we introduce deep neural networks (DNNs), generalized nonparametric regression models (GNRMs) and the ensemble subsampling method (ESM). Section 3 presents theoretical results for the convergence rate of DNNs under the GNRM framework, as well as results for statistical inference using ESM. In Section 4, we perform simulations to evaluate the finite sample performance of our method. Section 5 demonstrates the proposed methodology using the electronic Intensive Care Unit (eICU) dataset to estimate both the probability and expected number of ICU readmissions based on patient characteristics, with accompanying statistical inference. The eICU dataset is a multi-center resource comprising anonymized clinical data from ICU patients, developed through a collaboration between the Massachusetts Institute of Technology and multiple healthcare partners. The findings may support the identification of high-risk individuals and guide the delivery of personalized, risk-informed care. Section 6 concludes the paper with discussions. Additional experiments and proofs are provided in the Supplementary Material.

Notation.

We employ lower letters (e.g. a) for scalars and boldface letters for vectors and matrices (e.g. a and A). For any matrix A, we use A0 and A to denote its zero norm and infinity norm, respectively. The set of natural and real numbers are denoted by N and R respectively. For any function f mapping a vector to R, we use f to denote its infinity norm. For an integer n,[n]={1,,n}. We write X1(n)=OX2(n) or X1(n)X2(n) or X2(n)X1(n) if there exist C>0 and N0>0 such that X1(n)CX2(n) when n>N0. We denote X2(n)X1(n) if X1(n)=OX2(n) and X2(n)=OX1(n). We denote X1(n)=oX2(n) if X1(n)/X2(n)0, and X1(n)~X2(n) if X1(n)/X2(n) converges to 1. When clear from context, we add the index p (for probability) such as Op for a random variable and its realized value to avoid redundancy.

2. The Preamble

2.1. Deep Neural Networks

Let L be a positive integer denoting the depth of a neural network, and p=p0,,pL,pL+1 be a vector of positive integers specifying the width of each layer, with p0 corresponding to the input dimension, p1,,pL the dimensions of the L hidden layers, and pL+1 the dimension of the output layer. An (L+1)-layer DNN with layer-width p is defined as

fx=WLfLx+vL, (2.1)

with the recursive expression fl(x)=σWl-1fl-1(x)+vl-1,f1(x)=σW0x+v0. Here, WlRpl+1×pl and vectors vlRpl+1(l[L]) are the parameters of this DNN, and σ:RR is the activation function, applied element-wise to vectors. A commonly used activation function is the rectified linear unit (ReLU) function (Nair & Hinton 2010), or σ(x)=max(x,0). We focus on ReLU due to its piecewise linearity and projection properties, with potential extensions to other activation functions as discussed in Section 6.

Overparameterized DNNs can memorize noise and overfit (Cao et al. 2022), so sparsity is used to control complexity. We focus on the following class of s-sparse and F-bounded networks for our theoretical analysis:

L,p,s,F=fof form(2.1):Wl,vl1,l=1LWl0+vl0s,fF. (2.2)

Here, sN+, and F>0 is a constant. 0 is the number of nonzero entries of matrix or vector, and f is the sup-norm of function f.

2.2. Generalized Nonparametric Regression

Consider outcomes from the exponential family, including commonly used distributions such as normal, Bernoulli, Poisson, and binomial. The density (or mass) function for a random variable y in the (single-parameter) exponential family is p(yθ)=h(y)exp{θy-ψ(θ)}, where θ is the canonical parameter, 0<h(y)< ensures normalization, and ψ(θ) is convex and smooth, satisfying E(yθ)=ψ(θ), where ψ() denotes the derivative of ψ() and the inverse function of ψ() is known as the link function (Brown 1986, Fei & Li 2021).

Suppose we observe n independently and identically distributed (iid) sample xi,yii=1n from (x,y), where xRp is a covariate vector and yR is the response variable. Given x, the distribution of y is modeled as

pyx=hyexpyf0x-ψf0x, (2.3)

where f0:RpR is a nonparametric function with f0(x) representing the mean parameter of the response variable y given x. Model (2.3) is referred to as a Generalized Nonparametric Regression Model (GNRM), where the outcomes follow a distribution from the exponential family. With an independent test point (xnew,ynew) and that ynew, given xnew, follows (2.3), we aim to estimate Eynewxnew which is ψf0xnew, Consequently, it is of interest to estimate f0. By the universal approximation theorem (Hornik et al. 1989), DNNs are one of suitable candidates for estimating f0. We consider an optimal DNN estimator, f^nopt, which minimizes the negative log likelihood function (2.3) over the class of DNNs defined in (2.2), that is,

f^nopt=argminf(L,p,s,F)1ni=1n-yifxi+ψfxi. (2.4)

The estimator (2.4) is applicable to commonly used models, including these listed below.

Nonparametric Gaussian Regression:

Given x, we assume that the response variable follows a Gaussian distribution with mean f0(x) and variance σ2. In this setting, σ2 does not affect the estimation of the mean function f0(x), which can be seen by examining the likelihood and optimization. The likelihood function for a nonparametric Gaussian regression model is given by p(yx)=12πσe-y-f0(x)22σ2. An application of (2.4) to this setting yields:

f^nopt=argminf(L,p,s,F)1ni=1n-yifxi+12f2xi=argminf(L,p,s,F)1ni=1nyi-fxi2,

corresponding to mean squared error minimization as in Schmidt-Hieber (2020).

Nonparametric Logistic Regression:

To model binary outcomes y{0,1}, we set h(y)=1,ψ(θ)=log1+eθ in (2.3). The estimator (2.4) in this setting can be written as:

f^nopt=argminf(L,p,s,F)1ni=1n-yifxi+log1+expfxi.

Nonparametric Poisson Regression:

Considering yN (non-negative integers), we define h(y)=1y! and ψ(θ)=eθ in (2.3). The estimator (2.4) has the following form:

f^nopt=argminf(L,p,s,F)1ni=1nexpfxi-yifxi.

Nonparametric Binomial Regression:

Given covariates x, the response variable y follows a Binomial distribution with a fixed number of trials ntrial and the success probability p(x)=11+e-f0(x). This corresponds to h(y)=ntrialy and ψ(θ)=ntriallog1+eθ. In this setting, the estimator (2.4) is specified as

f^nopt=argminf(L,p,s,F)1ni=1n-yifxi+ntriallog1+efxi.

2.3. Empirical Risk Minimization and Function Smoothness

Recall that the pointwise Kullback–Leibler divergence between two distributions of the form (2.3), one specified by f and another by the true f0, conditional on x, i.e.,

E-yf(x)+ψ(f(x))+yf0(x)-ψf0(x)x=-ψf0(x)f(x)-f0(x)+ψ(f(x))-ψf0(x),

where we use E[yx]=ψf0(x). This equals the Bregman divergence between f(x) and f0(x), which is nonnegative and vanishes if and only if f(x)=f0(x). We then quantify the estimation error of f^, any estimator of f0, by evaluating the pointwise estimation error at an independent test point X:X;f^,f0=-ψf0(X)f^(X)-f0(X)+ψ(f^(X))-ψf0(X). We further define the average estimation error, our main theoretical target, as

Rnf^,f0EX;f^,f0, (2.5)

where the expectation is taken with respect to all involved random variables.

Computing the exact minimizer f^nopt in (2.4) is challenging due to the nonconvex loss (Fan & Gu 2024). Instead, we obtain an approximate minimizer f^n using optimization algorithms such as gradient descent. To quantify the optimization error, we introduce

Δn(f^n,f^nopt)=E{1ni=1n(yif^n(xi)+ψ(f^n(xi))1ni=1n(yif^nopt(xi)+ψ(f^nopt(xi))} (2.6)

that measures the difference between the expected empirical risk of f^n and f^nopt. For notational ease, we write Δnf^nΔnf^n,f^nopt when the reference to f^nopt is clear. We will show that a small empirical deviation Δnf^n, combined with controlled model complexity of the neural network class, implies a bound on the average excess risk Rnf^n,f0, forming the foundation for the inference theory developed in subsequent sections.

For desirable properties of f^n, we assume f0 belongs to a Hölder class (Schmidt-Hieber 2020, Fan & Gu 2024) with parameters γ,K>0, and domain DRr:

𝒢rγD,K=g:DR:β:β1<γβg+β:β1=γsupx,yD,xyβgx-βgyx-yγ-γK,

where γ is the largest integer strictly smaller than γ, and β=β1βr with β=β1,,βrNr. We further assume that f0 is a (q+1)-composition (qN) of several Hölder functions. That is, for some vectors d=d0,,dq+1N+q+2,t=t0,,tqN+q+1 and m=m0,,mqR+q+1,f0𝒢(q,d,t,m,K), where

𝒢q,d,t,m,Kf=gqg0:gi=gijj:ai,bidiai+1,bi+1di+1,gij𝒢timiai,bitiwithai,biK.

The motivation for the assumptions on f0 is that the Hölder class provides a flexible framework for quantifying smoothness, capturing both integer and fractional orders, and serves as a natural benchmark for assessing neural network approximation (Yarotsky 2017), while the compositional structure mirrors the architectural design of deep neural networks with each layer applying a transformation to the output of the previous one (Yarotsky 2017). We also note that d and t characterize the dimensions of the function, while m represents the measure of smoothness. For instance, if

f0(x)=g21g11g01x1,x2,x3,x4,g02x5,x6,x7,x8,g12g03x9,x10,x11,x12,g04x13,x14,x15,x16,g13g05x17,x18,x19,x20,g06x21,x22,x23,x24,x[0,1]24,

and gij are twice continuously differentiable, then smoothness m=(2,2,2), dimensions d=(24,6,3,1) and t=(4,2,3). We further define mj*=mjΠl=i+1qml1 and ϕn=maxj=0,,qn-2mj*/2mj*+tj.

2.4. Ensemble Subsampling Method (ESM)

Consider a scenario where x is drawn from an unknown probability measure Px supported on 𝒳, and for an input xnew=x*𝒳, the unobserved outcome ynew follows (2.3). Our goal is to estimate Eynewxnew=x*=ψf0x* and its uncertainty. For this, we introduce the Ensemble Subsampling Method (ESM), based on subsampling techniques (Wager & Athey 2018, Fei & Li 2024), to construct ensemble estimators and confidence intervals. ESM consists of the following components, which are summarized in Figure 1 as well.

Figure 1:

Figure 1:

Overview of the Ensemble Subsampling Method (ESM).

  • (Subsampling) Let ={1,,n} denote the index set of the training dataset 𝒟n. We construct subsets of size r(<n), yielding B*=nr unique combinations. Denoting the b-th subset as b=i1,,ir, where i1<<ir and b1:B*.

  • (Estimator Construction) For each 1bB*, we aim to minimize the objective function in (2.4) within (L,p,s,F) by using a gradient based optimization algorithm such as Stochastic Gradient Descent (SGD) on the observations indexed by b. Let f^b denote the resulting approximate minimizer satisfying
    Δrf^b,f^b,optΔbopt, (2.7)
    where Δbopt quantifies the optimization error (Fan & Gu 2024), and f^b,opt and Δrf^b,f^b,opt are defined respectively in (2.4) and (2.6) on the subsample indexed by b.
  • (Ensemble Estimation) Using the estimators f^b:b=1,,B*, we compute the ensemble estimation for any x*𝒳 as:
    f^B*x*=1B*b=1B*f^bx*,
    and estimate Eynewxnew=x* as ψf^B*x*.
  • (Confidence Interval Construction) To quantify uncertainty, we estimate the variance, denoted σ^*, using the infinitesimal jackknife method. The resulting confidence interval (CI) is defined as:
    CIx*=ψf^B*x*-cασ^*,ψf^B*x*+cασ^*. (2.8)
    where cα is a constant controlling the 1-α confidence level.

The distribution of f^bx* for a fixed x* is intractable. However, since f^bx* is permutation symmetric with respect to b, the ensemble estimator f^B*x* forms a generalized U-statistic, which we later show to be asymptotically normal by leveraging U-statistics theory. In practice, we allow r to grow with n, but evaluating all B*=nr models becomes computationally infeasible. To overcome this, we use a stochastic approximation by randomly drawing B size-r subsamples from 𝒟n. This involves independently sampling indices b1,,bB from 1 to nr, where each subsample is indexed by bj, and computing

f^Bx*=1Bj=1Bf^bjx*. (2.9)

As (2.9) provides the estimator for f0, the variance σ^* in (2.8) in the Confidence Interval Construction procedure is given as

σ^*2=n(n-1)(n-r)2i=1nV^i2originalσ^2-n(n-1)(n-r)21B(B-1)i=1nj=1BZbji-V^i2bias correction. (2.10)

Here, Zbji and V^i are defined as

Zbji=Jbji-Jif^bjx*-f^Bx*,V^i=j=1BZbjiB,

where Jbji=Iibj and J.i=1Bj=1BJbji. Extending Peng et al. (2024), (2.10) jointly corrects Monte Carlo bias and finite-sample effects using the factor n(n-1)/(n-r)2, enabling valid inference with Bn. Its construction relies on the dominance of the first order term in the Hoeffding decomposition, and the bias correction term adequately captures its resampling variability, regardless of the specific scaling between r and n.

3. Theoretical Results

Let zi=yi,xi represent an independent observation of sample points. For studying the asymptotic properties of f^B in (2.9), we introduce

ξ1,r(x)=Covf^x;z1,z2,,zr,f^x;z1,z2,,zr. (3.1)

Here x𝒳 is in the support of probability measure Px,f^x;z1,z2,,zr means that we obtain f^ from the subsample z1,z2,,zr and then apply it to the point x, and ξ1,r quantifies the covariance between estimates based on two overlapping subsets of size r that share a common point z1 but differ in the remaining elements. ξ1,r represents a second-order component in the Hoeffding decomposition of a U-statistic (Hoeffding 1992), capturing the pairwise dependence structure within the subsamples used for constructing the estimator. The terms zi and zi are independently generated from the same data generation process. We impose some regularity assumptions for our theoretical guarantees.

Assumption 3.1. Suppose that (2.3) holds and the true function f0𝒢(q,d,t,m,K). The support of y, denoted by supp(h(y)), is fixed and does not depend on x. Moreover, the sufficient statistic T(y)=y is not degenerate and can take at least two distinct values in supp(h). For any x, the conditional distribution p(yx) is sub-exponential. Specifically, with bounded f0 there exists a universal constant κ>0 such that for all t>0,

Py-Eyxtx2exp-tκ.

Assumption 3.2. The network class (L,p,s,F) with L,p and s satisfies i) Fmax(K,1), ii) i=0qlog24ti4milog2nLlogαn for some α>1, iii) nϕnmini=1,,Lpimaxi=1,,Lpin, iv) snϕnlogn. Here, ϕn=maxj=0,,qn-2mj*/2mj*+tj, where mj*=mjΠl=i+1qml1. The values ti,mi and pi are the (i+1)-th element in the vectors t,m and p, respectively.

Assumption 3.3. For each subsample with index b and size r, there exists a constant C>0 such that the optimization error Δbopt defined in (2.7) satisfies ΔboptCϕrLlog2r.

Assumption 3.4. The covariance ξ1,r(x) defined in (3.1) satisfies infx𝒳ξ1,r(x)Cr-(1+ε) for some ε>0. The subsample size r=nγ satisfies 1/1+minj=0,,q2mj*/2mj*+tj<γ<1, and the number of subsampling times B is large such that Bn.

Assumption 3.1 ensures that p(yx) not degenerate, while the concentration of y does not deviate too widely. Additionally, that T(y)=y takes at least two distinct values in supp(h) guarantees a non-zero variance of y, ensuring the log-partition function ψf0(x) exhibits local convexity. The boundedness of f0 guarantees the existence of κ. This condition, weaker than the sub-Gaussian tail bound assumed in Fan et al. (2024), is satisfied by most common exponential family distributions, including Beta, Bernoulli, Poisson, and Gaussian. Assumption 3.2 is a technical assumption similar to Schmidt-Hieber (2020), which specifies the requirements on the network parameters relative to the sample size n and smoothness parameters ϕn,ti and mi, including the number of layers L and the sparsity s. Assumption 3.3, related to the training dynamics of DNNs, stipulates that DNNs return an estimator sufficiently close to the global minimum. We note that the rate in Assumption 3.3 pertains to the prediction error in Theorem 3.5. Our intent is not to assert an explicit convergence rate for SGD in practice, but to impose a sufficient condition that separates optimization error from the inferential analysis. This perspective follows standard practice in the statistical literature (Chernozhukov et al. 2018, Schmidt-Hieber 2020, Fan & Gu 2024), where such assumptions serve as technical devices to ensure valid inference rather than as primary objects of study. Assumption 3.4 on ξ1,r(x) is not meant to serve as a precise characterization of neural networks. Rather, it ensures that the first-order Hoeffding projection remains non-degenerate and dominates the higher-order components, as in the random forest setting (Wager & Athey 2018). We choose the subsample size r much smaller than n to ensure first-order dominance but large enough to limit bias, and let the number of subsamples B scale with n to control computation.

Theorem 3.5. Suppose that f0𝒢(q,d,t,m,K) and f^n(L,p,s,F). If Assumptions 3.1–3.2 hold, then it holds that

12Δnf^n-cϕnLlog2nRnf^n,f02Δnf^n+CϕnLlog2n.

for some constants c,C>0. Here, Δnf^n is defined in (2.6).

Theorem 3.5 establishes a bound on the out-of-sample estimation error. Unlike previous studies (Schmidt-Hieber 2020, Fan & Gu 2024), which primarily focus on bounding the in-sample mean squared error, i.e., 1ni=1nyi-fxi2, to derive bounds for the estimation error, our approach applies to a wide class of training losses (2.6) under GNRMs. The heteroskedasticity inherent in GNRMs necessitates a more refined analysis, and a truncation technique is introduced to address this challenge by separating the noise into different regimes. Specifically, this result enables the analysis of non-Gaussian outcomes with heteroskedasticity depending on covariates, overcoming key limitations of prior methods (Schmidt-Hieber 2020, Fan & Gu 2024) that rely on the independence of yi-Eyixi and f0xi to get estimation error bounds.

Proposition 3.6. We suppose that the conditions of Theorem 3.5 hold, the neural network fitted on subsamples belongs to (L,p,s,F), Assumptions 3.2–3.4 hold for each subsample with size r, and sparsity sr1+δϕrlogr for some constant δ>0. Then for a fixed x*, if the pointwise bias is dominated by the best approximation bias, i.e., there exists constant Cx* only depends on x* such that Ef^bx*-f0x*Cx*infffx*-f0x*, it holds that nr2infx𝒳ξ1,r(x)Ef^Bx*-f0x*0.

Proposition 3.6 implies asymptotic unbiasedness, i.e., Ef^Bx*-f0x*=or2ξ1,rx*n for a fixed x*. Thus, the bias of f^B is asymptotically negligible, and enables us to focus on f^Bx*-Ef^Bx* and establish the following results.

Theorem 3.7. Suppose the data 𝒟n are generated from Model (2.3) with an unknown f0𝒢(q,d,t,m,K), the used neural networks belong to (L,p,s,F), and Assumptions 3.1–3.4 hold (Assumption 3.2 holds for each subsample with size r). Then with the results of Proposition 3.6, there exists a positive sequence, δ1,δ2,, with δn0 and a corresponding set 𝒜δn𝒳 with Px𝒜δn1-δn, such that for any fixed x*𝒜δn, it holds that

nf^Bx*-f0x*r2ξ1,rx*d𝒩0,1.

Theorem 3.7 establishes the asymptotic normality of the ensembled estimator f^B for any fixed x*𝒜δn, i.e., the inference result holds with high probability for almost all x𝒳. Here, Theorem 3.7 holds by the fact that the first-order projection of f^Bx*-Ef^Bx* dominates the other terms. Moreover, as the analytic form of ξ1,rx* is unavailable, the following theorem provides a consistent estimator of it.

Theorem 3.8. Under the same setting as in Theorem 3.7 and with σ^*2 defined in (2.10), it holds that for any fixed x*𝒳,

r2ξ1,rx*nσ^*2p1.

Compared to Wager & Athey (2018) and Fei & Li (2024), since we allow the number of subsamples B to grow at the same rate as the sample size n, the Monte Carlo bias in the variance estimation becomes non-negligible. As pointed out by Wager et al. (2014), this bias is typically of order n/B, and does not vanish when Bn. Different from the correction proposed in Wager et al. (2014), which assumes near asymptotic independence between the inclusion indicators Jbi and the estimates f^b, we propose a bias-correction term that directly accounts for the dependency between subsample structure and estimates. Specifically, we utilize the covariance form of the variance estimator and subtract a U-statistic-based empirical correction that estimates the within-subject variability due to repeated subsampling. This yields a debiased estimator that is asymptotically unbiased even when Bn. For the confidence interval in (2.8), Theorem 3.8 and the Slutsky theorem ensure that, as n, it holds that

PEynewx*L^x*,U^x*1-α,

where L^x*=ψf^Bx*-z1-α/2σ^*,U^x*=ψf^Bx*+z1-α/2σ^*, and z1-α/2 is the (1-α/2)-th quantile of the standard normal distribution. In practice, we recommend using this back-transformed interval, as it preserves the parameter’s natural range.

Remark.

Our inference framework relies on a well-trained network; when training is insufficient or overly regularized, inference may become unreliable. For analytical tractability, we adopt an idealized s-sparse, F-bounded network class (Schmidt-Hieber 2020), while in practice, regularization techniques such as weight decay and dropout promote approximate sparsity, keeping the trained model close to this class. The optimization gap Δnf^n in Theorem 3.5 quantifies the discrepancy between empirical and population risks, filters out degenerate training cases, and accounts for residual optimization and approximation errors. Our theory demonstrates that as long as Δnf^n remains small, inference validity is preserved, consistent with Chernozhukov et al. (2018), Fan & Gu (2024).

4. Numerical Experiments

We conduct simulations under logistic and Poisson generalized nonparametric regression models to evaluate the performance of the ensemble subsampling method (ESM) in point estimation, variance estimation σ^*, coverage, and interval length. For comparison, we include several alternative methods capable of producing confidence intervals, including HulC (Kuchibhotla et al. 2024), Bayesian Neural Networks (Sun et al. 2021), Ensemble Methods (Lakshminarayanan et al. 2017), and the naive bootstrap. Additional experiments are presented in the Supplementary Material, namely, Section A.1 (binomial regression), Section A.2 (comparison with random forests (Breiman 2001)), Section A.3 (kernel regression), Section A.4 (varying covariate dimensions and nonlinearities), and Sections A.5A.9 (sensitivity analyses for optimization parameters, subsample size r, number of subsamples B, network depth L, and dropout rate).

We present our basic settings. Define gx=x1+0.25x22+0.1arctan0.5x3-0.3. Here, xj is the j-th element in the vector x. We generate xi from 𝒩0,Ip with p=10, i.e., the first 3 covariates are signals and the rest are noise variables. For each i=1,,n and given xi, we independently simulate yi under the following models.

(logistic model) Pyi=1xi=11+exp-gxi, with f0(x)=g(x) and ψ(x)=(1+exp(-x))-1. (Poisson model) With k{0,1,2,},Pyi=kxi=e-λxiλxikk!, where we set λxi=log1+egxi to ensure that λxi>0. In this model, we have f0xi=logλxi and ψ(x)=exp(x). For evaluation, we generate a total of 80 independent test points xtest,i~𝒩0,I10,i=1,,80, which remain fixed throughout the following experiment. We then apply (2.9) and (2.10) to obtain the point estimate f^Bxtest,i and its associated variance estimate σ^*xtest,i. To assess the accuracy of the variance estimates and the empirical coverage of the confidence intervals (2.8), we repeat the entire procedure 300 times, each time independently regenerating the training data. This yields replicated estimates f^sBxtest,i and σ^*,sxtest,i for s=1,,300. Final evaluation is based on the average performance across the fixed test set, using the metrics summarized in Table 1.

Table 1:

Simulation results for different regressions under varying sample sizes n and subsampling ratios r using the following metrics: absolute Biasf and MAEf are, respectively, the average and mean absolute bias between f^sB and f0;Biasψ and MAEψ are, respectively, the average and mean absolute bias in the transformed form ψf^sB and ψf0; EmpSD denotes the empirical standard deviation of f^sBxtest,i computed across the 300 repetitions and then averaged over the 80 test points; SE and SEc denote the average standard errors without and with bias correction, respectively, averaged over the 80 test points; CP is the average, across test points, of the coverage probability that the 95% confidence interval (2.9) contains the truth over 300 repetitions, and AIL is the mean length of these intervals. Numbers in brackets indicate the standard deviation of each metric across test points.

Biasf MAEf Biasψ MAEψ EmpSD SE SEc CP AIL
Logistic Model
n=400,r=n0.8 0.11(0.72) 0.57(0.44) 0.02(0.14) 0.11(0.08) 0.65 0.78 0.66 91.7% 0.46(0.13)
r=n0.85 0.06(0.59) 0.47(0.36) 0.01(0.12) 0.10(0.07) 0.55 0.68 0.58 94.0% 0.43(0.11)
r=n0.9 0.07(0.50) 0.40(0.31) 0.01(0.10) 0.08(0.06) 0.48 0.59 0.50 94.6% 0.39(0.09)
n=700,r=n0.8 0.08(0.41) 0.33(0.26) 0.02(0.09) 0.07(0.05) 0.39 0.51 0.39 93.6% 0.31(0.08)
r=n0.85 0.05(0.36) 0.28(0.22) 0.01(0.08) 0.06(0.05) 0.34 0.44 0.34 93.9% 0.28(0.06)
r=n0.9 0.06(0.33) 0.26(0.21) 0.01(0.07) 0.06(0.05) 0.31 0.40 0.31 92.9% 0.26(0.06)
Poisson Model
n=400,r=n0.8 −0.32(0.38) 0.38(0.31) −0.16(0.23) 0.22(0.18) 0.35 0.43 0.36 88.6% 0.86(0.45)
r=n0.85 −0.23(0.36) 0.33(0.27) −0.12(0.25) 0.21(0.18) 0.34 0.43 0.36 91.9% 0.95(0.52)
r=n0.9 −0.14(0.34) 0.29(0.24) −0.06(0.25) 0.19(0.18) 0.33 0.42 0.35 94.3% 1.02(0.59)
n=700,r=n0.8 −0.17(0.26) 0.25(0.20) −0.10(0.19) 0.16(0.14) 0.25 0.34 0.26 90.7% 0.67(0.34)
r=n0.85 −0.10(0.26) 0.21(0.17) −0.06(0.19) 0.15(0.14) 0.24 0.33 0.25 93.2% 0.71(0.38)
r=n0.9 −0.04(0.24) 0.19(0.15) −0.02(0.19) 0.14(0.13) 0.23 0.32 0.24 94.5% 0.72(0.38)

In the experiment, we randomly select sub-datasets of size r to train neural networks, varying r from n0.8 to n0.9, with n=400and700. These choices of r are based on our experiments as they provide enough data for neural network convergence while remaining asymptotically smaller than the total sample size n. To ensure computational feasibility, we set B=1400 and use a three-layer DNN with architecture p=(p,128,64,1), a learning rate of 0.1, and ReLU activation. We apply early stopping (500 epochs), a 10% dropout rate, 2 weight decay with rate 0.02, and output truncation (F=3) to effectively control the sparsity s and boundedness F in the function class defined in (2.2), aligning the implemented network with theoretical assumptions.

Table 1 shows that increasing the sample size n reduces biases with decreased mean absolute errors (MAE) for both f and its transformed counterpart ψ. Figure 2a displays the average estimated E(yx) and its variation across 300 repetitions for various test points, in the Bernoulli case with n=700 and r=n0.9. The estimated means closely align with the true values. To further assess the quality of uncertainty quantification, we also examine the empirical standard deviation across repetitions. The corrected standard errors SEc align much more closely with the empirical standard deviations (EmpSD) than the uncorrected standard errors (SE), confirming the effectiveness of the correction term introduced in (2.10). This is further illustrated in Figure 2b, where the variance estimator with bias correction closely matches the empirical standard deviation, while the uncorrected version overestimates the uncertainty. As a result, our method achieves coverage probabilities (CP) close to the nominal level.

Figure 2:

Figure 2:

Estimation and inference in simulation samples: Logistic Model with n=700,r=n0.9 and B=1400. Figure 2a shows the average estimated E(yx) with variability across 300 repetitions (gray band) over test points. Figure 2b compares corrected and uncorrected standard errors and their variability (gray and blue bands, respectively) to the empirical standard deviations of estimates across all test samples.

We present the comparison results in Table 2. For ESM, we select our results with γ=0.9. For HulC method proposed by Kuchibhotla et al. (2024), although the coverage probability is close to the nominal level of 95%, its average interval length is very wide. For Bayesian neural networks (Sun et al. 2021), we observe that the performance is highly sensitive to the tuning of hyperparameters. Due to the difficulty of selecting appropriate priors and optimization settings, the coverage probabilities and average interval lengths in our experiments are worse than those of the proposed ESM, with noticeably wider intervals and less stable coverage. For the ensemble methods (Lakshminarayanan et al. 2017) and the naive bootstrap approach, the intervals are wider than those of ESM. As the sample size n increases, and the coverage probability tends to exceed the nominal 95%. Overall, ESM achieves competitive coverage performance with narrowest intervals, although it may not always provide coverage closest to the nominal level among the compared methods. So, in some cases, using estimators without bias correction or using bootstrap procedures may also have merit, as the presence of bias-correction terms in finite samples can lead to slight undercoverage.

Table 2:

Results of comparison methods.

Biasf MAEf Biasψ MAEψ EmpSD SE SEc CP AIL
Logistic Model, n=400
HulC - - - - - - - 94.8% 0.71(0.21)
BNN 0.06(0.61) 0.50(0.35) 0.02(0.14) 0.12(0.08) 0.27 - - 97.8% 0.68(0.13)
Ensemble 0.06(0.84) 0.67(0.52) 0.01(0.16) 0.13(0.09) 0.80 0.90 - 95.5% 0.59(0.15)
Naive bp 0.05(0.81) 0.64(0.50) 0.01(0.16) 0.13(0.09) 0.77 0.85 - 95.4% 0.57(0.15)
ESM 0.07(0.50) 0.40(0.31) 0.01(0.10) 0.08(0.06) 0.48 0.59 0.50 94.6% 0.39(0.09)
Logistic Model, n=700
HulC - - - - - - 94.9% 0.67(0.21)
BNN 0.01(0.64) 0.55(0.34) 0.00(0.15) 0.13(0.07) 0.11 - - 65.7% 0.31(0.10)
Ensemble 0.04(0.48) 0.38(0.30) 0.01(0.10) 0.08(0.06) 0.47 0.53 - 96.8% 0.41(0.10)
Naive bp 0.03(0.48) 0.37(0.29) 0.01(0.10) 0.08(0.06) 0.46 0.52 - 96.7% 0.40(0.09)
ESM 0.06(0.33) 0.26(0.21) 0.01(0.07) 0.06(0.05) 0.31 0.40 0.31 92.9% 0.26(0.06)
Poisson Model, n=400
HulC - - - - - - - 90.0% 1.40(0.89)
BNN 0.15(0.49) 0.41(0.30) 0.03(0.36) 0.30(0.21) 0.20 - - 86.3% 1.55(1.18)
Ensemble −0.22(0.44) 0.37(0.31) −0.09(0.30) 0.24(0.21) 0.42 0.50 - 96.0% 1.33(0.71)
Naive bp −0.21(0.44) 0.37(0.31) −0.08(0.30) 0.24(0.21) 0.42 0.48 - 95.3% 1.28(0.69)
ESM −0.14(0.34) 0.29(0.24) −0.06(0.25) 0.19(0.18) 0.33 0.42 0.35 94.3% 1.02(0.59)
Poisson Model, n=700
HulC - - - - - - - 92.8% 1.37(0.84)
BNN 0.26(0.43) 0.40(0.31) 0.12(0.31) 0.28(0.18) 0.09 - - 73.2% 0.92(0.34)
Ensemble −0.10(0.30) 0.25(0.20) −0.04(0.23) 0.17(0.16) 0.29 0.37 - 97.9% 1.06(0.56)
Naive bp −0.10(0.31) 0.25(0.20) −0.04(0.24) 0.18(0.16) 0.30 0.35 - 97.1% 1.03(0.54)
ESM −0.04(0.24) 0.19(0.15) −0.02(0.19) 0.14(0.13) 0.23 0.32 0.24 94.5% 0.72(0.38)

We also investigate the performance of narrower yet deeper architectures under our method. We design a deeper network with L=5 and p=(p,10,15,20,30,1). As deep networks may cause gradient explosion or vanishing during backpropagation (Glorot & Bengio 2010), we incorporate batch normalization layers after the second and second-tolast layers, aiming to mitigate gradient instability while maintaining the network’s ability to learn complex representations. Table 3 shows that key metrics, including empSD, SE, and SEc, remain largely consistent with those observed for shallower network architectures. Nevertheless, coverage probabilities drop to around 91%. Figure 3 presents representative results for n=400 and r=n0.8. As shown in Figure 3a, the fitted values exhibit a slight systematic deviation from the diagonal, indicating bias relative to the true function f0. Nevertheless, Figure 3b confirms that the corrected variance estimates remain accurate. This bias shifts the confidence intervals, leading to reduced coverage despite precise variance estimation. Similar bias-related undercoverage has been reported for random forests in Wager & Athey (2018). In our case, greater network depth may induce complex interactions that, without sufficient data, amplify estimation bias.

Table 3:

Simulation results for different models under varying sample sizes n and subsampling ratios r with different DNNs.

Biasf MAEf Biasψ MAEψ EmpSD SE SEc CP AIL
Logistic Model
n=400,r=n0.8 0.20(0.51) 0.44(0.33) 0.04(0.11) 0.10(0.07) 0.49 0.80 0.51 90.2% 0.40(0.11)
r=n0.85 0.20(0.52) 0.44(0.34) 0.04(0.11) 0.09(0.07) 0.50 0.86 0.52 90.2% 0.41(0.11)
r=n0.9 0.19(0.53) 0.45(0.34) 0.04(0.11) 0.10(0.07) 0.51 0.95 0.53 91.0% 0.41(0.11)
n=700,r=n0.8 0.16(0.39) 0.34(0.26) 0.04(0.09) 0.08(0.06) 0.37 0.84 0.38 90.5% 0.31(0.08)
r=n0.85 0.20(0.39) 0.35(0.27) 0.04(0.09) 0.08(0.06) 0.37 0.78 0.38 91.1% 0.31(0.08)
r=n0.9 0.16(0.39) 0.34(0.26) 0.04(0.09) 0.07(0.06) 0.37 0.84 0.38 91.4% 0.31(0.08)
Poisson Model
n=400,r=n0.8 −0.35(0.38) 0.38(0.28) −0.19(0.21) 0.23(0.18) 0.30 0.50 0.31 86.1% 0.70(0.35)
r=n0.85 −0.26(0.33) 0.33(0.26) −0.14(0.21) 0.20(0.16) 0.31 0.55 0.32 90.0% 0.78(0.39)
r=n0.9 −0.17(0.32) 0.28(0.23) −0.08(0.23) 0.18(0.16) 0.31 0.56 0.32 91.8% 0.86(0.45)
n=700,r=n0.8 −0.22(0.24) 0.26(0.20) −0.13(0.17) 0.16(0.14) 0.23 0.47 0.23 88.0% 0.57(0.27)
r=n0.85 −0.13(0.25) 0.21(0.18) −0.07(0.18) 0.14(0.13) 0.23 0.51 0.23 92.3% 0.63(0.29)
r=n0.9 −0.05(0.25) 0.20(0.16) −0.02(0.19) 0.14(0.13) 0.23 0.56 0.24 93.4% 0.68(0.34)

Figure 3:

Figure 3:

Estimation and inference in simulation samples: Logistic Model with n=400 and r=n0.8 under a deeper network. Figure 3a displays the average estimated E(yx) and its deviation (gray band) across 300 runs at various test points. Figure 3b compares corrected and uncorrected SEs and their deviations (gray and blue bands) across 300 runs with empirical SDs across test samples.

5. Analysis of eICU data

Preventing ICU readmissions is vital as readmitted patients face higher mortality risks and increased expenses. Identifying high-risk individuals enables more personalized care and optimized resource use. The eICU Collaborative Research Database (eICU-CRD) (Pollard et al. 2018) is a large-scale, multi-center collection of anonymized ICU patient data, created through collaboration between the MIT Laboratory for Computational Physiology and healthcare organizations. It includes detailed information on patients’ physiological measurements, clinical data, lab results, medication usage, and outcomes during their ICU stays. Our analysis focuses on Hospital #188, a major hospital with 2,632 patients admitted to ICU during 2014–2015. We include ten covariates in our analysis, selected for their clinical relevance and frequent use in prior predictive modeling studies. The features include six lab tests (Hct, chloride, WBC, Hgb, RBC, glucose), patient age, Emergency Department admission status, and two APACHE-based severity indicators: anticipated hospital stay and mortality likelihood. Among the 2,632 patients admitted to the ICU, 1,476 were not readmitted, 1,046 were readmitted once, 77 were readmitted twice, 26 were readmitted three times, and 7 were readmitted five or more times. We apply nonparametric logistic and Poisson models within our ESM framework to estimate, respectively, the probability of ICU readmission and total number of ICU readmissions using the laboratory and demographic information. We validate our results using 5-fold cross-validation. The dataset is split into five equal-sized folds; the model is trained on four folds (n=2105), and the ESM method is applied to estimate the subject-specific means and confidence intervals for each subject in the held-out fold, using only covariates while withholding true outcomes. Repeating this process across all folds yields point estimates and inference intervals for all observations.

To estimate the probability of ICU readmission, we define a binary outcome variable: y=1 if any readmission occurs during follow-up, i.e., 2014–2015, and y=0 otherwise. When implementing our method, we set the subsampling size to r=n0.9=979 and the number of subsampling iterations to B=3000. We use a (10, 128, 64, 1) architecture with ReLU activation, selected for its balance of accuracy and efficiency via simulations. Training is performed over 300 epochs with a tuned learning rate of η=0.1.

As shown in Figure 4, both random forests and neural networks achieve similar predictive performance (AUC = 0.84), but their uncertainty estimates differ much. Neural networks yield shorter average confidence intervals (0.123 vs. 0.225 for RFs). Figures 4b and 4d highlight these differences by displaying predicted readmission probabilities with 95% intervals across patients sorted by risk. The variability in interval widths suggests heteroskedasticity, likely driven by differences in data quality or patient characteristics. Notably, Figure 4b reveals a subset of patients (indices between 800–1600) with unusually high uncertainty under DNNs; further details are provided in Section B.1.

Figure 4:

Figure 4:

Evaluation of the nonparametric logistic model estimator. Figure 4a and 4c show the ROC curve with both AUC of 0.84. Figure 4b and 4d display estimated subject-level probabilities of ICU readmission and confidence intervals, illustrating heteroskedasticity across individuals.

To estimate subject-level ICU readmissions during 2014–2015, we fit a nonparametric Poisson regression using a DNN to model Eyx*, offering a complementary view of patient well-being. Figure 5 shows a lift curve around 0.72, indicating comparable predictive accuracy for DNNs and RFs. However, uncertainty estimates differ: DNNs yield average interval lengths of 0.124, compared to 0.274 for RFs. Figures 5b and 5d illustrate individual predictions with 95% intervals, again showing heteroskedasticity. Additional analyses (Section B.2) reveal that in some settings, RFs produce narrower intervals than DNNs, highlighting model sensitivity to data structure and signal-to-noise characteristics. We also assess the transferability of ESM by applying models trained on hospital #188 directly to data from hospital #458; see Section B.3 in the Supplementary Material. The results indicate that the model remains transferable in this setting.

Figure 5:

Figure 5:

Evaluation of the nonparametric logistic model estimator. Figure 5a and 5c show the ROC curve with both AUC of 0.72. Figure 5b and 5d display estimated subject-level probabilities of ICU readmission and confidence intervals, illustrating heteroskedasticity across individuals.

6. Discussion

This paper presents a general framework for inference on subject-level means estimated by deep neural networks for categorical and exponential family outcomes. We construct a DNN estimator by minimizing the loss function induced from the generalized nonparametric regression model (GNRM). To address a key gap in leveraging this general loss function, we analyze the convergence rates of DNNs under the GNRM framework and establish connections with U-statistics and Hoeffding decomposition theory. Building on these results, we introduce an ensemble subsampling method (ESM) to enable valid inference. Numerical studies and an application to the eICU dataset demonstrate ESM’s utility in quantifying predictive uncertainty and supporting personalized care.

Our methodology may have broad applicability as it enables valid inference for any continuously-differentiable functional of f0(x), such as Var(yx)=ψf0(x)gf0(x), where g(u)=ψ(u), the second derivative of ψ(u). We can estimate Var(yx) via the plug-in estimator gf^B(x) and apply the delta method to give nr2ξ1,r(x)(g(f^B(x))-gf0(x))=nr2ξ1,r(x)gf0(x)f^B(x)-f0(x)+op(1), which will lead to estimated confidence intervals. Our method may also naturally extend to semiparametric causal inference settings where the target parameter can be expressed as a functional of the conditional means. Specifically, to estimate the Conditional Average Treatment Effect (CATE), τ(x)=E[y(1)x]-E[y(0)x], we apply ESM separately to treated and control groups under standard unconfoundedness. Multiple subsamples are drawn for each group, predictive models are trained to estimate E[y(g)x], and ensemble means μ^1(x) and μ^0(x) are obtained. The CATE estimator is τ^(x)=μ^1(x)-μ^0(x). ESM’s variance framework provides inference by obtaining σ^g2(x)=Var^μ^g(x) for g=0,1, yielding pointwise confidence intervals τ^(x)±z1-α/2σ^12(x)+σ^02(x). Detailed justifications warrant investigation.

Our theoretical analysis focuses on ReLU activations, where monomial-based approximations are tractable. We envision that ESM can be extended to networks with smooth activation functions such as tanh or GELU, although this requires careful consideration. Unlike ReLU, smooth activations are less flexible in local adaptations, which may result in slower approximation rates and influence convergence behavior. To preserve inference validity in this setting, it may be necessary to employ alternative analytical tools or apply stronger regularization to control model complexity and prevent saturation. Moreover, Our theoretical analysis relies on several assumptions that are broadly consistent with recent work on implicit bias and convergence in overparameterized models (Chizat & Bach 2020, Chatterji et al. 2021, Cao et al. 2022). We acknowledge, however, that some of these assumptions may appear strong. In particular, the conditions imposed on ξ1,r(x), which ensure the validity of the first-order Hoeffding projection, can be further improved. An important direction for future work is to relax these conditions by linking them more directly to of the neural network class, such as its sparsity or other structural characteristics.

Supplementary Material

Supplement

Acknowledgements.

We are deeply grateful to the Editor, Associate Editor, two referees and one reproducibility reviewer for insightful reviews that have greatly improved the manuscript. This work was supported by NIH/NCI grants 2R01CA249096-5 and R01CA269398.

Data Availability Statement.

The data used in this study are publicly available from the eICU Collaborative Research Database (https://eicu-crd.mit.edu/) upon completion of a data use agreement and approval through the PhysioNet credentialed access process.

References

  1. Alaa A & Van Der Schaar M (2020), Discriminative jackknife: Quantifying uncertainty in deep learning via higher-order influence functions, in ‘International Conference on Machine Learning’, pp. 165–174. [Google Scholar]
  2. Angelopoulos AN & Bates S (2021), ‘A gentle introduction to conformal prediction and distribution-free uncertainty quantification’, arXiv preprint arXiv:2107.07511. [Google Scholar]
  3. Athey S, Imbens GW & Wager S (2018), ‘Approximate residual balancing: debiased inference of average treatment effects in high dimensions’, Journal of the Royal Statistical Society Series B: Statistical Methodology 80(4), 597–623. [Google Scholar]
  4. Barber RF, Candes EJ, Ramdas A & Tibshirani RJ (2021), ‘Predictive inference with the jackknife+’, The Annals of Statistics 49(1), 486–507. [Google Scholar]
  5. Bartlett PL, Long PM, Lugosi G & Tsigler A (2020), ‘Benign overfitting in linear regression’, Proceedings of the National Academy of Sciences 117(48), 30063–30070. [Google Scholar]
  6. Bhattacharya S, Fan J & Mukherjee D (2024), ‘Deep neural networks for nonparametric interaction models with diverging dimension’, The Annals of Statistics 52(6), 2738–2766. [Google Scholar]
  7. Boroskikh YV (2020), U-statistics in Banach Spaces, Walter de Gruyter GmbH & Co KG. [Google Scholar]
  8. Breiman L (2001), ‘Random forests’, Machine Learning 45(1), 5–32. [Google Scholar]
  9. Brown LD (1986), ‘Fundamentals of statistical exponential families: with applications in statistical decision theory’, Lecture Notes-Monograph Series 9, i–279. [Google Scholar]
  10. Cao Y, Chen Z, Belkin M & Gu Q (2022), ‘Benign overfitting in two-layer convolutional neural networks’, Advances in Neural Information Processing Systems 35, 25237–25250. [Google Scholar]
  11. Chatterji NS, Long PM & Bartlett PL (2021), ‘When does gradient descent with logistic loss find interpolating two-layer networks?’, Journal of Machine Learning Research 22(159), 1–48. [Google Scholar]
  12. Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W & Robins J (2018), ‘Double/debiased machine learning for treatment and structural parameters’, The Econometrics Journal 21(1), C1–C68. [Google Scholar]
  13. Chizat L & Bach F (2020), Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss, in ‘Conference on learning theory’, pp. 1305–1338. [Google Scholar]
  14. Drucker H, Burges CJ, Kaufman L, Smola A & Vapnik V (1996), ‘Support vector regression machines’, Advances in Neural Information Processing Systems 9, 155–161. [Google Scholar]
  15. Fan J (2018), Local Polynomial Modelling and Its Applications: Monographs on Statistics and Applied Probability 66, Routledge. [Google Scholar]
  16. Fan J & Gu Y (2024), ‘Factor augmented sparse throughput deep ReLU neural networks for high dimensional regression’, Journal of the American Statistical Association 119(548), 2680–2694. [Google Scholar]
  17. Fan J, Gu Y & Zhou W-X (2024), ‘How do noise tails impact on deep ReLU networks?’, The Annals of Statistics 52(4), 1845–1871. [Google Scholar]
  18. Fei Z & Li Y (2021), ‘Estimation and inference for high dimensional generalized linear models: A splitting and smoothing approach’, Journal of Machine Learning Research 22(58), 1–32. [Google Scholar]
  19. Fei Z & Li Y (2024), ‘U-learning for prediction inference via combinatory multisubsampling: With applications to lasso and neural networks’, arXiv preprint arXiv:2407.15301. [Google Scholar]
  20. Frees EW (1989), ‘Infinite order U-statistics’, Scandinavian Journal of Statistics 16(1), 29–45. [Google Scholar]
  21. Glorot X & Bengio Y (2010), Understanding the difficulty of training deep feedforward neural networks, in ‘Proceedings of the thirteenth international conference on artificial intelligence and statistics’, JMLR Workshop and Conference Proceedings, pp. 249–256. [Google Scholar]
  22. Goodfellow I, Bengio Y, Courville A & Bengio Y (2016), Deep Learning, Vol. 1, MIT press; Cambridge. [Google Scholar]
  23. Guo Z, Rakshit P, Herman DS & Chen J (2021), ‘Inference for the case probability in high-dimensional logistic regression’, Journal of Machine Learning Research 22(254), 1–54. [Google Scholar]
  24. Györfi L, Kohler M, Krzyzak A & Walk H (2006), A Distribution-Free Theory of Nonparametric Regression, Springer Science & Business Media. [Google Scholar]
  25. Hastie T, Tibshirani R, Friedman JH & Friedman JH (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer. [Google Scholar]
  26. Hoeffding W (1992), ‘A class of statistics with asymptotically normal distribution’, Break-throughs in Statistics: Foundations and Basic Theory pp. 308–334. [Google Scholar]
  27. Hornik K, Stinchcombe M & White H (1989), ‘Multilayer feedforward networks are universal approximators’, Neural Networks 2(5), 359–366. [Google Scholar]
  28. Huang J, Xi H, Zhang L, Yao H, Qiu Y & Wei H (2024), Conformal prediction for deep classifier via label ranking, in ‘International Conference on Machine Learning’, pp. 20331–20347. [Google Scholar]
  29. Jiao Y, Shen G, Lin Y & Huang J (2023), ‘Deep nonparametric regression on approximate manifolds: Nonasymptotic error bounds with polynomial prefactors’, The Annals of Statistics 51(2), 691–716. [Google Scholar]
  30. Kim B, Xu C & Barber R (2020), ‘Predictive inference is free with the jackknife+-after-bootstrap’, Advances in Neural Information Processing Systems 33, 4138–4149. [Google Scholar]
  31. Kohler M & Krzyżak A (2005), ‘Adaptive regression estimation with multilayer feedforward neural networks’, Nonparametric Statistics 17(8), 891–913. [Google Scholar]
  32. Kohler M & Langer S (2021), ‘On the rate of convergence of fully connected deep neural network regression estimates’, The Annals of Statistics 49(4), 2231–2249. [Google Scholar]
  33. Kou Y, Chen Z, Chen Y & Gu Q (2023), Benign overfitting in two-layer ReLU convolutional neural networks, in ‘International Conference on Machine Learning’, pp. 17615–17659. [Google Scholar]
  34. Kuchibhotla AK, Balakrishnan S & Wasserman L (2024), ‘The HulC: Confidence regions from convex hulls’, Journal of the Royal Statistical Society Series B: Statistical Methodology 86(3), 586–622. [Google Scholar]
  35. Lakshminarayanan B, Pritzel A & Blundell C (2017), ‘Simple and scalable predictive uncertainty estimation using deep ensembles’, Advances in Neural Information Processing Systems 30. [Google Scholar]
  36. LeCun Y, Bengio Y & Hinton G (2015), ‘Deep learning’, Nature 521(7553), 436–444. [DOI] [PubMed] [Google Scholar]
  37. Lee AJ (2019), U-statistics: Theory and Practice, Routledge. [Google Scholar]
  38. Lei J, G’Sell M, Rinaldo A, Tibshirani RJ & Wasserman L (2018), ‘Distribution-free predictive inference for regression’, Journal of the American Statistical Association 113(523), 1094–1111. [Google Scholar]
  39. Lu J, Shen Z, Yang H & Zhang S (2021), ‘Deep network approximation for smooth functions’, SIAM Journal on Mathematical Analysis 53(5), 5465–5506. [Google Scholar]
  40. Lundberg SM, Nair B, Vavilala MS, Horibe M, Eisses MJ, Adams T, Liston DE, Low DK-W, Newman S-F, Kim J et al. (2018), ‘Explainable machine-learning predictions for the prevention of hypoxaemia during surgery’, Nature Biomedical Engineering 2(10), 749–760. [Google Scholar]
  41. McCaffrey DF & Gallant AR (1994), ‘Convergence rates for single hidden layer feedforward networks’, Neural Networks 7(1), 147–158. [Google Scholar]
  42. Mentch L & Hooker G (2016), ‘Quantifying uncertainty in random forests via confidence intervals and hypothesis tests’, Journal of Machine Learning Research 17(26), 1–41. [Google Scholar]
  43. Nadaraya EA (1964), ‘On estimating regression’, Theory of Probability & Its Applications 9(1), 141–142. [Google Scholar]
  44. Nair V & Hinton GE (2010), Rectified linear units improve restricted boltzmann machines, in ‘International Conference on Machine Learning’, pp. 807–814. [Google Scholar]
  45. Peng W, Mentch L & Stefanski L (2024), ‘Bias, consistency, and alternative perspectives of the infinitesimal jackknife’, Statistica Sinica. [Google Scholar]
  46. Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG & Badawi O (2018), ‘The eICU collaborative research database, a freely available multi-center database for critical care research’, Scientific Data 5(1), 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Schmidt-Hieber AJ (2020), ‘Nonparametric regression using deep neural networks with ReLU activation function’, The Annals of Statistics 48(4), 1875–1897. [Google Scholar]
  48. Schupbach J, Sheppard JW & Forrester T (2020), Quantifying uncertainty in neural network ensembles using U-statistics, in ‘2020 International Joint Conference on Neural Networks (IJCNN)’, IEEE, pp. 1–8. [Google Scholar]
  49. Shen G, Jiao Y, Lin Y & Huang J (2021), ‘Robust nonparametric regression with deep neural networks’, arXiv preprint arXiv:2107.10343. [Google Scholar]
  50. Shen Z, Yang H & Zhang S (2019), ‘Nonlinear approximation via compositions’, Neural Networks 119, 74–84. [DOI] [PubMed] [Google Scholar]
  51. Sun Y, Xiong W & Liang F (2021), ‘Sparse deep learning: A new framework immune to local traps and miscalibration’, Advances in Neural Information Processing Systems 34, 22301–22312. [Google Scholar]
  52. Takezawa K (2005), Introduction to Nonparametric Regression, John Wiley & Sons. [Google Scholar]
  53. Van Calster B, McLernon DJ, Van Smeden M, Wynants L, Steyerberg EW, diagnostic tests, T. G. E. & prediction models’ of the STRATOS initiative Bossuyt Patrick Collins Gary S. Macaskill Petra McLernon David J. Moons Karel GM Steyerberg Ewout W. Van Calster Ben van Smeden Maarten Vickers Andrew J. (2019), ‘Calibration: the achilles heel of predictive analytics’, BMC Medicine 17(1), 230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Wager S & Athey S (2018), ‘Estimation and inference of heterogeneous treatment effects using random forests’, Journal of the American Statistical Association 113(523), 1228–1242. [Google Scholar]
  55. Wager S, Hastie T & Efron B (2014), ‘Confidence intervals for random forests: The jackknife and the infinitesimal jackknife’, Journal of Machine Learning Research 15(1), 1625–1651. [PMC free article] [PubMed] [Google Scholar]
  56. Wang Q & Wei Y (2022), ‘Quantifying uncertainty of subsampling-based ensemble methods under a U-statistic framework’, Journal of Statistical Computation and Simulation 92(17), 3706–3726. [Google Scholar]
  57. Wang X, Zhou L & Lin H (2024), ‘Deep regression learning with optimal loss function’, Journal of the American Statistical Association pp. 1–13. [Google Scholar]
  58. Yan S, Yao F & Zhou H (2025), ‘Deep regression for repeated measurements’, Journal of the American Statistical Association pp. 1–12. [Google Scholar]
  59. Yarotsky D (2017), ‘Error bounds for approximations with deep ReLU networks’, Neural Networks 94, 103–114. [DOI] [PubMed] [Google Scholar]
  60. Zhang Y & Politis DN (2023), ‘Bootstrap prediction intervals with asymptotic conditional validity and unconditional guarantees’, Information and Inference: A Journal of the IMA 12(1), 157–209. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

Data Availability Statement

The data used in this study are publicly available from the eICU Collaborative Research Database (https://eicu-crd.mit.edu/) upon completion of a data use agreement and approval through the PhysioNet credentialed access process.

RESOURCES