Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Mar 6.
Published before final editing as: IEEE Trans Neural Netw Learn Syst. 2022 Jun 30;PP:10.1109/TNNLS.2022.3185742. doi: 10.1109/TNNLS.2022.3185742

Significance Tests of Feature Relevance for a Black-Box Learner

Ben Dai 1, Xiaotong Shen 2, Wei Pan 3
PMCID: PMC10915654  NIHMSID: NIHMS1964699  PMID: 35771783

Abstract

An exciting recent development is the uptake of deep neural networks in many scientific fields, where the main objective is outcome prediction with a black-box nature. Significance testing is promising to address the black-box issue and explore novel scientific insights and interpretations of the decision-making process based on a deep learning model. However, testing for a neural network poses a challenge because of its black-box nature and unknown limiting distributions of parameter estimates while existing methods require strong assumptions or excessive computation. In this article, we derive one-split and two-split tests relaxing the assumptions and computational complexity of existing black-box tests and extending to examine the significance of a collection of features of interest in a dataset of possibly a complex type, such as an image. The one-split test estimates and evaluates a black-box model based on estimation and inference subsets through sample splitting and data perturbation. The two-split test further splits the inference subset into two but requires no perturbation. Also, we develop their combined versions by aggregating the p-values based on repeated sample splitting. By deflating the bias-sd-ratio, we establish asymptotic null distributions of the test statistics and the consistency in terms of Type II error. Numerically, we demonstrate the utility of the proposed tests on seven simulated examples and six real datasets. Accompanying this article is our python library dnn-inference (https://dnninference.readthedocs.io/en/latest/) that implements the proposed tests.

Keywords: Adaptive splitting, black-box tests, combining, computational constraints, feature relevance

I. Introduction

DEEP neural networks [1] are a representative of black-box models, in which the learning process between features and outcomes is usually difficult to track due to the lack of knowledge about complex hidden patterns inside. The primary goal of deep learning (DL) is to fit a deep neural network for predicting outcomes with high predictive accuracy. Driven by its superior prediction performance [1], scientists seek accountability and interpretability beyond prediction accuracy. In particular, they demand significance tests based on DL to explore novel discoveries on scientific domain knowledge, for example, if a specific lung region is significantly associated with COVID-19.

Given a dataset, the goal of statistical significance tests is to examine if a collection of features of interest is associated with the outcome. A problem of this kind frequently occurs in classical parametric models or nonblack-box models, for instance, in statistical genetics, such as Alzheimer’s disease (AD) studies, where a gene is routinely examined and tested for AD association based on a linear model. Yet, significance testing based on a black-box model for a more complicated dataset remains understudied. In Section IA, we discuss the existing methods and their limitations and issues. In Section IB, we summarize our contributions to highlight the novelty of the proposed methods in addressing the existing issues.

A. Existing Methods and Their Limitations

In the existing literature, inference methods can be categorized into two groups: nonblack-box tests and black-box tests. Nonblack-box tests, such as the Wald test [2] and the likelihood-ratio test [3], [4], perform hypothesis testing for hypothesized features (HFs) (i.e., features of interest) based on the asymptotic distribution of the estimated parameters in a parametric model, such as a linear model. Black-box tests focus on a model-free hypothesis, such as Model-X knockoffs, conditional randomization tests (CRT) [5], holdout randomization test (HRT) [6], permutation test (PT) [7], conditional PT (CPT) [8], and leave-one-covariate-out test (LOCO) [9]. Specifically, Model-X knockoffs conduct variable selection with false discovery rate (FDR) control based on a specified variable importance measure on each of the individual features. CRT, HRT, and CPT examine the independence between the outcome and each feature conditional on the remaining features (at least for their simulations and implementation). PT examines the marginal independence between the outcome and HFs. LOCO introduces the excess prediction error for each feature to measure its importance for a given dataset.

Limitations of Existing Works:

Despite the merits of the methods developed, they have their limitations. 1) First, for nonblack-box tests, it is difficult to derive the asymptotic distribution of the parameter estimates from black-box models, especially for over-parametrized neural networks. Moreover, the explicit feature-parameter correspondence may be lost for a black-box model, such as a convolutional neural network (CNN) [10] using shared weights for spatial pixels, and recurrent neural networks (RNNs) [11] using shared weights for subsequent states. 2) Second, most existing black-box tests focus on variable importance or inference on a single feature, yet simultaneous testing of a collection of features is more desirable in some applications. For example, in image analysis, it is more interesting to examine patterns captured by multiple pixels in a region, where the impact of every single pixel is negligible. 3) Third, CRT, CPT, and HRT reply on a strong assumption that the conditional feature distribution is known or well-estimated, and thus, a test statistic can be constructed based on the generated samples from the null distribution. However, the complete conditionals may not be known or easy to estimate in practice, especially for complex datasets such as images or texts. 4) Finally, PT, CPT, and CRT require massive computing power to refit a model many times, which is infeasible for complex deep neural networks. More detailed discussion and numerical results about the connections and differences between the existing tests and the proposed tests can be found in Sections IIE and VA.

B. Our Contributions

This article proposes one-split and two-split tests to address the existing issues (1)–4) in Section IA). Our main contributions are summarized as follows.

  1. To address issues 1) and 2), we propose a flexible risk invariance null hypothesis for a general loss in (2), which measures the impact of a collection of HFs on prediction. Its relation to conditional independence is given in Lemma 1.

  2. To address issues 3) and 4), the one-split and two-split tests bypass the requirement of estimating the conditional distributions of features and testing based on the differenced empirical loss with sample splitting subject to computational constraints.

  3. We provide a theoretical guarantee of the proposed tests, c.f., Theorems 2 and 4, and Theorems A.1A.4 in Appendix A in the Supplementary Material. The theory is illustrated by extensive simulations.

  4. We compare the proposed tests with other existing tests and demonstrate their utility on four benchmarks on various deep neural networks in Section VI. We develop a python library dnn-inference to implement the proposed tests.

Overall, the proposed tests relax the assumptions and reduce the computational cost, providing more practical, feasible, and reliable testing for a black-box model on a complex dataset.

This article is structured as follows. Section II introduces the one-split test as well as its combined test. Section III performs Type II analysis and establishes the consistency of the proposed tests. Section IV develops sample splitting schemes. Section V is devoted to simulation studies and applications to four real datasets. The Appendix in the Supplementary Material encompasses the two-split test, additional numerical examples, and technical proofs.

II. Proposed Black-Box Tests

In DL, a deep neural network f(X) is fit to predict an outcome Y based on features Xd, and its prediction performance is evaluated by a loss function l(f(X),Y). Our objective is to test the significance of a subset of features X𝓢={Xj:j𝓢} to the prediction of Y, where 𝓢 is an index set of HFs and X𝓢c={Xj:j𝓢} with 𝓢c indicating the complement set of 𝓢. Note that X𝓢 can be a collection of weak features in that none of these features is jointly significant to prediction, but collectively they are. For example, in image analysis, the impact of each pixel is negligible but a pattern of a collection of pixels (e.g., in a region) may instead become significant.

To formulate the proposed hypothesis, we first generate dual data (Z, Y) by replacing X𝓢 by some irrelevant constants as Z𝓢, such as Z𝓢=0, that is, for j=1,,d

Zj=M,ifj𝓢;Zj=Xj,otherwise (1)

where M is an arbitrary deterministic constant. Note that the dual data Z satisfies that Z𝓢YZ𝓢c and Z𝓢c=X𝓢c, and thus, we aim to use differences between (X, Y) and (Z, Y) to measure the impact of X𝓢 on the prediction of the outcome Y. To proceed, we introduce the corresponding risks

R(f)=E(l(f(X),Y)),R𝓢(g)=E(l(g(Z),Y)).

Then, the differenced risk is defined as R(f)R𝓢(g) to measure the significance of X𝓢, that is, compare the best prediction performance with/without the HFs X𝓢 with existence of X𝓢c. Here, f=argminfR(f) and g=argmingR𝓢(g) are the optimal prediction functions in population.

To determine if X𝓢 is significantly relevant to the prediction ofY, consider null H0 and alternative Ha hypotheses

H0:R(f)R𝓢(g)=0,Ha:R(f)R𝓢(g)<0. (2)

Rejection of H0 suggests that the feature set X𝓢 is relevant to the prediction of Y. It is emphasized that in (2), the targets are the two true or population-level functions f and g, instead of their estimates (based on a given sample) as implemented in some existing tests.

In Section IIA, we demonstrate the relation between the proposed hypothesis and the independence hypothesis. More discussion about the differences between the hypotheses in HRT and LOCO can be found in Sections IIE and VA.

A. Connection to Independence

This subsection illustrates the relationships among the risk invariance hypothesis in (2), marginal independence, and conditional independence; the latter two are defined as:

Marginalindependence:YX𝓢
conditionalindependence:YX𝓢X𝓢c.

Lemma 1:

For any loss function, conditional independent implies the proposed risk invariance, that is,

YX𝓢X𝓢cR(f)R𝓢(g)=0.

Moreover, if the negative log-likelihood or the cross-entropy l(f(X),Y)=1Ylog(f(X)) is used in (2) as a loss function, then H0 is equivalent to conditional independence almost surely under the marginal distribution of X, that is,

R(f)R𝓢(g)=0Foranyy(Y=yX𝓢,X𝓢c)=(Y=yX𝓢c)

As suggested by Lemma 1, conditional independence always implies risk invariance, but they can be almost surely equivalent with some particular loss functions. Hence, at any significance level, a rejection of the null hypothesis of risk invariance implies a rejection of the null hypothesis of conditional independence. Yet, such a relationship does not exist for marginal independence. Next, we present three cases with disparate loss functions to illustrate their relationships.

Case 1: (Constant loss): l(f(X),Y)=C for a constant C.

Case 2: (The L2-loss in regression): l(f(X),Y)=E((Yf(X))2) for Y.

Case 3: (The cross-entropy loss in multiclass classification): l(f(X),Y)=1Ylog(f(X))forY{1,,K}.

As shown in Fig. 1, conditional independence implies risk invariance in Cases 1 and 2 while they are equivalent in Case 3, as suggested by Lemma 1. In general, conditional independence or risk invariance does not yield marginal independence and vice versa.

Fig. 1.

Fig. 1.

Three cases illustrate different relations among marginal independence, conditional independence, and risk invariance.

It is worthwhile mentioning that different loss functions can lead to different conclusions. We interpret a significance test according to the loss function being used. For example, consider the misclassification error (MCE) and the cross-entropy loss for testing the relevance of X𝓢 respectively. With the existence of X𝓢c, the H0 under the cross-entropy loss indicates that the HFs are irrelevant to the conditional distribution of Y given X, yet the H0 under MCE suggests that the HFs are irrelevant to classification accuracy.

B. One-Split Test

Given a dataset 𝒟N=(Xi,Yi)i=1,,N, we first split it into an estimation subset 𝓔n=(Xi,Yi)i=1,,n and an inference subset 𝓘m=(Xj,Yj)j=n+1,,n+m, where N is the number of total samples, n=ζN and m=Nn are the sample sizes of estimation and inference subsets, and ζ is a splitting ratio. On this ground, the dual estimation subset (Zi,Yi)i=1,,n and the dual inference subset (Zj,Yj)j=n+1,,n+m can be generated based on the masking processing in (1). The sample splitting intends to reduce the potential bias and to prevent overfitting, especially for an over-parametrized black-box model, which has been considered elsewhere for a different purpose in [4], [12], [13].

To access the null hypothesis in (2), we conduct a two-level estimation to R(f)R𝓢(g), that is, using the (dual) estimation subset to empirically estimate predictive models (f, g), and the (dual) inference subset to empirically estimate two risks R() and R𝓢(). Specifically, given the (dual) estimation subset, we obtain an estimator (f^n, g^n) to approximate (f, g), for example, by minimizing a regularized empirical loss of a deep neural network based on the (dual) estimation subset. Then, the differenced empirical loss is evaluated on the (dual) inference subset based on the estimator (f^n, g^n):

1mj=1ml(f^n(Xn+j),Yn+j)l(g^n(Zn+j),Yn+j).

As a remark, for flexibility, we do not specify the estimation procedure of (f^n, g^n), the only explicit condition is summarized in Assumption A, which properly requires that (f^n, g^n), is a consistent estimator of (f, g).

One difficulty in inference is that under H0 the bias of R(f^n)R𝓢(g^n) approximating R(f)R𝓢(g) could dominate its standard error; that is, the ratio of the bias to the standard derivation, called the bias-sd-ratio, could be severely inflated, making the asymptotic distribution of R(f^n)R𝓢(g^n) invalid for inference. This aspect is explained in detail in Section IID. To circumvent this difficulty, we present the one-split test with data perturbation to guard against the potentially inflated bias-sd-ratio by adding an independent noise:

Λn(1)=j=1mΔn,j(1)mσ^n(1)Δn,j(1)=l(f^n(Xn+j),Yn+j)l(g^n(Zn+j),Yn+j)+ρnεj (3)

where σ^n(1) is the sample standard deviation of {Δn,j(1)}j=1,,m given f^n and g^n:

σ^n(1)=(1m1j=1m(Δn,j(1)Δ¯n(1))2)1/2,Δ¯n(1)=1mΔn,j(1)

and εjN(0,1); j=1,,m are independent noise, and ρn>0 is the perturbation size. Note that our proposed test is in principle similar to classical hypothesis tests using a single test statistic. For example, if we use the negative log-likelihood as the loss function, it can be regarded as an extension of the likelihood ratio test (LRT) [14] to a black-box model.

According to the asymptotic null distribution of Λn(1) in Theorem 2, we calculate the p-value P(1)=Φ(Λn(1)), where Φ() is the cumulative distribution function of N(0,1).

Note that m is a subsequence of n, and m as n. To derive the asymptotic null distribution of Λn(1), we make the following assumptions.

Assumption A (Estimation Consistency):

For some constant γ>0, (f^n, g^n) satisfies

R(fn^)R(f)(R𝓢(g^n)R𝓢(g))=Op(nγ) (4)

where Op() denotes stochastic boundedness [15].

Assumption A concerns the rate of convergence in terms of the differenced regret, where R(fn^)R(f)0, known as the prediction regret with respect to a loss function l(,) of f^n. Note that

R(f^n)R(f)(R𝓢(g^n)R𝓢(g))max(R(f^n)R(f),R𝓢(g^n)R𝓢(g)) (5)

which says that the rate nγ is no worse than the least favorable one between the regrets of f^n and g^n. In the literature, the convergence rates for the right-hand of (5) have been extensively investigated. For example, the rate is nξ/(2ξ+d) for nonparametric regression [16], and the rate is dn2ξ/(2ξ+1)log3n for a regularized ReLU neural net [17], where ξ is the degree of smoothness of a d-dimensional true regression function. Note that an over-parametrized model may slow the convergence rate γ, yet an under-parametrized model may violate Assumption A, since the approximation error may not vanish. This fact is supported by Example 7 in Simulation (Section VB).

Assumption B (Lyapounov Condition for Λn(1)):

Assume that

mμE(|Δn,1(1)|2(1+μ)𝓔n)p0,asn

for some constant μ>0, where Δn,1(1) is defined in (3), and E(𝓔n) is the conditional expectation of inference samples given the estimation samples.

Assumption C (Variance Condition for Λn(1)):

Assume that

Var(Δn,1(1)𝓔n)p(σ(1))2>0,asn

where Var(𝓔n) denotes the conditional variance of inference samples given the estimation samples.

Assumptions B and C are used in applying the central limit theorem for triangle arrays [18], which are verifiable under some mild conditions, c.f., Lemma C.1 in Appendix C in the Supplementary Material.

The asymptotic null distribution for Λn(1) is indicated in Theorem 2.

Theorem 2 (Asymptotic Null Distribution of Λn(1)):

In addition to Assumptions A-C, if m=o(n2γ), then under H0,

Λn(1)dN(0,1),asn (6)

where d denotes convergence in distribution.

Theorem 2 says that the proposed test is valid under the splitting condition of m=o(n2γ). As a result, the estimation/inference splitting ratio needs to be suitably controlled. In Section IV, we propose a “log-ratio” splitting scheme, in which the splitting condition is automatically satisfied, c.f., Lemma 6.

As an alternative, we present the two-split test in Appendix A in the Supplementary Material to address the bias-sd-ratio issue, where we divide inference samples further into two equal subsets for inference, in which no data perturbation is needed.

C. Combining p-Values Over Repeated Random Splitting

Combining p-values via repeated random sample splitting can strengthen the one-split test (3). First, it stabilizes the testing result. Second, it can often empirically compensate for the power loss by combining evidence across different split samples, as illustrated in [19] and [20] and our simulations in Section V. Subsequently, we use the order statistics of the p-values to combine the evidence from different splitting, though we could apply other types of combining such as the corrected arithmetic and geometric means [21], [22].

Given a splitting ratio, we repeat the random splitting scheme U2 times; that is, each time, we randomly split the original dataset into an estimation/inference subsets. In practice, U cannot be large due to computational constraints and is usually 3–10 for large-scale applications. Then we compute the p-value Pu(1) on the uth splitting; u=1,,U, and combine them in two ways: the q-order and Hommel’s weighted average of the order statistics [23]. Specifically,

(q-order)P¯(1)=min(UqP(q)(1),1)(Hommel)P¯(1)=min(CUmin1qUUqP(q)(1),1) (7)

where CU=q=1U1/q, and P(q)(1) is the qth order statistic of P1(1),,PU(1).

The q-order combined test (7) is a generalized Bonferroni test with the Bonferroni correction factor U/q. The Hommel combined test renders robust aggregation and yields a better control of Type I error, where CU is a normalizing constant.

In Theorems 3 and 5, we further generalize the result of [23] to control Type I and Type II errors of the proposed tests asymptotically. A computational scheme for the combined tests is summarized in Algorithm 1.

Theorem 3 (Type I Error for the Combined One-Split Test):

Under Assumptions A-C, if m=o(n2γ), then under H0, for any 0<α<1 and any U2, the combined one-split test for (3) achieves

limn(P¯(1)αH0)α

where P¯(1) is defined in (7).

D. Role of Data Perturbation

This subsection discusses the role of the data perturbation for the one-split test. Now consider the one-split test without perturbation, that is, Λn(1) in (3) with ρn=0. Then, we decompose Λn(1) into three terms:

Λn(1)=mσ^n(1)(1mj=1m(Δn,j(1)E(Δn,j(1)𝓔n)))+mσ^n(1)(R(f^n)R(f)(R𝓢(g^n)R𝓢(g)))+mσ^n(1)(R(f)R𝓢(g))T1+T2+T3.

Under H0, T3=0, and T2 is the bias-sd-ratio introduced in Section IIB. Specifically, under H0, as n, σ^n(1)p0 as opposed to σ^n(1)pσ(1)>0 in Assumption C when ρn=0. As a result, T1 may not satisfy the assumption of the central limit theorem. Furthermore, T2 may not converge to zero. For example, T2=Op(m1/2) when σ^n(1) and the differenced regret R(f^n)R(f)(R𝓢(g^n)R𝓢(g)) are vanishing in the same order. Thus, the asymptotic null distribution in (6) breaks down since Λn(1) is dominated by T2.

By comparison, with data perturbation, ρnρ>0, (σ(1))2=Var(l(f(X),Y)l(g(Z(X)),Y))+ρ2>0. By Assumption A,

|T2|=mσ^n(1)|R(f^n)R𝓢(g^n)(R(f)R𝓢(g))|=Op(m1/2nγ)

which implies that T2p0 under the splitting condition of m=o(n2γ). Hence, the asymptotic null distribution of Λn(1) in (6) is valid. Moreover, a “log-ratio” sample splitting scheme is proposed in (8), where the splitting condition is automatically satisfied, as indicated in Lemma 6.

In later simulations (cf. Table VII), we will show numerically that, if no data perturbation is applied in the one-split test, it leads to increasingly inflated Type I errors with larger datasets in a neural network model.

TABLE VII.

Type I Errors of the One-Split Tests With/Without Perturbation (PTB) and the Two-Split Test in Section VI.4

Test N=2000 N=6000 N=10000
One-split without PTB 0.083 0.109 0.193
One-split with PTB 0.057 0.053 0.061
Two-split 0.048 0.051 0.047

E. Comparison With Existing Black-Box Tests

The one-split test in (3) has some characteristics that distinguish it from other existing black-box tests, including CRT [5], HRT [6], CPT [8] and LOCO tests [9].

CRT, CPT, and HRT test the conditional independence of a single feature individually, and the LOCO test measures the increase in prediction error due to not using a specified feature in a given dataset. The differences between the proposed tests and other existing tests can be summarized in three folds. First, for CRT, CPT, HRT, and LOCO tests, it is unclear how to test a set of multiple features X𝓢, which is the target of our tests. Second, the significance of relevance is defined in different ways. The LOCO test conducts a significant test for the estimated model based on a given dataset with the mean absolute error, yet CRT, CPT, HRT, and the proposed tests conduct testing at the population level; that is, the former three examine conditional independence, while the last one focuses on the risk invariance as specified in (2). Third, CRT, CPT, and HRT require well-estimated conditional probabilities of every feature given the rest, which is often difficult in practice. Finally, the proposed tests are advantageous over CRT and CPT with reduced computational cost by avoiding a large number of model refitting.

III. Type II Error Analysis

This section performs Type II error analysis of the one-split test (3) and its combined version (7).

Consider an alternative hypothesis Ha:R(f)R𝓢(g)=m1/2 δ<0 for δ>0. The Type II error of the one-split test and its combined test can be written as

βn(δ)=(P(1)αHa),β¯n(δ)=(P¯(1)αHa)

where (Ha) denotes the probability under Ha, and α>0 is the nominal level or level of significance.

Theorems 4 and 5 suggest that the one-split test and its combined test are consistent in that their asymptotic Type II error tends to zero as δ.

Theorem 4 (Limiting Type II Error of the One-Split Test):

Suppose that the one-split test (3) satisfies Assumptions A-C and m=o(n2γ), then

limnsupβn(δ)=Φ(zαδσ(1)),limδlimnsupβn(δ)=0

where zα=Φ1(1α) is the z-multiplier of the standard normal distribution.

Given the results of Theorem A.3 in Appendix A in the Supplementary Material, we note that the one-split test is more powerful than the two-split test in terms of the asymptotic Type II error.

Theorem 5 (Limiting Type II Error of the Combined Tests):

Suppose that the one-split test (3) satisfies Assumptions A-C and m=o(n2γ), then for P¯(1) defined as the q-order combined test in (7), we have

limnsupβ¯n(δ)min(UαqΓ,1),limδlimnsupβ¯n(δ)=0

and for P¯(1) defined as the Hommel combined test (7), we have

limnsupβ¯n(δ)min{CUUαqΓ,1;q=1,,U}limδlimnsupβ¯n(δ)=0

where

Γ=(q1Uq+1)(Φ(h0)Φ2(h0)2T(h0,33))+Φ(h0),h0=δ2σ(1)

and T(h,a) is Owen’s T function [24].

Note that the upper bound in Theorem 5 can be further improved if the explicit dependency structures of the p-values from repeated sample splitting are known.

IV. Sample Splitting

The one-split and two-split tests require the sample splitting ratio ζ to satisfy the requirement m=o(n2γ) to control the Type I error. In this section, we develop two computing schemes, namely “log-ratio” and “data-adaptive” tuning schemes, to estimate ζ in addition to the perturbation size ρ for the one-split test.

A. Log-Ratio Sample Splitting Scheme

This subsection proposes a log-ratio splitting scheme to ensure automatically the splitting condition m=o(n2γ). Specifically, given a sample size NN0, where N0 is a minimal sample size required for the hypothesis testing, the estimation and inference sizes n and m are obtained by:

n=x0,m=Nn (8)

where x0 is a solution of {x+(N0/(2log(N0/2)))log(x)=N} (cf. Table I).

TABLE I.

Illustration of Split Sample Sizes (n, m) Using the Log-Ratio Splitting Scheme (8) as the Total Sample Size N Increases From 2000 to 100 000 While N0=2000 Is Fixed

Total sample size (N)

2000 5000 10000 20000 50000 100000

Estimation sample size (n) 1000 3807 8688 18578 48439 98336
Inference sample size (m) 1000 1193 1312 1422 1561 1664

Lemma 6:

Suppose that the estimation/inference sample sizes (n, m) are determined by the log-ratio sample splitting scheme in (8), then they satisfy the splitting condition m=o(n2γ) for any γ>0 in Assumption A.

B. Heuristic Data-Adaptive Splitting (Tuning) Scheme

The log-ratio splitting scheme in (8) is relatively conservative as the inference sample size m increases in the logarithm of the estimation sample size n. To further increase a test’s power, we develop a heuristic data-adaptive tuning scheme as an alternative.

The data-adaptive tuning scheme selects (ζ, ρ) by controlling the estimated Type I error on permutation datasets. To proceed, we define the permutation on HFs, that is, for j=1,,d:

X˜i,j=Xπ(i),j,ifj𝓢,X˜i,j=Xi,j,otherwise (9)

where π is a permutation mapping. Note that the HF X˜i,𝓢 is conditional independence to the outcome Yi for the permutation sample. Alternatively, the null hypothesis H0 is true for permutation datasets. On this ground, we aim to use the proportions of rejecting over T-times permutation as an estimate of the Type I error and select (ζ, ρ) which is able to control the estimated Type I error. Ideally, refitting and re-evaluation are required for each permutation dataset. To reduce the computational cost, we only fit (f^, g^) based on a permutation estimation subset, and estimate Type I error by re-evaluating them at T-times permutation on an inference subset. The detailed procedure is summarized in the following Steps 1–4.

Step 1 (Sample Splitting):

Given a splitting ratio ζ, split the original sample into the estimation and inference samples.

Step 2 (Permutation):

Permute HFs of estimation/inference samples via (9).

Step 3 (Fitting):

Generate the dual estimation subset via (1), and fit (f^n, g^n) based on (dual) permuted estimation subsets.

Step 4 (Estimate Type I Error):

Permute the inference subset T-times, and generate the corresponding dual samples via (1). For the fixed estimators (f^n, g^n), compute the (combined) p-values for each permuted (dual) inference samples under the perturbation size ρ, denote as (P(1,t))t=1,,T.

Then an estimated Type I error is computed as:

Err(1)^(ρ,ζ)=T1t=1TI(P(1,t)α). (10)

The splitting ratio ζ controls the trade-off between Type I and Type II errors. Specifically, a small ζ value yields biased estimators (f^n, g^n), leading an inflated Type I error, yet could reduce the Type II error because of an enlarged inference subset. The perturbation size ρ, as mentioned early, controls the bias-sd-ratio to ensure the validity of the asymptotic null distribution.

For the one-split test, the data-adaptive scheme estimates (ζ, ρ) as the smallest values in some candidate sets that controls the estimated Type I error. In the process of searching candidate sets ζ and ρ, it stops once the termination criterion Err(1)^(ρ,ζ)α is met, which intends to reduce the computational cost. In particular,

(ρ^,ζ^)=minρ,ζ{ρρ,ζζ:Err(1)^(ρ,ζ)α} (11)

where Err(1)^(ρ,ζ) is the estimated Type I error computed via (10), ζ and ρ represent sets of candidate ζ and ρ values.

Overall Computational Cost:

Algorithm 1 summarizes the computational scheme of the one-split test. For the noncombining test in Algorithm 1, the data-adaptive scheme usually requires 2 and 3 times of training and evaluations since the loop for the tuning of (ζ, ρ) usually terminates in one or two iterations. For the combined test, the data-adaptive scheme based on 5 random splits usually requires 10 times of training and evaluations. The running time for the proposed test is indicated in Tables III and B.1 in Appendix B in the Supplementary Material.

TABLE III.

Empirical Type I/II Errors of the (Combined) One-/Two-Split Tests, and Their Combined Tests in Example 1 at α=0.05

Splitting method Test Sample size Type I error Type II error Time (Second)

Log-ratio One-split 2000 0.004 (0.78, 0.12, 0.08) 8.1(0.4)
6000 0.004 (0.58, 0.00, 0.00) 9.6(0.6)
10000 0.010 (0.55, 0.00, 0.00) 11.7(0.4)

Two-split 2000 0.026 (0.89, 0.66, 0.65) 8.4(0.4)
6000 0.036 (0.91, 0.55, 0.58) 9.7(0.5)
10000 0.034 (0.84, 0.54, 0.57) 11.4(0.2)

Comb, one-split 2000 0.016 (0.76, 0.05, 0.05) 42.1(1.6)
6000 0.012 (0.49, 0.00, 0.00) 45.6(1.3)
10000 0.010 (0.33, 0.00, 0.00) 56.3(0.8)

Comb, two-split 2000 0.018 (0.90, 0.70, 0.68) 40.8(1.6)
6000 0.024 (0.91, 0.51, 0.53) 45.5(1.2)
10000 0.018 (0.92, 0.59, 0.58) 56.5(1.1)

Data-adaptive One-split 2000 0.043 (0.75, 0.21, 0.15) 15.2(0.1)
6000 0.050 (0.39, 0.01, 0.00) 41.2(0.3)
10000 0.049 (0.11, 0.00, 0.00) 66.0(0.4)

Two-split 2000 0.050 (0.89, 0.74, 0.69) 14.0(0.1)
6000 0.035 (0.82, 0.49, 0.42) 37.0(0.2)
10000 0.040 (0.81, 0.23, 0.25) 61.6(0.4)

Comb, one-split 2000 0.034 (0.74, 0.00, 0.05) 37.9(0.1)
6000 0.046 (0.14, 0.00, 0.00) 68.3(0.3)
10000 0.045 (0.00, 0.00, 0.00) 107.2(0.7)

Comb, two-split 2000 0.015 (0.91, 0.74, 0.71) 38.0(0.1)
6000 0.030 (0.90, 0.30, 0.35) 76.3(0.5)
10000 0.014 (0.87, 0.07, 0.08) 110.3(0.5)

Algorithm 1.

One-Split Test for Region Significance

graphic file with name nihms-1964699-t0012.jpg

V. Numerical Examples

This section examines the proposed tests for their capability of controlling Type I and Type II errors in both simulated and real examples. All tests are implemented in our Python library dnn-inference (https://github.com/statmlben/dnn-inference).

A. Numerical Comparison With Existing Black-Box Tests

This subsection presents a simple example to illustrate the differences between the proposed tests and other existing black-box tests, including the HRT [6], the LOCO [9], the PT [7], [25], and the holdout PT (HPT) [6]. For PT, we use the scheme of [7] to permute multiple HFs X𝓢, on which we refit the model, and the permutation size is 100. Algorithm 2 in Appendix B in the Supplementary Material summarizes the procedure for the PT. Note that we exclude CRT here due to its enormously expensive computing in refitting a model many times.

To alleviate the high computational cost of refitting, HPT uses data-splitting into a training sample and a test sample. Then it fits only one time on training data and performs the PT over the test sample with the trained model. In our context, we extend HPT in [6] by simultaneously permuting multiple HFs X𝓢.

One issue with the PT and HPT is that permutations of HFs usually alter the dependence structure between X𝓢 and X𝓢c. As a result, the sampling distribution based on permuted samples may differ from the null distribution. For example, the simulated example in Appendix B.2 in the Supplementary Material indicates that both HPT and PT lead to dramatically inflated Type I errors.

In this section, we generate a random sample of size N=1000. First, X=(X1,,X5) follows a uniform distribution on [−1,1] with a pairwise correlation ρij=0.5|ij|; i, j=1,,5. Second, the outcome Y is generated as Y=0.02(X1+X2+X3)+0.05ϵ, where ϵN(0,1).

A simulation study is performed for the one-split and two-split tests, HRT, LOCO, and HPT. For HRT, we use the code of [6] available on GitHub with a default mixture density network with two components. For other methods, we fit a linear function based on stochastic gradient descent (SGD) with the same fitting parameters, that is, epochs is 100, batch_size is 32, and early stopping with validation_split being 0.2 and patience being 10, where patience is the number of epochs until termination if no progress is made on the validation set. For HRT, LOCO, HPT, and PT, the sample splitting ratio is fixed as 0.8, and the data-adaptive scheme is used for the proposed tests.

The returning values are summarized in Table II: the one-split and two-split tests return valid p-values for the hypothesis in (2) with 𝓢={1,2,3}, HRT and LOCO return p-values for individual features of conditional independence and error-invariance for a given dataset, respectively. PT and HPT provide p-values for marginal independence. Therefore, the proposed tests are the only ones targeting the specified null hypothesis in (2).

TABLE II.

Returning Values of the One-Split/Two-Split Tests and Other Existing Black-Box Tests. Here, One-Split, Two-Split, HRT, LOCO, PT, and HPT Denote the Proposed Tests in ALGORITHMS 1 and ALGORITHM 1 in APPENDIX A in the SUPPLEMENTARY MATERIAL, HRT [6], the LOCO [9], PT, and HPT [6]

Test Return H0

One-split p-value risk-invariance R(f)=R𝓢(g), 0.003
Two-split p-value risk-invariance R(f)=R𝓢(g) 0.018
HRT p-values for all feats conditional indep XjYXj (0.840, 0.045, 0.064, 0.900, 0.158)
LOCO p-values for all feats equal errors with/without feat j for a given dataset (0.132, 0.791, 0.180, 0.435, 0.342)
PT p-value marginal indep X𝓢Y 0.010
HPT p-value marginal indep X𝓢Y 0.001

B. Simulations

Consider a nonparametric regression model

Y=f(X)+ϵ,ϵN(0,1) (12)

where f(x) is an unknown function on x[1,1]d. It is known that f(x)=g(z) only depends on a subset of features of x, in which z𝓢0=0 and z𝓢0c=x𝓢0c with goal is to test if 𝓢0={1,,|𝓢0|}. Given a hypothesized index set 𝓢, our gola goal is to test if X𝓢 is relevant to predicting the outcome Y, as specified in (2).

For illustration, we set the regression function as a neural network f(x)=A(WLA(WL1,,A(W1x))), where A() is the ReLU activation function, Wl=(wl,ij)dl×dl1 is a weight matrix, wl,j2=τ/dl11/2, wl,j is the jth column of the matrix Wl, τ>0 is a constant, dl is the width for the lth layer, and d0=d, dL=1, d1==dL1=ϖ and L candidate class defined asis the depth of the network. Clearly, f𝓗, where 𝓗 is a candidate class defined as

𝓗={f(x)=A(WLA(WL1,,A(W1x))):Wl2τ,Wl2,1τ}.

We perform simulations under the model in (12), where XN(0,BΣ), Σij=r|ij|, r[0,1), r represents the correlation coefficient of features, B controls the magnitude of the features, (L, ϖ, τ) denotes the depth, width, and the L2-norm of the neural network, 𝓢0={1,,|𝓢0|} is an index set of the true nondiscriminative features, and 𝓢 is an index set of HFs.

For hypotheses in (2), we examine four index sets of HFs 𝓢.

  1. 𝓢={1,,|𝓢0|}

  2. 𝓢={|𝓢0|/2,,|𝓢0|/2+|𝓢0|}.

  3. 𝓢={p/2,,p/2+|𝓢0|}.

  4. 𝓢={p|𝓢0|,,p}.

These four sets are illustrated in Fig. 2. Note that 𝓢𝓢0=𝓢0 in (i), implying that it is for Type I error analysis, while 𝓢𝓢0𝓢0 in 2)–4), suggesting Type II error analysis. From 2) to 4), the distance (or correlation) between the HFs 𝓢 and those nondiscriminative features in 𝓢0 is increasing (or decreasing), and thus, the Type II error is expected to go down. Seven examples are considered for 1)–4) HF sets.

Fig. 2.

Fig. 2.

Illustration of four index sets of HFs in simulations: (i) type I error analysis, (ii)–(iv): type II error analysis. Note that the impact of the HFs 𝓢 on 𝓢0 decreases, while the Type II error is expected to decrease from (ii) to (iv).

Example 1 (Impact of the Sample Size and Splitting Method):

This example (Table III) concerns the performance of the proposed tests in relation to the sample size N based on log-ratio and data-adaptive splitting methods, where N ranges from 2000 to 10000, B=0.4, r=0.25, p=100, ϖ=128, L=3, and |𝓢0|=5.

Example 2 (Impact of the Strength of Features of Interest):

This example (Table IV) concerns the performance of the proposed tests with respect to the magnitude of features B, where B=0.2,0.4,0.6, N=6000, p=100, r=0.25, ϖ=128, L=3, and |𝓢0|=5. The data-adaptive tuning scheme is applied for this example.

TABLE IV.

Empirical Type I/II Errors of the (Combined) One-Split and Two-Split Tests in Example 2

Test B Type I error Type II error

One-split 0.2 0.057 (0.76, 0.32, 0.12)
0.4 0.050 (0.29, 0.01, 0.00)
0.6 0.057 (0.03, 0.00, 0.00)

Two-split 0.2 0.049 (0.94, 0.88, 0.86)
0.4 0.035 (0.82, 0.49, 0.42)
0.6 0.041 (0.63, 0.03, 0.02)

Comb, one-split 0.2 0.027 (0.73, 0.07, 0.07)
0.4 0.046 (0.14, 0.00, 0.00)
0.6 0.033 (0.00, 0.00, 0.00)

Comb, two-split 0.2 0.019 (1.00, 1.00, 0.97)
0.4 0.030 (0.90, 0.30, 0.35)
0.6 0.012 (0.55, 0.00, 0.00)

Example 3 (Impact of the Depth and Width of a Neural Network):

This example (Table B.1 in Appendix B in the Supplementary Material) concerns the performance of the proposed tests in terms of the width ϖ and depth L of a neural network, where N=6000, L=2,3,4, ϖ=32,64,128, B=0.4, r=0.25, p=100, and |𝓢0|=5.

Example 4 (Impact of the Number of Hypothesized Features):

This example (Table B.2 in Appendix B in the Supplementary Material) concerns the proposed tests with respect to the number of HFs |𝓢0|, where |𝓢0|=5,10,15, N=6000, B=0.4, p=100, ϖ=128, r=0.25 and L=3.

Example 5 (Impact of Feature Correlations):

This example (Table B.3 in Appendix B in the Supplementary Material) concerns the proposed tests in terms of the feature correlation r, where r=0.00,0.25,0.50, N=6000, B=0.4, p=100, ϖ=128, and L=3.

Example 6 (Impact of Different Modes of Combining P-Values):

This example (Table B.4 in Appendix B in the Supplementary Material) concerns the combined tests with different ways of combining p-values. Type I/II errors are examined in two simulated examples: (1) N=6000, ρ=0.25, B=0.2, L=3, ϖ=128; (2) N=6000, ρ=0.25, B=0.4, L=4, and ϖ=32.

Example 7 (Impact of Over/Under-Parameterized Models):

This example (Table V) concerns the impact of the proposed tests based on different underlying black-box models. Specifically, we set the ground truth function f as a neural network with ϖ=64 and L=3, and consider both the under-parameterized and over-parameterized models with ϖ=32,64,128.

TABLE V.

Empirical Types I/II Errors of the (Combined) One-/Two-Split Tests in Example 7 Based on ϖ0=64 (Width for the Truth Model) and Different ϖs (Width for a Learning Model)

Test ϖ Type I error Type II error

One-split 32 0.067 (0.80, 0.36, 0.37)
64 0.025 (0.81, 0.32, 0.29)
128 0.017 (0.80, 0.28, 0.26)

Two-split 32 0.017 (0.97, 0.94, 0.93)
64 0.020 (0.97, 0.94, 0.93)
128 0.033 (0.96, 0.93, 0.93)

Comb, one-split 32 0.140 (0.58, 0.16, 0.11)
64 0.030 (0.81, 0.17, 0.14)
128 0.013 (0.85, 0.18, 0.16)

Comb, two-split 32 0.013 (0.96, 0.93, 0.91)
64 0.027 (0.96, 0.93, 0.94)
128 0.007 (0.97, 0.95, 0.97)

For a test’s Type I and II errors, we compute the proportions of its rejecting H0 out of 1000 simulations under H0 and out of 100 simulations under Ha, respectively.

When implementing the log-ratio splitting scheme, (n, m) is determined by (8) with N0=1000, and ρ=0.01; for the data-adaptive scheme, the grids of ζ are set as {0.2,0.4,0.6,0.8}. Moreover, the grids for searching the optimal perturbation size are {0.01,0.05,0.1,0.5,1.0}. For combined tests, the number of repeated random splitting is set as 5. The hyperparameters of fitting a neural network are the same as in Section VA.

1). Type I/II Errors of the (Combined) One-Split/Two-Split Tests:

As indicated in Tables III and IV and Tables B.1B.5 in the Supplementary Material, the one-split/two-split tests perform well in all examples with respect to controlling Type I/II errors. In particular, Type I errors are close to the nominal level α=0.05, whereas Type II errors decrease to 0 as the sample size N increases. As expected, the one-split test outperforms the two-split test in terms of Type II error, which agrees with Theorems 4 and Theorem A.3 in Appendix A in the Supplementary Material. The combined tests consistently improve the performance in terms of both Type I/II errors.

2). Runtime:

The combined tests may double the runtime of their noncombined counterparts based on the data-adaptive tuning scheme. This result suggests that the one-split/two-split and their combined tests are practically feasible for black-box testing subject to computational constraints as in the case of applying deep neural networks to large data.

3). Combining P-Values:

As suggested by Table B.4 in the Appendix in the Supplementary Material, the Hommel combining method controls the Type I error while having reasonably good power in reducing the Type II error. The Bonferroni and Cauchy methods have an issue of failing to control Type I error, whereas other combining methods are conservative in the first case of Example 6.

4). Over-/Under-Parameterized Models:

As suggested in Table V, an under-parameterized model (ϖ=32) has inflated Type I errors, which agrees with the theoretical analysis in Section IIB; an over-parameterized model (ϖ=128) is able to control the Type I error and provides similar performance in the power or Type II error to that of the perfectly specified model (with exactly the same network structure of the ground truth model), it is partially because the early stopping is conducted as a regularization for over-parameterized models. For the (combined) two-split test, both the over- and under-parameterized models perform similarly to the perfectly specified model. One plausible explanation (for its no inflation of Type I errors) is that the two-split test is conservative in the finite-sample setting.

We summarize the advantages of the different tests and the combining/tuning methods in Table VI.

TABLE VI.

Advantage for Different Tests, Combining, and Tuning Methods

Advantage Evidence
Test One-split
Two-split
More powerful No need to perturb data Tables III, IV, Tables B.1B.4 in Appendix B
Appendix A
Combine Comb.
Non-comb.
More powerful Less computation time Tables III, IV, Tables B.1B.4 in Appendix B
Table III
Ratio Data-adaptive
Log-ratio
More powerful No need to tune the ratio, and less computation time Tables III, IV, Tables B.1B.4 in Appendix B
Lemma 6, Table III

C. One-Split Test and Perturbation

Consider a regression model in (12), where 𝓢0={1,2,3}, XN(0,BΣ), where Σ1j=Σj1=0.1; j=1,,p, and Σij=0, if i, j1 and ij. In this case, let 𝓢=𝓢0, then H0 is true in the population level. Furthermore, only partial features are observed in a dataset (xi(N),yi(N))i=1N, where xi(N)=(xi1,,xidN) and yi(N) is generated as yi(N)=f(x˜i(N))+ϵi, dNd is the number of observed features and dNd as N, and x˜i(N)=(xi1,,xidN,0,,0) is a d-dimensional dummy variable.

Then, we simulate a dataset (xi(N),yi(N))i=1N, with d=100, dN=d(11/log(N)), and N=2000,6000,10000. For implementation, we set ζ=0.2 for the one-split and two-split tests and ρ=1.0 for the one-split test. The fitting parameters the Type I errors based on the two-split test, and the one-split tests with/without perturbation are reported in Table VII.

As indicated in Table VII, the two-split test and the one-split with perturbation approximately control Type I errors across all situations, whereas the one-split test without perturbation has inflated Type I errors significantly exceeding the nominal level α=0.05.

VI. Real Application

A. MNIST Handwritten Digits

This subsection applies the proposed test to the MNIST handwritten digits dataset [10]. The MNIST dataset is a standard benchmark for explainable artificial intelligence (XAI) methods [26], in part because the results of detection could be easily evaluated by human visual intuition. In particular, we extract 14251 images from the dataset with labels ‘7’ and ‘9’ to discriminate between these two digits. Our primary goal is to test certain image features differentiating digit ‘7’ from digit ‘9’, where a marked region of an image specifies HFs.

In this application, we consider three different types of masked regions, as displayed in Fig. 3.

Fig. 3.

Fig. 3.

HRs in Cases 1–3 for differentiating digits 7 and 9 in Section VIA. Case 1: an HR is (19 : 28,13 : 20), which indicates that H0 is true; Case 2: an HR is (21 : 28, 4 : 13), which indicates that H0 is true; Case 3: an HR is (7 : 16,9 : 16), which indicates that Ha is true. Note that the p-values in the top are given by the one-split test.

To proceed, we specify the underlying model as the default convolution neural network (CNN) provided by Keras for the MNIST dataset. Finally, we apply the one-split test, the two-split tests, and their combined tests based on the data-adaptive tuning scheme with a significance level of α=0.05.

As suggested by Table VIII, the (combined) one-/two-split tests all fail to reject H0 when it is true in Cases 1 and 2, but all reject H0 in Case 3 when it is false. Overall, the test results confirm our intuition that the hypothesized regions (HRs) in Cases 1 and 2 are visually indistinguishable, whereas that in Case 3 is visually discriminative, as illustrated in Fig. 3.

TABLE VIII.

P-Values of (Combined) One-/Two-Split Tests in the MNIST Dataset. Significant P-Values for Testing Feature Irrelevance are Underlined at a Nominal Level α=0.05

Test p-values (cases 1–3)

One-split (1.74e-l, 3.29e-l, 1.37e-13)
Two-split (9.59e-l, 5.69e-l, 1.10e-05)
Comb, one-split (3.85e-l, 1.00e-0, 4.43e-18)
Comb, two-split (5.44e-l, 1.92e-l, 2.25e-09)

B. Mechanisms of Action Prediction for New Drugs

This subsection applies the proposed tests to examine the significance of “treatment,” “gene expression,” and “cell viability” to mechanisms of action (MoA) prediction of new drugs. The dataset consists of 23814 drug-MoA annotation pairs with three types of features (“treatment,” “gene expression,” and “cell viability”), and 207 binary labels indicating multiple targets of MoA responses, as illustrated in Fig. 4. Specifically, “treatment” includes “treatment duration” (continuous) and “treatment dose” (categorical); “gene expression” and “cell viability” include 773 gene expression data (continuous), and 100 human cells’ responses (continuous) to drugs [27], [28], respectively.

Fig. 4.

Fig. 4.

Features (treatment features, gene expression, and cell viability) and targets in MoA dataset. Three cases with three different types of HFs are considered. Case 1 (Treatment): HFs are “treatment duration” and “treatment dose;” Case 2 (Gene): HFs are “g-0”–“g-772;” and Case 3 (Cell): HFs are “c-0”–“c-99.”

In this application, we consider the significance of those three types of feature sets, as displayed in Fig. 4.

For implementation, we use TabNet [29] as the predictive model for our proposed tests. The results are summarized in Table IX, which indicates that all tests fail to reject H0 at α=0.05 for “gene expression” features. For Cases 1 and 3, all tests consistently reject H0, identify “treatment” and “cell viability” as significant features to MoA prediction.

TABLE IX.

P-Values of (Combined) One-/Two-Split Tests in the Moa Prediction Dataset

Test p-values (cases 1–3) (‘treatment’, ‘gene exp’, ‘cell viability’)

One-split (1.42e-2, 1.34e-l, 9.69e-4)
Two-split (2.17e-2, 2.52e-l, 3.19e-4)
Comb, one-split (4.72e-2, 3.81e-l, 1.02e-3)
Comb, two-split (1.13e-3, 1.0le-1, 1.20e-5)

C. Chest X-Rays for Pneumonia Diagnosis

This subsection illustrates the application of the proposed tests to chest X-ray images in a pneumonia diagnosis dataset [30]. This dataset consists of 5863 X-ray images, each labeled as “Pneumonia” or “Normal.” To proceed, we crop an image to produce a version of the image that focuses on the lung fields, based on DeepXR. Then, we use a square cropping region to retain important areas containing parenchymal anatomy and retrocardiac anatomy.

For implementation, we specify the learning model as a CNN and apply the one-split test, two-split test, and their combined tests based on the data-adaptive tuning scheme at a significance level of α=0.05. Similarly, we also consider three different types of HRs, as displayed in Fig. 5.

Fig. 5.

Fig. 5.

HRs in Cases 1–3 for discriminating “Normal” (first row) versus “Pneumonia” (second row) X-ray images in Section VIC. Case 1: a HR is (50:200, 20:110), for which H0 is likely to be false; Case 2: an HR is (50:200, 100:150), for which H0 is likely to be true. Case 3: an HR is (50:200, 150:240), for which H0 is likely to be false. Note that the p-values in the top are given by the one-split test.

As suggested by Table X, all tests fail to reject H0 at α=0.05 in Case 2 when H0 is likely to be true. For Cases 1 and 3, only the (combined) one-split test rejects both the H0, but other tests fail to do so when H0 is likely to be false. In agreement with the earlier results, the one-split test seems more powerful to detect a discriminative region.

TABLE X.

P-Values of the (Combined) One-/Two-Split Tests in the Chest X-Ray Dataset

Test p-values (cases 1–3) (‘left lung’, ‘null region’, ‘right lung’)

One-split (2.61e-2, 9.95e-l, 2.12e-2)
Two-split (2.12e-l, 5.61e-l, 6.51e-2)
Comb, one-split (4.14e-2, 6.35e-l, 7.52e-2)
Comb, two-split (5.36e-2, 7.54e-l, 8.37e-2)

D. Significance of Keypoints to Facial

Expression Recognition

This section examines the significance of five keypoints (left eye, right eye, eyes, nose, and mouth) on seven facial expressions: “angry,” “disgust,” “fear,” “happy,” “sad,” “surprise,” “neutral”) on the FER2013 dataset, consisting of 48 × 48 pixel grayscale facial images. The facial images have been automatically registered. For each facial image, an emotion label is provided as one of seven expressions. Given a facial image, we produce the key points based on the existing facial landmark detection libraries dlib and open-cv. The primary goal is to deliver the significance of the key points to facial expression recognition.

After preprocessing, we obtain 11709 triples of images, labels, and key points. The scatter plot for the key points is provided in Fig. 6, from which we consider five different collections of HRs corresponding to five key points: left eye, right eye, eyes, nose, and mouth, respectively. The HRs-based illustrative examples are displayed in Fig. 7.

Fig. 6.

Fig. 6.

Scatter plot for the keypoints (left eye, right eye, nose, and mouth) in FER2013 facial expression recognition dataset, yielding that the HRs in Cases 1–4 cover the corresponding keypoints in most faces.

Fig. 7.

Fig. 7.

HRs in Cases 1–5 for discriminating seven facial expressions in rows (including “angry,” “disgust,” “fear,” “happy,” “sad,” “surprise,” and “neutral”). Case 1 (Left eye): an HR is (14:22, 9:22); Case 2 (Right eye): a HR is (14:22, 28:41). Case 3 (Eyes): an HR is (14:22, 9:22 ∪ 28:41); Case 4 (Nose): an HR is (24:32, 20:29); and Case 5 (Mouth): an HR is (34:45, 18:30). Note that the p-values in the top are given by the combined one-split test.

For implementation, we use the same VGG deep neural network and the same training hyperparameters as in [31]. Note that the adopted VGG network in [31] is one of the state-of-art facial expression recognition methods (Rank 4) in FER2013 papers-with-code Leaderboard [32].

As suggested by Table XI, all tests fail to reject H0 in Cases 1, 2, and 4. For Cases 1 and 2, it is partly because the predictive information in the left/right eye is symmetrically leaked in the other eye. For Case 4, the result confirms the visual intuition that “nose” is not a discriminative key point to facial expression. For Cases 3 and 5, all tests consistently reject H0, suggesting that “eyes” and “mouth” are discriminative regions, which are visually confirmed by illustrative samples in Fig. 7. Note that the proposed tests are equally applicable to more substantial computer vision applications, for which the testing results could provide instructive information for visual sensor management and construction.

TABLE XI.

P-Values of the (Combined) One-/Two-Split Tests in the FER 2013 Dataset Based on Five Keypoints: Left Eye, Right Eye, Eyes, Nose, and Mouth

Test p-values (cases 1–5) (‘left eye’, ‘right eye’, ‘eyes’, ‘nose’, ‘mouth’)

One-split (1.58e-l. 3.55e-l, 6.14e-3, 6.89e-l, 1.27e-3)
Two-split (9.25e-2, 2.87e-l, 1.75e-2, 2.86e-l, 3.46e-2)
Comb, one-split (6.03e-l, 1.43e-l, 7.88e-5, 8.23e-l, 1.09e-7)
Comb, two-split (5.91e-2, 6.81e-2, 4.42e-2, 1.33e-l. 2.29e-2)

E. Evaluating Significance of Localization in CIFAR100

Note that the proposed methods are equally applicable to significance tests for instance adaptive HFs. Therefore, they can be used to evaluate the effectiveness of discriminative localization methods, such as the class activation maps (CAM) [33] and Grad-CAM [34]. In this section, we demonstrate a significant test in the CIFAR100 dataset based on adaptive HFs localized by Grad-CAM.

Specifically, in the training set, we apply Grad-CAM to a fit AlexNet to produce importance/heatmaps of features/pixels of all images, see six demonstrative examples in Fig. 8. Then, four cases of hypothesized tests are provided by taking top-5%, top-10%, top-15%, and top-30% important features as HFs. Next, the proposed one-/two-split tests are conducted in the testing set with a ResNet50 network, and the resulting p-values are summarized in Table XII.

Fig. 8.

Fig. 8.

Demonstrative adaptive HRs in CIFAR100 dataset, localized by Grad-CAM. Cases 1–4: the percentages of HFs are 5%, 10%, 15%, and 30% in rows from top to bottom, corresponding to the top important features ranked by Grad-CAM localization heatmaps.

TABLE XII.

P-Values of the (Combined) One-/Two-Split Tests in the CIFAR 100 Dataset. The Percentages of HFS Are 5%, 10%, 15%, and 30%, Corresponding to the Top Important Features Ranked by Grad-CAM Localization Heatmaps

Test p-values (cases 1–4) (Top-5%, Top-10%, Top-15%, Top-30%)

One-split (3.13e-l, 6.44e-3, 5.33e-3, 4.25e-8)
Two-split (9.04e-l, 2.58e-l, 7.90e-l, 2.59e-4)
Comb, one-split (5.56e-2, 4.08e-3, 1.92e-5, 1.12e-7)
Comb, two-split (5.81e-l, 1.59e-l, 2.20e-2, 9.68e-5)

Overall, the test results confirm our intuition, the inconsistent results in Cases 2 and 3 by one-split and two-split tests may be caused by the power loss of two-split tests. It is worth mentioning that the sequence of pairs (top important hypothesized/localized features, p-values) produced by the proposed tests can be an evaluation of the effectiveness of the localization method.

F. Significance of Keywords in Sentiment Analysis

This section examines the significance of keywords in sentiment classification based on the IMDB dataset [35]. This dataset provides 50000 highly polar movie reviews for binary sentiment classification. We also obtain lists of positive, negative, and neutral opinion words from [36]. In this application, we apply the proposed tests to examine the significance of positive/negative/neutral words contributing to sentiment analysis. For illustration, we report the results based on the top 350 frequent positive- and negative-sentiment words and 350 randomly selected neutral-sentiment words in the IMDB dataset.

For implementation, we use a bidirectional LSTM model as a prediction model for sentiment classification and apply the one-split/two-split tests and their combined tests based on the log-ratio splitting method at a significance level of α=0.05.

Overall, the test results in Table XIII confirm our intuition, where the positive and negative-sentiment words significantly contribute to sentiment analysis, but not neutral-sentiment words. Inconsistent results in Cases 1 and 2 by one-split and two-split tests may be caused by a power loss of the two-split test.

TABLE XIII.

P-Values of the (Combined) One-/Two-Split Tests in the IMDB Dataset With HFs as: CASE 1: the Top 350 Frequent Positive Words; CASE 2: Top 350 Negative-Sentiment Words; CASE 3: 350 Randomly Selected Neutral-Sentiment Words

Test p-values (cases 1–3) (positive, negative, neutral)

One-split (2.92e-2, 1.20e-3, 3.37e-l)
Two-split (9.61e-2, 1.61e-l, 1.14e-l)
Comb, one-split (2.53e-5, 6.87e-3, 1.29e-l)
Comb, two-split (2.98e-l, 2.24e-l, 6.20e-l)

VII. Conclusion

This article proposes two novel risk-invariance tests, one-split and two-split tests, to assess the impact of a collection of HFs on prediction. Theoretically, we have established asymptotic null distributions of test statistics and their consistency in Type I/II errors. Numerically, we have demonstrated the utility of the proposed tests on simulated and real datasets. Next, we summarize some strengths and limitations of the proposed tests.

Strengths:

1) The proposed tests provide a practical inference tool for black-box models on complex data, which considerably relax assumptions in the existing literature. For example, CRT and HRT require a well-estimated conditional probability for features, which is often impractical. 2) The proposed tests work for general risk-invariance testing on a collection of features of interest, which encompasses the conditional independence test when the log-likelihood loss is used. 3) The proposed tests involve a limited number of model refitting, which is suitable for large-scale problems.

Limitations:

1) one-split/two-split tests split over the original dataset at the expense of reduced power or increased Type II error. 2) The log-ratio splitting scheme is conservative in that it prefers situations with a large estimation subset and a small inference subset.

Supplementary Material

supp1-3185742

Acknowledgments

This work was supported in part by NSF under Grant DMS-1712564, Grant DMS-1721216, and Grant DMS-1952539; in part by NIH under Grant R01GM126002, Grant R01AG069895, Grant R01AG065636, Grant R01AG074858, and Grant U01AG073079, and in part by The Chinese University of Hong Kong Faulty of Science Direct Grant.

Biographies

graphic file with name nihms-1964699-b0009.gif

Ben Dai (Member, IEEE) received the B.S. degree in mathematics and applied mathematics from Hangzhou Dianzi University, Hangzhou, China, in 2015, and the Ph.D. degree in data science from The City University of Hong Kong, Hong Kong, in 2019.

He is currently an Assistant Professor with the Department of Statistics, The Chinese University of Hong Kong. His research interests include statistical machine learning, learning theory, statistical XAI, recommender systems, and deep learning.

graphic file with name nihms-1964699-b0010.gif

Xiaotong Shen received the B.S. degree in mathematics from Peking University, Beijing, China, in 1985, and the Ph.D. degree in statistics from the University of Chicago, Chicago, IL, USA, in 1991.

He is currently the John Black Johnston Distinguished Professor with the University of Minnesota, Minneapolis, MN, USA. His research interests include machine learning and data science, high-dimensional inference, nonparametric and semiparametric inference, causal graphical models, recommender systems, and nonconvex minimization.

graphic file with name nihms-1964699-b0011.gif

Dr. Shen is a fellow of the American Association for the Advancement of Science, the American Statistical Association, and the Institute of Mathematical Statistics.

Wei Pan received the B.S. degree in computer engineering and in applied mathematics from Tsinghua University, Beijing, China, in 1989, and the Ph.D. degree in statistics from the University of Wisconsin–Madison, Madison, WI, USA, in 1997. He is currently a Professor with the Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA. His research interests include statistical genetics, bioinformatics, and deep learning.

Dr. Pan is a fellow of the American Statistical Association and the Institute of Mathematical Statistics.

Contributor Information

Ben Dai, Department of Statistics, The Chinese University of Hong Kong, Hong Kong.

Xiaotong Shen, School of Statistics, University of Minnesota, Minneapolis, MN 55455 USA.

Wei Pan, Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455 USA.

References

  • [1].Schmidhuber J, “Deep learning in neural networks: An overview,” Neural Netw., vol. 61, pp. 85–117, Jan. 2015. [DOI] [PubMed] [Google Scholar]
  • [2].Fahrmeir L, Kneib T, Lang S, and Marx B, Regression. Berlin, Germany: Springer, 2007. [Google Scholar]
  • [3].King G, Unifying Political Methodology: The Likelihood Theory of Statistical Inference. Cambridge, U.K.: Cambridge Univ. Press, 1989. [Google Scholar]
  • [4].Wasserman L, Ramdas A, and Balakrishnan S, “Universal inference,” Proc. Nat. Acad. Sci. USA, vol. 117, no. 29, pp. 16880–16890, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Candès E, Fan Y, Janson L, and Lv J, “Panning for gold: ‘Model-X’ knockoffs for high dimensional controlled variable selection,” J. Roy. Stat. Soc. B, Stat. Methodol, vol. 80, no. 3, pp. 551–577, 2018. [Google Scholar]
  • [6].Tansey W, Veitch V, Zhang H, Rabadan R, and Blei DM, “The holdout randomization test: Principled and easy black box feature selection,” J. Comput. Graph. Statist, vol. 31, no. 1, pp. 151–162, 2021. [Google Scholar]
  • [7].Ojala M. and Garriga GC, “Permutation tests for studying classifier performance,” J. Mach. Learn. Res, vol. 11, no. 62, pp. 1833–1863, Jun. 2010. [Google Scholar]
  • [8].Berrett TB, Wang Y, Barber RF, and Samworth RJ, “The conditional permutation test for independence while controlling for confounders,” J. Roy. Stat. Soc. B, Stat. Methodol, vol. 82, no. 1, pp. 175–197, Feb. 2020. [Google Scholar]
  • [9].Lei J, G’Sell M, Rinaldo A, Tibshirani RJ, and Wasserman L, “Distribution-free predictive inference for regression,” J. Amer. Stat. Assoc, vol. 113, no. 523, pp. 1094–1111, Jul. 2018. [Google Scholar]
  • [10].LeCun Y, Bottou L, Bengio Y, and Haffner P, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998. [Google Scholar]
  • [11].Rumelhart DE, Hinton GE, and Williams RJ, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, Oct. 1986. [Google Scholar]
  • [12].Chernozhukov V. et al. , “Double/debiased machine learning for treatment and structural parameters,” Econometrics J, vol. 21, no. 1, pp. C1–C68, Feb. 2018. [Google Scholar]
  • [13].Wasserman L. and Roeder K, “High dimensional variable selection,” Ann. Statist, vol. 37, no. 5A, p. 2178, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Buse A, “The likelihood ratio, Wald, and Lagrange multiplier tests: An expository note,” Amer. Statistician, vol. 36, no. 3, pp. 153–157, 1982. [Google Scholar]
  • [15].Dodge Y. and Commenges D, The Oxford Dictionary of Statistical Terms. London, U.K.: Oxford Univ. Press, 2006. [Google Scholar]
  • [16].Wasserman L, All of Nonparametric Statistics. New York, NY, USA: Springer, 2006. [Google Scholar]
  • [17].Schmidt-Hieber J, “Nonparametric regression using deep neural networks with ReLU activation function,” Ann. Statist, vol. 48, no. 4, pp. 1875–1897, 2020. [Google Scholar]
  • [18].Cappé O, Moulines E, and Rydén T, Inference in Hidden Markov Models. New York, NY, USA: Springer, 2006. [Google Scholar]
  • [19].Romano J. and DiCiccio C, “Multiple data splitting for testing,” Dept. Statist., Stanford Univ, Stanford, CA, USA, Tech. Rep. 2019-03, 2019. [Google Scholar]
  • [20].Meinshausen N, Meier L, and Bühlmann P, “P-values for high-dimensional regression,” J. Amer. Stat. Assoc, vol. 104, no. 488, pp. 1671–1681, Dec. 2009. [Google Scholar]
  • [21].Vovk V. and Wang R, “Combining p-values via averaging,” Biometrika, vol. 107, no. 4, pp. 791–808, 2020. [Google Scholar]
  • [22].Hardy GH et al. , Inequalities. Cambridge, U.K.: Cambridge Univ. Press, 1952. [Google Scholar]
  • [23].Hommel G, “Tests of the overall hypothesis for arbitrary dependence structures,” Biometrical J, vol. 25, no. 5, pp. 423–430, Jan. 1983. [Google Scholar]
  • [24].Owen DB, “Tables for computing bivariate normal probabilities,” Ann. Math. Statist, vol. 27, no. 4, pp. 1075–1090, Dec. 1956. [Google Scholar]
  • [25].Breiman L, “Random forests,” Mach. Learn, vol. 45, no. 1, pp. 5–32, 2001. [Google Scholar]
  • [26].Ribeiro MT, Singh S, and Guestrin C, “‘Why should i trust you?’ Explaining the predictions of any classifier,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Aug. 2016, pp. 1135–1144. [Google Scholar]
  • [27].Corsello SM et al. , “Discovering the anticancer potential of non-oncology drugs by systematic viability profiling,” Nature Cancer, vol. 1, no. 2, pp. 235–248, Feb. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Subramanian A. et al. , “A next generation connectivity map: L1000 platform and the first 1,000,000 profiles,” Cell, vol. 171, no. 6, pp. 1437–1452, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Arik SO and Pfister T, “TabNet: Attentive interpretable tabular learning,” in Proc. AAAI Conf. Artif. Intell, 2021, pp. 6679–6687. [Google Scholar]
  • [30].Kermany DS et al. , “Identifying medical diagnoses and treatable diseases by image-based deep learning,” Cell, vol. 172, no. 5, pp. 1122–1131, 2018. [DOI] [PubMed] [Google Scholar]
  • [31].Khaireddin Y. and Chen Z, “Facial emotion recognition: State of the art performance on FER2013,” 2021, arXiv:2105.03588. [Google Scholar]
  • [32].Pramerdorfer C. and Kampel M, “Facial expression recognition using convolutional neural networks: State of the art,” 2016, arXiv:1612.02903. [Google Scholar]
  • [33].Zhou B, Khosla A, Lapedriza A, Oliva A, and Torralba A, “Learning deep features for discriminative localization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2016, pp. 2921–2929. [Google Scholar]
  • [34].Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, and Batra D, “Grad-CAM: Visual explanations from deep networks via gradient-based localization,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV; ), Oct. 2017, pp. 618–626. [Google Scholar]
  • [35].Maas A, Daly RE, Pham PT, Huang D, Ng AY, and Potts C, “Learning word vectors for sentiment analysis,” in Proc. 49th Annu. Meeting Assoc. Comput. Linguistics, Hum. Lang. Technol., 2011, pp. 142–150. [Google Scholar]
  • [36].Hu M. and Liu B, “Mining and summarizing customer reviews,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2004, pp. 168–177. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp1-3185742

RESOURCES