Abstract
An exciting recent development is the uptake of deep neural networks in many scientific fields, where the main objective is outcome prediction with a black-box nature. Significance testing is promising to address the black-box issue and explore novel scientific insights and interpretations of the decision-making process based on a deep learning model. However, testing for a neural network poses a challenge because of its black-box nature and unknown limiting distributions of parameter estimates while existing methods require strong assumptions or excessive computation. In this article, we derive one-split and two-split tests relaxing the assumptions and computational complexity of existing black-box tests and extending to examine the significance of a collection of features of interest in a dataset of possibly a complex type, such as an image. The one-split test estimates and evaluates a black-box model based on estimation and inference subsets through sample splitting and data perturbation. The two-split test further splits the inference subset into two but requires no perturbation. Also, we develop their combined versions by aggregating the -values based on repeated sample splitting. By deflating the bias-sd-ratio, we establish asymptotic null distributions of the test statistics and the consistency in terms of Type II error. Numerically, we demonstrate the utility of the proposed tests on seven simulated examples and six real datasets. Accompanying this article is our python library dnn-inference (https://dnninference.readthedocs.io/en/latest/) that implements the proposed tests.
Keywords: Adaptive splitting, black-box tests, combining, computational constraints, feature relevance
I. Introduction
DEEP neural networks [1] are a representative of black-box models, in which the learning process between features and outcomes is usually difficult to track due to the lack of knowledge about complex hidden patterns inside. The primary goal of deep learning (DL) is to fit a deep neural network for predicting outcomes with high predictive accuracy. Driven by its superior prediction performance [1], scientists seek accountability and interpretability beyond prediction accuracy. In particular, they demand significance tests based on DL to explore novel discoveries on scientific domain knowledge, for example, if a specific lung region is significantly associated with COVID-19.
Given a dataset, the goal of statistical significance tests is to examine if a collection of features of interest is associated with the outcome. A problem of this kind frequently occurs in classical parametric models or nonblack-box models, for instance, in statistical genetics, such as Alzheimer’s disease (AD) studies, where a gene is routinely examined and tested for AD association based on a linear model. Yet, significance testing based on a black-box model for a more complicated dataset remains understudied. In Section I–A, we discuss the existing methods and their limitations and issues. In Section I–B, we summarize our contributions to highlight the novelty of the proposed methods in addressing the existing issues.
A. Existing Methods and Their Limitations
In the existing literature, inference methods can be categorized into two groups: nonblack-box tests and black-box tests. Nonblack-box tests, such as the Wald test [2] and the likelihood-ratio test [3], [4], perform hypothesis testing for hypothesized features (HFs) (i.e., features of interest) based on the asymptotic distribution of the estimated parameters in a parametric model, such as a linear model. Black-box tests focus on a model-free hypothesis, such as Model-X knockoffs, conditional randomization tests (CRT) [5], holdout randomization test (HRT) [6], permutation test (PT) [7], conditional PT (CPT) [8], and leave-one-covariate-out test (LOCO) [9]. Specifically, Model-X knockoffs conduct variable selection with false discovery rate (FDR) control based on a specified variable importance measure on each of the individual features. CRT, HRT, and CPT examine the independence between the outcome and each feature conditional on the remaining features (at least for their simulations and implementation). PT examines the marginal independence between the outcome and HFs. LOCO introduces the excess prediction error for each feature to measure its importance for a given dataset.
Limitations of Existing Works:
Despite the merits of the methods developed, they have their limitations. 1) First, for nonblack-box tests, it is difficult to derive the asymptotic distribution of the parameter estimates from black-box models, especially for over-parametrized neural networks. Moreover, the explicit feature-parameter correspondence may be lost for a black-box model, such as a convolutional neural network (CNN) [10] using shared weights for spatial pixels, and recurrent neural networks (RNNs) [11] using shared weights for subsequent states. 2) Second, most existing black-box tests focus on variable importance or inference on a single feature, yet simultaneous testing of a collection of features is more desirable in some applications. For example, in image analysis, it is more interesting to examine patterns captured by multiple pixels in a region, where the impact of every single pixel is negligible. 3) Third, CRT, CPT, and HRT reply on a strong assumption that the conditional feature distribution is known or well-estimated, and thus, a test statistic can be constructed based on the generated samples from the null distribution. However, the complete conditionals may not be known or easy to estimate in practice, especially for complex datasets such as images or texts. 4) Finally, PT, CPT, and CRT require massive computing power to refit a model many times, which is infeasible for complex deep neural networks. More detailed discussion and numerical results about the connections and differences between the existing tests and the proposed tests can be found in Sections II–E and V–A.
B. Our Contributions
This article proposes one-split and two-split tests to address the existing issues (1)–4) in Section I–A). Our main contributions are summarized as follows.
To address issues 1) and 2), we propose a flexible risk invariance null hypothesis for a general loss in (2), which measures the impact of a collection of HFs on prediction. Its relation to conditional independence is given in Lemma 1.
To address issues 3) and 4), the one-split and two-split tests bypass the requirement of estimating the conditional distributions of features and testing based on the differenced empirical loss with sample splitting subject to computational constraints.
We provide a theoretical guarantee of the proposed tests, c.f., Theorems 2 and 4, and Theorems A.1–A.4 in Appendix A in the Supplementary Material. The theory is illustrated by extensive simulations.
We compare the proposed tests with other existing tests and demonstrate their utility on four benchmarks on various deep neural networks in Section VI. We develop a python library dnn-inference to implement the proposed tests.
Overall, the proposed tests relax the assumptions and reduce the computational cost, providing more practical, feasible, and reliable testing for a black-box model on a complex dataset.
This article is structured as follows. Section II introduces the one-split test as well as its combined test. Section III performs Type II analysis and establishes the consistency of the proposed tests. Section IV develops sample splitting schemes. Section V is devoted to simulation studies and applications to four real datasets. The Appendix in the Supplementary Material encompasses the two-split test, additional numerical examples, and technical proofs.
II. Proposed Black-Box Tests
In DL, a deep neural network is fit to predict an outcome based on features , and its prediction performance is evaluated by a loss function . Our objective is to test the significance of a subset of features to the prediction of , where is an index set of HFs and with indicating the complement set of . Note that can be a collection of weak features in that none of these features is jointly significant to prediction, but collectively they are. For example, in image analysis, the impact of each pixel is negligible but a pattern of a collection of pixels (e.g., in a region) may instead become significant.
To formulate the proposed hypothesis, we first generate dual data (, ) by replacing by some irrelevant constants as , such as , that is, for
| (1) |
where is an arbitrary deterministic constant. Note that the dual data satisfies that and , and thus, we aim to use differences between (, ) and (, ) to measure the impact of on the prediction of the outcome . To proceed, we introduce the corresponding risks
Then, the differenced risk is defined as to measure the significance of , that is, compare the best prediction performance with/without the HFs with existence of . Here, and are the optimal prediction functions in population.
To determine if is significantly relevant to the prediction of, consider null and alternative hypotheses
| (2) |
Rejection of suggests that the feature set is relevant to the prediction of . It is emphasized that in (2), the targets are the two true or population-level functions and , instead of their estimates (based on a given sample) as implemented in some existing tests.
In Section II–A, we demonstrate the relation between the proposed hypothesis and the independence hypothesis. More discussion about the differences between the hypotheses in HRT and LOCO can be found in Sections II–E and V–A.
A. Connection to Independence
This subsection illustrates the relationships among the risk invariance hypothesis in (2), marginal independence, and conditional independence; the latter two are defined as:
Lemma 1:
For any loss function, conditional independent implies the proposed risk invariance, that is,
Moreover, if the negative log-likelihood or the cross-entropy is used in (2) as a loss function, then is equivalent to conditional independence almost surely under the marginal distribution of , that is,
As suggested by Lemma 1, conditional independence always implies risk invariance, but they can be almost surely equivalent with some particular loss functions. Hence, at any significance level, a rejection of the null hypothesis of risk invariance implies a rejection of the null hypothesis of conditional independence. Yet, such a relationship does not exist for marginal independence. Next, we present three cases with disparate loss functions to illustrate their relationships.
Case 1: (Constant loss): for a constant .
Case 2: (The -loss in regression): for .
Case 3: (The cross-entropy loss in multiclass classification): .
As shown in Fig. 1, conditional independence implies risk invariance in Cases 1 and 2 while they are equivalent in Case 3, as suggested by Lemma 1. In general, conditional independence or risk invariance does not yield marginal independence and vice versa.
Fig. 1.
Three cases illustrate different relations among marginal independence, conditional independence, and risk invariance.
It is worthwhile mentioning that different loss functions can lead to different conclusions. We interpret a significance test according to the loss function being used. For example, consider the misclassification error (MCE) and the cross-entropy loss for testing the relevance of respectively. With the existence of , the under the cross-entropy loss indicates that the HFs are irrelevant to the conditional distribution of given , yet the under MCE suggests that the HFs are irrelevant to classification accuracy.
B. One-Split Test
Given a dataset , we first split it into an estimation subset and an inference subset , where is the number of total samples, and are the sample sizes of estimation and inference subsets, and is a splitting ratio. On this ground, the dual estimation subset and the dual inference subset can be generated based on the masking processing in (1). The sample splitting intends to reduce the potential bias and to prevent overfitting, especially for an over-parametrized black-box model, which has been considered elsewhere for a different purpose in [4], [12], [13].
To access the null hypothesis in (2), we conduct a two-level estimation to , that is, using the (dual) estimation subset to empirically estimate predictive models (, ), and the (dual) inference subset to empirically estimate two risks and . Specifically, given the (dual) estimation subset, we obtain an estimator (, ) to approximate (, ), for example, by minimizing a regularized empirical loss of a deep neural network based on the (dual) estimation subset. Then, the differenced empirical loss is evaluated on the (dual) inference subset based on the estimator (, ):
As a remark, for flexibility, we do not specify the estimation procedure of (, ), the only explicit condition is summarized in Assumption A, which properly requires that (, ), is a consistent estimator of (, ).
One difficulty in inference is that under the bias of approximating could dominate its standard error; that is, the ratio of the bias to the standard derivation, called the bias-sd-ratio, could be severely inflated, making the asymptotic distribution of invalid for inference. This aspect is explained in detail in Section II–D. To circumvent this difficulty, we present the one-split test with data perturbation to guard against the potentially inflated bias-sd-ratio by adding an independent noise:
| (3) |
where is the sample standard deviation of given and :
and ; are independent noise, and is the perturbation size. Note that our proposed test is in principle similar to classical hypothesis tests using a single test statistic. For example, if we use the negative log-likelihood as the loss function, it can be regarded as an extension of the likelihood ratio test (LRT) [14] to a black-box model.
According to the asymptotic null distribution of in Theorem 2, we calculate the -value , where is the cumulative distribution function of .
Note that is a subsequence of , and as . To derive the asymptotic null distribution of , we make the following assumptions.
Assumption A (Estimation Consistency):
For some constant , (, ) satisfies
| (4) |
where denotes stochastic boundedness [15].
Assumption A concerns the rate of convergence in terms of the differenced regret, where , known as the prediction regret with respect to a loss function of . Note that
| (5) |
which says that the rate is no worse than the least favorable one between the regrets of and . In the literature, the convergence rates for the right-hand of (5) have been extensively investigated. For example, the rate is for nonparametric regression [16], and the rate is for a regularized ReLU neural net [17], where is the degree of smoothness of a -dimensional true regression function. Note that an over-parametrized model may slow the convergence rate , yet an under-parametrized model may violate Assumption A, since the approximation error may not vanish. This fact is supported by Example 7 in Simulation (Section V–B).
Assumption B (Lyapounov Condition for ):
Assume that
for some constant , where is defined in (3), and is the conditional expectation of inference samples given the estimation samples.
Assumption C (Variance Condition for ):
Assume that
where denotes the conditional variance of inference samples given the estimation samples.
Assumptions B and C are used in applying the central limit theorem for triangle arrays [18], which are verifiable under some mild conditions, c.f., Lemma C.1 in Appendix C in the Supplementary Material.
The asymptotic null distribution for is indicated in Theorem 2.
Theorem 2 (Asymptotic Null Distribution of ):
In addition to Assumptions A-C, if , then under ,
| (6) |
where denotes convergence in distribution.
Theorem 2 says that the proposed test is valid under the splitting condition of . As a result, the estimation/inference splitting ratio needs to be suitably controlled. In Section IV, we propose a “log-ratio” splitting scheme, in which the splitting condition is automatically satisfied, c.f., Lemma 6.
As an alternative, we present the two-split test in Appendix A in the Supplementary Material to address the bias-sd-ratio issue, where we divide inference samples further into two equal subsets for inference, in which no data perturbation is needed.
C. Combining p-Values Over Repeated Random Splitting
Combining -values via repeated random sample splitting can strengthen the one-split test (3). First, it stabilizes the testing result. Second, it can often empirically compensate for the power loss by combining evidence across different split samples, as illustrated in [19] and [20] and our simulations in Section V. Subsequently, we use the order statistics of the -values to combine the evidence from different splitting, though we could apply other types of combining such as the corrected arithmetic and geometric means [21], [22].
Given a splitting ratio, we repeat the random splitting scheme times; that is, each time, we randomly split the original dataset into an estimation/inference subsets. In practice, cannot be large due to computational constraints and is usually 3–10 for large-scale applications. Then we compute the -value on the th splitting; , and combine them in two ways: the -order and Hommel’s weighted average of the order statistics [23]. Specifically,
| (7) |
where , and is the th order statistic of .
The -order combined test (7) is a generalized Bonferroni test with the Bonferroni correction factor . The Hommel combined test renders robust aggregation and yields a better control of Type I error, where is a normalizing constant.
In Theorems 3 and 5, we further generalize the result of [23] to control Type I and Type II errors of the proposed tests asymptotically. A computational scheme for the combined tests is summarized in Algorithm 1.
Theorem 3 (Type I Error for the Combined One-Split Test):
Under Assumptions A-C, if , then under , for any and any , the combined one-split test for (3) achieves
where is defined in (7).
D. Role of Data Perturbation
This subsection discusses the role of the data perturbation for the one-split test. Now consider the one-split test without perturbation, that is, in (3) with . Then, we decompose into three terms:
Under , , and is the bias-sd-ratio introduced in Section II–B. Specifically, under , as , as opposed to in Assumption C when . As a result, may not satisfy the assumption of the central limit theorem. Furthermore, may not converge to zero. For example, when and the differenced regret are vanishing in the same order. Thus, the asymptotic null distribution in (6) breaks down since is dominated by .
By comparison, with data perturbation, , . By Assumption A,
which implies that under the splitting condition of . Hence, the asymptotic null distribution of in (6) is valid. Moreover, a “log-ratio” sample splitting scheme is proposed in (8), where the splitting condition is automatically satisfied, as indicated in Lemma 6.
In later simulations (cf. Table VII), we will show numerically that, if no data perturbation is applied in the one-split test, it leads to increasingly inflated Type I errors with larger datasets in a neural network model.
TABLE VII.
Type I Errors of the One-Split Tests With/Without Perturbation (PTB) and the Two-Split Test in Section VI.4
| Test | |||
|---|---|---|---|
| One-split without PTB | 0.083 | 0.109 | 0.193 |
| One-split with PTB | 0.057 | 0.053 | 0.061 |
| Two-split | 0.048 | 0.051 | 0.047 |
E. Comparison With Existing Black-Box Tests
The one-split test in (3) has some characteristics that distinguish it from other existing black-box tests, including CRT [5], HRT [6], CPT [8] and LOCO tests [9].
CRT, CPT, and HRT test the conditional independence of a single feature individually, and the LOCO test measures the increase in prediction error due to not using a specified feature in a given dataset. The differences between the proposed tests and other existing tests can be summarized in three folds. First, for CRT, CPT, HRT, and LOCO tests, it is unclear how to test a set of multiple features , which is the target of our tests. Second, the significance of relevance is defined in different ways. The LOCO test conducts a significant test for the estimated model based on a given dataset with the mean absolute error, yet CRT, CPT, HRT, and the proposed tests conduct testing at the population level; that is, the former three examine conditional independence, while the last one focuses on the risk invariance as specified in (2). Third, CRT, CPT, and HRT require well-estimated conditional probabilities of every feature given the rest, which is often difficult in practice. Finally, the proposed tests are advantageous over CRT and CPT with reduced computational cost by avoiding a large number of model refitting.
III. Type II Error Analysis
This section performs Type II error analysis of the one-split test (3) and its combined version (7).
Consider an alternative hypothesis for . The Type II error of the one-split test and its combined test can be written as
where denotes the probability under , and is the nominal level or level of significance.
Theorems 4 and 5 suggest that the one-split test and its combined test are consistent in that their asymptotic Type II error tends to zero as .
Theorem 4 (Limiting Type II Error of the One-Split Test):
Suppose that the one-split test (3) satisfies Assumptions A-C and , then
where is the -multiplier of the standard normal distribution.
Given the results of Theorem A.3 in Appendix A in the Supplementary Material, we note that the one-split test is more powerful than the two-split test in terms of the asymptotic Type II error.
Theorem 5 (Limiting Type II Error of the Combined Tests):
Suppose that the one-split test (3) satisfies Assumptions A-C and , then for defined as the -order combined test in (7), we have
and for defined as the Hommel combined test (7), we have
where
and is Owen’s function [24].
Note that the upper bound in Theorem 5 can be further improved if the explicit dependency structures of the -values from repeated sample splitting are known.
IV. Sample Splitting
The one-split and two-split tests require the sample splitting ratio to satisfy the requirement to control the Type I error. In this section, we develop two computing schemes, namely “log-ratio” and “data-adaptive” tuning schemes, to estimate in addition to the perturbation size for the one-split test.
A. Log-Ratio Sample Splitting Scheme
This subsection proposes a log-ratio splitting scheme to ensure automatically the splitting condition . Specifically, given a sample size , where is a minimal sample size required for the hypothesis testing, the estimation and inference sizes and are obtained by:
| (8) |
where is a solution of (cf. Table I).
TABLE I.
Illustration of Split Sample Sizes (, ) Using the Log-Ratio Splitting Scheme (8) as the Total Sample Size Increases From 2000 to 100 000 While Is Fixed
| Total sample size () | ||||||
|---|---|---|---|---|---|---|
|
|
||||||
| 2000 | 5000 | 10000 | 20000 | 50000 | 100000 | |
|
| ||||||
| Estimation sample size () | 1000 | 3807 | 8688 | 18578 | 48439 | 98336 |
| Inference sample size () | 1000 | 1193 | 1312 | 1422 | 1561 | 1664 |
Lemma 6:
Suppose that the estimation/inference sample sizes (, ) are determined by the log-ratio sample splitting scheme in (8), then they satisfy the splitting condition for any in Assumption A.
B. Heuristic Data-Adaptive Splitting (Tuning) Scheme
The log-ratio splitting scheme in (8) is relatively conservative as the inference sample size increases in the logarithm of the estimation sample size . To further increase a test’s power, we develop a heuristic data-adaptive tuning scheme as an alternative.
The data-adaptive tuning scheme selects (, ) by controlling the estimated Type I error on permutation datasets. To proceed, we define the permutation on HFs, that is, for :
| (9) |
where is a permutation mapping. Note that the HF is conditional independence to the outcome for the permutation sample. Alternatively, the null hypothesis is true for permutation datasets. On this ground, we aim to use the proportions of rejecting over -times permutation as an estimate of the Type I error and select (, ) which is able to control the estimated Type I error. Ideally, refitting and re-evaluation are required for each permutation dataset. To reduce the computational cost, we only fit (, ) based on a permutation estimation subset, and estimate Type I error by re-evaluating them at -times permutation on an inference subset. The detailed procedure is summarized in the following Steps 1–4.
Step 1 (Sample Splitting):
Given a splitting ratio , split the original sample into the estimation and inference samples.
Step 2 (Permutation):
Permute HFs of estimation/inference samples via (9).
Step 3 (Fitting):
Generate the dual estimation subset via (1), and fit (, ) based on (dual) permuted estimation subsets.
Step 4 (Estimate Type I Error):
Permute the inference subset -times, and generate the corresponding dual samples via (1). For the fixed estimators (, ), compute the (combined) -values for each permuted (dual) inference samples under the perturbation size , denote as .
Then an estimated Type I error is computed as:
| (10) |
The splitting ratio controls the trade-off between Type I and Type II errors. Specifically, a small value yields biased estimators (, ), leading an inflated Type I error, yet could reduce the Type II error because of an enlarged inference subset. The perturbation size , as mentioned early, controls the bias-sd-ratio to ensure the validity of the asymptotic null distribution.
For the one-split test, the data-adaptive scheme estimates (, ) as the smallest values in some candidate sets that controls the estimated Type I error. In the process of searching candidate sets and , it stops once the termination criterion is met, which intends to reduce the computational cost. In particular,
| (11) |
where is the estimated Type I error computed via (10), and represent sets of candidate and values.
Overall Computational Cost:
Algorithm 1 summarizes the computational scheme of the one-split test. For the noncombining test in Algorithm 1, the data-adaptive scheme usually requires 2 and 3 times of training and evaluations since the loop for the tuning of (, ) usually terminates in one or two iterations. For the combined test, the data-adaptive scheme based on 5 random splits usually requires 10 times of training and evaluations. The running time for the proposed test is indicated in Tables III and B.1 in Appendix B in the Supplementary Material.
TABLE III.
Empirical Type I/II Errors of the (Combined) One-/Two-Split Tests, and Their Combined Tests in Example 1 at
| Splitting method | Test | Sample size | Type I error | Type II error | Time (Second) |
|---|---|---|---|---|---|
|
| |||||
| Log-ratio | One-split | 2000 | 0.004 | (0.78, 0.12, 0.08) | 8.1(0.4) |
| 6000 | 0.004 | (0.58, 0.00, 0.00) | 9.6(0.6) | ||
| 10000 | 0.010 | (0.55, 0.00, 0.00) | 11.7(0.4) | ||
|
|
|||||
| Two-split | 2000 | 0.026 | (0.89, 0.66, 0.65) | 8.4(0.4) | |
| 6000 | 0.036 | (0.91, 0.55, 0.58) | 9.7(0.5) | ||
| 10000 | 0.034 | (0.84, 0.54, 0.57) | 11.4(0.2) | ||
|
|
|||||
| Comb, one-split | 2000 | 0.016 | (0.76, 0.05, 0.05) | 42.1(1.6) | |
| 6000 | 0.012 | (0.49, 0.00, 0.00) | 45.6(1.3) | ||
| 10000 | 0.010 | (0.33, 0.00, 0.00) | 56.3(0.8) | ||
|
|
|||||
| Comb, two-split | 2000 | 0.018 | (0.90, 0.70, 0.68) | 40.8(1.6) | |
| 6000 | 0.024 | (0.91, 0.51, 0.53) | 45.5(1.2) | ||
| 10000 | 0.018 | (0.92, 0.59, 0.58) | 56.5(1.1) | ||
|
| |||||
| Data-adaptive | One-split | 2000 | 0.043 | (0.75, 0.21, 0.15) | 15.2(0.1) |
| 6000 | 0.050 | (0.39, 0.01, 0.00) | 41.2(0.3) | ||
| 10000 | 0.049 | (0.11, 0.00, 0.00) | 66.0(0.4) | ||
|
|
|||||
| Two-split | 2000 | 0.050 | (0.89, 0.74, 0.69) | 14.0(0.1) | |
| 6000 | 0.035 | (0.82, 0.49, 0.42) | 37.0(0.2) | ||
| 10000 | 0.040 | (0.81, 0.23, 0.25) | 61.6(0.4) | ||
|
|
|||||
| Comb, one-split | 2000 | 0.034 | (0.74, 0.00, 0.05) | 37.9(0.1) | |
| 6000 | 0.046 | (0.14, 0.00, 0.00) | 68.3(0.3) | ||
| 10000 | 0.045 | (0.00, 0.00, 0.00) | 107.2(0.7) | ||
|
|
|||||
| Comb, two-split | 2000 | 0.015 | (0.91, 0.74, 0.71) | 38.0(0.1) | |
| 6000 | 0.030 | (0.90, 0.30, 0.35) | 76.3(0.5) | ||
| 10000 | 0.014 | (0.87, 0.07, 0.08) | 110.3(0.5) | ||
Algorithm 1.
One-Split Test for Region Significance
|
V. Numerical Examples
This section examines the proposed tests for their capability of controlling Type I and Type II errors in both simulated and real examples. All tests are implemented in our Python library dnn-inference (https://github.com/statmlben/dnn-inference).
A. Numerical Comparison With Existing Black-Box Tests
This subsection presents a simple example to illustrate the differences between the proposed tests and other existing black-box tests, including the HRT [6], the LOCO [9], the PT [7], [25], and the holdout PT (HPT) [6]. For PT, we use the scheme of [7] to permute multiple HFs , on which we refit the model, and the permutation size is 100. Algorithm 2 in Appendix B in the Supplementary Material summarizes the procedure for the PT. Note that we exclude CRT here due to its enormously expensive computing in refitting a model many times.
To alleviate the high computational cost of refitting, HPT uses data-splitting into a training sample and a test sample. Then it fits only one time on training data and performs the PT over the test sample with the trained model. In our context, we extend HPT in [6] by simultaneously permuting multiple HFs .
One issue with the PT and HPT is that permutations of HFs usually alter the dependence structure between and . As a result, the sampling distribution based on permuted samples may differ from the null distribution. For example, the simulated example in Appendix B.2 in the Supplementary Material indicates that both HPT and PT lead to dramatically inflated Type I errors.
In this section, we generate a random sample of size . First, follows a uniform distribution on [−1,1] with a pairwise correlation ; , . Second, the outcome is generated as , where .
A simulation study is performed for the one-split and two-split tests, HRT, LOCO, and HPT. For HRT, we use the code of [6] available on GitHub with a default mixture density network with two components. For other methods, we fit a linear function based on stochastic gradient descent (SGD) with the same fitting parameters, that is, epochs is 100, batch_size is 32, and early stopping with validation_split being 0.2 and patience being 10, where patience is the number of epochs until termination if no progress is made on the validation set. For HRT, LOCO, HPT, and PT, the sample splitting ratio is fixed as 0.8, and the data-adaptive scheme is used for the proposed tests.
The returning values are summarized in Table II: the one-split and two-split tests return valid -values for the hypothesis in (2) with , HRT and LOCO return -values for individual features of conditional independence and error-invariance for a given dataset, respectively. PT and HPT provide -values for marginal independence. Therefore, the proposed tests are the only ones targeting the specified null hypothesis in (2).
TABLE II.
Returning Values of the One-Split/Two-Split Tests and Other Existing Black-Box Tests. Here, One-Split, Two-Split, HRT, LOCO, PT, and HPT Denote the Proposed Tests in ALGORITHMS 1 and ALGORITHM 1 in APPENDIX A in the SUPPLEMENTARY MATERIAL, HRT [6], the LOCO [9], PT, and HPT [6]
| Test | Return | ||
|---|---|---|---|
|
| |||
| One-split | p-value | risk-invariance , | 0.003 |
| Two-split | p-value | risk-invariance | 0.018 |
| HRT | p-values for all feats | conditional indep | (0.840, 0.045, 0.064, 0.900, 0.158) |
| LOCO | p-values for all feats | equal errors with/without feat for a given dataset | (0.132, 0.791, 0.180, 0.435, 0.342) |
| PT | p-value | marginal indep | 0.010 |
| HPT | p-value | marginal indep | 0.001 |
B. Simulations
Consider a nonparametric regression model
| (12) |
where is an unknown function on . It is known that only depends on a subset of features of , in which and with goal is to test if . Given a hypothesized index set , our gola goal is to test if is relevant to predicting the outcome , as specified in (2).
For illustration, we set the regression function as a neural network , where is the ReLU activation function, is a weight matrix, , is the th column of the matrix , is a constant, is the width for the th layer, and , , and candidate class defined asis the depth of the network. Clearly, , where is a candidate class defined as
We perform simulations under the model in (12), where , , , represents the correlation coefficient of features, controls the magnitude of the features, (, , ) denotes the depth, width, and the -norm of the neural network, is an index set of the true nondiscriminative features, and is an index set of HFs.
For hypotheses in (2), we examine four index sets of HFs .
These four sets are illustrated in Fig. 2. Note that in (i), implying that it is for Type I error analysis, while in 2)–4), suggesting Type II error analysis. From 2) to 4), the distance (or correlation) between the HFs and those nondiscriminative features in is increasing (or decreasing), and thus, the Type II error is expected to go down. Seven examples are considered for 1)–4) HF sets.
Fig. 2.
Illustration of four index sets of HFs in simulations: (i) type I error analysis, (ii)–(iv): type II error analysis. Note that the impact of the HFs on decreases, while the Type II error is expected to decrease from (ii) to (iv).
Example 1 (Impact of the Sample Size and Splitting Method):
This example (Table III) concerns the performance of the proposed tests in relation to the sample size based on log-ratio and data-adaptive splitting methods, where ranges from 2000 to 10000, , , , , , and .
Example 2 (Impact of the Strength of Features of Interest):
This example (Table IV) concerns the performance of the proposed tests with respect to the magnitude of features B, where , , , , , , and . The data-adaptive tuning scheme is applied for this example.
TABLE IV.
Empirical Type I/II Errors of the (Combined) One-Split and Two-Split Tests in Example 2
| Test | Type I error | Type II error | |
|---|---|---|---|
|
| |||
| One-split | 0.2 | 0.057 | (0.76, 0.32, 0.12) |
| 0.4 | 0.050 | (0.29, 0.01, 0.00) | |
| 0.6 | 0.057 | (0.03, 0.00, 0.00) | |
|
| |||
| Two-split | 0.2 | 0.049 | (0.94, 0.88, 0.86) |
| 0.4 | 0.035 | (0.82, 0.49, 0.42) | |
| 0.6 | 0.041 | (0.63, 0.03, 0.02) | |
|
| |||
| Comb, one-split | 0.2 | 0.027 | (0.73, 0.07, 0.07) |
| 0.4 | 0.046 | (0.14, 0.00, 0.00) | |
| 0.6 | 0.033 | (0.00, 0.00, 0.00) | |
|
| |||
| Comb, two-split | 0.2 | 0.019 | (1.00, 1.00, 0.97) |
| 0.4 | 0.030 | (0.90, 0.30, 0.35) | |
| 0.6 | 0.012 | (0.55, 0.00, 0.00) | |
Example 3 (Impact of the Depth and Width of a Neural Network):
This example (Table B.1 in Appendix B in the Supplementary Material) concerns the performance of the proposed tests in terms of the width and depth of a neural network, where , , , , , , and .
Example 4 (Impact of the Number of Hypothesized Features):
This example (Table B.2 in Appendix B in the Supplementary Material) concerns the proposed tests with respect to the number of HFs , where , , , , , and .
Example 5 (Impact of Feature Correlations):
This example (Table B.3 in Appendix B in the Supplementary Material) concerns the proposed tests in terms of the feature correlation , where , , , , , and .
Example 6 (Impact of Different Modes of Combining P-Values):
This example (Table B.4 in Appendix B in the Supplementary Material) concerns the combined tests with different ways of combining -values. Type I/II errors are examined in two simulated examples: (1) , , , , ; (2) , , , , and .
Example 7 (Impact of Over/Under-Parameterized Models):
This example (Table V) concerns the impact of the proposed tests based on different underlying black-box models. Specifically, we set the ground truth function as a neural network with and , and consider both the under-parameterized and over-parameterized models with .
TABLE V.
Empirical Types I/II Errors of the (Combined) One-/Two-Split Tests in Example 7 Based on (Width for the Truth Model) and Different s (Width for a Learning Model)
| Test | Type I error | Type II error | |
|---|---|---|---|
|
| |||
| One-split | 32 | 0.067 | (0.80, 0.36, 0.37) |
| 64 | 0.025 | (0.81, 0.32, 0.29) | |
| 128 | 0.017 | (0.80, 0.28, 0.26) | |
|
| |||
| Two-split | 32 | 0.017 | (0.97, 0.94, 0.93) |
| 64 | 0.020 | (0.97, 0.94, 0.93) | |
| 128 | 0.033 | (0.96, 0.93, 0.93) | |
|
| |||
| Comb, one-split | 32 | 0.140 | (0.58, 0.16, 0.11) |
| 64 | 0.030 | (0.81, 0.17, 0.14) | |
| 128 | 0.013 | (0.85, 0.18, 0.16) | |
|
| |||
| Comb, two-split | 32 | 0.013 | (0.96, 0.93, 0.91) |
| 64 | 0.027 | (0.96, 0.93, 0.94) | |
| 128 | 0.007 | (0.97, 0.95, 0.97) | |
For a test’s Type I and II errors, we compute the proportions of its rejecting out of 1000 simulations under and out of 100 simulations under , respectively.
When implementing the log-ratio splitting scheme, (, ) is determined by (8) with , and ; for the data-adaptive scheme, the grids of are set as {0.2,0.4,0.6,0.8}. Moreover, the grids for searching the optimal perturbation size are {0.01,0.05,0.1,0.5,1.0}. For combined tests, the number of repeated random splitting is set as 5. The hyperparameters of fitting a neural network are the same as in Section V–A.
1). Type I/II Errors of the (Combined) One-Split/Two-Split Tests:
As indicated in Tables III and IV and Tables B.1–B.5 in the Supplementary Material, the one-split/two-split tests perform well in all examples with respect to controlling Type I/II errors. In particular, Type I errors are close to the nominal level , whereas Type II errors decrease to 0 as the sample size increases. As expected, the one-split test outperforms the two-split test in terms of Type II error, which agrees with Theorems 4 and Theorem A.3 in Appendix A in the Supplementary Material. The combined tests consistently improve the performance in terms of both Type I/II errors.
2). Runtime:
The combined tests may double the runtime of their noncombined counterparts based on the data-adaptive tuning scheme. This result suggests that the one-split/two-split and their combined tests are practically feasible for black-box testing subject to computational constraints as in the case of applying deep neural networks to large data.
3). Combining P-Values:
As suggested by Table B.4 in the Appendix in the Supplementary Material, the Hommel combining method controls the Type I error while having reasonably good power in reducing the Type II error. The Bonferroni and Cauchy methods have an issue of failing to control Type I error, whereas other combining methods are conservative in the first case of Example 6.
4). Over-/Under-Parameterized Models:
As suggested in Table V, an under-parameterized model () has inflated Type I errors, which agrees with the theoretical analysis in Section II–B; an over-parameterized model () is able to control the Type I error and provides similar performance in the power or Type II error to that of the perfectly specified model (with exactly the same network structure of the ground truth model), it is partially because the early stopping is conducted as a regularization for over-parameterized models. For the (combined) two-split test, both the over- and under-parameterized models perform similarly to the perfectly specified model. One plausible explanation (for its no inflation of Type I errors) is that the two-split test is conservative in the finite-sample setting.
We summarize the advantages of the different tests and the combining/tuning methods in Table VI.
TABLE VI.
Advantage for Different Tests, Combining, and Tuning Methods
| Advantage | Evidence | ||
|---|---|---|---|
| Test | One-split Two-split |
More powerful No need to perturb data |
Tables III, IV, Tables B.1–B.4 in Appendix B Appendix A |
| Combine | Comb. Non-comb. |
More powerful Less computation time |
Tables III, IV, Tables B.1–B.4 in Appendix B Table III |
| Ratio | Data-adaptive Log-ratio |
More powerful No need to tune the ratio, and less computation time |
Tables III, IV, Tables B.1–B.4 in Appendix B Lemma 6, Table III |
C. One-Split Test and Perturbation
Consider a regression model in (12), where , , where ; , and , if , and . In this case, let , then is true in the population level. Furthermore, only partial features are observed in a dataset , where and is generated as , is the number of observed features and as , and is a -dimensional dummy variable.
Then, we simulate a dataset , with , , and . For implementation, we set for the one-split and two-split tests and for the one-split test. The fitting parameters the Type I errors based on the two-split test, and the one-split tests with/without perturbation are reported in Table VII.
As indicated in Table VII, the two-split test and the one-split with perturbation approximately control Type I errors across all situations, whereas the one-split test without perturbation has inflated Type I errors significantly exceeding the nominal level .
VI. Real Application
A. MNIST Handwritten Digits
This subsection applies the proposed test to the MNIST handwritten digits dataset [10]. The MNIST dataset is a standard benchmark for explainable artificial intelligence (XAI) methods [26], in part because the results of detection could be easily evaluated by human visual intuition. In particular, we extract 14251 images from the dataset with labels ‘7’ and ‘9’ to discriminate between these two digits. Our primary goal is to test certain image features differentiating digit ‘7’ from digit ‘9’, where a marked region of an image specifies HFs.
In this application, we consider three different types of masked regions, as displayed in Fig. 3.
Fig. 3.
HRs in Cases 1–3 for differentiating digits 7 and 9 in Section VI–A. Case 1: an HR is (19 : 28,13 : 20), which indicates that is true; Case 2: an HR is (21 : 28, 4 : 13), which indicates that is true; Case 3: an HR is (7 : 16,9 : 16), which indicates that is true. Note that the -values in the top are given by the one-split test.
To proceed, we specify the underlying model as the default convolution neural network (CNN) provided by Keras for the MNIST dataset. Finally, we apply the one-split test, the two-split tests, and their combined tests based on the data-adaptive tuning scheme with a significance level of .
As suggested by Table VIII, the (combined) one-/two-split tests all fail to reject when it is true in Cases 1 and 2, but all reject in Case 3 when it is false. Overall, the test results confirm our intuition that the hypothesized regions (HRs) in Cases 1 and 2 are visually indistinguishable, whereas that in Case 3 is visually discriminative, as illustrated in Fig. 3.
TABLE VIII.
-Values of (Combined) One-/Two-Split Tests in the MNIST Dataset. Significant -Values for Testing Feature Irrelevance are Underlined at a Nominal Level
| Test | p-values (cases 1–3) |
|---|---|
|
| |
| One-split | (1.74e-l, 3.29e-l, 1.37e-13) |
| Two-split | (9.59e-l, 5.69e-l, 1.10e-05) |
| Comb, one-split | (3.85e-l, 1.00e-0, 4.43e-18) |
| Comb, two-split | (5.44e-l, 1.92e-l, 2.25e-09) |
B. Mechanisms of Action Prediction for New Drugs
This subsection applies the proposed tests to examine the significance of “treatment,” “gene expression,” and “cell viability” to mechanisms of action (MoA) prediction of new drugs. The dataset consists of 23814 drug-MoA annotation pairs with three types of features (“treatment,” “gene expression,” and “cell viability”), and 207 binary labels indicating multiple targets of MoA responses, as illustrated in Fig. 4. Specifically, “treatment” includes “treatment duration” (continuous) and “treatment dose” (categorical); “gene expression” and “cell viability” include 773 gene expression data (continuous), and 100 human cells’ responses (continuous) to drugs [27], [28], respectively.
Fig. 4.
Features (treatment features, gene expression, and cell viability) and targets in MoA dataset. Three cases with three different types of HFs are considered. Case 1 (Treatment): HFs are “treatment duration” and “treatment dose;” Case 2 (Gene): HFs are “g-0”–“g-772;” and Case 3 (Cell): HFs are “c-0”–“c-99.”
In this application, we consider the significance of those three types of feature sets, as displayed in Fig. 4.
For implementation, we use TabNet [29] as the predictive model for our proposed tests. The results are summarized in Table IX, which indicates that all tests fail to reject at for “gene expression” features. For Cases 1 and 3, all tests consistently reject , identify “treatment” and “cell viability” as significant features to MoA prediction.
TABLE IX.
-Values of (Combined) One-/Two-Split Tests in the Moa Prediction Dataset
| Test | p-values (cases 1–3) (‘treatment’, ‘gene exp’, ‘cell viability’) |
|---|---|
|
| |
| One-split | (1.42e-2, 1.34e-l, 9.69e-4) |
| Two-split | (2.17e-2, 2.52e-l, 3.19e-4) |
| Comb, one-split | (4.72e-2, 3.81e-l, 1.02e-3) |
| Comb, two-split | (1.13e-3, 1.0le-1, 1.20e-5) |
C. Chest X-Rays for Pneumonia Diagnosis
This subsection illustrates the application of the proposed tests to chest X-ray images in a pneumonia diagnosis dataset [30]. This dataset consists of 5863 X-ray images, each labeled as “Pneumonia” or “Normal.” To proceed, we crop an image to produce a version of the image that focuses on the lung fields, based on DeepXR. Then, we use a square cropping region to retain important areas containing parenchymal anatomy and retrocardiac anatomy.
For implementation, we specify the learning model as a CNN and apply the one-split test, two-split test, and their combined tests based on the data-adaptive tuning scheme at a significance level of . Similarly, we also consider three different types of HRs, as displayed in Fig. 5.
Fig. 5.
HRs in Cases 1–3 for discriminating “Normal” (first row) versus “Pneumonia” (second row) X-ray images in Section VI–C. Case 1: a HR is (50:200, 20:110), for which is likely to be false; Case 2: an HR is (50:200, 100:150), for which is likely to be true. Case 3: an HR is (50:200, 150:240), for which is likely to be false. Note that the -values in the top are given by the one-split test.
As suggested by Table X, all tests fail to reject at in Case 2 when is likely to be true. For Cases 1 and 3, only the (combined) one-split test rejects both the , but other tests fail to do so when is likely to be false. In agreement with the earlier results, the one-split test seems more powerful to detect a discriminative region.
TABLE X.
-Values of the (Combined) One-/Two-Split Tests in the Chest X-Ray Dataset
| Test | p-values (cases 1–3) (‘left lung’, ‘null region’, ‘right lung’) |
|---|---|
|
| |
| One-split | (2.61e-2, 9.95e-l, 2.12e-2) |
| Two-split | (2.12e-l, 5.61e-l, 6.51e-2) |
| Comb, one-split | (4.14e-2, 6.35e-l, 7.52e-2) |
| Comb, two-split | (5.36e-2, 7.54e-l, 8.37e-2) |
D. Significance of Keypoints to Facial
Expression Recognition
This section examines the significance of five keypoints (left eye, right eye, eyes, nose, and mouth) on seven facial expressions: “angry,” “disgust,” “fear,” “happy,” “sad,” “surprise,” “neutral”) on the FER2013 dataset, consisting of 48 × 48 pixel grayscale facial images. The facial images have been automatically registered. For each facial image, an emotion label is provided as one of seven expressions. Given a facial image, we produce the key points based on the existing facial landmark detection libraries dlib and open-cv. The primary goal is to deliver the significance of the key points to facial expression recognition.
After preprocessing, we obtain 11709 triples of images, labels, and key points. The scatter plot for the key points is provided in Fig. 6, from which we consider five different collections of HRs corresponding to five key points: left eye, right eye, eyes, nose, and mouth, respectively. The HRs-based illustrative examples are displayed in Fig. 7.
Fig. 6.
Scatter plot for the keypoints (left eye, right eye, nose, and mouth) in FER2013 facial expression recognition dataset, yielding that the HRs in Cases 1–4 cover the corresponding keypoints in most faces.
Fig. 7.
HRs in Cases 1–5 for discriminating seven facial expressions in rows (including “angry,” “disgust,” “fear,” “happy,” “sad,” “surprise,” and “neutral”). Case 1 (Left eye): an HR is (14:22, 9:22); Case 2 (Right eye): a HR is (14:22, 28:41). Case 3 (Eyes): an HR is (14:22, 9:22 ∪ 28:41); Case 4 (Nose): an HR is (24:32, 20:29); and Case 5 (Mouth): an HR is (34:45, 18:30). Note that the -values in the top are given by the combined one-split test.
For implementation, we use the same VGG deep neural network and the same training hyperparameters as in [31]. Note that the adopted VGG network in [31] is one of the state-of-art facial expression recognition methods (Rank 4) in FER2013 papers-with-code Leaderboard [32].
As suggested by Table XI, all tests fail to reject in Cases 1, 2, and 4. For Cases 1 and 2, it is partly because the predictive information in the left/right eye is symmetrically leaked in the other eye. For Case 4, the result confirms the visual intuition that “nose” is not a discriminative key point to facial expression. For Cases 3 and 5, all tests consistently reject , suggesting that “eyes” and “mouth” are discriminative regions, which are visually confirmed by illustrative samples in Fig. 7. Note that the proposed tests are equally applicable to more substantial computer vision applications, for which the testing results could provide instructive information for visual sensor management and construction.
TABLE XI.
-Values of the (Combined) One-/Two-Split Tests in the FER 2013 Dataset Based on Five Keypoints: Left Eye, Right Eye, Eyes, Nose, and Mouth
| Test | p-values (cases 1–5) (‘left eye’, ‘right eye’, ‘eyes’, ‘nose’, ‘mouth’) |
|---|---|
|
| |
| One-split | (1.58e-l. 3.55e-l, 6.14e-3, 6.89e-l, 1.27e-3) |
| Two-split | (9.25e-2, 2.87e-l, 1.75e-2, 2.86e-l, 3.46e-2) |
| Comb, one-split | (6.03e-l, 1.43e-l, 7.88e-5, 8.23e-l, 1.09e-7) |
| Comb, two-split | (5.91e-2, 6.81e-2, 4.42e-2, 1.33e-l. 2.29e-2) |
E. Evaluating Significance of Localization in CIFAR100
Note that the proposed methods are equally applicable to significance tests for instance adaptive HFs. Therefore, they can be used to evaluate the effectiveness of discriminative localization methods, such as the class activation maps (CAM) [33] and Grad-CAM [34]. In this section, we demonstrate a significant test in the CIFAR100 dataset based on adaptive HFs localized by Grad-CAM.
Specifically, in the training set, we apply Grad-CAM to a fit AlexNet to produce importance/heatmaps of features/pixels of all images, see six demonstrative examples in Fig. 8. Then, four cases of hypothesized tests are provided by taking top-5%, top-10%, top-15%, and top-30% important features as HFs. Next, the proposed one-/two-split tests are conducted in the testing set with a ResNet50 network, and the resulting -values are summarized in Table XII.
Fig. 8.
Demonstrative adaptive HRs in CIFAR100 dataset, localized by Grad-CAM. Cases 1–4: the percentages of HFs are 5%, 10%, 15%, and 30% in rows from top to bottom, corresponding to the top important features ranked by Grad-CAM localization heatmaps.
TABLE XII.
-Values of the (Combined) One-/Two-Split Tests in the CIFAR 100 Dataset. The Percentages of HFS Are 5%, 10%, 15%, and 30%, Corresponding to the Top Important Features Ranked by Grad-CAM Localization Heatmaps
| Test | p-values (cases 1–4) (Top-5%, Top-10%, Top-15%, Top-30%) |
|---|---|
|
| |
| One-split | (3.13e-l, 6.44e-3, 5.33e-3, 4.25e-8) |
| Two-split | (9.04e-l, 2.58e-l, 7.90e-l, 2.59e-4) |
| Comb, one-split | (5.56e-2, 4.08e-3, 1.92e-5, 1.12e-7) |
| Comb, two-split | (5.81e-l, 1.59e-l, 2.20e-2, 9.68e-5) |
Overall, the test results confirm our intuition, the inconsistent results in Cases 2 and 3 by one-split and two-split tests may be caused by the power loss of two-split tests. It is worth mentioning that the sequence of pairs (top important hypothesized/localized features, -values) produced by the proposed tests can be an evaluation of the effectiveness of the localization method.
F. Significance of Keywords in Sentiment Analysis
This section examines the significance of keywords in sentiment classification based on the IMDB dataset [35]. This dataset provides 50000 highly polar movie reviews for binary sentiment classification. We also obtain lists of positive, negative, and neutral opinion words from [36]. In this application, we apply the proposed tests to examine the significance of positive/negative/neutral words contributing to sentiment analysis. For illustration, we report the results based on the top 350 frequent positive- and negative-sentiment words and 350 randomly selected neutral-sentiment words in the IMDB dataset.
For implementation, we use a bidirectional LSTM model as a prediction model for sentiment classification and apply the one-split/two-split tests and their combined tests based on the log-ratio splitting method at a significance level of .
Overall, the test results in Table XIII confirm our intuition, where the positive and negative-sentiment words significantly contribute to sentiment analysis, but not neutral-sentiment words. Inconsistent results in Cases 1 and 2 by one-split and two-split tests may be caused by a power loss of the two-split test.
TABLE XIII.
-Values of the (Combined) One-/Two-Split Tests in the IMDB Dataset With HFs as: CASE 1: the Top 350 Frequent Positive Words; CASE 2: Top 350 Negative-Sentiment Words; CASE 3: 350 Randomly Selected Neutral-Sentiment Words
| Test | p-values (cases 1–3) (positive, negative, neutral) |
|---|---|
|
| |
| One-split | (2.92e-2, 1.20e-3, 3.37e-l) |
| Two-split | (9.61e-2, 1.61e-l, 1.14e-l) |
| Comb, one-split | (2.53e-5, 6.87e-3, 1.29e-l) |
| Comb, two-split | (2.98e-l, 2.24e-l, 6.20e-l) |
VII. Conclusion
This article proposes two novel risk-invariance tests, one-split and two-split tests, to assess the impact of a collection of HFs on prediction. Theoretically, we have established asymptotic null distributions of test statistics and their consistency in Type I/II errors. Numerically, we have demonstrated the utility of the proposed tests on simulated and real datasets. Next, we summarize some strengths and limitations of the proposed tests.
Strengths:
1) The proposed tests provide a practical inference tool for black-box models on complex data, which considerably relax assumptions in the existing literature. For example, CRT and HRT require a well-estimated conditional probability for features, which is often impractical. 2) The proposed tests work for general risk-invariance testing on a collection of features of interest, which encompasses the conditional independence test when the log-likelihood loss is used. 3) The proposed tests involve a limited number of model refitting, which is suitable for large-scale problems.
Limitations:
1) one-split/two-split tests split over the original dataset at the expense of reduced power or increased Type II error. 2) The log-ratio splitting scheme is conservative in that it prefers situations with a large estimation subset and a small inference subset.
Supplementary Material
Acknowledgments
This work was supported in part by NSF under Grant DMS-1712564, Grant DMS-1721216, and Grant DMS-1952539; in part by NIH under Grant R01GM126002, Grant R01AG069895, Grant R01AG065636, Grant R01AG074858, and Grant U01AG073079, and in part by The Chinese University of Hong Kong Faulty of Science Direct Grant.
Biographies

Ben Dai (Member, IEEE) received the B.S. degree in mathematics and applied mathematics from Hangzhou Dianzi University, Hangzhou, China, in 2015, and the Ph.D. degree in data science from The City University of Hong Kong, Hong Kong, in 2019.
He is currently an Assistant Professor with the Department of Statistics, The Chinese University of Hong Kong. His research interests include statistical machine learning, learning theory, statistical XAI, recommender systems, and deep learning.

Xiaotong Shen received the B.S. degree in mathematics from Peking University, Beijing, China, in 1985, and the Ph.D. degree in statistics from the University of Chicago, Chicago, IL, USA, in 1991.
He is currently the John Black Johnston Distinguished Professor with the University of Minnesota, Minneapolis, MN, USA. His research interests include machine learning and data science, high-dimensional inference, nonparametric and semiparametric inference, causal graphical models, recommender systems, and nonconvex minimization.

Dr. Shen is a fellow of the American Association for the Advancement of Science, the American Statistical Association, and the Institute of Mathematical Statistics.
Wei Pan received the B.S. degree in computer engineering and in applied mathematics from Tsinghua University, Beijing, China, in 1989, and the Ph.D. degree in statistics from the University of Wisconsin–Madison, Madison, WI, USA, in 1997. He is currently a Professor with the Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA. His research interests include statistical genetics, bioinformatics, and deep learning.
Dr. Pan is a fellow of the American Statistical Association and the Institute of Mathematical Statistics.
Contributor Information
Ben Dai, Department of Statistics, The Chinese University of Hong Kong, Hong Kong.
Xiaotong Shen, School of Statistics, University of Minnesota, Minneapolis, MN 55455 USA.
Wei Pan, Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455 USA.
References
- [1].Schmidhuber J, “Deep learning in neural networks: An overview,” Neural Netw., vol. 61, pp. 85–117, Jan. 2015. [DOI] [PubMed] [Google Scholar]
- [2].Fahrmeir L, Kneib T, Lang S, and Marx B, Regression. Berlin, Germany: Springer, 2007. [Google Scholar]
- [3].King G, Unifying Political Methodology: The Likelihood Theory of Statistical Inference. Cambridge, U.K.: Cambridge Univ. Press, 1989. [Google Scholar]
- [4].Wasserman L, Ramdas A, and Balakrishnan S, “Universal inference,” Proc. Nat. Acad. Sci. USA, vol. 117, no. 29, pp. 16880–16890, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Candès E, Fan Y, Janson L, and Lv J, “Panning for gold: ‘Model-X’ knockoffs for high dimensional controlled variable selection,” J. Roy. Stat. Soc. B, Stat. Methodol, vol. 80, no. 3, pp. 551–577, 2018. [Google Scholar]
- [6].Tansey W, Veitch V, Zhang H, Rabadan R, and Blei DM, “The holdout randomization test: Principled and easy black box feature selection,” J. Comput. Graph. Statist, vol. 31, no. 1, pp. 151–162, 2021. [Google Scholar]
- [7].Ojala M. and Garriga GC, “Permutation tests for studying classifier performance,” J. Mach. Learn. Res, vol. 11, no. 62, pp. 1833–1863, Jun. 2010. [Google Scholar]
- [8].Berrett TB, Wang Y, Barber RF, and Samworth RJ, “The conditional permutation test for independence while controlling for confounders,” J. Roy. Stat. Soc. B, Stat. Methodol, vol. 82, no. 1, pp. 175–197, Feb. 2020. [Google Scholar]
- [9].Lei J, G’Sell M, Rinaldo A, Tibshirani RJ, and Wasserman L, “Distribution-free predictive inference for regression,” J. Amer. Stat. Assoc, vol. 113, no. 523, pp. 1094–1111, Jul. 2018. [Google Scholar]
- [10].LeCun Y, Bottou L, Bengio Y, and Haffner P, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998. [Google Scholar]
- [11].Rumelhart DE, Hinton GE, and Williams RJ, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, Oct. 1986. [Google Scholar]
- [12].Chernozhukov V. et al. , “Double/debiased machine learning for treatment and structural parameters,” Econometrics J, vol. 21, no. 1, pp. C1–C68, Feb. 2018. [Google Scholar]
- [13].Wasserman L. and Roeder K, “High dimensional variable selection,” Ann. Statist, vol. 37, no. 5A, p. 2178, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Buse A, “The likelihood ratio, Wald, and Lagrange multiplier tests: An expository note,” Amer. Statistician, vol. 36, no. 3, pp. 153–157, 1982. [Google Scholar]
- [15].Dodge Y. and Commenges D, The Oxford Dictionary of Statistical Terms. London, U.K.: Oxford Univ. Press, 2006. [Google Scholar]
- [16].Wasserman L, All of Nonparametric Statistics. New York, NY, USA: Springer, 2006. [Google Scholar]
- [17].Schmidt-Hieber J, “Nonparametric regression using deep neural networks with ReLU activation function,” Ann. Statist, vol. 48, no. 4, pp. 1875–1897, 2020. [Google Scholar]
- [18].Cappé O, Moulines E, and Rydén T, Inference in Hidden Markov Models. New York, NY, USA: Springer, 2006. [Google Scholar]
- [19].Romano J. and DiCiccio C, “Multiple data splitting for testing,” Dept. Statist., Stanford Univ, Stanford, CA, USA, Tech. Rep. 2019-03, 2019. [Google Scholar]
- [20].Meinshausen N, Meier L, and Bühlmann P, “P-values for high-dimensional regression,” J. Amer. Stat. Assoc, vol. 104, no. 488, pp. 1671–1681, Dec. 2009. [Google Scholar]
- [21].Vovk V. and Wang R, “Combining -values via averaging,” Biometrika, vol. 107, no. 4, pp. 791–808, 2020. [Google Scholar]
- [22].Hardy GH et al. , Inequalities. Cambridge, U.K.: Cambridge Univ. Press, 1952. [Google Scholar]
- [23].Hommel G, “Tests of the overall hypothesis for arbitrary dependence structures,” Biometrical J, vol. 25, no. 5, pp. 423–430, Jan. 1983. [Google Scholar]
- [24].Owen DB, “Tables for computing bivariate normal probabilities,” Ann. Math. Statist, vol. 27, no. 4, pp. 1075–1090, Dec. 1956. [Google Scholar]
- [25].Breiman L, “Random forests,” Mach. Learn, vol. 45, no. 1, pp. 5–32, 2001. [Google Scholar]
- [26].Ribeiro MT, Singh S, and Guestrin C, “‘Why should i trust you?’ Explaining the predictions of any classifier,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Aug. 2016, pp. 1135–1144. [Google Scholar]
- [27].Corsello SM et al. , “Discovering the anticancer potential of non-oncology drugs by systematic viability profiling,” Nature Cancer, vol. 1, no. 2, pp. 235–248, Feb. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Subramanian A. et al. , “A next generation connectivity map: L1000 platform and the first 1,000,000 profiles,” Cell, vol. 171, no. 6, pp. 1437–1452, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Arik SO and Pfister T, “TabNet: Attentive interpretable tabular learning,” in Proc. AAAI Conf. Artif. Intell, 2021, pp. 6679–6687. [Google Scholar]
- [30].Kermany DS et al. , “Identifying medical diagnoses and treatable diseases by image-based deep learning,” Cell, vol. 172, no. 5, pp. 1122–1131, 2018. [DOI] [PubMed] [Google Scholar]
- [31].Khaireddin Y. and Chen Z, “Facial emotion recognition: State of the art performance on FER2013,” 2021, arXiv:2105.03588. [Google Scholar]
- [32].Pramerdorfer C. and Kampel M, “Facial expression recognition using convolutional neural networks: State of the art,” 2016, arXiv:1612.02903. [Google Scholar]
- [33].Zhou B, Khosla A, Lapedriza A, Oliva A, and Torralba A, “Learning deep features for discriminative localization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR; ), Jun. 2016, pp. 2921–2929. [Google Scholar]
- [34].Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, and Batra D, “Grad-CAM: Visual explanations from deep networks via gradient-based localization,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV; ), Oct. 2017, pp. 618–626. [Google Scholar]
- [35].Maas A, Daly RE, Pham PT, Huang D, Ng AY, and Potts C, “Learning word vectors for sentiment analysis,” in Proc. 49th Annu. Meeting Assoc. Comput. Linguistics, Hum. Lang. Technol., 2011, pp. 142–150. [Google Scholar]
- [36].Hu M. and Liu B, “Mining and summarizing customer reviews,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2004, pp. 168–177. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.








