Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Nov 15.
Published in final edited form as: Stat Med. 2021 Jul 26;40(25):5673–5689. doi: 10.1002/sim.9147

Bayes estimate of primary threshold in clusterwise functional magnetic resonance imaging inferences

Yunjiang Ge 1, Stephanie Hare 2, Gang Chen 3, James A Waltz 2, Peter Kochunov 2, L Elliot Hong 2, Shuo Chen 2,4
PMCID: PMC8972072  NIHMSID: NIHMS1788916  PMID: 34309050

Abstract

Clusterwise statistical inference is the most widely used technique for functional magnetic resonance imaging (fMRI) data analyses. Clusterwise statistical inference consists of two steps: (i) primary thresholding that excludes less significant voxels by a prespecified cut-off (eg, p < .001); and (ii) clusterwise thresholding that controls the familywise error rate caused by clusters consisting of false positive suprathreshold voxels. The selection of the primary threshold is critical because it determines both statistical power and false discovery rate (FDR). However, in most existing statistical packages, the primary threshold is selected based on prior knowledge (eg, p < .001) without taking into account the information in the data. In this article, we propose a data-driven approach to algorithmically select the optimal primary threshold based on an empirical Bayes framework. We evaluate the proposed model using extensive simulation studies and real fMRI data. In the simulation, we show that our method can effectively increase statistical power by 20% to over 100% while effectively controlling the FDR. We then investigate the brain response to the dose-effect of chlorpromazine in patients with schizophrenia by analyzing fMRI scans and generate consistent results.

Keywords: clusterwise inference, empirical Bayes, fMRI, optimal threshold

1 ∣. INTRODUCTION

Functional magnetic resonance imaging (fMRI) technique has become a popular tool for noninvasively studying circuit-level brain activity for more than two decades. Statistical analyses of neuroimaging data remain challenging due to their high-dimensionality and spatiotemporal dependence structure.1-4 Advanced statistical methods have been developed to solve the multiple comparison problem and successfully applied to many fMRI studies.5-8 The widely used voxelwise parametric methods (eg, random field theory based methods) have good sensitivity when the spatial parameters are well estimated, though the performance could be affected by the feature of the true signals and sample sizes.1,7,9 The clusterwise inference method remains a popular tool for neuroimaging data analysis due to its relatively high sensitivity and low computational cost comparing to voxel-extent based thresholding methods.10,11 The performance and parameters of this procedure have been well discussed and studied.12-15

The clusterwise inference consists of two steps: a primary thresholding step at voxel level that applies a cut-off to all voxels and only keeps those suprathreshold voxels; and a cluster-extent based thresholding step at cluster level to avoid selecting false positive clusters under the null hypothesis (eg, no activation). In this current study, we focus on the nonparametric inference method at voxel level due to its robustness, though parametric methods can also be applied under various assumptions.6,16-19

Unlike conventional multiple testing correction,20 the threshold for clusterwise inference does not directly determine the subset of selected variables, rather it specifies the cluster-forming condition in the three-dimensional (3D) brain space. Then, each formed cluster was tested while controlling familywise error rate (FWER). Therefore, the clusterwise inference results are spatially contiguous brain areas (ie, a set of adjacent voxels) instead of individual variables. The primary threshold determines the test statistic for a cluster, and thus has a major impact on the sensitivity and false positive error rate of clusterwise inference.11 In light of this, our paper focuses on giving a generic primary threshold estimation scheme that applies to most clusterwise fMRI inferences.

A key limitation for clusterwise inference is its vulnerability to the poor selection of the voxel-level primary threshold. An overly conservative threshold may lead to trivial clusters of connected true positive voxels and cause low statistical power. On the other hand, a liberal primary threshold (eg, p < .01) can generate massive false positive points in a smoothed brain space and thus can be connected and form false positive clusters.11 Both false positive clusters and low statistical power become major potential causes of the low reproducibility and replicability of fMRI findings.3

Setting the primary threshold at p < .001 is now standard for most studies, because it can generally effectively control false-positive findings based on empirical studies.11,21 However, a prespecified voxelwise threshold may be suboptimal because it does not account for several important factors of the data, including sample size, effect size (ES), noise level, and the selection of statistical models, among many others.

We illustrate the concept of data-driven optimal threshold selection by a typical example (Figure 1). We consider a simple scenario that all voxels in a brain image can be divided into two sets: those are truly associated with the covariate of interest and the rest. The test statistics of the two sets follow a nonnull distribution and a null distribution, respectively. We argue that the optimal primary threshold should be selected based on the nonnull and null distributions rather than a prespecified threshold. For example, if the two distributions are well separated (large sample sizes or significantly strong signals), a more rigorous primary threshold (eg, more stringent than 0.001) should be used, to suppress the false-positive findings. By contrast, if the two distributions are less separable (but still separable, exhibiting small/moderate sample sizes, and moderate/large ESs), a relatively liberal primary threshold (eg, less conservative than 0.001) should be used to keep the false discovery rate (FDR) at a low level while maintaining a maximal statistical power. Therefore, empirical distributions of null and nonnull distribution can reflect the sample size, ES, and noise level from the data and provide important guidance for primary threshold selection.

FIGURE 1.

FIGURE 1

Nonnull (in purple) and null (in black) distribution of test statistics from the (non) event spatial points. The red vertical line marks down the corresponding z-score when the p-value is .001 on each graph, while the purple vertical line marks down the corresponding z-score when the p-value is given by empirical Bayes adaptive threshold selection (eBass). Yellow triangles on the x-axis indicate threshold z-values for local fdr fdr^(z)=0.2 = 0.2 if such cases exist. The null subdensity π0f0 is marked in a blue dashed line, and the mixture density is marked in a solid green line

In the above example, we show that a data-driven primary threshold can maximize the statistical power to detect true positive clusters while effectively controlling the FDR. However, the data-driven primary threshold selection procedure has not been fully developed. To fill this gap, we propose a new eBass method to objectively select the optimal primary threshold selection based on the information from data. The eBass objective function aims to achieve a maximal statistical power while avoiding false positive voxels connected to be a false positive cluster. We develop new algorithms to implement the objective function and provide the corresponding theoretical properties. In this article, we focus on the two-step clusterwise fMRI inference with nonparametric statistical tests by providing a new primary threshold selection strategy. We note that alternative advanced statistical methods, including both nonparametric inference methods (eg, threshold-free methods cluster enhancement—TFCE and pTFCE7,22) and parametric inference (eg, frequentist and Bayesian models5,16,23), can produce reliable and biologically meaningful results. Thus, eBass can become a complement to these commonly used statistical approaches and enhance the widely used two-step clusterwise inference.

The rest of this article is organized as follows: (1) we introduce eBass method and algorithm in Section 2.1; (2) we perform extensive simulation analysis to fully assess the properties of eBass, in Section 3; (3) we apply eBass to an fMRI data analysis for schizophrenia research and conclude with discussions and future works.

2 ∣. METHOD

We consider the multiple comparison problem for all brain voxels in a spatially smoothed 3D space. The task-induced study and resting-state study are two major applications of fMRI technique. The numeric value of a voxel in a task-induced fMRI study expresses the magnitude of the local neural response to the task/stimuli, while the voxel value in a resting-state fMRI study (Rs-fMRI; seed voxel analysis) represents the local functional connection strength to the seed voxels. The clusterwise inference is applied to both data types. We perform statistical inference on each voxel marginally or conditionally, where “marginally” and “conditionally” are referred to as two data analysis strategies. In the field of fMRI data analysis, the general linear model (GLM) is the most commonly used statistical analysis tool, which is applied to each voxel independently for group-level regression analysis. We consider GLM inference results as “marginal”. 6 In comparison, advanced statistical models have been developed to perform regression for each voxel while accounting for the dependence between voxels.4,24,25 Inference results from these analyses are treated as “conditional,” and we consider z statistics from those advanced statistical models as uncorrelated.26

Generally, the null hypothesis H0v at each voxel is set to no activation. Let v be an index for brain voxel and v = 1, … , V. Thus, for the whole brain volume, there are simultaneously V hypothesis tests with their test statistics z and corresponding p values P:

Null hypothesis:H01,H02,H03,H0VvsAlternative hypothesis:H11,H12,H13,H1VTest statistics:z1,z2,z3,zV,p-values:p1,p2,p3,pV

The commonly used multiple testing correction methods (Benjamini-Hochberg false discovery rate, or BH-FDR, correction) correct the multiplicity at the voxel level. The two-step clusterwise inference aim to extract cluster-level findings and gain additional power.9 Specifically,

  1. Consider the primary thresholding as a screening step. We first apply a predetermined threshold η to binarize all voxels based on their p-values. Denote the indicator variable δv = I(pv < ηp) (eg, ηp < .001), where I is an indicator function, and the set Δ = {v : δv > 0} that consists of voxels passing the threshold ηp. The binarization naturally leads to voxel-level false positive rate and sensitivity.

  2. Perform permutation tests using the cluster extent as a test statistic to select a set of spatially adjacent suprathreshold voxels Δ as the resulting cluster while controlling the FWER. This step bears a resemblance to the commonly used spatial statistical models (eg, SaTScan27) that can competently handle an inhomogeneous spatial point process with clustered patterns.28 Thus, the final output is cluster-level findings (adjusting FWER).

We refer voxel-level false discovery rate as vFDR=vVI(H^v=1Hv=0)vVI(H^v=1), and cluster-level familywise error as cFWER = Pr(at least one detected cluster is false positive). The two-step clusterwise inference controls cFWER instead of vFDR. Nevertheless, the vFDR can be used to evaluate the selected primary threshold in step one (screening).

It has been well-known that the choice of primary threshold is critical because (i) an overconservative primary threshold can achieve a low vFDR at the cost of low sensitivity (the cluster-level sensitivity is also low because true positive voxels are too few to form clusters); (ii) a liberal primary threshold can lead to detecting both true and false positive clusters (ie, high cFWER). Either low sensitivity or high cFWER can lead to less replicable results because (i) the probability is low to observe overlapped true findings across datasets with low sensitivity, and (ii) the chance for false positive voxels reappearing in different datasets is small. Currently, the primary threshold of p < .001 is well accepted by the research community, while p < .01 is considered overly liberal.13 Here, we argue that a data-driven primary threshold may better balance the above trade-off than the predetermined primary threshold.

We aim to select an optimal primary threshold to achieve maximal sensitivity (power) with a low FDR at voxel-level. In practice, however, neither voxel-level sensitivity nor FDR is known because the ground-truth is unavailable. We resort to an empirical Bayes framework for calculating the estimated voxel-level sensitivity and FDR.

2.1 ∣. Empirical Bayes estimated voxel-level sensitivity and FDR

The empirical Bayes framework has been developed to estimate the marginal distribution of the null and nonnull test statistics for the multiple testing problem.29,30 In these models, the test statistics of the whole brain voxels follow a mixture distribution, f(z) = π0f0(z) + π1f1(z), where π0, π1 (π1 = 1 − π0) are the prior probabilities for a voxel belonging to the null and nonnull components.

π0=Pr{zvnull},null densityf0(zv)π1=Pr{zvnonnull},nonnull densityf1(zv)

The posterior probability of a voxel is from the null set given z is Pr(zvnullz)=π0f0(zv)π0f0(zv)+π1f1(zv)=π0f0(zv)f(zv). A critical step of the empirical Bayes method is to estimate the mixture density f. Fortunately, numerous efficient and robust numerical algorithms (eg, MLE based and Poisson regression estimates) have been developed.31 The null density f0 can be estimated by maximum likelihood estimation, N(0, 1) or central matching method. The prior probability π0 is estimated based on the estimated null distribution, and accordingly, π^1 is 1π^0. Generally, the empirical Bayes estimation method provides consistent and robust estimates for π^0, π^1, f^0, f^1z.29,32 We denote θ = {π0, π1, f0, f1}.

The empirical Bayes estimated prior probabilities and densities provide an effective tool to estimate the unknown joint or marginal distribution of the test statistics. Specifically, we calculate the posterior sensitivity and FDR. Under the empirical Bayes framework, the unknown number of nonnull test statistics is m1=zf^1(t)dt. Without the loss of generality, we consider the right-tail situation. We denote the decision rule with a cut-off zθ^ based on the empirical Bayes posterior f^. The number of true nonnulls (true positive) is a function of zθ^ given by S(zθ^)=zθ^+f^1(t)dt, which is the tail-area of f^1. Then, we define this true positive proportion S/m1 to be the posterior sensitivity with cut-off zθ^, that is

Posterior sensitivity:TPR^(zθ^)=zθ^+f^1(t)dt+f^1(t)dt,

By definition, the estimated Bayes local false discovery rate is fdr^(zθ^)=π^0f^0(zθ^)f^(zθ^). We then consider the tail areas of null and mixture subdensities, which can be obtained from the corresponding CDFs. Thus, the posterior FDR is given by

Posterior FDR:FDR^(zθ^)=π^0zθ^+f^0(t)dtzθ^+f^1(t)dt,

and accordingly the true discovery rate (TDR) is

Posterior TDRTDR^(zθ^)=π^1zθ^+f^1(t)dtzθ^+f^(t)dt.

The TPR^(zθ^) and TDR^(zθ^) inherit the consistency property from empirical Bayes estimators. Therefore, the empirical Bayes estimated TPR^(zθ^) and TDR^(zθ^) provide satisfactory surrogates for the true, yet unknown, sensitivity and TDR, which are required to determine the optimal threshold.

2.2 ∣. Objective function for the optimal threshold

Built on the posterior sensitivity (recall) TPR^(zθ^) and TDR (precision) TDR^(zθ^), we propose an objective function for optimal threshold selection. Specifically,

argmaxzθ^π^1zθ^+f^1(t)dtπ^1+f^1(t)dt+π^1zθ^+f^1(t)dt+(1π^1)zθ^+f^0(t)dtsubject toFWER^cluster(zθ^)α, (1)

where the optimal cut-off zθ^ is the estimand. The FWER^cluster(zθ^) is the constraint to suppress false positive clusters, while the objective function aims to achieve voxel-level high sensitivity and TDR. We are giving the details of estimating the support Ωα in section 2.3, and optimizing procedure in section 2.4.

Equation (1) is the harmonic mean of precision and recall, which bears a resemblance to the F-measure in the field of information retrieval and machine learning. The F measure is used when two classes (ie, null and nonnull voxels) are imbalanced33-35 and the interest is on nonnull voxels given a large number of unknown null voxels. This is well-suited for clusterwise inference in fMRI data analysis because only the small proportion of nonnull voxels can determine the clusters. The maximization of the harmonic mean leads to a balanced sensitivity and FDR for unbalanced two classes.

Particularly, F1 score is the most commonly used F measure that treats the two rates (ie, precision and recall) with equal importance. The concept of F1 measure is suitable to primary threshold selection because our goal is to maximize sensitivity while controlling the FDR for clusterwise inference. The commonly used thresholding methods often only focus on controlling false-positive findings without considering the sensitivity or power. Since clusterwise inference is a two-step procedure, the final inference results are subject to both the primary threshold at the voxel level and the cluster-level threshold. Let β1, β2 denote the false negative rates from primary thresholding step and cluster-extent thresholding step, respectively. Then the overall power is approximately (1 − β1)×(1 − β2). An overly conservative primary threshold (eg, BH-FDR and Bonferroni correction) can lead to a relatively low overall power.

In summary, the empirical Bayes based objective function (1) jointly models voxelwise sensitivity and FDR, and clusterwise FWER. Our model gains more power than conventional multiple testing methods focusing exclusively on false positive errors, and our model is thus well-suited for the two-step inference procedure of clusterwise inference.

2.3 ∣. Estimating Ωα

In order to control the cluster level FWER below α, we define the support Ωα for zθ^ and search z^θ^ on the support. Here, we describe the procedure to identify the set Ωα based on the empirical Bayes estimated sensitivity and FDR.

As stated in section 2.2, the cluster level FWER FWER^cluster is calculated based on z^θ^+f^0(t)dt. If the total number of false positive voxels Vz^θ^+f^0(t)dt is large at the cut-off of z^θ^, the cluster size of these false positive voxels is likely to be greater than the permutation tests determined cluster-size threshold Kα. Thus, false positive clusters appear in the final pass-threshold clusters. In order to avoid false positive clusters, the cut-off z^θ^ is required to prohibit forming large clusters.

Specifically, we denote the estimated number of false positive voxels using a cut-off z^θ^ by mfp(z^θ^)=Vπ^0z^θ^+f^0(t)dt. We next compute the upper bound of the cluster size based on the combinatorial probability that these mfp(z^θ^) false positive suprathreshold voxels can form a contiguous nontrivial cluster in the 3D brain space. The cluster level FWER FWER^cluster(zθ^) is determined by the number of estimated false positive voxels zθ^+f^0(t)dt and the cluster size cut-off. Namely, FWER^cluster(zθ^)=Pr(Sup{μ(mfp(zθ^))}Kα), where μ is the cardinality measure of any set of contiguous voxels formed by mfp(zθ^) in the brain space. Then we define the search domain for z^θ^ by Ωα={zθ^:Sup{μ(mfp(zθ^))}<Kα}, In practice, the direct calculation of Sup{μ(mfp(zθ^))} is intractable. We resort to permutation based techniques to approximate Sup{μ(mfp(zθ^))}. In each permutation, the random shuffling of subject labels hypothetically produces a f^0 distribution and an αp level is chosen to control the permutation test FWER. Since we are able to theoretically calculate the mfp(z^θ^) based on empirical Bayes estimated f^0, f^1, we consider (1) an α1 for permutation bound that controls the FWER among all suprathreshold voxels, (2) an α2 for false positive cluster bound controls the FWER for estimated false positive voxels. Commonly, the widely used 5% αp level is unadjusted and that αp = α1. The adjustment based on the definition of α1 and α2 is (1 − α1)(1 − α2) = 1 − αp. For α2, it is calculated by the top mfp(z^θ^) voxels on both tail-end of f^0 and randomly choose mfp(z^θ^) from those extreme observations to form false discovery clusters. The adjustment indicates that when α2 → 0, 1 − α1 ≈ 1 − αp. In other words, we need to take the cluster size corresponds to max{α2} (or other levels based on the adjustment) to estimate the false positive voxels formed cluster size Cα,zθ^, and then we can estimate the support Ωα.

Algorithm 1. Estimating Q
INPUT:b:number of breaks in empiricalBayes estimation.π^0:empirical Bayes estimated prior probabilityf^0:empirical Bayes estimated null densityJ:number of permutation testsα2-level:false positive cluster bound(1%or maximum)αp-level:commonly used5%clusterwise FWERArrayzofzθ^(i),i=1,,b.OUTPUT:ArrayΩα.Ωα=[]forallzθ^(i)zdomfp(zθ^(i))Vπ^0z^θ^(i)+f^0(t)dtPFmfpth smallestP-value inzPθ^P-value corresponding tozθ^(i)forallj[1,J]doRandomly permute subject group labelszp(j)test statistics after permutationC(j)size of maximum connected component inzp(j)thresholded byPFK(j)size of maximum connected component inzp(j)thresholded byPθ^endforCα,zθ^Cα2levelKαKαplevelifCα,zθ^<KαthenAppendzθ^(i)toΩαendifendfor

The algorithm to estimate Ωα is described in Algorithm 1.

In practice, we find both Cα,zθ^, Kα are monotonously decreasing with zθ^ while the decreasing speed of Cα,zθ^ is faster. Therefore, Ωα is often a continuous domain. Next, we optimize z^θ^ on the support Ωα.

2.4 ∣. Optimizing z^θ^

The cluster level FWER^cluster is controlled while estimating the support Ωα. Then we are able to obtain the FDR^(z^θ^), TDR^(z^θ^) at each z^θ^Ωα and find the optimal z^θ^ on the support.

Define the decision rule fzθ^:f(zv)f(zθ^), where zθ^ is the primary threshold and f is the mixture density. Let Y+ be the set of voxels from the nonnull component, and Y− be the voxels for the null component. Then, let yv ∈ {−1, 1} be the indicator function where yv = 1 if zvY+ and yv = −1 if zvY−. We consider the 0 to 1 loss function of true positive and false positives w.r.t. the empirical Bayes estimated f^0, f^1. Inherit the notation from section 2.1, we have

TP(fzθ^)=zθ^+f^1(t)dt=zvY+1[f(zv)f(zθ^)]=zvY+1l(fzθ^,zv,yv) (2)
FP(fzθ^)=zθ^+f^0(t)dt=zvY1[f(zv)f(zθ^)]=zvYl(fzθ^,zv,yv) (3)

To bound the above two quantities and ensure the convexity, replace the 0 to 1 loss function by the hinge loss function lh(fzθ^,zv,yv)max(0,1yv(f(zv)f(zθ^))). In this way, the lower bound of true positives TPL and upper bound of false positives FPU at zθ^ satisfies TPLTP(fzθ^) and FPUFP(fzθ^), respectively.

Once we have the upper bound and lower bound, our objective function (1), denoted by h(zθ^,π^0,π^1,f^0,f^1)=π^1zθ^+f^1(t)dtπ^1+f^1(t)dt+π^1zθ^+f^1(t)dt+(1π^1)zθ^+f^0(t)dt, can be replaced by a surrogate function FeB=TPLY++TPL+FPUFeB. Our goal is to maximize the function FeB given that the maximum FDR is controlled at some level 1 − α*, which equivalent to the minimum precision is greater than α*. The problem can be written as:

maxTPL(f)Y++TPL(f)+FPU(f)s.t.TPL(f)α(TPL(f)+FPU(f))

We also replace the notation of the hinge loss function as a shorthand

+(f)=zY+lh(fzθ^,z,y);1(f)=zYlh(fzθ^,z,y); (4)

Alternatively, as maximizing FeB is equivalent to minimizing (FeB)1, we rewrite the objective function with +, :

min2Y++(f)+(f)Y++(f)s.t.(1α)(Y++(f))α(f), (5)

where α* is obtained from the boundary point on FWER^cluster constraint.

Straightforwardly, we can write ϕ = ∣Y + ∣ − +(f), and thus the above minimization is equivalent to

minf,ϕmaxλϕ1(1+(f))λϕ+λα1α(f),

where λ is the Lagrange multiplier. The above weighted loss function can be optimized by cost-sensitive binary classification algorithms (eg, logistic regression).36,37 Since we restrict the search region to the locally convex neighborhood, the solution of this optimization problem is unique. Since we restrict the search region to the locally convex neighborhood, the solution of this optimization problem is unique and we have the theorem on optimality:

Theorem 1 (Optimality). Let z^θ^Ωα, and each TDR^(z^θ^), FDR^(z^θ^) exist. The eBass primary threshold z^θ^=argmaxz^θ^Ωαh(z^θ^,θ^)s.t.h(z^θ^,)=Supz^θ^Ωαh(z^θ^,) is optimal.

Proof. The search region of objective function h(z^θ^,) is restricted on the support Ωα. Since each TDR^(z^θ^), FDR^(z^θ^) corresponds to the z^θ^ exists, and the hinge loss function (4) of unique expressions in the constraint (2),(3) are convex, the optimization problem is updated in the form (5). Solution to (5) exists and unique with cost-sensitive binary classification algorithm. Thus, the eBass primary threshold is optimal. ■

In addition, since the support Ωα is a finite set under the empirical Bayes estimation, our grid search algorithm can also guarantee the detection of the unique z^θ^ and achieves the optimality.

2.5 ∣. eBass in clusterwise inference

In sections 2.1 to 2.4, we have proposed an objective function (1) which is expressed by the empirical Bayes estimators π^0, π^1, f^0, f^1. The above estimators are consistent and we are able to obtain the optimal solution z^θ^ to h(z^θ^,θ^), we conclude the theorem on consistency of z^θ^:

Theorem 2 (Consistency). Suppose for samples x(1), x(2), … , x(n), … , their eBass primary threshold z^θ^(i)=argmaxh(z^θ(i),θ^), θ^={π^0,π^1,f^0,f^1} exist, which are denoted by z^θ^(1), z^θ^(2), … , z^θ^(n), …. Under mild regularity conditions, the estimated primary threshold z^θ^ from objective function h(z^θ^,θ^) is consistent as n → ∞.

Proof. For ith random sample x(i), the corresponding test statistics are z(i)={z1(i),z2(i),,zv(i)}, where v is the number of voxels and i = 1, … , n. Let the test statistics be independently drawn from an unknown density with prior probability π(θ) and follow the marginal distribution zv(i)f(zv(i)θv), where θvθ, θ = (θ1, … , θV) is a finite parameter space. The empirical Bayes estimator θ^={π^0,π^1,f^0,f^1} is consistent.29,31 When the sample size of random sample goes to infinity, θ^pθ under mild regularity conditions.38-40

Suppose the eBass primary threshold z^θ^(i) exists on support Ωα. In the empirical Bayes estimated objective function h(z^θ^(i),θ^)=π^1z^θ^(i)+f^1(t)dtπ^1+f^1(t)dt+π^1z^θ^(i)+f^1(t)dt+(1π^1)z^θ^(i)+f^0(t)dt, π^1, π^0 are treated as constant c and 1 − c, while the two functions g1(t)=π^1z^θ^(i)+f^1(t)dt, g2(t)=π^1z^θ^(i)+f^0(t)dt are CDFs for either continuous random variable from N(0, 1) or smoothed differentiable function. Thus, g1(t), g2(t) are continuous functions, and the objective function h(g1(t),g2(t))=cg1(t)c+cg1(t)+(1c)g2(t), h(g1(t), g2(t)) is continuous at each g1(z^θ^(i)), g2(z^θ^(i)). Then ho(g1, g2) is a continuous function.

By continuous mapping theorem, h^(z^θ^(i),)ph(z^θ^(i),).

For ith sample, the objective function h(z^θ^(i),) can be normalized by h~(z^θ^(i),)=h(z^θ^(i),)Ωα(z^θ^(i),). The parameter z^θ^(i) for h~(z^θ^(i),) satisfies z^θ^(i)=argmaxh~(z^θ^(i),)=argmaxh(z^θ^(i),). Consider the likelihood function for z^θ^Ωα that L(z^θ^)=i=1nh~(x(i);z^θ^(i)) is log-concave. Then, z^θ^ is the maximizer of L(z^θ^)=1ni=1nlogh~(x(i);z^θ^(i)).

Suppose the true parameter is zθ^. By Law of Large Numbers for any zθ^z^θ^, 1ni=1nlogh~(x(i);z^θ^)Ezθ^[logh~(x;z^θ^)] where Ezθ^ denotes the expectation with respect to the true unknown parameter. Then, we have Ezθ^[logh~(x;z^θ^)]Ezθ^[logh~(x;zθ^)]=Ezθ^[logh~(x;z^θ^)h~(x;zθ^)].

Since x ↦ logx is concave, by Jensen’s inequality, Ezθ^[logh~(x;z^θ^)h~(x;zθ^)]logEzθ^[h~(x;z^θ^)h~(x;zθ^)]=logh~(x;z^θ^)h~(x;zθ^)]=0. Thus, z^θ^Ezθ^[logh~(x;z^θ^)] is maximized at z^θ^=zθ^. When the number of samples n → ∞, the consistency holds under mild regularity conditions (see Appendix C). ■

The empirical Bayes estimated FDR and sensitivity determine our objective function (1), and bridges the data information and parameter optimization. The data-driven primary threshold selection method eBass automatically maximizes the empirical Bayes sensitivity while controlling cluster-level FWER and voxel-level FDR, which fully leveraging the information from the empirical data. Therefore, eBass primary threshold can outperform the prespecified primary thresholds in many scenarios (see simulation and data example results).

We summarize the computational procedure of eBass in three steps:

Step 1. Calculate TPR^(zθ^), FDR^(zθ^) with the empirical Bayes estimated π^0, π^1, f^0, f^1, and the objective function h(zθ^,π^0,π^1,f^0,f^1);

Step 2. Identify the support Ωα which guarantees FWER^clusterα;

Step 3. Obtain the z^θ^ by optimizing the objective function argmaxzθ^Ωαh(zθ^,π^0,π^1,f^0,f^1) on the support calculated by step 2, with the updated constraint on FDR^(zθ^).

In summary, we provide a data-driven optimal primary threshold selection step via an empirical Bayes framework. The selection of the primary threshold is more flexible because eBass optimizes the primary threshold by maximizing the sensitivity while rigorously controlling the clusterwise false positive error rate. In the following simulation analysis and real data example, we demonstrate that eBass can improve the statistical power without losing the rigor of FWER.

3 ∣. SIMULATION

3.1 ∣. Data description

In the simulation study, we evaluate the performance of eBass and compare it to the existing methods. We first simulate two-dimensional (2D) image data for multiple subjects. The number of voxels in each image is V = 100 × 100 = 10 000, and thus the number of simultaneous tests is 10 000. We assume that most voxels are from the null set, whereas two squared areas (N0 = 21 × 21 + 6 × 6 = 477 voxels) are from the nonnull, see Figure 2B. We apply a commonly used two-group (ie, cases vs controls) scenario, which can be easily extended to the regression setting. First, let voxels from the null set follow a normal distribution N(0, 1) for both cases and controls. Within the two squared areas, the nonnull voxels of the cases follow a normal distribution N(μk, 1) (k = 1, 2 for the two areas), whereas the voxels of the controls follow a N(0, 1) distribution. The signal-to-noise ratio (SNR) as the reciprocal of the coefficient of variation, SNR = μk/σ, the σ = 1 allows the difference of group means to be the true positive ES which is equivalent to Cohen’s d. A higher SNR can lead to higher sensitivity and a lower FDR, and vice versa. We further smooth each image with a Gaussian filter, using a full width at half maximum (FWHM) equivalent to 8 mm. The voxels in the smoothed image are correlated like the real fMRI data. We further let the number of subjects per group be 30, 60, and 100. For each setting, we simulate 100 datasets.

FIGURE 2.

FIGURE 2

(A) is the image of the underlying truth. Red squares represent the activated regions. (B) shows one original image with sample size 30 per arm, an ES set to 0.8. Nonactivated voxels in blue region of (A) generated that follow N(0, 1) and activated voxels in red regions of (A) follow N(0.8, 1). (C) was the original image in (B) smoothed with a Gaussian kernel of FWHM = 8 mm. ES, effect size; FWHM, full width at half maximum

3.2 ∣. Data analysis

For each dataset, we perform the two-step clusterwise inference. We determine the primary threshold using eBass as well as a variety of popular methods, including BH-FDR correction, p < .001, and p < .01. We evaluate the performance of these methods in terms of vFDR, together with cFWER by comparing the resulting clusters selected by the clusterwise inference to the two true squares. Note that ultimately, the voxelwise sensitivity and FDR are not applicable in clusterwise inference, we yet provide them in order to evaluate the quality of controls in the voxel-level primary thresholding step.

3.3 ∣. Results

We summarize simulation results in Table 1. We first compare the results when the ES is medium (ES = 0.6). For the sample size of 30 cases vs 30 controls, the study is underpowered and the test statistics from the true positive voxels are mixed with false positive voxels.41,42 Therefore, only a few voxels can survive the corrected primary threshold and form a cluster with a size greater than the step two cluster-level threshold. The vFDRs for all methods are well controlled. The Sensitivity of eBass is 137% higher than p < .001 threshold and 388% higher than BH-FDR correction. Although the well-controlled vFDR prohibits false-positive findings, the overall low sensitivity can also lead to low replicability because true positive findings are rarely overlapped across datasets.

TABLE 1.

Simulation result for original images of ES = 0.6, 0.8, and 1.0, sample size 30, 60, and 100 per arm smoothed with FWHM = 8 mm

eBass BH-FDR p <.001 p <.01
Effect size = 0.6
30 per arm Threshold 0.0027(Q1)/0.0051(Q2)/0.0147(Q3) 0.0002(Q1)/0.0009(Q2)/0.0017(Q3) 0.001 0.01
Sensitivity 0.1932 ± 0.0934 0.0396 ± 0.0325 0.0816 ± 0.0344 0.2263 ± 0.081
vFDR 0.024 ± 0.0653 0 0.0117 ± 0.0469 0
cFWER 5.6% 0 6.3% 0
60 per arm Threshold 0.0012(Q1)/0.0027(Q2)/0.0056(Q3) 0.0008(Q1)/0.0009(Q2)/0.0013(Q3) 0.001 0.01
Sensitivity 0.5039 ± 0.1237 0.3731 ± 0.1481 0.3777 ± 0.1075 0.6763 ± 0.1164
vFDR 0.0067 ± 0.0155 0.0050 ± 0.0154 0.0046 ± 0.0144 0.0127 ± 0.0256
cFWER 5% 10% 10% 10%
100 per arm Threshold 0.0009(Q1)/0.0020(Q2)/0.0032(Q3) 0.0020(Q1)/0.0021(Q2)/0.0022(Q3) 0.001 0.01
Sensitivity 0.8292 ± 0.0749 0.8460 ± 0.0767 0.7850 ± 0.081 0.9378 ± 0.0422
vFDR 0 0.0019 ± 0.0049 0.0001 ± 0.0006 0
cFWER 0 0 0 0
Effect size = 0.8
30 per arm Threshold 0.0073(Q1)/0.0097(Q2)/0.0106(Q3) 0.0006(Q1)/0.0007(Q2)/0.0009(Q3) 0.001 0.01
Sensitivity 0.6012 ± 0.0886 0.2881 ± 0.1141 0.3146 ± 0.0848 0.6109 ± 0.0914
vFDR 0.004 ± 0.01 0.0045 ± 0.0201 0 0.0039 ± 0.0096
cFWER 0 5% 0 0
60 per arm Threshold 0.0034(Q1)/0.0050(Q2)/0.0055(Q3) 0.0021(Q1)/0.0023(Q2)/0.0023(Q3) 0.001 0.01
Sensitivity 0.9438 ± 0.0396 0.9072 ± 0.0552 0.8559 ± 0.0605 0.9624 ± 0.0376
vFDR 0.0109 ± 0.0252 0.0052 ± 0.0156 0.0049 ± 0.0129 0.0119 ± 0.0263
cFWER 15% 10% 15% 10%
100 per arm Threshold 0.0011(Q1)/0.0016(Q2)/0.0020(Q3) 0.0024(Q1)/0.0024(Q2)/0.0025(Q3) 0.001 0.01
Sensitivity 0.9945 ± 0.0095 0.9955 ± 0.0087 0.9897 ± 0.0145 0.9988 ± 0.0030
vFDR 0.0036 ± 0.0095 0.0023 ± 0.0073 0.0025 ± 0.0078 0.0064 ± 0.0147
cFWER 10% 5% 10% 5%
Effect size = 1.0
30 per arm Threshold 0.0093(Q1)/0.0119(Q2)/0.0154(Q3) 0.0014(Q1)/0.0016(Q2)/0.0018(Q3) 0.001 0.01
Sensitivity 0.8539 ± 0.0522 0.6306 ± 0.1122 0.5781 ± 0.0982 0.8432 ± 0.0526
vFDR 0.0093 ± 0.0189 0.0045 ± 0.0138 0.0026 ± 0.0115 0.0057 ± 0.0168
cFWER 5% 10% 5% 5%
60 per arm Threshold 0.0015(Q1)/0.0022(Q2)/0.0031(Q3) 0.0023(Q1)/0.0024(Q2)/0.0025(Q3) 0.001 0.01
Sensitivity 0.9851 ± 0.0154 0.9842 ± 0.0191 0.9727 ± 0.0281 0.9951 ± 0.0087
vFDR 0.0003 ± 0.0014 0.0002 ± 0.0006 0.0027 ± 0.0085 0.0097 ± 0.0177
cFWER 0 0 10% 5%
100 per arm Threshold 0.0004(Q1)/0.0006(Q2)/0.0008(Q3) 0.0024(Q1)/0.0024(Q2)/0.0025(Q3) 0.001 0.01
Sensitivity 1 1 1 1
vFDR 0 0.0015 ± 0.0034 0 0.0067 ± 0.0105
cFWER 0 0 0 0

Abbreviations: BH-FDR, Benjamini-Hochberg false discovery rate; cFWER, cluster-level familywise error; eBass, empirical Bayes adaptive threshold selection; ES, effect size; FWHM, full width at half maximum; vFDR, voxel-level false discovery rate.

When the sample size is increased to 60 subjects per arm, the statistical power increases to above 80% at each voxel. The vFDRs of eBass, BH-FDR correction, and p < .001 primary threshold are around .005, while vFDR of threshold p < .01 is close to .013. Note that our eBass primary threshold varies across repeatedly simulated datasets, and automatically adapt to the characteristics of the data. In that, our average vFDR is about the same as p < .001 and BH-FDR, but has a significantly increased Sensitivity. Since the sample size of 60 vs 60 and ES = 0.6 is common in fMRI studies, these results provide practical guidance for optimal primary threshold selection for clusterwise inference.

Last, for sample size of 100 vs 100, the test statistics of voxels from the null set are clearly apart from those from the nonnull set, which leads to generally increased Sensitivity in all methods. The eBass method has slightly higher Sensitivity than that of p < .001 with better controlled vFDR because of its adaptive optimal threshold selection. The cFWER for all methods are well-controlled.

The results of larger ESs (ES = 0.8 and 1) follow a similar pattern as above (see Table 1). When the sample size is small (ie, 30 cases vs 30 controls), eBass significantly improves Sensitivity while keeping the vFDR at a very low level. At ES 0.8 and the sample size is medium to large (ie, 60, 100 subjects per arm), adaptive threshold selection methods eBass and BH-FDR correction performed slightly superior to the fixed primary threshold p < .001. When the ES reaches to 1.0, there is not much difference in Sensitivity among all methods especially when sample size is very large (100 subjects per arm). All methods control the vFDR well when ESs increase. Since we focus on clusterwise inference and cFWER, we only compare eBass with existing clusterwise inference methods. In the Appendix A, we show the comparison results of TFCE and clusterwise inference methods including eBass by assuming all voxels in significant clusters as positive.

In summary, eBass shows advantageous performance in improving the sensitivity while controlling the voxelwise FDR, especially when the sample size and ES is small to medium. With increased sample size and ES, most of the widely used thresholding methods tend to have a good performance with similar primary thresholds.

4 ∣. DATA EXAMPLE

4.1 ∣. Data acquisition

Rs-fMRI data were collected from 92 schizophrenia patients (SZs) at the University of Maryland Center for Brain Imaging Research. The average age of the SZ cohort is 35.5 ± 13.2, 26 of the participants are females. A Siemens 3T TRIO MRI (Erlangen, Germany) system equipped with a 32-channel phase array head coil was used to collect the resting-state T2*-weighted images with the following parameters: TR = 2 seconds, TE = 30 ms, flip angle = 90°, FOV = 248 mm, 128 × 128 matrix, 1.94 × 1.94 in-plane resolution, 4 mm slice thickness, 37 axial slices, 444 volumes). During the scan, participants were asked to keep their eyes closed and relax.

4.2 ∣. Data preprocessing

Preprocessing of the rs-fMRI data was performed using the data processing and analysis for (resting-state) brain imaging toolbox.43 The first ten-time frames were removed to allow for signal stabilization. Raw data underwent motion correction to the first image, slice-timing correction to the middle slice, and normalization to MNI space. To ensure that spurious motion and physiological artifacts did not drive observed effects in our statistical analyses, resting data also underwent regression of 6-motion parameters and their derivatives (12 total motion estimates) and physiological (white matter and cerebrospinal fluid) signals prior to spatial smoothing with an 8 mm FWHM Gaussian kernel. Framewise displacement was calculated for each image; this measure differentiates head realignment parameters across frames and generates a six-dimensional times series that represents instantaneous head motion.44 All individuals in the current analysis have mean framewise displacement ≤ 0.25 to better control for potential confounding effects of motion and motion artifacts on the rs-fMRI signal.

4.3 ∣. Data analysis

We aim to examine the resting-state function connectivity (rsFC) pattern (ie, seed voxel based connectivity map) influenced by the chlorpromazine (CPZ) equivalent daily dose among SZ. CPZ is a commonly used medication to treat psychotic disorders including schizophrenia and bipolar.45 Although previous studies have thorough investigated the treatment effect on symptoms of schizophrenia, the neurobiology of treatment effect is poorly understood. The rsFC analysis provides a high-resolution assessment of the treatment effect on central nervous system.

The seed-voxel based rsFC analysis is performed. A10 mm spherical seed is placed centering on the posterior cingulate cortex (PCC) at (−5, −49, 40). The correlations are calculated between the rest of voxels and the seed, and then normalized by the Fisher’s Z transformation. We then conduct the two-step clusterwise inference to identify treatment dose response related voxel-clusters.

In step one, voxelwise regression analyses on 223 553 nonzero are performed in the patient group with 92 participants. The CPZ is included as a regressor of interest, and we adjust for age and gender. Specifically, we apply the eBass approach and selected the optimal primary threshold p = .00053 by balancing the empirical Bayes TPR, FWER in the objective function (1). At step two clusterwise inference, AFNI’s 3dttest++with Clustsim option is used to perform permutation test (controlled at FWE < 0.05) on suprathreshold voxels and yield the cluster-size threshold. In this step, voxelwise primary thresholds eBass, p < .001, and p < .01 are applied accordingly.

4.4 ∣. Results

We identify patterns of rsFC correlated with CPZ dose in schizophrenia: increasing in CPZ dose causes reduced connectivity between PCC and three significant clusters locate at (1) right Heschl’s gyrus (Brodmann Area, ie, BA 41), (2) left middle orbital gyrus (BA 11), and (3) left superior temporal gyrus (BA 21). Cluster (1) has size 240 and the peak voxel is at (48, −21, 9). It is part of the primary auditory cortex. Cluster (2) has a similar size as cluster (1), but on the left brain and the peak voxel is at (−24, 30, −21). This region is in charge of processing emotion and values. The third cluster size is 79, which is only slightly larger than the clusterwise threshold. Its peak voxel locates at (−54, −3, −9) and the function in this area is to process auditory information and language. All the clusters’ information include cluster size, peak voxel location, BA number, label and region’s function are summarized in Table 2, and a demonstration of those regions on a 3D surface model46-49 is given in Figure 3. The detected regions have reduced connectivity between seed and regions.

TABLE 2.

Significant clusters information: Given (1) eBass primary threshold p = .00053, q = 0.0273 when sample size is full (n = 92); (2) eBass primary threshold p =.0015, q = 0.1274 when sample size is half of the full sample (n = 46)

Sample size Clusters Size MNI: peak
voxel (x, y, z)
BA Label Function
N = 92 Cluster 1 240 48, −21, 9 R41 Right Heschl’s gyrus Part of primary auditory cortex
Cluster 2 238 −24, 30, −21 L11 Left orbitofrontal cortex Processing emotion and value
Cluster 3 79 −54, −3, −9 L21 Left superior temporal gyrus Auditory processing and language
N = 46 Cluster 1 138 57, −6, 12 R41 Right Heschl’s gyrus Part of primary auditory cortex

Abbreviations: BA, Brodmann area; L, left; MNI, Montreal Neurological Institute; R, right.

FIGURE 3.

FIGURE 3

(A) is the three-dimensional whole brain with detected clusters marked in different colors under the full sample (n = 92). (1), (2), and (3) are the anatomical diagrams focus on the corresponding clusters listed in Table 2. (B) is the detected cluster under half sample (n = 46). Specifically, (C) shows the overlapping findings from full sample Cluster 1 (red outline underlay) and half sample Cluster 1 (green solid area overlay). Varies colors are used to distinguish the different clusters. Note that all regions have reduced connectivity with the PCC seed. PCC, posterior cingulate cortex

These activation regions match relatively well with the regions reported from well-established studies that show significant rsFC differences between SZ patients and healthy controls. Previous studies showed reduced connectivity between seed and cluster (1) among drug-resistance patients,50,51 abnormal connectivity between seed and cluster (2) among patients,50,52 and increased connectivity between seed and cluster (3) among patients without treatment group or drug-resistance group.50,53 In our results, all of the three regions show reduced connectivity, which indicate that CPZ dose is negatively correlated with the activation of those regions. Comparing to the results from previous studies, CPZ effectively reduces the connectivity at orbitofrontal cortex and superior temporal gyrus. However, the reduced connectivity between PCC and Heschl’s gyru is not affected much by the CPZ dose. Based on above findings, we conclude that the CPZ can mitigate some abnormal connectivity patterns in SZ patients, and thus ultimately relief the SZ patients’ psychotic symptoms, especially on auditory hallucination.

4.5 ∣. Comparisons

We compare the eBass primary threshold with the popular primary threshold values including p < .001 and p < .01. In Figure 4, we demonstrate significant clusters with FWER correction based on different primary thresholds. The cluster-size threshold and cluster details are listed at Table B1 in Appendix B. The primary threshold of p < .001 yields three similar clusters with slightly increased cluster size, while two oversized clusters are detected by the primary threshold p < .01 expanding multiple anatomical brain gyri/sulci.

FIGURE 4.

FIGURE 4

(A) to (C) show significant clusters using the primary thresholds of eBass, p < .001, and p < .01, respectively, based on all participants in the data example. Significant clusters are marked in color. (D) and (E) demonstrate significant clusters using the primary thresholds of eBass and p < .001 based on data with a half of the original sample size. The primary threshold p < .01 yields no significant cluster

In addition, we randomly sample n = 46 subjects from the original population to further evaluate their performance. Under the subsample, one cluster roughly at the same location with cluster (1) from full sample is detected by the eBass primary threshold p = .0015, see Figure 3. As for p < .001, two significant clusters are detected but no overlap with full sample results (see Table B1 in Appendix B). The ratio of Jaccard index between full sample and subsample for eBass and p < .001 is 1.91. There is no significant cluster detected by the primary threshold of p < .01 and thus leads to a 0 for the Jaccard index due to the discontinuity of the significant clusters.

The eBass primary threshold identifies significant clusters that are consistent with those in the original dataset and subsample. Therefore, the data-driven eBass primary threshold is more flexible and provides a better balance between the sensitivity and FDR, which should lead to greater replicability.

5 ∣. DISCUSSION

We have developed a data-driven primary threshold selection method for the two-step clusterwise fMRI inference. The multiple comparison problem has been at the heart of neuroimaging data analysis, because it can determine the validity of findings. In practice, true signals in neuroimaging data are often mixed with various sources of noise, and the statistical inference models are sensitive to the noise. Therefore, a small erroneous shift from the optimal decision-making threshold can cause a significant loss of statistical power or uncontrolled false-positive findings. However, the primary threshold has been conventionally selected based on empirical analysis and experience, which may not provide the optimal threshold for the target neuroimaging data. To address this need, we propose an empirical Bayes method to calculate estimated sensitivity and FDR and thus facilitate the optimization of selecting the primary threshold for clusterwise inference.

Built on the successful development of the empirical Bayes approach in the field of high-dimensional statistics, eBass enjoys several advantageous theoretical properties, regarding the estimation robustness and consistency.31,54 The eBass threshold provides a reliable cut-off to binarize voxels in the 3D brain space into a point process.55 The step two inference (ie, permutation tests) is also sensitive to the noise level of the point process. When the sensitivity level is low (a stringent threshold), true positive points are unlikely to be spatially adjacent and form a nontrivial cluster resulting in a reduced ability to detect no significant clusters. When a large proportion of false positive points are present in the point process, the false positive points tend to be spatially connected due to the spatial smoothness of the neuroimaging data, leading to clusterwise false-positive findings. For these reasons, we often find it challenging to produce replicable findings in neuroimaging studies.3

Our simulation and data example results concur with the previous findings that the empirical primary threshold (p < .001) is a good option, especially when no information from the data is available. In general, the primary threshold (p < .001) can adequately control the false-positive findings, which is analogous to the traditional cut of p < .05 in univariate statistical inference.13 In practice, we find that the data-driven eBass threshold often varies around the primary threshold p < .001, in many applications. Nevertheless, the eBass primary threshold is objectively selected based on the data, and can thus improve sensitivity in many scenarios (eg, datasets with smaller sample sizes and small-medium ESs). Therefore, we consider the eBass primary threshold to be a good complement to the existing methods for clusterwise inference.

We also note that the eBass is built on the estimation of the two-component mixture model. When the empirical Bayes approach cannot estimate the two components well, we resort to the p < .001 primary threshold for clusterwise inference or TFCE as potential solutions.

The eBass method is compatible with all voxel-level statistical inference because the marginal distribution of test statistics is often robust.5 The more accurate voxel-level statistical inference can lead to more separable null and nonnull distributions and thus more accurate clusterwise inference results via the eBass primary threshold.

In summary, the eBass provides a data-driven and automatically optimized primary threshold for the two-step clusterwise fMRI inference. Since the computation is efficient, eBass can be conveniently implemented and compatible with most existing software platforms.

APPENDIX

APPENDIX A. COMPARISON BETWEEN CLUSTER-EXTENT METHODS AND VOXELWISE INFERENCE

Although we focus on the two-step clusterwise inference in the simulation study, we further compare the cluster-extent methods with TFCE—one of the most popular voxel-extent inference. The TFCE generates voxel-level corrected p-values instead of considering clusters as a whole, thus the clusterwise FWER is not applicable to this method. Similarly, as in clusterwise inference we assume all suprathreshold voxel-formed clusters are significant, we compare the voxelwise sensitivity and FDR of primary thresholding step specifically with TFCE.

The 2D image has the same dimension with V = 100 × 100 = 10 000 voxels. The truth consists of four identical squared areas that are placed in the center of the image with equal distance (roughly same as the side of the square). Different from the images in the simulation section, we apply the smoothing step after adding the true signals to the original image. In this way, the underlying truth have an irregular shape on the margin, and the strength of signal decreases steady from center to margin. The total number of true significant voxels is 900. We perform the simulation study on the set of images have ES = 0.6, smoothed with FWHM = 8 mm, 30 subjects per group.

TABLE A1.

Performance comparison between cluster-extent methods and voxelwise thresholding method

eBass BH-FDR p < 0.001 p < 0.01 TFCE
Primary threshold 0.003 ± 0.0029 0.0063 ± 0.0001 0.001 0.01 NA
Sensitivity 0.9996 ± 0.0007 0.9998 ± 0.0004 0.9991 ± 0.001 0.9998 ± 0.0004 1
vFDR 0.2469 ± 0.025 0.2644 ± 0.0129 0.2200 ± 0.0096 0.2767 ± 0.0116 0.3602 ± 0.0142
cFWER 0 0 0 0 NA

Note: 30 subjects per arm with ES = 0.6, smoothed with Gaussian kernel FWHM = 8 mm.

Abbreviations: BH-FDR, Benjamini-Hochberg false discovery rate; cFWER, cluster-level familywise error; eBass, empirical Bayes adaptive threshold selection; TFCE, threshold-free methods cluster enhancement; vFDR, voxel-level false discovery rate.

We perform the two-step clusterwise inference with eBass, BH-FDR correction, p < .001, and p < .01. We also apply TFCE voxelwise inference on the images under this setting. Similarly, we evaluate the performance of methods by their voxelwise sensitivity, FDR, and clusterwise FWER. We exhibit the simulation results in Table A1.

From the example above, we find out that when the significant clusters have blurred edges and close to each other, the voxel-extent methods would have a significant increase (up to 50%) in vFDR comparing to cluster-extent methods. When the ES and sample size are low to medium, the voxel-extent method (eg, TFCE) tends to outperform cluster-extent methods on sensitivity by 5% to 30% in this type of images who have clear edges, while the vFDR remains about the same.

APPENDIX B. REAL DATA RESULTS

Real data example results across different sample sizes are shown in Table B1.

TABLE B1.

Comparison between eBass and hard thresholds p < .01 and p <.001

Sample size = 92
Sample size = 46
Primary threshold Number of clusters Primary threshold Number of clusters
eBass 0.00053(68) 3:239/235/78 0.0015(137) 1:138
p < .001 0.001(118) 3:472/459/137 0.001(95) 2:102/101
p < .01 0.01(1097) 2:2518/2281 0.01(981) NA

Note: Subsample (N = 46) is randomly drawn from original patient group. Numbers inside parentheses are the corresponding cluster-size threshold. Number of clusters display information as “# of clusters: cluster 1 size/cluster 2 size/ ….”

APPENDIX C. MILD REGULARITY CONDITIONS FOR CONSISTENCY

Consistency for empirical Bayes: All f(zv(i)) in the model have the same support; θ is an interior point of θ; The log-likelihood l(θ) is differentiable in θ; θ^ is unique value of θθ that solves the equation l′(θ) = 0.

Hellinger consistency for infinity sample of empirical Bayes estimator: n(f:H(f,f0)>ε)0a.s.[F0] for all ε > 0, where H(f,f0)={(ff0)2}12 is the Hellinger distance between f and f0.

Consistency for objective function: All h(z^θ^(i)) in the model have the same support; zθ^ is an interior point of Ωα; The log-likelihood l(h~(x(i),z^θ^(i))) is differentiable in zθ^;z^θ^ is unique value of zθ^Ωα that solves the equation l(h~(x(i),z^θ^(i)))=0.

Footnotes

SOFTWARE

We have the implementation of eBass published in Github at https://github.com/yierge/eBass.

DATA AVAILABILITY STATEMENT

Data are available upon request.

REFERENCES

  • 1.Lindquist M, Mejia A. Zen and the art of multiple comparisons. Psychosom Med. 2015;77(2):114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Smith S, Nichols T. Statistical challenges in “big data” human neuroimaging. Neuron. 2018;97(2):263–268. [DOI] [PubMed] [Google Scholar]
  • 3.Lindquist M Neuroimaging results altered by varying analysis pipelines. Nature. 2020;582(7810):36–37. [DOI] [PubMed] [Google Scholar]
  • 4.Derado G, Bowman D, Kilts C. Modeling the spatial and temporal dependence in fMRI data. Biometrics. 2010;66(3):949–957. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Chen G, Xiao Y, Taylor P, et al. Handling multiplicity in neuroimaging through Bayesian lenses with multilevel modeling. Neuroinformatics. 2019;17(4):515–545. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Nichols T Multiple testing corrections, nonparametric methods, and random field theory. NeuroImage. 2012;62(2):811–815. [DOI] [PubMed] [Google Scholar]
  • 7.Smith S, Nichols T. Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. NeuroImage. 2009;44(1):83–98. [DOI] [PubMed] [Google Scholar]
  • 8.Alberton BA, Nichols T, Gamba H, Winkler A. Multiple testing correction overcontrasts for brain imaging. NeuroImage. 2020;216:116760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Nichols T, Hayasaka S. Controlling the familywise error rate in functional neuroimaging: a comparative review. Stat Methods Med Res. 2003;12(5):419–446. [DOI] [PubMed] [Google Scholar]
  • 10.Friston K, Worsley K, Frackowiak RS, Mazziotta J, Evans A. Assessing the significance of focal activations using their spatial extent. Hum Brain Mapp. 1994;1(3):210–220. [DOI] [PubMed] [Google Scholar]
  • 11.Woo CW, Krishnan A, Wager T. Cluster-extent based thresholding in fMRI analyses: pitfalls and recommendations. NeuroImage. 2014;91:412–419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Hong YW, Yoo Y, Han J, Wager TD, Woo CW. False-positive neuroimaging: undisclosed flexibility in testing spatial hypotheses allows presenting anything as a replicated finding. NeuroImage. 2019;195:384–395. [DOI] [PubMed] [Google Scholar]
  • 13.Eklund A, Knutsson H, Nichols T. Cluster failure revisited: Impact of first level design and physiological noise on cluster false positive rates. Human Brain Mapping. 2019;40(7):2017–2032. 10.1002/hbm.24350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Eklund A, Nichols T, Knutsson H. Cluster failure: why fMRI inferences for spatial extent have inflated false-positive rates. Proc Natl Acad Sci. 2016;113(28):7900–7905. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Flandin G, Friston K. Analysis of family-wise error rates in statistical parametric mapping using random field theory. Hum Brain Mapp. 2019;40(7):2052–2054. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Schwartzman A, Telschow F. Peak p-values and false discovery rate inference in neuroimaging. NeuroImage. 2019;197:402–413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Hayasaka S, Nichols T. Validating cluster size inference: random field and permutation methods. NeuroImage. 2003;20(4):2343–2356. [DOI] [PubMed] [Google Scholar]
  • 18.Bennett C, Wolford G, Miller M. The principled control of false positives in neuroimaging. Soc Cogn Affect Neurosci. 2009;4(4):417–422. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Eklund A, Nichols T, Knutsson H. Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates. Proceedings of the National Academy of Sciences. 2016;113(28):7900–7905. 10.1073/pnas.1602413113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci. 2003;100(16):9440–9445. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Bennett CM, Miller MB, Wolford GL. Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: an argument for multiple comparisons correction. NeuroImage. 2009;47:S125. 10.1016/s1053-8119(09)71202-9. [DOI] [Google Scholar]
  • 22.Spisák T, Spisák Z, Zunhammer M, et al. Probabilistic TFCE: a generalized combination of cluster size and voxel intensity to increase statistical power. NeuroImage. 2019;185:12–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Benjamini Y, Heller R. False discovery rates for spatial signals. J Am Stat Assoc. 2007;102(480):1272–1281. [Google Scholar]
  • 24.Bowman FD. Spatio-temporal modeling of localized brain activity. Biostatistics. 2005;6(4):558–575. [DOI] [PubMed] [Google Scholar]
  • 25.Risk B, Matteson D, Spreng N, Ruppert D. Spatiotemporal mixed modeling of multi-subject task fMRI via method of moments. NeuroImage. 2016;142:280–292. [DOI] [PubMed] [Google Scholar]
  • 26.Leek JT, Storey JD. A general framework for multiple testing dependence. Proc Natl Acad Sci. 2008;105(48):18718–18723. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Kulldorff M SaTScanTM user guide for version 7.0. SaTScanTM; 2006. https://www.satscan.org/cgi-bin/satscan/register.pl/SaTScan_Users_Guide.pdf?todo=process_userguide_download. Accessed August 13, 2007. [Google Scholar]
  • 28.Waller LA, Gotway CA. Applied Spatial Statistics for Public Health Data. Vol 368. New York, NY: John Wiley & Sons; 2004. [Google Scholar]
  • 29.Efron B Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge: Cambridge University Press; 2010. 10.1017/CBO9780511761362. [DOI] [Google Scholar]
  • 30.Fan J, Han X, Gu W. Estimating false discovery proportion under arbitrary covariance dependence. J Am Stat Assoc. 2012;107(499):1019–1035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Efron B Two modeling strategies for empirical Bayes estimation. Stat Sci Rev J Inst Math Stat. 2014;29(2):285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Petrone S, Rousseau J, Scricciolo C. Bayes and empirical Bayes: do they merge? Biometrika. 2014;101(2):285–302. [Google Scholar]
  • 33.Hand DJ, Christen P, Kirielle N. F*: an interpretable transformation of the F-measure. Mach Learn. 2021;110(3):451–456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Soleymani R, Granger E, Fumera G. F-measure curves: A tool to visualize classifier performance under imbalance. Pattern Recognition. 2020;100:107–146. 10.1016/j.patcog.2019.107146. [DOI] [Google Scholar]
  • 35.Fujino A, Isozaki H, Suzuki J. Multi-label text categorization with model combination based on f1-score maximization. Paper presented at: Proceedings of the 3rd International Joint Conference on Natural Language Processing. Hyderabad, India; Vol. II, 2008. [Google Scholar]
  • 36.Eban E, Schain M, Mackey A, Gordon A, Rifkin R, Elidan G. Scalable learning of non-decomposable objectives. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017; PMLR. [Google Scholar]
  • 37.Parambath SP, Usunier N, Grandvalet Y. Optimizing F-measures by cost-sensitive classification. Adv Neural Inf Process Syst. Montraal, Quebec, Canada; 2014;27:2123–2131. [Google Scholar]
  • 38.Deely J, Lindley D. Bayes empirical bayes. J Am Stat Assoc. 1981;76(376):833–841. [Google Scholar]
  • 39.Petrone S, Rizzelli S, Rousseau J, Scricciolo C. Empirical Bayes methods in classical and Bayesian inference. Metro. 2014;72(2):201–215. [Google Scholar]
  • 40.Walker S On sufficient conditions for Bayesian consistency. Biometrika. 2003;90(2):482–488. [Google Scholar]
  • 41.Lindquist M The statistical analysis of fMRI data. Stat Sci. 2008;23(4):439–464. [Google Scholar]
  • 42.Cremers H, Wager T, Yarkoni T. The relation between statistical power and inference in fMRI. PLoS One. 2017;12(11):e0184923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Yan CG, Wang XD, Zuo XN, Zang YF. DPABI: data processing & analysis for (resting-state) brain imaging. Neuroinformatics. 2016;14(3):339–351. [DOI] [PubMed] [Google Scholar]
  • 44.Power JD, Barnes KA, Snyder AZ, Schlaggar BL, Petersen SE. Spurious but systematic correlations in functional connectivity MRI networks arise from subject motion. NeuroImage. 2012;59(3):2142–2154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Chlorpromazine / Chlorpromazine hydrochloride. The American Society of Health-System Pharmacists. https://www.drugs.com/monograph/chlorpromazine.html. Archived from the original on 8 December 2015. Retrieved 1 December 2015. [Google Scholar]
  • 46.Muschelli J, Sweeney E, Crainiceanu C. brainR: interactive 3 and 4D images of high resolution neuroimage data. R J. 2014;6(1):41. [PMC free article] [PubMed] [Google Scholar]
  • 47.Lancaster JL, Cykowski MD, McKay DR, et al. Anatomical global spatial normalization. Neuroinformatics. 2010;8(3):171–182. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Lancaster JL, Laird AR, Eickhoff SB, Martinez MJ, Fox PM, Fox PT. Automated regional behavioral analysis for human brain images. Front Neuroinform. 2012;6:23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Kochunov P, Lancaster JL, Thompson P, et al. Regional spatial normalization: toward an optimal target. J Comput Assist Tomogr. 2001;25(5):805–816. [DOI] [PubMed] [Google Scholar]
  • 50.Chan NK, Kim J, Shah P, et al. Resting-state functional connectivity in treatment response and resistance in schizophrenia: a systematic review. Schizophr Res. 2019;211:10–20. [DOI] [PubMed] [Google Scholar]
  • 51.Ganella EP, Seguin C, Bartholomeusz CF, et al. Risk and resilience brain networks in treatment-resistant schizophrenia. Schizophr Res. 2018;193:284–292. [DOI] [PubMed] [Google Scholar]
  • 52.Lui S, Yao L, Xiao Y, et al. Resting-state brain function in schizophrenia and psychotic bipolar probands and their first-degree relatives. Psychol Med. 2015;45(1):97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Alonso-Solís A, Vives-Gilabert Y, Grasa E, et al. Resting-state functional connectivity alterations in the default network of schizophrenia patients with persistent auditory verbal hallucinations. Schizophr Res. 2015;161(2-3):261–268. [DOI] [PubMed] [Google Scholar]
  • 54.Schwartzman A, Dougherty R, Lee J, Ghahremani D, Taylor JE. Empirical null and false discovery rate analysis in neuroimaging. NeuroImage. 2009;44(1):71–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Kang J, Johnson T, Nichols T, Wager T. Meta analysis of functional neuroimaging data via Bayesian spatial point processes. J Am Stat Assoc. 2011;106(493):124–134. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data are available upon request.

RESOURCES