Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Apr 22.
Published in final edited form as: Neuroimage. 2019 Dec 15;209:116468. doi: 10.1016/j.neuroimage.2019.116468

Cluster failure or power failure? Evaluating sensitivity in cluster-level inference

Stephanie Noble a,b,*, Dustin Scheinost a,b,c,d, R Todd Constable a,b,e
PMCID: PMC8061745  NIHMSID: NIHMS1688246  PMID: 31852625

Abstract

Pioneering work in human neuroscience has relied on the ability to map brain function using task-based fMRI, but the empirical validity of these inferential methods is still being characterized. A recent landmark study by Eklund and colleagues showed that popular multiple comparison corrections based on cluster extent suffer from unexpectedly low specificity (i.e., high false positive rate). Yet that study’s focus on specificity, while important, is incomplete. The validity of a method depends also on its sensitivity (i.e., true positive rate or power), yet the sensitivity of cluster correction remains poorly understood. Here, we assessed the sensitivity of gold-standard nonparametric cluster correction by resampling real data from five tasks in the Human Connectome Project and comparing results with those from the full “ground truth” datasets (n=480-493). Critically, we found that sensitivity after correction is lower than may be practical for many fMRI applications. In particular, sensitivity to medium-sized effects (|Cohen’s d|=0.5) was less than 20% across tasks on average, about three times smaller than without any correction. Furthermore, cluster extent correction exhibited a spatial bias in sensitivity that was independent of effect size. In comparison, correction based on the Threshold-Free Cluster Enhancement (TFCE) statistic approximately doubled sensitivity across tasks but increased spatial bias. These results suggest that we have, until now, only measured the tip of the iceberg in the activation-mapping literature due to our goal of limiting the familywise error rate through cluster extent-based inference. There is a need to revise our practices to improve sensitivity; we therefore conclude with a list of modern strategies to boost sensitivity while maintaining respectable specificity in future investigations.

Keywords: fMRI, Activation, Power, Sensitivity, Empirical, Resampling, HCP

1. Introduction

Human neuroscience has been enriched by analytical methods used to map the brain using functional magnetic resonance imaging (fMRI). One of the most common inferential techniques in fMRI involves estimating which areas show activity corresponding with tasks or stimuli and relating this to underlying cognitive or behavioral constructs. Activation-mapping analyses continue to be a bourgeoning field of research, highlighted by several large-scale efforts that pool task-based data such as the Human Connectome Project (Van Essen et al., 2013), NeuroVault (Gorgolewski et al., 2015), BrainMap (Laird et al., 2011), and Neurosynth (Yarkoni et al., 2011). Paramount to the success of activation-based human brain mapping is the continued evaluation and improvement of our methods (Logothetis, 2008; Nichols et al., 2017; Poldrack et al., 2017). Specifically, the validity of our methods depends on their sensitivity (i.e., ability to detect true positives, statistical power) and specificity (i.e., ability to avoid false positives)1.

Recent work has raised concerns regarding the specificity of common activation-mapping methods (Eklund et al., 2016). For the past decade, the field has appreciated the need to correct for multiple comparisons across the large space of the image (Bennett et al., 2009), which can span 100,000 voxels. Standard corrections that do not account for the spatial structure or dependence between voxels (e.g., Bonferroni) are overconservative; to improve sensitivity, parametric corrections were developed to take advantage of these features of the data (cf. Friston et al., 1996; Nichols and Hayasaka, 2003; Lindquist and Mejia, 2015). These estimate a corrected significance threshold that should correspond with a desired familywise error rate (FWER; probability of at least one false positive in a family of tests) for clusters of active voxels2. Posing a major challenge to the field, Eklund and colleagues revealed that the empirical FWER of these popular procedures was often far larger than expected. They demonstrated the merit of nonparametric cluster correction as an attractive alternative that more robustly controls the FWER. This work has not only played a major role in improving our understanding of the limitations of common activation-mapping methods, but also, importantly, provided immediately implementable guidelines for remediation. While that groundbreaking study was significant in drawing much-needed attention to the specificity of common methods, the story remains incomplete. It is also necessary to examine the complementary facet of sensitivity.

Concern about the failure to detect effects is on the rise in neuroscience in general (Button et al., 2013) and fMRI activation-mapping in particular (Cremers et al., 2017; Lohmann et al., 2018; Bansal and Peterson, 2018). After all, investigations are frequently designed to detect an effect of interest. Recent work highlighted this issue even before any correction is applied; uncorrected sensitivity was already found to be low in typical empirical data (Cremers et al., 2017) such that no reasonable sample size is expected to recover small effects (Geuter et al., 2018). It is clear that sensitivity after any multiple comparison correction should then be much lower. Indeed, recent studies demonstrated compelling reasons to expect low sensitivity after cluster-based FWER correction: results from typical task-based data exhibit low reproducibility (Lohmann et al., 2018), and simulated power analyses suggest low sensitivity (Lohmann et al., 2018; Bansal and Peterson, 2018). However, neither study estimated sensitivity empirically. Simulations enable understanding of a phenomenon in a controlled environment, but it can be challenging to identify under which conditions simulations generalize to the complexities of real data. Thus, empirically benchmarking sensitivity can be a critical step in completing our understanding of the accuracy of these techniques. Furthermore, it is necessary to understand how sensitivity changes across effect sizes, since the two are linked.

Here, we benchmark the sensitivity of gold-standard multiple comparison correction in fMRI by leveraging one of the largest task-based fMRI datasets ever collected—the Human Connectome Project dataset (Van Essen et al., 2013). Although it has traditionally been challenging to identify realistic ground truth effects, recent studies have proposed a useful approximation using this state-of-the-art dataset (Cremers et al., 2017; Lohmann et al., 2018; Geuter et al., 2018). Extending the procedures of Cremers et al. (2017), we estimated sensitivity of nonparametric cluster correction by comparing results from resampled data to those from the full “ground truth” dataset in five tasks. We additionally investigated how results change with a threshold-free approach expected to improve sensitivity, threshold-free cluster enhancement (TFCE; Smith and Nichols, 2009). Finally, we examined whether there is a spatial bias in sensitivity independent of effect size. Now that we have the tools to validate our methods empirically, it is essential to understand sensitivity in fMRI activation mapping to appreciate what we may be missing.

2. Methods

The present study estimated sensitivity by replicating and extending the methods described in Cremers et al. (2017) and inspired by the investigation of nonparametric corrections in Eklund et al. (2016). In the work of Cremers et al. (2017), a group-level analysis of the TOM-RAND contrast (n=485) was conducted for 16 random samples of 15 subjects from the openly available Human Connectome Project S1200 release (Van Essen et al., 2013). The uncorrected voxel-level statistic maps were then compared to group results from the full n=484 sample as a measure of ground truth. The following sections describe the procedures used here to estimate and interpret sensitivity.

We have made our best effort to ensure that the present manuscript is compliant with the Committee on Best Practice in Data Analysis and Sharing (COBIDAS) guidelines; to the best of our knowledge, the reporting in the present manuscript is consistent with all “mandatory” recommendations (Supplementary Methods: Statement of compliance with COBIDAS guidelines).

2.1. Data description

Data was obtained from the Human Connectome Project S1200 release (Van Essen et al., 2013). Five tasks were specifically selected from Barch et al. (2013) to include a range of effects from local to diffuse and from weak to strong; additionally, this allowed us to examine results across task contexts. These task contrasts included: the Theory of Mind versus Random contrast from the Social task (SOCIAL COPE 6: TOM-RANDOM; N=484), which showed strong and widespread effects; the Face versus Other contrast from the Working Memory task (WM COPE 20: FACE-AVG; N=493), which showed a range of localized effects from strong to weak; the Relational versus Match contrast from the Relational task (RELATIONAL COPE 4: REL-MATCH; N=480), which also showed localized but weak effects; the Reward versus Punishment contrast from the Gambling task (GAMBLING COPE 6: REWARD-PUNISH; N=491), which showed weak and widespread effects, appearing the most unstructured spatially; and the Faces versus Shapes contrast from the Emotion task (EMOTION COPE 3: FACES-SHAPES; N=482), which showed widespread positive effects (cf. https://wiki.humanconnectome.org/display/PublicData/Task+fMRI+Contrasts). For each selected task, all available volume-based data was included (i.e., no exclusion criteria were employed). The present use of publicly available, de-identified data from the Human Connectome Project and sharing of analysis results has been reviewed and designated as exempt (Exemption 4) by the Yale University Institutional Review Board.

2.2. Preprocessing and first-level analysis

Volume-based preprocessing and first-level analyses were previously completed as described in Barch et al. (2013) and Glasser et al. (2013). In brief, this included “gradient unwarping, motion correction, fieldmap-based EPI distortion correction, brain-boundary-based registration of EPI to structural T1-weighted scan, non-linear (FNIRT) registration into MNI152 space, and grand-mean intensity normalization,” spatial smoothing with an “unconstrained 3D Gaussian kernel of FWHM=4mm,” computation of activity estimates from the general linear model (including corrections and confound modeling), then temporal filtration and prewhitening (Barch et al., 2013). First-level Contrast of Parameter Estimate (COPE) results were accessed from the Human Connectome Project bucket for the next steps. COPE images were in MNI152 2mm space.

2.3. Estimation of ground truth effect sizes

To estimate the ground truth effect size at each voxel in the full brain, a group-level contrast was performed for the full sample of each task (N=480-493), as in Cremers et al. (2017). One-sample t-statistics were obtained for each voxel from the raw (unpermuted) test statistic outputs of FSL’s Randomise tool (Winkler et al., 2014). T-statistics were then converted to Cohen’s d coefficients used to measure effect size here. The conversion for a one-sample t-statistic is ds=tN where ds is the Cohen’s d coefficient for the sample, t is the t-test statistic, and N is the sample size (from eqn. 2.5.9. of Cohen, 1988, pp. 72)3. Tools from FSL (v5.0.10; https://fsl.fmrib.ox.ac.uk/fsl/; Jenkinson et al., 2012) and AFNI (v17.3.02; https://afni.nimh.nih.gov/; Cox, 1996) were used for other basic image manipulation operations involved in calculating effect sizes. Effect size histograms across the true positive rate map were created with the Matlab histogram function (75 bins between d = −2.5 – 2.5).

Here we note two conceptual points about this measure of ground truth. First, while it is important to remember that the accuracy and generalizability of this estimate is limited (e.g., the extent to which this dataset is representative of the overarching population of interest remains to be determined), this data represents some of the largest samples collected for task-based fMRI and therefore our best empirical estimate of the ground truth. Second, this ground truth procedure results in a non-zero estimated effect due to task at each voxel, yet some effects may be smaller in magnitude than the error of the estimator. Thus, caution is particularly recommended in the interpretation of very small effects. In response to the above points, the reader may find it useful to interpret the calculated effect sizes as known characteristics pertaining only to a fixed empirical distribution (i.e., the data used here) rather than estimated characteristics of an unknown distribution, while bearing in mind that generalizability of these effect sizes beyond this data remains to be determined.

2.4. Resampling procedure

Data was resampled (i.e., repeatedly subsampled) at each of R=500 repetitions. For each repetition, n=20 subjects were randomly selected without replacement from the full dataset of N subjects for group-level analysis. Next, one of two procedures was used to perform one-sample, one-sided cluster-based inference with nonparametric correction for multiple comparisons resulting in a map of voxels that survived correction (see 2.4.1. Nonparametric cluster-based inference). All surviving voxels were added to a cumulative summary kept across repetitions. This procedure was repeated switching the sign of the tail used for the one-sided test to accumulate surviving voxels from the positive and negative tails in two separate cumulative result maps.

2.4.1. Nonparametric cluster-based inference

Inferences were formed on the basis of two different mass univariate “cluster-based” statistics: 1) Cluster Extent, and 2) the Threshold-Free Cluster Enhancement (TFCE) statistic (Smith and Nichols, 2009). Each cluster-based statistic is derived from “voxel-based” statistics (e.g., a z-statistic image based on separate calculations at each voxel). Cluster extent is calculated after a voxel-based image has been thresholded with a “cluster-determining threshold” (here, z>3.1 corresponding with p<0.001). Cluster extent is defined for each group of contiguous voxels (i.e., cluster) in the image as the number of contiguous voxels. In contrast, the TFCE statistic is calculated for each voxel of an unthresholded voxel-based statistic image. The TFCE statistic at a voxel x is defined as:

TFCE(x)=h=h0hxe(h)EhHdh

where h is the magnitude (i.e., height) of the voxel-based statistic (h0 is typically zero and hx is the height at voxel x), e is the cluster extent of all surrounding voxels ≥ h, and E and H are constant weighting parameters for e and h. The values of E and H have previously been determined empirically in Smith and Nichols, 2009, and form the defaults in FSL; these defaults were used here. Eponymously, no cluster-determining threshold is needed a priori. Simply put, the TFCE statistic is akin to a measure of cluster mass at a voxel that integrates over the effect size of all surrounding voxels greater than zero, omitting the tops of nearby peaks.

The Randomise tool in FSL was used to perform nonparametric correction for each cluster-based statistic (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/Randomise). Randomise uses the Freedman-Lane procedure (Winkler et al., 2014) to estimate a test statistic threshold corresponding with a specified FWER through permutation of the original data (sign-flipping is specifically used in the case of a one-sample test, specified with the argument-1, but can be placed under the umbrella term ‘permutation testing’; e.g. Nichols and Holmes, 2001). In brief, the nonparametric one-sample test for cluster-based inference proceeds as follows. For each of K=1000 permutations:

  1. each subject’s data is randomly “sign-flipped” (multiplied by +1 or −1),

  2. voxel-based statistics are calculated,

  3. cluster-based statistics (i.e., cluster extent or TFCE statistics) are calculated from voxel-based statistics, then

  4. the maximum cluster-based statistic across the image (i.e., maximum cluster extent or maximum TFCE statistic) is recorded.

The K maxima cluster-based statistics across permutations form the null distribution used to estimate the threshold corresponding with a target FWER (here, FWER=α=5%) by finding the cluster-based statistic value delineating the top α percent of values in the distribution. This threshold can be applied to the original cluster-based statistic image from which a map of surviving voxels (i.e., voxels of any suprathreshold clusters obtained through cluster extent-based inference or voxels with suprathreshold TFCE statistics) can be obtained. This procedure has been shown to offer weak but exact (on average) control of the desired FWER level with a per-repetition 95% confidence interval corresponding with K=1000 of FWER = 0.0500 ± 0.0138 (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/Randomise/Theory).

Nonparametric correction based on cluster extent was labelled Standard+NP and correction based on TFCE was labelled TFCE+NP.

2.5. True positive rate calculation

For each voxel x, the true positive rate was calculated in Matlab as follows. The sign of the ground truth effect at x was determined. The corresponding positive- or negative-tail cumulative map was selected (this is equivalent to selecting the tail of the one-sided test a priori based on the ground truth effect sign at x). The value of the selected cumulative map at v, reflecting the number of times v survived thresholding during resampling, was interpreted as the number of “true positives” at x (Vx,1|1). The true positive rate at x (TPRx) was then calculated as Vx,1|1 divided by the total number of tests (Vx,1=R).

A couple conceptual points will be discussed here. First, it is important to note that calculation of voxel-level true positive rate does not contradict principles of cluster-based inference. It is true that rejecting the null of no activation for either cluster-based statistic does not actually imply that the null is rejected at any voxel in particular. For inferences based on cluster extent, rejecting the null for a cluster only implies that the null is rejected for an unspecified subset of voxels (one or more) in that cluster. For inferences based on TFCE, rejecting the null at a voxel implies that the null is rejected for some cluster containing that voxel, but then, as for cluster extent, this only implies that the null is rejected at an unspecified subset of voxels in that cluster (Smith and Nichols, 2009). However, this does not preclude the calculation of true positives at the individual voxel level. If the alternative hypothesis is actually true at a voxel x, a true positive occurs when the null is rejected for any cluster containing x (cf. Woo et al., 2014). Others have interpreted true positives similarly (cf. Smith and Nichols, 2009; Woo et al., 2014). Second, as noted above (see 2.3. Estimation of ground truth effect sizes), it is important to remember the limitations in the accuracy and generalizability of the ground truth estimation procedure when interpreting true positive results. Notably, any detected effect sharing the same sign as the ground truth is labelled a true positive, but caution is particularly recommended in interpreting signs of ground truth effects that are close to zero. As also noted above, the reader may find it useful to interpret the results as emerging from a fixed empirical distribution, in which case the alternative hypothesis is expected to be true (i.e., non-zero) at every voxel in the data used here and its sign and magnitude known; this interpretation requires the reader to bear in mind that generalizability of these results beyond this data remains to be determined.

A summary of the study procedures resulting in voxel-wise true positive rate is given in Figure 1.

Figure 1. Summary of methodology.

Figure 1.

Sensitivity was estimated using five tasks from the Human Connectome Project dataset (Social, Working Memory, Gambling, Relational, and Emotion; n=480-493). In brief, group activations obtained from resampled data were compared to “ground truth” activations obtained from the full dataset.

2.6. Relationship between effect size and true positive rate

The relationship between true positive rate and effect size was modeled by fitting a cubic smoothing spline (csaps function in MATLAB) to a sliding window average of the data (window size=0.01, 50% overlap). A spline was chosen since the shape of the relationship between effect size and true positive rate for this data was difficult to determine a priori. This model can be thought of as representing a smoothed average true positive rate for each effect size. The residuals of this fit were also examined to explore the spatial distribution of true positive rate while controlling for effect size. Conceptually, the residual for a given voxel represents how much its actual true positive rate deviates from the expected true positive rate for an effect of its size. Residuals for each task were also averaged across tasks to summarize results.

2.7. Comparison with other correction methods

To contextualize results, power analyses were performed to estimate how much the true positive rate (i.e., power), would change in the case of either no correction (more liberal) or Bonferroni correction (more conservative). Comparison with the no correction case illustrates how much sensitivity would be gained by performing a single region-of-interest analysis, as one might in testing a specific hypothesis on a region, or in the case where tests are performed at multiple voxels/regions but no correction is used. Comparison with Bonferroni correction, which also controls FWER, illustrates how much sensitivity would be reduced by not taking advantage of the spatial structure or dependence in the data.

Power analyses were conducted with the pwr.t.test function from the R package pwr (Champely, 2018). The following variables were specified to calculate power: number of subjects (n=20), effect size (d=0.8), t-test design (type=“one.sample”), t-test alternative hypothesis (alternative=“greater”), and significance level (sig.level specified as follows). For the uncorrected case, the significance level is set to p=0.05. For the Bonferroni corrected case, the significance level for each test (Pq) is calculated as Pq = α / ntests, where α is the desired familywise error rate (α=FWER=0.05) and ntests is the number of tests. For reference, the maximum number of tests was calculated as the number of non-zero voxels in the template brain. The template used in this study was the skull-stripped 2mm MNI152 brain (MNI152_T1_2mm_brain_mask.nii.gz in FSL; n=228,483 voxels). The difference in true positive rate obtained with cluster correction versus no correction and Bonferroni correction were also presented.

2.8. Summary of methodology decisions

A summary of the data and processing choices are given in Table 1. These mainly form a subset of the analyses performed in the original study by Cremers et al. (2017), with changes and additions according to the following rationale. 20 subjects were sampled instead of 15, to facilitate comparisons with Eklund et al. (2016; 10, 20, or 40 subjects used in that study) and to reflect a more moderate group size. 500 repetitions were run instead of 16 in order to provide more stable parameter estimates (note that repetitions in Cremers et al., 2017, were designed for illustration, not estimation). Inference was performed on cluster-based statistics (cluster extent, TFCE) rather than voxel-based statistics, to conform with the central purpose of this study. Finally, five task contrasts were selected instead of one to allow us to examine results across different distributions of effects and task contexts (see 2.1.1. Data description).

Table 1.

Summary of methodology decisions for comparison with Cremers et al. (2017).

Category Selection(s)
fMRI data Human Connectome Project S1200 release (n=1200 subjects)
No. of subjects per sample n=20 *
Activity paradigm Social (COPE 6: TOM-RANDOM; n=484)
WORKING MEMORY (COPE 20: FACE-AVG; n=493) +
GAMBLING (COPE 6: REWARD-PUNISH; n=491) +
RELATIONAL (COPE 4: REL-MATCH; n=480) +
EMOTION (COPE 3: FACES-SHAPES; n=482) +
Repetitions 500 *
Analysis type One sample permutation test (group activation)
Inference level Cluster *
Cluster-based statistic Cluster extent +
TFCE +
Multiple comparison correction method FSL Randomise (1000 permutations) +
Cluster-determining threshold p=0.001 (z=3.1)

Additions to the procedures of Cremers et al. (2017) marked with a plus (+).

Changes marked with an asterisk (*).

2.9. Resource availability and technical details

Scripts for performing the ground truth calculation and resampling are available at https://github.com/SNeuroble/cluster_power_failure. The Randomise-based procedure was built using the openly available scripts of Eklund et al. (2016) as a reference (https://github.com/wanderine/ParametricMultisubjectfMRI), and the nonparametric correction for cluster extent is expected to be equivalent to their “FSL Rand” procedure. Scripts were written to run repetitions in parallel. Analyses were run via Amazon Web Services (https://aws.amazon.com/) on a 16-virtual CPU c5.4xlarge Red Hat Enterprise Linux (RHEL) 7.4 instance (3.10.0-693.5.2.el7.x86_64) which employs “Custom 2nd generation Intel Xeon Scalable Processors (Cascade Lake) with a sustained all core Turbo frequency of 3.6GHz and single core turbo frequency of up to 3.9GHz or 1st generation Intel Xeon Platinum 8000 series (Skylake-SP) processor with a sustained all core Turbo frequency of up to 3.4GHz, and single core turbo frequency of up to 3.5 GHz” (https://aws.amazon.com/ec2/instance-types/). Run time for a single task with 16 simultaneous jobs was approximately 1 hour per R=500 repetitions for Standard+NP and 3 hours per R=500 repetitions for TFCE+NP. For comparison, sequential run time is estimated to be approximately 16 hours per R=500 repetitions for Standard+NP and 48 hours for TFCE+NP.

Interactive spatial plots, or “Live Figures”, were created for further exploration using BioImage Suite Web (BISWeb; https://bioimagesuiteweb.github.io/webapp/). Live figures are available for exploration at https://github.com/SNeuroble/power_cluster_failure/tree/master/hcpTask/liveFigures. BISWeb is a powerful GUI-based software that runs in any modern web browser without any installation necessary. After navigating the website, a Live Figure can be dragged and dropped into BISWeb for exploration.

3. Results

3.1. Ground truth effects are small-to-moderate and widespread

The distributions of ground truth effect sizes for each task are shown in Fig 2. As expected based on how tasks were chosen (see Methods), each task showed a distinct spatial distribution of effects (Fig. 2a). For example, many of the supplementary visual areas that showed large positive effects during the Social task showed large negative effects during the Working Memory task, and the Working Memory task uniquely resulted in large negative effects in primary visual cortex. Some areas of positive effects were similar during the Emotion and Gambling tasks, but effects were less smooth in the latter task. Notably, ground truth maps frequently formed clusters that spanned the brain and only infrequently formed localized clusters of activity.

Figure 2. Ground truth effect sizes across tasks.

Figure 2.

Rows show results from each task: Social, Working Memory, Gambling, Relational, and Emotion. A, Maps showing the spatial distribution of effects, where intensity indicates effect size (d). Made in BioImage Suite Web. B, Histogram depicting the distribution of effect sizes across voxels in the image, reported as fractions of the full image. The yellow interval in the histogram highlights effects below medium (|d|<0.5). The percentages of effects below large, as well as between small and medium (0.2<|d|<0.5), are indicated.

The proportion of effects of different sizes is summarized in Fig. 2b. The proportion of effect sizes for each task followed a roughly normal distribution. Overall, the majority of effect sizes (86–99%) were found to be below medium (|d|<0.5). However, effect sizes also varied across tasks. The Social task resulted in the highest magnitude effects of either sign. The Working Memory task resulted in effects that were uniquely heavy tailed in the negative direction; in contrast, the Gambling task resulted in the fewest negative effects, with effects being, in fact, centered slightly above zero. The Relational task resulted in effects that were slightly negatively skewed; in contrast, the Emotion task resulted in effects that were slightly positively skewed. Preliminary analyses based on cluster extent resulted in clusters that extended across large areas of the brain.

3.2. Low sensitivity of standard nonparametric correction based on cluster extent

The sensitivity of nonparametric correction approaches to task effects is shown in Fig. 3a. As expected, the true positive rate increased with effect size, and the shape of this relationship for either of the positive and negative arms was roughly sigmoidal. Yet for medium-sized effects (|d|=0.5), Standard+NP resulted in a small true positive rate across tasks (mean TPR = 18% ± 3% SEM). For comparison, the expected true positive rate for an uncorrected map is more than three times larger (TPRuncorrected=70%) and that of a Bonferroni-corrected map is several orders of magnitude smaller (TPRBonferroni=0.03%).

Figure 3. True positive rates by effect size.

Figure 3.

Each graph depicts effect size (x-axis) versus true positive rate (left y-axis, teal). Two grey dotted lines representing true positive rates after no correction or Bonferroni correction are provided in each graph for reference. A histogram trace representing the distribution of effect sizes from Fig. 2 is also provided (right y-axis, gold). The yellow interval highlights effects below medium (|d|<0.5), and true positive rates within an interval surrounding the medium effect size (|d|=0.5) are indicated. Rows depict results from each task. A, Results from Standard+NP. B, Results from TFCE+NP.

3.3. Better sensitivity with TFCE

On the other hand, TFCE+NP nearly doubled sensitivity to medium-sized effects (Fig. 3b). For medium-sized effects, TFCE+NP resulted in a moderate true positive rate (mean TPR=30% ± 18% SEM). Although TFCE improved sensitivity to several orders of magnitude greater than Bonferroni correction (see above), the expected true positive rate for an uncorrected map is more than twice as large (see above).

In addition to improving sensitivity relative to Standard+NP, using TFCE+NP also changed the shape of the relationship between effect size and sensitivity (Supplementary Fig. 1). Notably, this favored more positive relative to negative effects for the Gambling and Emotion tasks, whereas the opposite was true for the Relational and Working Memory tasks. This seems to correspond with the sign associated with the majority of effect sizes: the Gambling task distribution of effect sizes is centered on the positive side; the Emotion task distribution has a heavier positive tail; the Working Memory task distribution has a heavier negative tail; and the Relational task distribution decays more slowly in the negative direction. Consistent with this, the Social task, which has more balanced effect sizes in both directions (i.e., slightly more centered in the negative direction but also slightly heavier tailed in the positive direction), shows similar increases in sensitivity in both directions.

3.4. Spatial bias in detecting effects

To explore whether there was a spatial bias towards detecting effects in specific areas independent of effect size, we examined the mean residuals across tasks of the fitted curves shown in Fig. 3. For each voxel, the residual represents how much the observed true positive rate deviates from the expected true positive rate for an effect of its size. Exploratory analysis suggests that while the majority of the variance in true positive rate is explained by these models, a very small (|r|<0.08, p<0.05) correlation between residuals and effect size remained. We do not expect this very small association to substantially change the following interpretation of these residuals.

The spatial distribution of these residuals is shown in Fig. 4 (see Supplementary Fig. 2 for residual scatterplots for each task). On average across all tasks and correction techniques, effects were more likely to be detected in cortical areas such as visual and supplementary visual areas extending to posterior cingulate cortex, and less likely to be detected in CSF-adjacent and frontal areas. This bias roughly corresponds with areas exhibiting increased numbers of false positives and smoothness shown in Eklund et al. (2016). Although tasks roughly shared a posterior spatial bias in detecting effects, there were marked differences across tasks. For example, the Social and Emotion tasks showed elevated rates of detection in the orbitofrontal cortex. In addition, the spatial structure of residuals seemed least structured for the Relational task.

Figure 4. Spatial distribution of true positive rate, independent of effect size.

Figure 4.

Residuals were calculated from the fit to the relationship between effect size and true positive rate. Thus, the residual for a given voxel represents how much its actual true positive rate deviates from the expected true positive rate for an effect of its size. Graphs depict the voxelwise residuals for each task and, bottom, the mean of all residuals across tasks. Warm colors (red to yellow) indicate positive residuals while cool colors (dark blue to magenta) indicate negative residuals. A, Standard+NP. B, TFCE+NP. Made in BioImage Suite Web.

The spatial bias was greater in magnitude for results from TFCE+NP than for those from Standard+NP. This corresponds with an increased standard deviation across residuals for TFCE+NP compared with Standard+NP (Supplementary Fig. 1), indicating that more voxels deviated from the fitted relationship with effect size. Additionally, voxels that exhibited a small positive bias under Standard+NP exhibited a small negative bias under TFCE+NP. Examining results across correction approaches within each task, few major differences were found in the spatial distribution of residuals besides the shared differences reported above.

4. Discussion

Here, we conducted an empirical investigation into the sensitivity of nonparametric cluster correction, critically extending previous work (Eklund et al., 2016; Cremers et al., 2017; Lohmann et al., 2018). We examined sensitivity in the context of two cluster-based statistics, including a threshold-free statistic, and five tasks in the Human Connectome Project dataset. Overall, nonparametric cluster correction was found to exhibit low sensitivity to medium-sized effects regardless of task and correction approach. In fact, the sensitivity of the standard correction based on cluster-extent was more than three times smaller than that with no correction. Sensitivity improved across tasks when using TFCE relative to cluster extent, consistent with the results of Smith and Nichols (2009). However, TFCE also showed more spatial bias and an unexpected shift in the relationship between effect size and sensitivity. The inability to detect medium-sized effects is meaningful, since most ground truth effect sizes were found to be medium or below. This means that the majority of effects will be detected even less frequently than the medium-sized effects that are mainly reported here. While recent work by Eklund et al. (2016) showed that nonparametric cluster correction more robustly controls the FWER than conventional fMRI cluster correction approaches, the present results suggest that this gold standard approach too has limitations. In fact, converging evidence suggests that we may have missed numerous true effects over the years due to disproportionate focus on controlling the FWER.

4.1. Why is sensitivity so low?

The simplest answer to this question is that any multiple comparison correction necessarily reduces sensitivity, and this scales with the number of tests. Just correcting for four tests with the Bonferroni approach would reduce the expected power to detect medium-sized effects in this study by 10%. The space over which tests are being conducted is large (and may grow larger still, as technological advances allow for more detailed imaging), creating the need for a proportionally large correction. Importantly, cluster correction capitalizes on the significant spatial structure and dependence between voxels; this improves sensitivity relative to approaches that do not take advantage of these characteristics (e.g., Bonferroni correction; Friston et al., 1996). However, the correction still must be stringent enough to limit the detection of a suprathreshold cluster by chance over the large, spatially correlated, and noisy image.

The focus on controlling specificity first and sensitivity second is strategic. For a number of reasons (e.g., incentives, inertia), researchers may be inclined to accept or encourage erroneous results (McElreath and Smaldino, 2015; Ioannidis, 2005; Simmons et al., 2011). That is, they naturally try to optimize sensitivity at the cost of specificity. To combat this, many scientists have advocated for standards that reduce the rate of these false positive results. In neuroimaging, this has led to the frequentist standard of limiting FWER rate to 5%. The FWER reflects how often we find even one false positive throughout the whole brain map. If 100 studies are conducted, only about 5 are permitted to yield one or more false positive cluster(s) across the whole brain. Since the space of the brain map is large (228,483 voxels in this study), a very stringent threshold is required to meet this standard. It is no surprise that sensitivity is sacrificed to achieve this goal.

4.2. The tip of the iceberg: problems caused by low sensitivity

The crucial issue arising from widespread use of low sensitivity methods is that we endemically underestimate the presence of true effects. Although frequentist null findings should not be used to confirm the absence of an effect, in practice they often obscure investigation into avenues of research. If an effect is actually present, this may result in missed opportunities to generate fruitful hypotheses in exploratory research, or the inability to replicate effects in confirmatory research, discouraging otherwise healthy avenues of research. Dovetailing with this, underpowered approaches often result in biased effect size estimates, which, in turn, leads to another type of bias whereby only effects inflated enough to pass the threshold are reported (Ioannidis, 2008; Lindquist and Mejia, 2015; Yarkoni, 2009). While false negatives can waste resources by discouraging further investigation, inflating effect sizes can waste resources in generating misleading inferences and irreproducible results. Similarly, decreasing power increases the probability that detected effects are false positives for a given study for which the null may be true (Ioannidis et al., 2005).

In neuroimaging specifically, less sensitive techniques may have contributed to an incomplete and potentially misleading picture of what brain activity looks like. While highly localized activations are often reported, the present study suggests that activations are actually quite diffuse and often span across the brain. This agrees with other recent work (Cremers et al., 2017; Gueter et al., 2018) and converges with the reproducible identification of long-distance networks via fMRI functional connectivity (De Luca et al., 2006). As such, the small, discrete regions typically reported in fMRI activation-mapping may actually only represent hotspots of a widespread underlying effect. These issues and more (cf. Button et al., 2013; Poldrack et al., 2017) emphasize the need for increased adoption of strategies that seek to improve sensitivity alongside specificity.

4.3. Improving sensitivity in fMRI

While low sensitivity methods are concerning, this is not a new problem and myriad solutions exist. These are presented below in order of easy-to-implement “quick fixes” to more systematic changes.

Amongst the easiest solutions is the simple yet controversial option to loosen the threshold, in stark contrast with the recent call to use even more stringent thresholds in the biomedical sciences (i.e., p<0.005; Benjamin et al., 2018). Some investigators have taken this approach to the extreme; just under half (41%) of empirical fMRI studies recently surveyed did not use any multiple comparison correction at all (Nichols, 2016). Certainly, neglecting to perform any correction defeats the purpose of the frequentist inference. Yet the unprincipled use of a looser threshold also raises similar concerns—how many false positives are actually expected if 90% of studies allow one or more false positive(s)? A principled argument weighing the benefits and weaknesses should precede the decision to select a more lenient threshold.

Similarly, researchers should avoid the temptation to achieve similar ends by using parametric cluster correction if they are motivated by the possibility of a higher than specified FWER (perhaps after observing the results in Eklund et al., 2016). That relies on violating assumptions of the test, resulting in not just high but unpredictable FWER. Instead, researchers should always ensure that assumptions are justified for any test chosen. Flandin and Friston (2016) have argued that Random Field Theory (RFT) assumptions are reasonably attainable (i.e., stringent cluster-determining threshold, high and uniform smoothness; Friston et al., 1996), despite the fact that typical study characteristics may violate them (cf. Eklund et al., 2016). Used correctly, parametric procedures should ultimately yield the same sensitivity and specificity as their nonparametric counterparts for the same objective, and provide additional benefits (e.g., computational efficiency; Flandin and Friston, 2016). Better to use a procedure for its intended advantages rather than misuse it to take advantage of unpredictable side effects.

An easy yet more principled solution is to restrict analyses to a circumscribed set of hypotheses, if possible (e.g., specific regions, pipelines, analyses, etc.). This can boost power for studies at a confirmatory stage of analysis (Wagenmakers et al., 2012; Lindquist and Mejia, 2015), e.g., a vision study may restrict analyses to visual cortex. If desired, a confirmatory study may be followed by an exploratory study involving more or all brain areas and exploratory statistical methods. Researchers should be clear that any exploratory findings are used for hypothesis generation and not confirmatory inference (Behrens and Yu, 2003).

An important yet more difficult option in neuroimaging is to improve power through study design and/or meta-analysis. Many design strategies may improve power: more subjects (Button et al., 2013), more within-subject data, more effective tasks, etc. An empirical study is needed to explore which strategies lead to desired power levels after cluster correction. A recent investigation provides an important benchmark; using uncorrected inferences, medium-sized effects can generally be detected with more than 80 subjects, and large effects with 40 subjects (Geuter et al., 2018). Sample size can be increased by pooling data from different sources, either across sites—although it is important to assess for possible site effects (cf. Noble et al., 2017)—or across studies in a meta-analysis. Some have argued for more exploratory approaches at the individual study level that can then be gathered together for confirmatory meta-analysis at a later stage (Lieberman and Cunningham, 2009). If following this route, researchers should take care to follow best practices for exploratory analysis as discussed above.

Others have proposed that we reconsider the error rate under control. In most correction procedures in neuroimaging, we have taken it upon ourselves to control the rate of at least one false positive (i.e., FWER) instead of the proportion of false positives (e.g., false discovery rate introduced by Benjamini and Hochberg, 1995; Type I/Type II error-based ratios introduced by loannidis et al., 2011 and Cremers et al., 2017; etc.). Yet many have argued that the latter is a more natural target for much of biomedical research, where the large number of tests inherent in individual studies make the possibility of at least one false positive nearly unavoidable (Cremers et al., 2017; Genovese et al., 2002; Kessler et al., 2017; Narum, 2006; see 4 for a simple illustration and Lindquist and Mejia, 2015, for a more detailed example). These approaches may be appealing if one is willing to tolerate a greater number of false positives in order to detect a greater proportion of true effects. Many procedures have been developed to control the false discovery rate (Strimmer et al., 2008), some of which have been extended to cluster-based corrections in fMRI (Benjamini and Heller, 2007; Chumbley and Friston, 2009; cf. Perone-Pacifico et al., 2004). For cluster-level inference, parametric (RFT) and nonparametric FDR corrections have been compared with parametric (RFT) FWER correction (Kessler et al., 2017; Lohmann et al., 2018; cluster-determining thresholds p=0.01 and p=0.001). In practice, parametric FDR and FWER corrections can yield nearly equivalent results (Lohmann et al., 2018); nonparametric FDR correction is also able to preserve most (cluster-determining threshold p=0.01) to nearly all (cluster-determining threshold p=0.001) results from parametric FWER correction (Kessler et al., 2017). A systematic, empirical comparison between approaches that control the proportion of false positives and those that control FWER is needed to better understand their respective advantages and disadvantages.

Another promising option is to redefine the target of inference. This may be used to promote detection of a broader class of expected effects, improve the interchangeability of effects, and more. Recent threshold-independent statistics like TFCE have been used to redefine a cluster to essentially incorporate information about the effect size of surrounding voxels. A related approach is available through AFNI: Equitable Thresholding and Clustering (ETAC; Cox, 2018). ETAC combines results over a range of cluster-determining thresholds; although promising, preliminary analysis suggested it would not run to completion in a reasonable amount of time for the present study. A third method has recently been proposed based on Local Indicators of Spatial Association (LISA) and may show better sensitivity than TFCE (Lohmann et al., 2018). A fourth possibility is to infer on peaks of activation, which is expected to be more reliable than cluster-level inference based on RFT while capable of accounting for intervoxel dependence unlike voxel-level inference (Durnez et al., 2016); notably, a web-based power analysis tool is available for the method (www.neuropowertools.org). As demonstrated here, threshold-free approaches are expected to be better suited for fMRI than standard approaches that require a priori selection of a threshold. However, the increased spatial bias in sensitivity warrants further investigation.

A complementary approach is to perform spatially nonuniform correction based on smoothness, e.g., via the SPM Non-Stationary Cluster Extent Correction (http://fmri.wfubmc.edu/cms/software#NS). One could alternatively aim to uniformly smooth the image, which also reduces motion-related confounds (Scheinost et al., 2014). However, this may also introduce undesired effects such as signal mixing, which should be avoided since the present study suggests that certain areas of activation may already go under-detected. Notably, these results suggest that nonuniform smoothness is only part of the issue; TFCE has been shown to be more robust to nonstationary smoothness than cluster extent (Salimi-Khorshidi et al., 2011), but showed greater spatial bias here. A model that explicitly links nonuniform smoothness, sensitivity, and specificity is needed to illuminate whether we need spatially nonuniform inferences to go beyond accounting for nonuniform smoothness.

Finally, non-frequentist approaches can be used for inference—and by non-frequentist, we do mean Bayesian. Although a hypothesis may be difficult to clearly articulate in the form of a prior, the Bayesian framework provides a more accurate way of making inferences if a hypothesis is possible (Kruschke and Liddell, 2018). It also provides a natural way to incorporate prior information about the spatial extent of effects—within well-characterized regions or perhaps even spanning known networks—and other information from previous investigations. In fact, Bayesian inference may be especially suited to enable explicit modeling of spatially-nonuniform errors and effects. fMRI methods that make use of Bayesian inference are few and forthcoming (cf. Lindquist and Gelman, 2009; Woolrich, 2012), but one has recently been developed in the context of a major fMRI software (Bayesian Multilevel Modeling in AFNI; Chen et al., 2019). Note also that false discovery rate control has also been situated within a Bayesian framework (Efron, 2007).

4.4. Limitations

It is always difficult to establish a realistic measure of ground truth essential for the sensitivity estimation. Although we attempted to include a ground truth measure that was as realistic as possible by drawing from one of the largest task-based fMRI datasets available, the real-world accuracy—and thus, generalizability—of this measure has limitations (see notes in 2.3. Estimation of ground truth effect sizes and 2.5. True positive rate calculation). More precise estimates may be obtained as efforts continue to increase the size of datasets such as the one used here. To compliment empirical investigations, future studies may also seek to use a realistic simulated measure of ground truth, perhaps using complex data generation approaches even at the level of raw data that more accurately capture characteristics of the underlying phenomena (e.g., Bellec et al., 2009; Cremers et al., 20175). In addition, due to meaningful heterogeneity within the group, some subsamples may show real effects that are not reflected in the larger group analysis. Better modeling this heterogeneity in the ground truth measure may more accurately reflect sensitivity of these methods. In addition, 1,000 permutations were used for nonparametric correction in this study (as in Eklund et al., 2016); however, 5,000-10,000 permutations are recommended to obtain a more reasonably narrow confidence interval for the p value. Finally, although outside the scope of this study, other methods for improving sensitivity, including those discussed above, warrant examination. It is always important to benchmark the performance of the procedures and implementations we rely on, particularly since even conceptually similar cluster-based corrections have been shown to yield diverging results (Bowring et al., 2018).

5. Conclusion

The neuroimaging community has recently witnessed rising concerns about the reproducibility of methods commonly used for multiple comparison correction. Building upon the work of others, we demonstrated that many sizable effects exist in fMRI activation studies that are simply not detected because of the low sensitivity of cluster correction aimed at limiting the familywise error rate. In fact, this work suggests we have only witnessed the tip of the iceberg in the activation-mapping literature. Fortunately, many strategies may be used to improve sensitivity, and many are easy to implement. It is crucial to improve sensitivity because optimizing for specificity alone can degrade the accuracy and reproducibility of neuroimaging research. The present work adds to the mounting evidence that it is time to revise our practices to balance the scale.

Supplementary Material

Supplementary Materials

6. Acknowledgments

This work was supported by the National Institute of Neurological Disorders and Stroke at the National Institutes of Health under award numbers F99NS108557 and K00MH122372 (S.N.), and by the U.S. National Science Foundation under grant number DGE1122492 (S.N.).

In addition, this study would not be possible without openly available preprocessed data from the Human Connectome Project (Van Essen et al., 2013; Barch et al., 2013). We acknowledge all researchers who have been part of those efforts and the initiatives supporting them. We are grateful for these efforts and remain committed to advocating for an open science ecosystem that enables more reproducible practices.

Footnotes

1

The statistical terminology used here follows from conventions established by Shapiro, 1999, and notation by Nichols et al., 2003. For a statistical test of the null hypothesis, the false positive rate (FPR) measures the expected proportion of “false positives” (V1|10)—tests wherein the null is true but incorrectly rejected—out of all tests where the null is true (V·|0 = V-V·|1): FPR = E(V1|0/V·|0). In contrast, the true positive rate (TPR) measures the expected proportion of “true positives” (V1|1)—tests wherein the null is false and correctly rejected—out of all tests where the null is false (V·|1): TPR = E(V1|1/V·|1). Note that the true positive rate can be referred to as “sensitivity” (Trevethan, 2017; Altman and Bland, 1994; Shapiro, 1999), 1-“Type II error rate” (Shapiro, 1999), and “power” (see equivalence between power and 1-Type II error rate in Lipsey, 1990, pp. 45, and Cohen, 1988, pp. 5). The false positive rate can be referred to as “Type 1 error rate” and 1-“specificity” (Shapiro, 1999).

2

In detail, the familywise error rate (FWER) measures the probability of detecting at least one false positive (V1|0>0) in a family of tests where the null is true: FWER=P(V1|0 > 0). Note that the FWER is sometimes understood to be a type of false positive rate (e.g., Nichols et al., 2003; Eklund et al., 2016), taking false positive rate in a more general sense than the definition in the previous footnote. The more general sense of false positive rate is used only in the abstract for approachability and consistency with the study discussed in the abstract (Eklund et al., 2016).

3

A derivation of the t-to-d conversion for a one-sample test is as follows. The definitions for the one-sample t-statistic (t) and sample d-coefficient (ds) are t=x¯μ0sx¯ and ds=x¯μ0s, where x¯ is the sample mean, μ0 is the mean of the null hypothesis, s is sample standard deviation, and sx¯ is the sample standard error of the mean. sx¯ is related to s by the sample size n: sx¯=s/n. By substitution, we can write t in terms of ds: t=x¯μ0s/n=x¯μ0s*n=ds*nds=tn. See Cohen (1998) pp. 71-72 for further reference.

4

Simple illustration comparing correction procedures: imagine a study of 40 variables, where the ground truth is that 2 have no effect (i.e., 5% are actual negatives) and 38 have an effect ((i.e., 95% are actual positives). Imagine a test that produces exactly 1 false positive effect for every 19 true positive effects (i.e., 5% of all detected effects are false positives). If 100 replicates of that study are conducted, limiting the rate of at least one false positive per study to 5% (i.e., FWER=5%) means that about 5 of those replicates would yield any false positives, resulting in about 5-10 false positives and 19-38 true positives across replicates. Limiting the proportion of false positives per study to 5% of all detected effects (i.e., FDR=5%) would allow approximately all effects to pass the threshold in each replicate, resulting in about 200 false positives and 3800 true positives across replicates.

Data Availability

All original data has been made openly available by the Human Connectome Project (Van Essen et al., 2013). All code used in this study has also been made openly available at https://github.com/SNeuroble/cluster_power_failure. This repository also contains the spatial maps shown in this manuscript that can be interactively explored as drag-and-drop “live figures” in BioImage Suite Web (https://bioimagesuiteweb.github.io/webapp/). The authors are willing to answer any additional inquiries upon request.

7. References

  1. Altman DG, Bland JM, 1994. Diagnostic tests. 1: Sensitivity and specificity. BMJ: British Medical Journal 308, 1552. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bansal R, Peterson BS, 2018. Cluster-level statistical inference in fMRI datasets: The unexpected behavior of random fields in high dimensions. Magnetic resonance imaging 49, 101–115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Barch DM, Burgess GC, Harms MP, Petersen SE, Schlaggar BL, Corbetta M, Glasser MF, Curtiss S, Dixit S, Feldt C, 2013. Function in the human connectome: task-fMRI and individual differences in behavior. Neuroimage 80, 169–189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Behrens JT, Yu C.-h., 2003. Exploratory data analysis. Handbook of psychology 2, 33–64. [Google Scholar]
  5. Bellec P, Perlbarg V, Evans AC, 2009. Bootstrap generation and evaluation of an fMRI simulation database. Magnetic resonance imaging 27, 1382–1396. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers E-J, Berk R, Bollen KA, Brembs B, Brown L, Camerer C, 2018. Redefine statistical significance. Nature Human Behaviour 2, 6. [DOI] [PubMed] [Google Scholar]
  7. Benjamini Y, Hochberg Y, 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57, 289–300. [Google Scholar]
  8. Benjamini Y, Heller R, 2007. False discovery rates for spatial signals. Journal of the American Statistical Association, 102, 1272–1281. [Google Scholar]
  9. Bennett CM, Miller MB, Wolford GL, 2009. Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: an argument for multiple comparisons correction. Neuroimage, 47, S125. [Google Scholar]
  10. Bowring A, Maumet C, Nichols TE, 2018. Exploring the Impact of Analysis Software on Task fMRI Results. BioRxiv, 285585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Button KS, Ioannidis JP, Mokrysz C, Nosek BA, Flint J, Robinson ES, Munafò MR, 2013. Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365. [DOI] [PubMed] [Google Scholar]
  12. Champely S, 2018. pwr: Basic Functions for Power Analysis. R package version 1.2-2. https://CRAN.R-project.org/package=pwr [Google Scholar]
  13. Chumbley JR, Friston KJ, 2009. False discovery rate revisited: FDR and topological inference using Gaussian random fields. Neuroimage, 44, 62–70. [DOI] [PubMed] [Google Scholar]
  14. Chen G, Xiao Y, Taylor PA, Rajendra JK, Riggins T, Geng F, Redcay E, Cox RW, 2018. Handling Multiplicity in Neuroimaging through Bayesian Lenses with Multilevel Modeling. bioRxiv, 238998. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Cohen J, 1988. Statistical power analysis for the behavioral sciences. Routledge. [Google Scholar]
  16. Cox RW, 1996. AFNI: software for analysis and visualization of functional magnetic resonance neuroimages. Computers and Biomedical research, 29, 162–173. [DOI] [PubMed] [Google Scholar]
  17. Cox RW, 2018. Equitable thresholding and clustering. bioRxiv, 295931. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Cremers HR, Wager TD, Yarkoni T, 2017. The relation between statistical power and inference in fMRI. PloS one 12, e0184923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. De Luca MBCFDS, Beckmann CF, De Stefano N, Matthews PM, Smith SM, 2006. fMRI resting state networks define distinct modes of long-distance interactions in the human brain. Neuroimage, 29, 1359–1367. [DOI] [PubMed] [Google Scholar]
  20. Durnez J, Degryse J, Moerkerke B, Seurinck R, Sochat V, Poldrack R, Nichols T, 2016. Power and sample size calculations for fMRI studies based on the prevalence of active peaks. BioRxiv, 049429. [Google Scholar]
  21. Eklund A, Nichols TE, Knutsson H, 2016. Cluster failure: why fMRI inferences for spatial extent have inflated false-positive rates. Proceedings of the National Academy of Sciences, 201602413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Efron B, 2007. Size, power and false discovery rates. The Annals of Statistics 35, pp. 1351–1377. [Google Scholar]
  23. Flandin G, Friston KJ, 2016. Analysis of family-wise error rates in statistical parametric mapping using random field theory. Human brain mapping. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Friston KJ, Holmes A, Poline J-B, Price CJ, Frith CD, 1996. Detecting activations in PET and fMRI: levels of inference and power. Neuroimage 4, 223–235. [DOI] [PubMed] [Google Scholar]
  25. Genovese CR, Lazar NA, Nichols T, 2002. Thresholding of statistical maps in functional neuroimaging using the false discovery rate. Neuroimage 15, 870–878. [DOI] [PubMed] [Google Scholar]
  26. Geuter S, Qi G, Welsh RC, Wager TD, Lindquist MA, 2018. Effect Size and Power in fMRI Group Analysis. bioRxiv, 295048. [Google Scholar]
  27. Gorgolewski KJ, Varoquaux G, Rivera G, Schwarz Y, Ghosh SS, Maumet C, Sochat VV, Nichols TE, Poldrack RA, Poline JB, Yarkoni T, 2015. NeuroVault. org: a web-based repository for collecting and sharing unthresholded statistical maps of the human brain. Frontiers in neuroinformatics, 9, 8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Ioannidis JP, 2005. Why most published research findings are false. PLoS medicine, 2, e124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Ioannidis JP, 2008. Why most discovered true associations are inflated. Epidemiology, 640–648. [DOI] [PubMed] [Google Scholar]
  30. loannidis JP, Tarone R, McLaughlin JK, 2011. The false-positive to false-negative ratio in epidemiologic studies. Epidemiology, 450–456. [DOI] [PubMed] [Google Scholar]
  31. Jenkinson M, Beckmann CF, Behrens TE, Woolrich MW, Smith SM, 2012. Fsl. Neuroimage, 62, 782–790. [DOI] [PubMed] [Google Scholar]
  32. Kessler D, Angstadt M, Sripada CS, 2017. Reevaluating “cluster failure” in fMRI using nonparametric control of the false discovery rate. Proceedings of the National Academy of Sciences 114, E3372–E3373. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Kruschke JK, Liddell TM, 2018. The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review, 25, 178–206. [DOI] [PubMed] [Google Scholar]
  34. Laird AR, Eickhoff SB, Fox PM, Uecker AM, Ray KL, Saenz JJ, McKay DR, Bzdok D, Laird RW, Robinson JL, Turner JA, 2011. The BrainMap strategy for standardization, sharing, and meta-analysis of neuroimaging data. BMC research notes, 4, 349. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Lieberman MD, Cunningham WA, 2009. Type I and Type II error concerns in fMRI research: re-balancing the scale. Social cognitive and affective neuroscience, nsp052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Lindquist MA, Gelman A, 2009. Correlations and multiple comparisons in functional imaging: a statistical perspective (Commentary on Vul et al., 2009). Perspectives on Psychological Science 4, 310–313. [DOI] [PubMed] [Google Scholar]
  37. Lindquist MA, Mejia A, 2015. Zen and the art of multiple comparisons. Psychosomatic medicine 77, 114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Lipsey MW, 1990. Design sensitivity: Statistical power for experimental research (Vol. 19). Sage. [Google Scholar]
  39. Logothetis NK, 2008. What we can do and what we cannot do with fMRI. Nature 453, 869–878. [DOI] [PubMed] [Google Scholar]
  40. Lohmann G, Stelzer J, Lacosse E, Kumar VJ, Mueller K, Kuehn E, Grodd W, Scheffler K, 2018. LISA improves statistical analysis for fMRI. Nature communications, 9, 4014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Lohmann G, Stelzer J, Mueller K, Lacosse E, Buschmann T, Kumar VJ, Grodd W, Scheffler K, 2017. Inflated false negative rates undermine reproducibility in task-based fMRI. bioRxiv, 122788. [Google Scholar]
  42. McElreath R, Smaldino PE, 2015. Replication, communication, and the population dynamics of scientific discovery. PloS one 10, e0136088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Narum SR, 2006. Beyond Bonferroni: less conservative analyses for conservation genetics. Conservation genetics 7, 783–787. [Google Scholar]
  44. Nichols T, 2016. Bibliometrics of cluster inference. Available at https://blogs.warwick.ac.uk/nichols/entry/bibliometrics_of_cluster/. Accessed April 10, 2017.
  45. Nichols TE, Das S, Eickhoff SB, Evans AC, Glatard T, Hanke M, Kriegeskorte N, Milham MP, Poldrack RA, Poline JB, Proal E, 2017. Best practices in data analysis and sharing in neuroimaging using MRI. Nature Neuroscience, 20, 299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Nichols TE, Holmes AP, 2002. Nonparametric permutation tests for functional neuroimaging: a primer with examples. Human brain mapping, 15, 1–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Nichols T, Hayasaka S, 2003. Controlling the familywise error rate in functional neuroimaging: a comparative review. Statistical methods in medical research, 12, 419–446. [DOI] [PubMed] [Google Scholar]
  48. Noble S, Scheinost D, Finn ES, Shen X, Papademetris X, McEwen SC, Bearden CE, Addington J, Goodyear B, Cadenhead KS, 2017. Multisite reliability of MR-based functional connectivity. Neuroimage 146, 959–970. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Perone Pacifico M, Genovese C, Verdinelli I, Wasserman L, 2004. False discovery control for random fields. Journal of the American Statistical Association, 99, 1002–1014. [Google Scholar]
  50. Poldrack RA, Baker CI, Durnez J, Gorgolewski KJ, Matthews PM, Munafò MR, Nichols TE, Poline JB, Vul E, Yarkoni T, 2017. Scanning the horizon: towards transparent and reproducible neuroimaging research. Nature Reviews Neuroscience, 18, 115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Salimi-Khorshidi G, Smith SM, Nichols TE, 2011. Adjusting the effect of nonstationarity in cluster-based and TFCE inference. Neuroimage, 54, 2006–2019. [DOI] [PubMed] [Google Scholar]
  52. Scheinost D, Papademetris X, Constable RT, 2014. The impact of image smoothness on intrinsic functional connectivity and head motion confounds. Neuroimage, 95, 13–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Shapiro DE, 1999. The interpretation of diagnostic tests. Statistical methods in medical research, 8, 113–134. [DOI] [PubMed] [Google Scholar]
  54. Simmons JP, Nelson LD, Simonsohn U, 2011. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science 22, 1359–1366. [DOI] [PubMed] [Google Scholar]
  55. Smith SM, Nichols TE, 2009. Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. Neuroimage 44, 83–98. [DOI] [PubMed] [Google Scholar]
  56. Strimmer K, 2008. A unified approach to false discovery rate estimation. BMC bioinformatics 9, 303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Trevethan R, 2017. Sensitivity, specificity, and predictive values: foundations, pliabilities, and pitfalls in research and practice. Frontiers in public health, 5, 307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Van Essen DC, Smith SM, Barch DM, Behrens TE, Yacoub E, Ugurbil K, Consortium W-MH, 2013. The WU-Minn human connectome project: an overview. Neuroimage 80, 62–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Wagenmakers E-J, Wetzels R, Borsboom D, van der Maas HL, Kievit RA, 2012. An agenda for purely confirmatory research. Perspectives on Psychological Science 7, 632–638. [DOI] [PubMed] [Google Scholar]
  60. Winkler AM, Ridgway GR, Webster MA, Smith SM, Nichols TE, 2014. Permutation inference for the general linear model. Neuroimage, 92, 381–397. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Woo CW, Krishnan A, Wager TD, 2014. Cluster-extent based thresholding in fMRI analyses: pitfalls and recommendations. Neuroimage, 91, 412–419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Woolrich MW, 2012. Bayesian inference in fMRI. Neuroimage 62, 801–810. [DOI] [PubMed] [Google Scholar]
  63. Woolrich MW, Behrens TEJ, Beckmann CF, Jenkinson M, Smith SM, 2004. Multilevel linear modelling for FMRI group analysis using Bayesian inference. Neuroimage 21, 1732–1747. [DOI] [PubMed] [Google Scholar]
  64. Yarkoni T, 2009. Big correlations in little studies: Inflated fMRI correlations reflect low statistical power—Commentary on Vul et al.(2009). Perspectives on Psychological Science 4, 294–298. [DOI] [PubMed] [Google Scholar]
  65. Yarkoni T, Poldrack RA, Nichols TE, Van Essen DC, Wager TD, 2011. Large-scale automated synthesis of human functional neuroimaging data. Nature methods 8, 665. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Materials

RESOURCES