Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Aug 7.
Published in final edited form as: Ann N Y Acad Sci. 2017 May 2;1396(1):5–18. doi: 10.1111/nyas.13325

Progress Toward Openness, Transparency, and Reproducibility in Cognitive Neuroscience

Rick O Gilmore a, Michele T Diaz a,b, Brad A Wyble a, Tal Yarkoni c
PMCID: PMC5545750  NIHMSID: NIHMS878076  PMID: 28464561

Abstract

Accumulating evidence suggests that many findings in psychological science and cognitive neuroscience may prove difficult to reproduce; statistical power in brain imaging studies is low, and has not improved recently; software errors in common analysis tools are common, and can go undetected for many years; and, a few large scale studies notwithstanding, open sharing of data, code, and materials remains the rare exception. At the same time, there is a renewed focus on reproducibility, transparency, and openness as essential core values in cognitive neuroscience. The emergence and rapid growth of data archives, meta-analytic tools, software pipelines, and research groups devoted to improved methodology reflects this new sensibility. We review evidence that the field has begun to embrace new open research practices, and illustrate how these can begin to address problems of reproducibility, statistical power, and transparency in ways that will ultimately accelerate discovery.

Keywords: open science, reproducibility, data sharing

Introduction

Most cognitive neuroscientists seek answers to questions about what patterns of neural activity underlie perception, thinking, memory, and action, among other topics. In answering these questions, we marshal evidence from studies of human and animal behavior, nervous system structure and activity, the effects of endogenous and exogenous substances, patterns of disorder and disease, and trajectories of change across the lifespan. Our common aim is to reveal reliable, reproducible, and useful facts about the relationship between mind and brain. These facts depend crucially on the tools we deploy to collect and evaluate data and on how we report what we do or do not find. In this paper, we review the degree to which our field meets the scientific ideals of reproducibility, transparency, and openness.

Rigorous self-reflection and self-criticism about methodology have been core values in cognitive neuroscience for some time13. Efforts to to foster widespread data sharing 46 and other open research practices have long histories. What strikes us as new and important enough to merit reviewing them in 2016 are developments that likely cheer the pessimist and the optimist alike. On the one hand, accumulating evidence suggests that many findings in psychological science may be difficult to reproduce7; statistical power in brain imaging studies is low810 and has not improved11 over time; software errors in analysis tools are common and can go undetected for many years12; and a few large scale studies and databases notwithstanding 4,1314, the open sharing of data, code, and materials is rare. On the other hand, we see a renewed focus on reaffirming reproducibility, transparency, and openness as essential core values in psychological science and related fields7,1519. This reinvigorated focus has begun to force greater clarity about what these values mean in practice20. We find that the emergence and rapid growth of data archives, meta-analytic tools, software pipelines, and research groups devoted to improved methodology are genuine reasons for optimism about the future of an open, transparent, and reproducible cognitive neuroscience.

In the sections that follow, we discuss definitions of open science practices and why they might be important for the field. We then review some of the history of these practices, discuss a range of recent developments, and offer some speculations about what the near future might hold.

History of open science practices in cognitive neuroscience

What are open science practices? What does it mean to reproduce or replicate a study? Most researchers agree that discovering robust and generalizable findings is central to the scientific enterprise12, 16, 19, 21, but what evidence determines success or failure in meeting the ideal? In a previous paper in NYAS, Bennett & Miller1 sought to assess the reliability of fMRI results, and provided data about the diverse measures used to assess reliability in the functional neuroimaging literature of that time. In summarizing the results of a large sample of studies reporting measures of test/retest reliability, Bennett and Miller1 observed that no agreement exists about what constitutes acceptable reliability, nor was there consensus on what measure or measures should be used to evaluate it. Half a decade later, Goodman and colleagues20 argue that uncertainty and disagreement about the meaning of these concepts 22 persists and that misunderstanding impedes progress toward solutions. In response, Goodman and colleagues suggest three new terms which we adopt here: methods reproducibility, results reproducibility, and inferential reproducibility. Methods reproducibility means that a different investigator is able to obtain the same results when applying the same tools and analytical procedures used in a study to the same (i.e., original) dataset. Results reproducibility means that a new study with new data, collected following the original procedures as closely as possible, yields the same outcomes. Inferential reproducibility occurs when independent researchers come to similar conclusions about what patterns of data mean, based on their own replication study or a reanalysis of a prior study20. For example, Goodman et al.20 suggest that competing views about the implications of a recent high-profile study of replicability in psychology7 stem, at least in part, from a disagreement at this level23, 24.

Clearly, to achieve methods reproducibility, research practices that accurately and precisely capture essential details about methods, data, and workflows must be deployed; to achieve results reproducibility, those elements must be made openly and freely available to the scientific community; and achieving inferential reproducibility requires, among other developments, the capacity to accumulate, analyze, and interpret large quantities of data2526 in consistent ways. Thus, openness and transparency relate directly to reproducibility of all three kinds. Reflecting this sensibility, a diverse array of behavioral scientists have begun arguing that that achieving the scientific ideals of a free and open exchange of information requires the widespread adoption of open and transparent communication practices 1516, 2730. How well has the field of cognitive neuroscience measured up to these ideals?

Methods reproducibility

Much of cognitive neuroscience research is computationally intensive, so the extent to which the field's methods are reproducible depends on whether complex computational workflows can be reliably regenerated. Whether measuring task- or non-task-related nervous system activity using EEG, fMRI, or brain structure using MRI, CT, or PET, cognitive neuroscience studies regularly generate spatially and temporally dense data streams. Seemingly minor choices made at each step of an analysis pipeline—including experimental design, data acquisition, preprocessing, analysis, and reporting—can ramify and have important implications for reproducibility.

The complexity of the typical neuroimaging pipeline is visible from the earliest stages of data acquisition. While there are only three major manufacturers of MRI scanners, the machines run different pulse sequences, and even scanners from the same manufacturers do not often run the same software. At the preprocessing stage—even prior to statistical analysis—researchers face a bewildering array of options when considering when and how (or even whether) to account for subject movement, signal spikes, differences in brain anatomy, physiological confounds, and any number of standard concerns. Statistical analysis is no less complicated, as researchers must decide what kind of analyses to conduct (mass univariate, multivariate pattern classification, etc.), what search space to use (whole-brain, specific regions-of-interest), what statistical contrasts and multiple comparisons correction procedures to apply, and so on. The sheer magnitude of variation in analytical approaches underscores why computational reproducibility is so critical in cognitive neuroscience and why it has historically seemed so daunting. Put simply, without the ability to understand precisely what steps a research group took, it is doubtful that anyone else could ever reproduce the procedures.

In EEG, the diversity of methods is arguably even larger than in fMRI, with numerous manufacturers and a corresponding variety of technologies using different kinds of electrodes, amplifier settings, cap configurations, and software packages31. There have been some efforts at standardization of analysis methods through the release of software packages, such as the Matlab-based EEGLAB32 and ERPLAB33. These packages have the advantage of allowing researchers to explore the data using a graphical interface while simultaneously generating an executable history script that records most of the analysis decisions. The BigEEG Consortium (www.bigeeg.orgy an offshoot of the EEGLAB initiative, seeks to develop and promote data and metadata standards for EEG-based research that may eventually facilitate large-scale analysis and meta-analysis. But, by and large, EEG data collection and analysis involves equipment and workflows that vary considerably from lab to lab.

Of course, the complexity of neuroimaging data analysis is not itself the enemy. It is not the raw number of methodological and analytical choices per se that creates barriers to reproducibility; rather, the challenge lies in encoding those degrees of freedom in a standardized (and ideally, machine-readable) way. Fortunately, overtime, the brain imaging community has converged on recommendations about what parameters should be reported and how3435. Moreover, at least in fMRI, imaging data analysis software shows a significant degree of standardization. From the earliest days of human brain imaging, leading research groups in the U.S. and U.K. wrote and freely distributed analysis software. This led to the widespread adoption of common tools with similar, although not identical, algorithms. Concerns about the inferential consequences of using one tool over another have been largely alleviated by findings from Gold36 and Morgan37, and questions about the reliability of workflows using one or more tools have addressed by Strother and colleagues3839 and others (but see2, 12, 40). All of the major tools in common use—SPM, FSL, AFNI, and BrainVoyager—enable researchers to write scriptable workflows, built either on internal engines (BrainVoyager), widely available commercial (SPM-MATLAB), or free/open source software languages (Linux/unix shell, python, C/C++ for FSL and AFNI). Naturally, there are some important caveats to this seemingly rosy picture. One concern is that while existing software supports relatively standardized and highly processing workflows in principle, whether researchers actually take advantage of those features in practice is a separate matter. The number of SPM, AFNI, and BrainVoyager users who commonly rely exclusively on automated scripting in their analysis workflows, as opposed to using more user-friendly, but inherently irreproducible, graphical interfaces, is not known. We speculate it is small. Moreover, even in labs that do conduct fully automated analyses, the sharing or publication of the corresponding scripts or data processing pipelines remains rare40. While the differences between pipelines can be subtle40, the margin for error is also small (many published results only barely survive statistical correction), so a lack of full reporting can severely impair reproducibility.

A second caveat is that perfect reproducibility may be impossible to achieve even when a researcher is armed with all of the original data and scripts used to generate an analysis. Operating system differences, untracked differences in implicit software dependencies, and other factors can sometimes produce numerical discrepancies that, while initially small, may magnify as they cascade through a workflow to the point of introducing qualitative differences in results 41. We discuss potential solutions to this problem (e.g., containerization) later; for present purposes, we note that acknowledging the intrinsic limits of methodological reproducibility does not grant researchers license to ignore best practices in automation and code sharing.

Importantly, brain imaging data are only part of the reproducibility story in cognitive neuroscience. It is also critical to understand how to reproduce the psychological components of cognitive neuroscience studies—most notably, the experimental design and its intended relationship to the latent constructs of interest. Here, the prospects for full reproducibility have historically seemed less promising. Most experimental tasks involve the presentation of sequences of visual or auditory events and the collection of participants' behavioral responses -button presses, mouse movements or clicks, vocalizations, or eye movements - using computer programs that instantiate tasks custom-tailored by a research team to address particular questions of interest. There have been researcher-initiated efforts to develop controlled vocabularies that describe the range of cognitive tasks deployed in the literature4243. NIH has spearheaded the creation of a standard toolbox of easily deployable tasks44 and also the development of data repositories designed to capture metadata about behavioral tasks and their variants45 in standardized and searchable forms. However, such efforts notwithstanding, most cognitive neuroscience researchers employ customized tasks built using a variety of software and scripting environments (e.g., E-Prime, the Matlab-based Psychophysics Toolbox, PsychoPy, and DMDX) Tasks use customized image and sound components, and researchers rarely share the code, image, or sound files used in experimental tasks40. These practices limit the reproducibility of behavioral measures used in cognitive neuroscience and psychology as a whole7. Of course, the rigid standardization of tasks and materials has its own significant flaws, including the possibility of stifling innovation and slowing progress. We suggest that more widespread and open sharing of behavioral tasks, code, and materials provides a constructive middle ground.

Results reproducibility

Assuming that independent researchers are able to reproduce the methods of one another's studies, how closely do the findings generated converge? The answer is: It depends. In principle, even differences as basic as the make of MRI scanner could undermine the ability to compare results across studies46, so considerable effort has gone into standardizing techniques that allow multi-site imaging studies to be carried out in rigorous and reproducible ways.

Fortunately, the viability of the basic technology is no longer in any serious doubt; abundant evidence demonstrates that all major brain imaging techniques are at least capable of producing highly convergent results across different sites and experimental procedures. Perhaps the best-known effort to demonstrate the basic robustness of results—not only in neuroimaging, but in other biomedical fields, is the Biomedical Informatics Research Network (BIRN, https://www.nitrc.org/projects/birn/y BIRN is a multi-site, collaborative research consortium that strives to advance understanding of brain research and brain disease through the principles of data sharing and collaboration. There were several different BIRN initiatives, including the morphology BIRN, the mouse BIRN, and the function BIRN (fBIRN), among others. Although the fBIRN's disease focus was schizophrenia, considerable effort went into developing generalizable models for multi-site data collection, best practices for research, and methods to facilitate the use of standardized processes across sites. One of fBIRN's biggest contributions was software that enabled the systematic investigation of how fMRI activation signals vary across sites, field strengths, and scanner platforms. The project also developed methods to control for these differences47. fBIRN scientists developed an automated Quality Assurance (QA) procedure based on a standard MRI phantom, and the team released freely available software that could be easily incorporated into any service center's data transfer pipeline48. The fBIRN also provided leadership in modeling inter-site reliability49: The same 18 participants were scanned in four different scanning sites. These analyses revealed that inter-subject variability was 10 times greater than inter-site variability; activation in many brain regions showed fair to good reliability; and measures of reliability increased with more runs of data.

More generally, the ability of techniques such as fMRI to produce robust and replicable findings is demonstrated by the rapid canonization of many initially surprising neuroimaging findings. For example, the tendency of a spatially conserved frontoparietal “task-positive” brain network to increase activity when participants engage in effortful cognitive activity has been replicated so often with fMRI over the past two decades 5052 that the result is now often treated as a de facto manipulation check in new experiments. Large-scale meta-analyses of hundreds or even thousands of neuroimaging studies at a time further demonstrate a marked degree of convergence on stable neural correlates for most major psychological processes, from pain perception to episodic memory to language production26, 5354.

Of course, it is one thing to establish that neuroimaging methods can consistently reveal broad mappings between cognitive processes and distributed brain networks, and quite another to establish that the specific pattern of findings generated by any single study can be reproduced with a high degree of fidelity in another study. Unfortunately, as previous commentators have observed1, 55, it is unclear whether neuroimaging findings meet this criterion. Arguably, the central problem is not that results reproducibility is particularly low, but that it has been difficult to quantify, leaving open the question of how much faith one should put in the results of any given published study. We focus on two critical barriers to the comprehensive assessment of results reproducibility in cognitive neuroscience, emphasizing fMRI (though the same concerns apply to other commonly used methods such as EEG, MEG, and TMS).

A first major challenge is that careful comparison of results across independent sites and studies typically requires that the full results be openly shared between sites, yet initiatives promoting neuroimaging data sharing have historically met with limited success. An early pioneer in data sharing was the fMRI Data Center (fMRIDC) at Dartmouth College, founded in 1999 6, 5657. Around the same time, several journals, most notably the Journal of Cognitive Neuroscience (JOCN), tried to implement mandatory open data sharing—including deposition of raw data files (e.g., BOLD time series, anatomical images)—as a requirement for publication. These efforts to foster increased transparency, while laudable, sparked controversy and backlash from the community. Opponents raised concerns about practical issues— technology, data formats, time and money constraints, and privacy—and cultural ones—the possibly negative impact of open sharing on individual scientific careers and advancement, questions about data ownership, and whether data sharing should be mandatory or optional57, 58. The backlash eventually led JOCN to backstep on the data sharing requirement, and when funding to maintain the archive ran out, fMRIDC stopped accepting new data. fMRIDC's architects argue that despite the setbacks, fMRIDC should be viewed as a successful pioneer in open fMRI data sharing56, whose experiences shaped the next generation of repositories like the 1000 Functional Connectomes Project (FCP; fcon_1000.projects.nitrc.org/) and its International Neuroimaging Data Initiative (INDI)58, the OpenfMRI project (openfmri.org) and NeuroVault (neurovault.org), the Human Connectome Project (humanconnectomeproject.org), and the NIMH-based National Database for Autism Research (NDAR; ndar.nih.gov). More broadly, fMRIDC helped fuel interest in, recognition of, and support for the essential role information infrastructure (neuroinformatics) plays in making widespread data sharing and reuse possible.

A second barrier to the evaluation of results reproducibility is a lack of consensus about what quantitative measures should be used1. One area of contention is whether the magnitude or spatial extent of task-related activation (or both) should be assessed. Measures of magnitude and spatial extent depend on criteria for determining which voxels are active, of course. Bennett and Miller1 argued that the plurality of measures of reliability reported within individual studies made it challenging to ask about the reliability of findings across studies. They reported that the reliability of group-level results, using individual participants tested at different times, varied depending on the temporal gap between the tests, the specific tasks employed (sensory/motor vs. cognitive), design factors (block vs. event-related), the magnitude of activations, and other inter-individual factors. Importantly, Miller59 and others60 found that, like the difference between within- and between site variability found by the fBIRN team49, variability within participants across testing sessions was lower than variability between participants. Differences in tasks, the degree of selectivity of active voxels to those tasks, and subject motion appeared to be the biggest contributors to inter-subject variability61. Nevertheless, Bennett and Miller1 noted that all of the studies reporting test/retest reliabilities had small sample sizes, foreshadowing concerns about limited statistical power that others raised in the intervening years8, 11.

In sum, we believe that, perhaps surprisingly, given the size of the primary literature, the jury is still out on the degree to which researchers should expect individual neuroimaging findings to replicate when repeated under similar conditions. There is little doubt that methods like fMRI can produce highly replicable results, and that many canonical findings are indeed highly robust; however, as the low-hanging fruit are plucked and researchers increasingly turn to subtler phenomena, it becomes more important for researchers to share data and results openly. Only in doing so can we progress toward consensus on criteria for evaluating the reproducibility of results.

Inferential Reproducibility

The challenge of generating reproducible inferences—where independent researchers come to similar conclusions about what patterns of data mean—has been a central concern in the field for many years. Numerous published reviews have highlighted conceptual and statistical problems that threaten common neuroimaging inferences25, 6263. One source of concern is that the statistical power of most fMRI studies is well below conventionally adequate levels811; in a recent review based on over 1,100 samples, Poldrack et al.62 found that the median fMRI study in 2015 was underpowered to detect anything but relatively large effects (Cohen's d of ∼0.75) even when using relatively high-powered procedures (i.e., a one-sample t-test). This observation is worrisome not only because low power implies a high false negative rate and inefficient resource expenditure, but because it frequently leads to incorrect interpretations—the notion that effects are stronger and better-localized than they actually are6465. This increases the false-positive rate8 across the literature as a whole.

A second set of concerns arises at the analysis stage. As Bennett et al's well-known “dead salmon” illustration showed66, insufficiently stringent multiple corrections procedures can easily inflate the false positive rate—an observation echoed by numerous studies that have highlighted limitations with common correction methods12, 6769. Moreover, such analyses all assume a best-case scenario under which researchers are not (inadvertently) capitalizing on the many “researcher degrees of freedom” available in a typical fMRI pipeline40. If one could formally account for “p-hacking” (i.e., data-dependent selection of analysis procedures), it is likley that the false positive rate would rise, perhaps substantially11, 70.

Lastly, even if one sets aside the statistical issues involved in the generation of cognitive neuroscience findings and assumes for the sake of argument that most published findings are fundamentally sound, it does not follow that researchers will agree about how to interpret such findings. Indeed, trenchant concerns have been raised about some of the most common assumptions researchers make when interpreting neuroimaging results, ranging from basic questions about what the BOLD signal reflects to what kind of information is actually extracted from multivariate pattern analysis 7174. Poldrack75 has flagged the problem of reverse inference as a particularly serious challenge, noting that the widespread approach of inferring mental function based on the pattern of observed brain activity results runs a high risk of failure—unless it is supported by an appropriate Bayesian analysis that directly estimates the probability of a given task or state occurring conditional on an observed pattern of activity that is based on a reasonable prior distribution26.

Of course, science is a difficult enterprise, and it is easy to find serious methodological or statistical problems with virtually any piece of scientific research. The key question is what steps are researchers taking to address inferential concerns and to ensure that research findings continue to improve in reliability over time. To this end, to consider more recent initiatives aimed at improving the reproducibility of cognitive neuroscience research.

Recent Initiatives

The focus on problems of reproducibility in scientific research as a whole has accelerated in the last several years16, 1920, and its scope extends well beyond psychological and neural science. As a result, cognitive neuroscience is both a beneficiary of new tools that promise to improve reproducibility and a contributor to them. We show that, fortunately, our field has already begun to embrace new, open and transparent research practices that promise to mitigate or even eliminate many of the serious problems of methods, results, and inferential reproducibility.

Methods reproducibility

Concern about reproducible workflows and practices across the computational sciences has sharpened in similar ways7677. While the specific practices that make computations reproducible vary from one field to another, Sandve and colleagues78 summarize a set of steps that have broad applicability to cognitive neuroscientists. These include avoiding manual data manipulation steps (using scripts not GUIs); keeping careful track of the provenance (history) of all data, including derived results; tracking versions of all software and data; and providing public access to all code, outputs, and data.

Several data analysis tools have been developed based on free open source languages like R (RStudio; rstudio.com) and Python (Jupyter; jupyter.orgV These tools support the creation of interactive electronic notebooks that combine data manipulation and analysis code along with graphic visualizations and text-based commentary. The tools can be used with version control environments like git or mercurial, allowing the history of a project's data analysis to be captured. Version control software can be used to store and share software, analyses, manuscripts, and documents written in virtually any language (both human and computer). Coupled with web-based repositories like GitHub (qithub.com), BitBucket (bitbucket.org), or the Open Science Framework (OSF; osf.io), version control systems enable researchers to share the histories and current status of all project materials and data. Some researchers concerned about computational reproducibility have gone even further, by creating full software environments that can run a particular analysis and packaging them in a specialized “containerized” environment (e.g., Docker; www.docker.com) that can be distributed for others' use across a wide range of computer platforms. The use of electronic notebooks, version control software, and web-based open data repositories has begun to enable cognitive neuroscience researchers to produce open and transparent workflows that can be readily reproduced. The authors use many of these techniques in their own research workflows.

Other efforts focus on methods reproducibility across study teams. On such initiative is the development of the Brain Imaging Data Structure (BIDS; bids.neuroimaging.ioy a new open data format designed to facilitate the storage and sharing of data from brain imaging studies79. BIDS attempts to achieve an easily implementable file directory and data structure that captures critical data and metadata about brain imaging studies and some data about the behavioral tasks performed by participants. BIDS arose out of the work involved in creating the OpenFMRI (openfmri.org) data repository34, designed to allow researchers to openly share raw BOLD imaging data sets with sufficient information to permit re- or meta-analysis. The BigEEG project mentioned earlier represents a similar data format standardization initiative targeted at the EEG community.

On the data sharing side, modern platforms have picked up where pioneers like the fMRIDC left off, making it ever easier for researchers to distribute large neuroimaging datasets in a readily usable form. A major initiative focused on methods and results reproducibility is the Stanford Center for Reproducible Neuroscience (CRN; reproducibility.stanford.edu) formed in 2015 by Russell Poldrack and colleagues. CRN is developing data repositories for both raw neuroimaging datasets (an upcoming successor to the OpenFMRI platform) and whole brain statistical maps (NeuroVault.org)80. A long-term goal of the CRN is not only to facilitate sharing, but also to provide containerized, modular, and fully reproducible cloud-based tools that can be easily executed via a graphical web interface. This will bring reproducible state-of-the-art neuroimaging data analysis within reach of researchers who lack the resources to deploy their own pipelines locally81.

One of us82 has argued that many problems in reproducing the methods of behavioral studies could be ameliorated if video of all experimental procedures was more widely recorded and shared with researchers. Text-based methods sections with restrictive page or word limits simply can't convey sufficiently detailed information about a study's methods so that it can be reproduced by another researcher. Sharing video can pose privacy risks, but the Databrary (databrary.org) digital library, a repository specialized for storing and sharing video, has developed a policy framework to share identifiable data with participant permission. Like OSF, Databrary has begun to serve as a web-based home for researchers to store and share data, metadata, and materials about the non-imaging-related portions of a study, including videos of experimental procedures, images, audio recordings, or displays. Databrary largely focuses on developmental and learning science research now, but may expand in the future.

In sum, the field is making rapid strides to improve the reproducibility of methods with the emergence of new tools, practices, centers, web-based data management systems, and data repositories.

Results reproducibility

Despite the acknowledged lack of consensus about how to measure and thereby evaluate results reproducibility1 and the noted significant problems with statistical power, we find a number of encouraging developments concerning the reproducibility of results. Researchers continue to take seriously the effort to systematically measure the factors that influence test/retest reliability of responses across time and tasks, and these sorts of studies are increasingly common 8388. Other research programs focus on addressing questions about the long-term within-subject stability of responses 8990, and how the accurate assessment of within-participant differences might address questions about individual differences93. There is increasing support for conducting and publishing the results of confirmatory studies9495, thereby rectifying some existing biases that often favor the publication of new, novel results over confirmatory ones.

Several large-scale cross-site imaging studies whose results were designed to be widely shared with the research community have been undertaken (e.g., the Human Connectome Project and the U.K. Biobank Project). Findings from these studies are beginning to appear95 with results that both confirm and extend current understanding. Perhaps equally important is the extent to which planning for the large-scale sharing of these sorts of data has led to publication of extensive details about processing pipelines96 and careful planning about how to make shared components useful to other researchers.

Policy makers and publishers have taken a renewed interest in how the context in which scientific research is conducted and results shared can influence reproducibility. The Consortium for Reliability and Reproducibility (CoRR) has developed best practice guidelines for the use the resting-state fMRI data available through the INDI archive96, and the Organization for Human Brain Mapping has created a Committee on Best Practice in Data Analysis and Sharing (COBIDAS)97. Following on the success of the ArXiv preprint service, increasing numbers of cognitive neuroscientists have begun to deposit article preprints in the BioRxiv preprint service (biorxiv.org/neuroscience). An effort specific to psychological science (PsyArxiv; osf.io/view/psvarxiv/) has begun with support from the Center for Open Science (COS; cos.io) and the newly formed Society for the Improvement of Psychological Science (SIPS; improvingpsych.orgy High profile generalist and topic-specific journals are adopting data sharing requirements reminiscent of those JOCN attempted to implement 15 years ago; there are new journals, such as Nature Publishing's Scientific Data, focused on creating citable, scholarly homes for well-curated datasets; and some journals (e.g., Cortex) have adopted a new publication format, the pre-registered report, that conducts a review of the methods and analysis plan prior to data collection in exchange for a commitment to publish the results regardless of the findings.

There have also been recent developments to improve appropriate usage and reporting of statistical tests by means of automated tools such as statcheck (statcheck.io)9899, which looks for elementary errors in the reporting of individual statistical tests. A related tool, P-curve (www.p-curve.com)100 uses the complete set of statistical results from a body of work to estimate the evidentiary strength in favor of a hypothesis. While largely focused on the psychological science literature, these initiatives bear close watching by cognitive neuroscientists as they illustrate how the standardization of reporting practices can lead to insights about the quality of research practices and strength or weakness of evidence across a broad published literature100. In the case of statcheck, the system depends on the the fact that most experimental psychology papers report statistical analyses in ways that allow pertinent parameters to be automatically extracted from the published texts. Clearly, the diversity of efforts focused on bolstering the reproducibility of cognitive neuroscience results have considerable forward momentum.

Inferential reproducibility

Since the 2010 Bennett and Miller review on replicability, new tools and practices that promise to bolster the reproducibility of inferences have been created and are being adopted at an accelerating rate. We highlight three: Meta-analysis, improved statistical practices, and machine learning.

For meta-analysis to succeed, the statistical effects from a large number of disparate studies must be collected, normalized, and reported in standardized ways101. The variability in analysis and reporting practices across the cognitive neuroscience literature can make meta-analysis challenging. As a result, the creation and curation of large-scale brain imaging databases has been essential for the growth of meta-analysis as an inferential tool. One of the oldest such systems devoted to supporting meta-analytic datasets and software is the BrainMap project (www.brainmap.org) project5. As of late fall 2016, BrainMap consisted of data from more than 100,000 individual participants from nearly 4,000 papers. The BrainMap data and meta-analysis tools have been used and cited more than 600 times since 1992, with more than 125 citations in 2016 alone.

Neurosynth (neurosynth.org)26 takes an alternative approach to meta-analysis in which the raw data are i) activation (x,y,z) coordinates mined from the text of imaging papers published in HTML on the web, combined with ii) word frequencies from the same papers. In this way, Neurosynth aims to automate and thereby standardize and accelerate the process of meta-analysis. By combining information about activation coordinates with term frequencies derived from calculating distributional statistics from the published articles, Neurosynth enables the analyst to interactively determine the extent of evidence for a relationship between a specific term of interest and a set of brain coordinates. For example, the analyst could visualize either the probability of a given voxel's activation given the existence of a specific term in the system's database of papers or the probability of a target term appearing in papers that report a particular voxel as active. The system allows users to view 3D maps of the conditional probabilities online, to download the maps for further analysis, and to create customized sets of searches. As of early 2017, Neurosynth contained data from more than 11,000 imaging studies, and it provides users with downloadable interactive meta-analyses from more than 3,000 terms. The system has been cited almost 600 times, with more than 180 citations in 2016 alone. Of course, Neurosynth only supports meta-analysis on a subset of the published literature—older papers that were not published in easily parsable HTML formats and unpublished findings fall outside its scope.

Beyond meta-analysis, cognitive neuroscience research continues to push for new statistical procedures and the wider adoption of long-standing, but more robust ones. For space reasons we do not elaborate extensively here, but among the issues under active discussion are the appropriate handling of main effects and interaction tests102, the applicability of linear mixed effects modeling techniques 103, and the ongoing need to guard against the risks of false positive results3, 104 even when using well-established and vetted and widely used analysis software12. Still others suggest that the standard practice of treating stimulus effects as fixed, not random, may undermine the generalizability of findings across studies105. An emergent theme is the ongoing need for vigorous and rigorous methodological reevaluation combined with a commitment to more open software publication practices. In the recent case of Eklund and colleagues12, the discovery of an error in the algorithm for controlling for cluster-wise fMRI activation effects quickly led to changes in the widely-used AFNI package106107. The episode highlights the corrective, collaborative nature of open-source software development, while also underscoring the uncomfortable reality that, at present, very few people who use open-source software packages actually bother to read the underlying code (the AFNI bug had previously gone undetected for many years).

In many other areas of social and computational science, progress has been facilitated by borrowing ideas and techniques from the field of machine learning. Philosophically, machine learning researchers tend to emphasize their ability to quantitatively predict key outcomes and pay less attention to traditional forms of scientific explanation108. This philosophy has led to the rapid proliferation of thousands of predictive modeling techniques—a number of which (e.g., support vector machines) are deployed regularly in cognitive neuroscience. Many fMRI studies are now framed as predictive problems in classification or regression, where the goal is to build a model that successfully discovers a mapping between a set of predictor variables and a set of discrete (in the case of classification) or continuous (in the case of regression) outcomes. For example, the distributed activation pattern of a large number of voxels in an fMRI data set can be used to predict successful vs. unsuccessful attempts to recognize a stimulus. The resultant classifier can then be used to predict outcomes “out-of-sample” (i.e., in new data sets) and potentially also to aid in the interpretation of which voxels were likely to have played a role in processing relevant information. This gives rise to applications such as the ‘mind-reading’ of representations present in visual cortex during movie viewing109110 or revealing participants' semantic maps activated during narrative comprehension111.

Machine learning and related ‘big data’ techniques have provided entirely new approaches to analysis. They allow researchers to capitalize on neural information patterns that may be too subtle or complex to be easily discovered using more conventional summary statistic approaches (e.g. brain activation maps or ERPs)112. Of course, prediction-oriented approaches are not a panacea for standard concerns about the seeming ease with which researchers can fool themselves and unwittingly generate false or exaggerated findings. In the context of machine learning, the term ‘overfitting’ is used to describe a case in which the predictions of an analysis have been inadvertently contaminated by noise in the data that were used to develop the analysis. As a consequence of overfitting, favorable results obtained when analyzing the same dataset that was used to develop and calibrate the analysis will not be obtained when examining other, equivalent datasets. Some researchers who use machine learning take the notion of ‘overfitting’ more seriously than others113. The cautious deploy methods such as cross-validation (i.e., training and testing a model on independent subsets of the data) that should in principle guard against overfitting. However, machine learning pipelines allow for considerably more analytical flexibility than conventional analyses. Researchers often have a choice between literally hundreds of different estimation approaches, each of which may have its own free parameters that require tuning to perform optimally. Whereas there is widespread awareness of the need to cross-validate results once a model has been selected, there is much less recognition that overfitting can still occur through the optimization of an analysis (for further discussion, see114115). Thus, as our field increasingly adopts machine learning techniques, it will be important to borrow established best practices from fields that have been using similar big data approaches for longer periods of time112, 115.

The Future

As Van Horn and Gazzaniga56 observed, “the reality remains that very little of the neuroimaging data gathered each day in the field have been made available to those who could help provide much needed understanding.” While we agree that this assessment still holds, we see other evidence that points toward a very different future. There is increasing recognition that greater openness and transparency, reflected in data, materials, and code sharing offers individual investigators and the field as a whole far more benefits than risks28, 58, 108, 116. While significant challenges remain in developing technology and workflow practices that make open and transparent workflows easy to generate and data readily shareable, that progress is being made. It is increasingly clear that there are substantial scientific rewards in analyzing or re-analyzing large-scale shared data sets beyond improving statistical power. Accordingly, while we take seriously the concerns many have raised recently about the methods, results, and inferential reproducibility of our field, we encourage our colleagues to embrace the newly emerging open science practices with an optimistic mindset56, 117, as there is so much more to gain than to lose.

At the same time, it is essential that the field identify barriers that stand in the way of a more open, transparent, and reproducible neuroscience of cognition. One clear gap is the difficulty of capturing and reporting reproducible information about tasks, displays, and analysis procedures, although new data and materials repositories like OSF and Databrary, emerging data standards (BIDS; BigEEG), and pipelines (EEGLAB; nipype) can play constructive roles. Another concerns the need to forge community consensus around a set of principles about the culture in which cognitive neuroscience research is carried out—how to seek permission to share, when data and materials should be shared, how to measure and report individual scholarly contributions to large-scale studies, how to weigh the impact of analyses conducted on secondary data relative to the collection of new data, and how to ensure that the transition to more open science practices doesn't unduly harm the careers of the next generation of researchers. Proscribing answers to these questions goes beyond our scope, but we urge continued dialogue focused on achieving community consensus.

A vital question for which there remains no satisfying answer is what entity will pay for the curation, support, maintenance and long-term storage of cognitive neuroscience data and materials. Data repositories, both past and present, have been funded either by short-term (3-5 year duration) NIH or NSF research grants or private foundation funders. Thus, despite increasingly strong encouragements from granting agencies to share data and materials or even mandates to do so, there is as yet no long-term commitment from the agencies for funding devoted to long-term data preservation. Data curation, storage, preservation, is not inexpensive, and the problem of how to sustain research infrastructure that benefits the entire research community will neither solve itself nor go away. Nevertheless, based on the success of other fields like astronomy, high energy physics, and the geosciences, we think that a strong case can be made for enduring federal and private donor support for research infrastructure that empowers cognitive neuroscientists to openly share data, materials, and methods.

Fundamentally, we think that investments in the future of cognitive neuroscience infrastructure will generate big payoffs. Fostering the widespread adoption of open, transparent, and reproducible research practices coupled with innovations in technology that enable the large-scale analysis of our particular store of ‘big data’ will accelerate the discovery of generalizable, robust, and meaningful findings about the nature and origins118 of human cognition.

Acknowledgments

ROG acknowledges support from NSF BCS-1147440, NSF BCS-1238599, and NICHD U01-HD-076595. BW acknowledges support from NSF BCS-1331073

References

RESOURCES