Skip to main content
eLife logoLink to eLife
. 2020 Mar 4;9:e53385. doi: 10.7554/eLife.53385

NeuroQuery, comprehensive meta-analysis of human brain mapping

Jérôme Dockès 1,, Russell A Poldrack 2, Romain Primet 3, Hande Gözükan 3, Tal Yarkoni 4, Fabian Suchanek 5, Bertrand Thirion 1, Gaël Varoquaux 1,6,
Editors: Christian Büchel7, Thomas Yeo8
PMCID: PMC7164961  PMID: 32129761

Abstract

Reaching a global view of brain organization requires assembling evidence on widely different mental processes and mechanisms. The variety of human neuroscience concepts and terminology poses a fundamental challenge to relating brain imaging results across the scientific literature. Existing meta-analysis methods perform statistical tests on sets of publications associated with a particular concept. Thus, large-scale meta-analyses only tackle single terms that occur frequently. We propose a new paradigm, focusing on prediction rather than inference. Our multivariate model predicts the spatial distribution of neurological observations, given text describing an experiment, cognitive process, or disease. This approach handles text of arbitrary length and terms that are too rare for standard meta-analysis. We capture the relationships and neural correlates of 7547 neuroscience terms across 13 459 neuroimaging publications. The resulting meta-analytic tool, neuroquery.org, can ground hypothesis generation and data-analysis priors on a comprehensive view of published findings on the brain.

Research organism: Human

Introduction

Pushing the envelope of meta-analyses

Each year, thousands of brain-imaging studies explore the links between brain and behavior: more than 6000 publications a year contain the term ‘neuroimaging’ on PubMed. Finding consistent trends in the knowledge acquired across these studies is crucial, as individual studies by themselves seldom have enough statistical power to establish fully trustworthy results (Button et al., 2013; Poldrack et al., 2017). But compiling an answer to a specific question from this impressive number of results is a daunting task. There are too many studies to manually collect and aggregate their findings. In addition, such a task is fundamentally difficult due to the many different aspects of behavior, as well as the diversity of the protocols used to probe them.

Meta-analyses can give objective views of the field, to ground a review article or a discussion of new results. Coordinate-Based Meta-Analysis (CBMA) methods (Laird et al., 2005; Wager et al., 2007; Eickhoff et al., 2009) assess the consistency of results across studies, comparing the observed spatial density of reported brain stereotactic coordinates to the null hypothesis of a uniform distribution. Automating CBMA methods across the literature, as in NeuroSynth (Yarkoni et al., 2011), enables large-scale analyses of brain-imaging studies, giving excellent statistical power. Existing meta-analysis methods focus on identifying effects reported consistently across the literature, to distinguish true discoveries from noise and artifacts. However, they can only address neuroscience concepts that are easy to define. Choosing which studies to include in a meta-analysis can be challenging. In principle, studies can be manually annotated as carefully as one likes. However, manual meta-analyses are not scalable, and the corresponding degrees of freedom are difficult to control statistically. In what follows, we focus on automated meta-analysis. To automate the selection of studies, the common solution is to rely on terms present in publications. But closely related terms can lead to markedly different meta-analyses (Figure 1). The lack of a universally established vocabulary or ontology to describe mental processes and disorders is a strong impediment to meta-analysis (Poldrack and Yarkoni, 2016). Indeed, only 30% of the terms contained in a neuroscience ontology or meta-analysis tool are common to another (see Table 1). In addition, studies are diverse in many ways: they investigate different mental processes, using different terms to describe them, and different experimental paradigms to probe them (Newell, 1973). Yet, current meta-analysis approaches model all studies as asking the same question. They cannot model nuances across studies because they rely on in-sample statistical inference and are not designed to interpolate between studies that address related but different questions, or make predictions for unseen combinations of mental processes. A consequence is that, as we will show, their results are harder to control outside of well-defined and frequently-studied psychological concepts.

Figure 1. Taming query variability Maps obtained for a few words related to mental arithmetic.

Figure 1.

By correctly capturing the fact that these words are related, NeuroQuery can use its map for easier words like ‘calculation’ and ‘arithmetic’ to encode terms like ‘computation’ and ‘addition’ that are difficult for meta-analysis.

Table 1. Diversity of vocabularies: there is no established lexicon of neuroscience, even in hand-curated reference vocabularies, as visible across CognitiveAtlas (Poldrack and Yarkoni, 2016), MeSH (Lipscomb, 2000), NeuroNames (Bowden and Martin, 1995), NIF (Gardner et al., 2008), and NeuroSynth (Yarkoni et al., 2011).

Our dataset, NeuroQuery, contains all the terms from the other vocabularies that occur in more than 5 out of 10 000 articles. ‘MeSH’ corresponds to the branches of PubMed’s MEdical Subject Headings related to neurology, psychology, or neuroanatomy (see Section 'The choice of vocabulary'). Many MeSH terms are hardly or never used in practice – For example variants of multi-term expressions with permuted word order such as ‘Dementia, Frontotemporal’, and are therefore not included in NeuroQuery’s vocabulary. Numbers above 25% are shown in bold.

% of ↓ contained in → Cognitive Atlas (895) MeSH (21287) NeuroNames (7146) NIF (6912) NeuroSynth (1310) NeuroQuery(7547)
Cognitive Atlas 100% 14% 0% 3% 14% 68%
MeSH 1% 100% 3% 4% 1% 9%
NeuroNames 0% 9% 100% 29% 1% 10%
NIF 0% 12% 30% 100% 1% 10%
NeuroSynth 9% 14% 5% 5% 100% 98%
NeuroQuery 8% 25% 9% 9% 17% 100%

Currently, an automated meta-analysis cannot cover all studies that report a particular functional contrast (contrasting mental conditions to isolate a mental process, Poldrack et al., 2011). Indeed, we lack the tools to parse the text in articles and reliably identify those that relate to equivalent or very similar contrasts. As an example, consider a study of the neural support of translating orthography to phonology, probed with visual stimuli by Pinho et al. (2018). The results of this study build upon an experimental contrast labeled by the authors as ‘Read pseudo-words vs. consonant strings’, shown in Figure 2. Given this description, what prior hypotheses arise from the literature for this contrast? Conversely, given the statistical map resulting from the experiment, how can one compare it with previous reports on similar tasks? For these questions, meta-analysis seems the tool of choice. Yet, the current meta-analytic paradigm requires the practitioner to select a set of studies that are included in the meta-analysis. In this case, which studies from the literature should be included? Even with a corpus of 14 000 full-text articles, selection based on simple pattern matching –as with NeuroSynth– falls short. Indeed, only 29 studies contain all 5 words from the contrast description, which leads to a noisy and under-powered meta-analytic map (Figure 2). To avoid relying on the contrast name, which can be seen as too short and terse, one could do a meta-analysis based on the page-long task description (that can be found at https://project.inria.fr/IBC/data/ and is reproduced in the supplementary data). However, that would require combining even more terms, which precludes selecting studies that contain all of them. A more manual selection may help to identify relevant studies, but it is far more difficult and time-consuming. Moreover, some concepts of interest may not have been investigated by themselves, or only in very few studies: rare diseases, or tasks involving a combination of mental processes that have not been studied together. For instance, there is evidence of agnosia in Huntington’s disease (Sitek et al., 2014), but it has not been studied with brain imaging. To compile a brain map from the literature for such queries, it is necessary to interpolate between studies only partly related to the query. Standard meta-analytic methods lack an automatic way to measure the relevance of studies to a question, and to interpolate between them. This prevents them from answering new questions, or questions that cannot be formulated simply.

Figure 2. Illustration: studying the contrast ‘Read pseudo words vs. consonant strings’.

Figure 2.

Left: Group-level map from the IBC dataset for the contrast ‘Read pseudo-words vs. consonant strings’ and contour of NeuroQuery map obtained from this query. The NeuroQuery map was obtained directly from the contrast description in the dataset’s documentation, without needing to manually select studies for the meta-analysis nor convert this description to a string pattern usable by existing automatic meta-analysis tools. The map from which the contour is drawn, as well as a NeuroQuery map for the page-long description of the RSVP language task, are shown in Section 'Example Meta-analysis results for the RSVP language task from the IBC dataset', in Section 'Example Meta-analysis results for the RSVP language task from the IBC dataset' c and Section 'Example Meta-analysis results for the RSVP language task from the IBC dataset'd respectively. Right: ALE map for 29 studies that contain all terms from the IBC contrast description. The map was obtained with the GingerALE tool (Eickhoff et al., 2009). With only 29 matching studies, ALE lacks statistical power for this contrast description.

Many of the constraints of standard meta-analysis arise from the necessity to define an in-sample test on a given set of studies. Here, we propose a new kind of meta-analysis, that focuses on out-of-sample prediction rather than hypothesis testing. The focus shifts from establishing consensus for a particular subject of study to building multivariate mappings from mental diseases and psychological concepts to anatomical structures in the brain. This approach is complementary to classic meta-analysis methods such as Activation Likelihood Estimate (ALE) (Laird et al., 2005), Multilevel Kernel Density Analysis (MKDA) (Wager et al., 2007) or NeuroSynth (Yarkoni et al., 2011): these perform statistical tests to evaluate trustworthiness of results from past studies, while our framework predicts, based on the description of an experiment or subject of study, which brain regions are most likely to be observed in a study. We introduce a new meta-analysis tool, NeuroQuery, that predicts the neural correlates of neuroscience concepts – related to behavior, diseases, or anatomy. To do so, it considers terms not in isolation, but in a dynamic, contextually-informed way that allows for mutual interactions. A predictive framework enables maps to be generated by generalizing from terms that are well studied (‘faces’) to those that are less well studied and inaccessible to traditional meta-analyses (‘prosopagnosia’). As a result, NeuroQuery produces high-quality brain maps for concepts studied infrequently in the literature and for a larger class of queries than existing tools – including, for example free text descriptions of a hypothetical experiment. These brain maps predict well the spatial distribution of findings and thus form good grounds to generate regions of interest or interpret results for studies of infrequent terms such as prosopagnosia. Yet, unlike with conventional meta-analysis, they do not control a voxel-level null hypothesis, hence are less suited to asserting that a particular area is activated in studies, for example of prosopagnosia.

Our approach, NeuroQuery, assembles results from the literature into a brain map using an arbitrary query with words from our vocabulary of 7547 neuroscience terms. NeuroQuery uses a multivariate model of the statistical link between multiple terms and corresponding brain locations. It is fitted using supervised machine learning on 13459 full-text publications. Thus, it learns to weight and combine terms to predict the brain locations most likely to be reported in a study. It can predict a brain map given any combination of terms related to neuroscience – not only single words, but also detailed descriptions, abstracts, or full papers. With an extensive comparison to published studies, we show in Section 'Quantitative evaluation: NeuroQuery is an accurate model of the literature' that it indeed approximates well results of actual experimental data collection. NeuroQuery also models the semantic relations that underlie the vocabulary of neuroscience. Using techniques from natural language processing, NeuroQuery infers semantic similarities across terms used in the literature. Thus, it makes better use of the available information, and can recover biologically plausible brain maps where other automated methods lack statistical power, for example with terms that are used in few studies, as shown in Section 'NeuroQuery can map rare or difficult concepts'. This semantic model also makes NeuroQuery less sensitive to small variations in terminology (Figure 1). Finally, the semantic similarities captured by NeuroQuery can help researchers navigate related neuroscience concepts while exploring their associations with brain activity. NeuroQuery extends the scope of standard meta-analysis, as it extracts from the literature a comprehensive statistical summary of evidence accumulated by neuroimaging research. It can be used to explore the domain knowledge across sub-fields, generate new hypotheses, and construct quantitative priors or regions of interest for future studies, or put in perspective results of an experiment. NeuroQuery is easily usable online, at neuroquery.org, and the data and source code can be freely downloaded. We start by briefly describing the statistical model behind NeuroQuery in Section 'Overview of the NeuroQuery model', then illustrate its usage (Section 'Illustration: using NeuroQuery for post-hoc interpretation') and show that it can map new combinations of concepts in Section 'NeuroQuery can map new combinations of concepts'. In Section 'NeuroQuery can map rare or difficult concepts and Quantitative evaluation: NeuroQuery is an accurate model of the literature', we conduct a thorough qualitative and quantitative assessment of the new possibilities it offers, before a discussion and conclusion.

Results

The NeuroQuery tool and what it can do

Overview of the NeuroQuery model

NeuroQuery is a statistical model that identifies brain regions related to an arbitrary text query – a single term, a few keywords, or a longer text. It is built on a controlled vocabulary of neuroscience terms and a large corpus containing the full text of neuroimaging publications and the coordinates that they report. The main components of the NeuroQuery model are an estimate of the relatedness of terms in the vocabulary, derived from co-occurrence statistics, and a regression model that links term occurrences to neural activations using supervised machine learning techniques. To generate a brain map, NeuroQuery first uses the estimated semantic associations to map the query onto a set of keywords that can be reliably associated with brain regions. Then, it transforms the resulting representation into a brain map using a linear regression model (Figure 3). This model can thus be understood as a reduced rank regression, where the low-dimensional representation is a distribution of weights over keywords selected for their strong link with brain activity. We emphasize the fact that NeuroQuery is a predictive model. The maps it outputs are predictions of the likelihood of observation brain location (rescaled by their standard deviation). They do not have the same meaning as ALE, MKDA or NeuroSynth maps as they do not show a voxel-level test statistic. In this section we describe our neuroscience corpus and how we use it to estimate semantic relations, select keywords, and map them onto brain activations.

Figure 3. Overview of the NeuroQuery model: two examples of how association maps are constructed for the terms ‘grasping’ and ‘visual’.

Figure 3.

The query is expanded by adding weights to related terms. The resulting vector is projected on the subspace spanned by the smaller vocabulary selected during supervised feature selection. Those well-encoded terms are shown in color. Finally, it is mapped onto the brain space through the regression model. When a word (e.g., ‘visual’) has a strong association with brain activity and is selected as a regressor, the smoothing has limited effect. Details: the first bar plot shows the semantic similarities of neighboring terms with the query. It represents the smoothed TFIDF vector. Terms that are not used as features for the supervised regression are shown in gray. The second bar plot shows the similarities of selected terms, rescaled by the norms of the corresponding regression coefficient maps. It represents the relative contribution of each term in the final prediction. The coefficient maps associated with individual terms are shown next to the bar plot. These maps are combined linearly to produce the prediction shown on the right.

NeuroQuery relies on a corpus of 13459 full-text neuroimaging publications, described in Section 'Building the NeuroQuery training data'. This corpus is by far the largest of its kind; the NeuroSynth corpus contains a similar number of documents, but uses only the article abstracts, and not the full article texts. We represent the text of a document with the (weighted) occurrence frequencies of each phrase from a fixed vocabulary, that is Term Frequency · Inverse Document Frequency (TFIDF) features (Salton and Buckley, 1988). This vocabulary is built from the union of terms from several ontologies (shown in Table 1) and labels from 12 anatomical atlases (listed in Table 4 in Section 'The choice of vocabulary'). It comprises 7547 terms or phrases related to neuroscience that occur in at least 0.05% of publications. We automatically extract 418772 peak activations coordinates from publications, and transform them to brain maps with a kernel density estimator. Coordinate extraction is discussed and evaluated in Section 'coordinate extraction'. This preprocessing step thus yields, for each article: its representation in term frequency space (a TFIDF vector), and a brain map representing the estimated density of activations for this study. The corresponding data is also openly available online.

The first step of the NeuroQuery pipeline is a semantic smoothing of the term-frequency representations. Many expressions are challenging for existing automated meta-analysis frameworks, because they are too rare, polysemic, or have a low correlation with brain activity. Rare words are problematic because peak activation coordinates are a very weak signal: from each article we extract little information about the associated brain activity. Therefore existing frameworks rely on the occurrence of a term in hundreds of studies in order to detect a pattern in peak activations. Term co-occurrences, on the other hand, are more consistent and reliable, and capture semantic relationships (Turney and Pantel, 2010). The strength of these relationships encodes semantic proximity, from very strong for synonyms that occur in statistically identical contexts, to weaker for different yet related mental processes that are often studied one opposed to the other. Using them helps meta analysis: it would require hundreds of studies to detect a pattern in locations reported for ‘aphasia’, for example in lesion studies. But with the text of a few publications we notice that it often appears close to 'language', which is indeed a related mental process. By leveraging this information, NeuroQuery recovers maps for terms that are too rare to be mapped reliably with standard automated meta-analysis. Using Non-negative Matrix Factorization (NMF), we compute a low-rank approximation of word co-occurrences (the covariance of the TFIDF features), and obtain a denoised semantic relatedness matrix (details are provided in Section 'smoothing: regularization at test time'). These word associations guide the encoding of rare or difficult terms into brain maps. They can also be used to explore related neuroscience concepts when using the NeuroQuery tool.

The second step from a text query to a brain map is NeuroQuery’s text-to-brain encoding model. When analyzing the literature, we fit a linear regression to reliably map text onto brain activations. The intensity (across the peak density maps) of each voxel in the brain is regressed on the TFIDF descriptors of documents. This model is an additive one across the term occurrences, as opposed to logical operations traditionally used to select studies for meta-analysis. It results in higher predictive power (Section 'Word occurrence frequencies across the corpus').

One challenge is that TFIDF representations are sparse and high-dimensional. We use a reweighted ridge regression and feature selection procedure (described in Section 'reweighted ridge matrix and feature (vocabulary) selection') to prevent uninformative terms such as ‘magnetoencephalography’ from degrading performance. This procedure automatically selects around 200 keywords that display a strong statistical link with brain activity and adapts the regularization applied to each feature. Indeed, mapping too many terms (covariates) without appropriate regularization would degrade the regression performance due to multicolinearity.

To make a prediction, NeuroQuery combines semantic smoothing and linear regression of brain activations. To encode a new document or query, the text is expanded, or smoothed, by adding weight to related terms using the semantic similarity matrix. The resulting smoothed representation is projected onto the reduced vocabulary of selected keywords, then mapped onto the brain through the linear regression coefficients (Figure 3). The rank of this linear model is therefore the size of the restricted vocabulary that was found to be reliably mapped to the brain. Compared with other latent factor models, this 2-layer linear model is easily interpretable, as each dimension (both of the input and the latent space) is associated with a term from our vocabulary. In addition, NeuroQuery uses an estimate of the voxel-level variance of association (see methodological details in Section 'Mathematical details of the NeuroQuery statistical model'), and reports a map of Z statistics. Note that this variance represents an uncertainty around a prediction for a TFIDF representation of the concept of interest, which is treated as a fixed quantity. Therefore, the resulting map cannot be thresholded to reject any simple null hypothesis. NeuroQuery maps have a different meaning and different uses than standard meta-analysis maps obtained e.g. with ALE.

Illustration: using NeuroQuery for post-hoc interpretation

After running a functional Magnetic Resonance Imaging (fMRI) experiment, it is common to compare the computed contrasts to what is known from the existing literature, and even use prior knowledge to assess whether some activations are not specific to the targeted mental process, but due to experimental artifacts such as the stimulus modality. It is also possible to introduce prior knowledge earlier in the study and choose a Region of Interest (ROI) before running the experiment. This is usually done based on the expertise of the researcher, which is hard to formalize and reproduce. With NeuroQuery, it is easy to capture the domain knowledge and perform these comparisons or ROI selections in a principled way.

As an example, consider again the contrast from the RSVP language task (Pinho et al., 2018; Humphries et al., 2006) in the Individual Brain Charting (IBC) dataset, shown in Figure 2. It is described as ‘Read pseudo-words vs. consonant strings’. We obtain a brain map from NeuroQuery by simply transforming the contrast description, without any manual intervention, and compare both maps by overlaying a contour of the NeuroQuery map on the actual IBC group contrast map. We can also obtain a meta-analytic map for the whole RSVP language task by analyzing the free-text task description with NeuroQuery (Section 'Example Meta-analysis results for the RSVP language task from the IBC dataset.').

NeuroQuery can map new combinations of concepts

To study the predictions of NeuroQuery, we first demonstrate that it can indeed give good brain maps on combinations of terms that have never been studied together. For this, we leave out from our corpus of studies all the publications that simultaneously mention two given terms, we fit a NeuroQuery model on the resulting reduced corpus, and evaluate its predictions on the left out publications, that did actually report these terms together. Figure 4 shows an example of such an experiment: excluding publications mentioning simultaneously ‘distance’ and ‘color’. The figure compares a simple meta analysis of the combination of these two terms – contrasting the left-out studies with the remaining ones – with the predictions of the model fitted excluding studies that include the term conjunction. Qualitatively, the predicted maps comprise all the brain structures visible in the simultaneous studies of ‘distance’ and ‘color’: on the one hand, the intra-parietal sulci, the frontal eye fields, and the anterior cingulate/anterior insula network associated with distance perception, and on the other hand, the additional mid-level visual region around the approximate location of V4 associated with color perception. The extrapolation from two terms for which the model has seen studies, ‘distance’ and ‘color’, to their combination, for which the model has no data, is possible thanks to the linear additive model, combining regression maps for ‘distance’ and ‘color’.

Figure 4. Mapping an unseen combination of terms.

Figure 4.

Left The difference in the spatial distribution of findings reported in studies that contains both ‘distance’ and ‘color’ (n=687), and the rest of the studies. – Right Predictions of a NeuroQuery model fitted on the studies that do not contain simultaneously both terms ‘distance’ and ‘color’.

To assert that the good generalization to unseen pairs of terms is not limited to the above pair, we apply quantitative experiments of prediction quality (introduced later, in Section 'Quantitative evaluation: NeuroQuery is an accurate model of the literature') to 1 000 randomly-chosen pairs. We find that measures of how well predictions match the literature decrease only slightly for studies with terms already seen together compared to studies with terms never seen jointly (details in Section 'NeuroQuery performance on unseen pairs of terms'). Finally, we gauge the quality of the maps with a quantitative experiment mirroring the qualitative evaluation of Figure 4: for each of the 1 000 pairs of terms, we compute the Pearson correlation of the predicted map for the unseen combination of terms with the meta-analytic map obtained on the left-out studies. We find a median correlation of 0.85 which shows that the excellent performance observed on Figure 4 is not due to a specific choice of terms.

NeuroQuery can map rare or difficult concepts

We now we compare the NeuroQuery model to existing automated meta-analysis methods, investigate how it handles terms that are challenging for the current state of the art, and quantitatively evaluate its performance. We compare NeuroQuery with NeuroSynth (Yarkoni et al., 2011), the best known automated meta-analytic tool, and with Generalized Correspondence Latent Dirichlet Allocation (GCLDA) (Rubin et al., 2017). GCLDA is an important baseline because it is the only multivariate meta-analytic model to date. However, it produces maps with a low spatial resolution because it models brain activations as a mixture of Gaussians. Moreover, it takes several days to train and a dozen of seconds to produce a map at test time, and is thus unsuitable to build an online and responsive tool like NeuroSynth or NeuroQuery.

By combining term similarities and an additive encoding model, NeuroQuery can accurately map rare or difficult terms for which standard meta-analysis lacks statistical power, as visible on Figure 5.

Figure 5. Examples of maps obtained for a given term, compared across different large-scale meta-analysis frameworks.

Figure 5.

‘GCLDA’ has low spatial resolution and produces inaccurate maps for many terms. For relatively straightforward terms like ‘psts’ (posterior superior temporal sulcus), NeuroSynth and NeuroQuery give consistent results. For terms that are more rare or difficult to map like ‘dyslexia’, only NeuroQuery generates usable brain maps.

Quantitatively comparing methods on very rare terms is difficult for lack of ground truth. We therefore conduct meta-analyses on subsampled corpora, in which some terms are made artificially rare, and use the maps obtained from the full corpus as a reference. We choose a set of frequent and well-mapped terms, such as ‘language’, for which NeuroQuery and NeuroSynth (trained on a full corpus) give consistent results. For each of those terms, we construct a series of corpora in which the word becomes more and more rare: from a full corpus, we erase randomly the word from many documents until it occurs at most in 213 = 8912 articles, then 212 = 4096, and so on. For many terms, NeuroQuery only needs a dozen examples to produce maps that are qualitatively and quantitatively close to the maps it obtains for the full corpus – and to NeuroSynth’s full-corpus maps. NeuroSynth typically needs hundreds of examples to obtain similar results, as seen in Figure 6. Document frequencies roughly follow a power law (Piantadosi, 2014), meaning that most words are very rare – half the terms in our vocabulary occur in less than 76 articles (see Section 'Word occurrence frequencies across the corpus'). Reducing the number of studies required to map well a term (a.k.a. the sample complexity of the meta-analysis model) therefore greatly widens the vocabulary that can be studied by meta-analysis.

Figure 6. Learning good maps from few studies.

Figure 6.

left: maps obtained from subsampled corpora, in which the encoded word appears in 16 and 128 documents, and from the full corpus. NeuroQuery needs less examples to learn a sensible brain map. NeuroSynth maps correspond to NeuroSynth’s Z scores for the ‘association test’ from neurosynth.org. NeuroSynth’s ‘posterior probability’ maps for these terms for the full corpus are shown in Figure 19. Each tool is trained on its own dataset, which is why the full-corpus occurrence counts differ. right: convergence of maps toward their value for the full corpus, as the number of occurrences increases. Averaged over 13 words: ‘language’, ‘auditory’, ‘emotional’, ‘hand’, ‘face’, ‘default mode’, ‘putamen’, ‘hippocampus’, ‘reward’, ‘spatial’, ‘amygdala’, ‘sentence’, ‘memory’. On average, NeuroQuery is closer to the full-corpus map. This confirms quantitatively what we observe for the two examples ‘language’ and ‘reward’ on the left. Note that here convergence is only measured with respect to the model’s own behavior on the full corpus, hence a high value does not indicate necessarily a good face validity of the maps with respect to neuroscience knowledge. The solid line represents the mean across the 13 words and the error bands represent a 95% confidence interval based on 1 000 bootstrap repetitions.

Capturing relations between terms is important because the literature does not use a perfectly consistent terminology. The standard solution is to use expert-built ontologies (Poldrack and Yarkoni, 2016), but these tend to have low coverage. For example, the controlled vocabularies that we use display relatively small intersections, as can be seen in Table 1. In addition, ontologies are typically even more incomplete in listing relations across terms. Rather than ontologies, NeuroQuery relies on distributional semantics and co-occurrence statistics across the literature to estimate relatedness between terms. These continuous semantic links provide robustness to inconsistent terminology: consistent meta-analytic maps for similar terms. For instance, ‘calculation’, ‘computation’, ‘arithmetic’, and ‘addition’ are all related terms that are associated with similar maps by NeuroQuery. On the contrary, standard automated meta-analysis frameworks map these terms in isolation, and thus suffer from a lack of statistical power and produce empty, or nearly empty, maps for some of these terms (see Figure 1).

NeuroQuery improves mapping not only for rare terms that are variants of concepts widely studied, but also for some concepts rarely studied, such as ‘color’ or ‘Huntington’ (Figure 5). The main reason is the semantic smoothing described in Section 'Overview of the NeuroQuery model'. Another reason is that working with the full text of publications associates many more studies to a query: 2779 for ‘color’, while NeuroSynth matches only 236 abstracts, and 147 for ‘huntington’, a term not known to NeuroSynth. Full-text matching however requires to give unequal weight to studies, to avoid giving too much weight to studies weakly related to the query. These weights are computed by the supervised-learning ridge regression: in its dual formulation, ridge regression is seen as giving weights to training samples (Bishop, 2006, sec 6.1).

Quantitative evaluation: NeuroQuery is an accurate model of the literature

Unlike standard meta-analysis methods, which compute in-sample summary statistics, NeuroQuery is a predictive model, that can produce brain maps for out-of-sample neuroimaging studies. This enables us to quantitatively assess its generalization performance. Here we check that NeuroQuery captures reliable links from concepts to brain activity – associations that generalize to new, unseen neuroimaging studies. We do this with 16-fold shuffle-split cross-validation. After fitting a NeuroQuery model on 90% of the corpus, for each document in the left-out test set (around 1 300), we encode it, normalize the predicted brain map to coerce it into a probability density, and compute the average log-likelihood of the coordinates reported in the article with respect to this density. The procedure is then repeated 16 times and results are presented in Figure 7. We also perform this procedure with NeuroSynth and GCLDA. NeuroSynth does not perform well for this test. Indeed, the NeuroSynth model is designed for single-phrase meta-analysis, and does not have a mechanism to combine words and encode a full document. Moreover, it is a tool for in-sample statistical inference, which is not well suited for out-of sample prediction. GCLDA performs significantly better than chance, but still worse than a simple ridge regression baseline. This can be explained by the unrealistic modelling of brain activations as a mixture of a small number of Gaussians, which results in low spatial resolution, and by the difficulty to perform posterior inference for GCLDA. Another metric, introduced in Mitchell et al. (2008) predicting for encoding models, tests the ability of the meta-analytic model to match the text of a left-out study with its brain map. For each article in the test set, we draw randomly another one and check whether the predicted map is closer to the correct map (containing peaks at each reported location) or to the random negative example. More than 72% of the time, NeuroQuery’s output has a higher Pearson correlation with the correct map than with the negative example (see Figure 7 right).

Figure 7. Explaining coordinates reported in unseen studies.

Figure 7.

Left: log-likelihood for coordinates reported in test articles, relative to the log-likelihood of a naive baseline that predicts the average density of the training set. NeuroQuery outperforms GCLDA, NeuroSynth, and a ridge regression baseline. Note that NeuroSynth is not designed to make encoding predictions for full documents, which is why it does not perform well on this task. – Right: how often the predicted map is closer to the true coordinates than to the coordinates for another article in the test set (Mitchell et al., 2008). The boxes represent the first, second and third quartiles of scores across 16 cross-validation folds. Whiskers represent the rest of the distribution, except for outliers, defined as points beyond 1.5 times the IQR past the low and high quartiles, and represented with diamond fliers.

NeuroQuery maps are close to reference meta-analytic maps and atlases

The above experiments quantify how well NeuroQuery captures the information in the literature, by comparing predictions to reported coordinates. However, the scores are difficult to interpret, as peak coordinates reported in the literature are noisy and incomplete with respect to the full activation maps. We also want to quantify the quality of the brain maps generated by NeuroQuery, extending the visual comparisons of Figure 5. For this purpose, we compare NeuroQuery predictions to a few reliable references.

First, we use a set of diverse and curated Coordinate-Based Meta-Analysis (IBMA) maps available publicly (Varoquaux et al., 2018). This collection contains 19 IBMA brain maps, labelled with cognitive concepts such as ‘visual words’. For each of these labels, we obtain a prediction from NeuroQuery and compare it to the corresponding IBMA map. The IBMA maps are thresholded. We evaluate whether thresholding the NeuroQuery predicted maps can recover the above-threshold voxels in the IBMA, quantifying false detections and misses for all thresholds with the Area Under the Receiver Operating Characteristic (ROC) Curve (Fawcett, 2006). NeuroQuery predictions match well the IBMA results, with a median Area Under the Curve (AUC) of 0.80. Such results cannot be directly obtained with NeuroSynth, as many labels are missing from NeuroSynth’s vocabulary. Manually reformulating the labels to terms from NeuroSynth’s vocabulary gives a median AUC of .83 for NeuroSynth, and also raises the AUC to .88 for NeuroQuery (details in Section 'Comparison with the BrainPedia IBMA study' and Figure 13).

We also perform a similar experiment for anatomical terms, relying on the Harvard-Oxford structural atlases (Desikan et al., 2006). Both NeuroSynth and NeuroQuery produce maps that are close to the atlases’ manually segmented regions, with a median AUC of 0.98 for NeuroQuery and 0.95 for NeuroSynth, for the region labels that are present in NeuroSynth’s vocabulary. Details are provided in Section 'Comparison with Harvard-Oxford anatomical atlas' and Figure 14.

For frequent-enough terms, we consider NeuroSynth as a reference. Indeed, while the goal of NeuroSynth is to perform a voxel-level test of independence, and not to predict an activation distribution like NeuroQuery, in most casesNeuroQuery should predict few observations where the test statistic is small. We threshold NeuroSynth maps by controlling the False Discovery Rate (FDR) at 1% and select the 200 maps with the largest number of activations. We compare NeuroQuery predictions to NeuroSynth activations by computing the AUC. NeuroQuery and NeuroSynth maps for these well-captured terms are very similar, with a median AUC of 0.90. Details are provided in Section 'Comparison with NeuroSynth on terms with strong activations' and Figure 15.

NeuroQuery is an openly available resource

NeuroQuery can easily be used online: https://neuroquery.org. Users can enter free text in a search box (rather than select a single term from a list as is the case with existing tools) and discover which terms, neuroimaging publications, and brain regions are related to their query. NeuroQuery is also available as an open-source Python package that can be easily installed on all platforms: https://github.com/neuroquery/neuroquery (copy archived at https://github.com/elifesciences-publications/neuroquery). This will enable advanced users to run extensive meta-analysis with Neuroquery, integrate it in other applications, and extend it. The package allows training new NeuroQuery models as well as downloading and using a pre-trained model. Finally, all the resources used to build NeuroQuery are freely available at https://github.com/neuroquery/neuroquery_data (copy archived at https://github.com/elifesciences-publications/neuroquery_data). This repository contains (i) the data used to train the model: vocabulary list and document frequencies, word counts (TFIDF features), and peak activation coordinates for our whole corpus of 13 459 publications, (ii) the semantic-smoothing matrix, that encodes relations across the terminology. The corpus is significantly richer than NeuroSynth, the largest corpus to date (see Table 3 for a comparison), and manual quality assurance reveals more accurate extraction of brain coordinates (Table 2).

Discussion

NeuroQuery makes it easy to perform meta-analyses of arbitrary questions on the human neuroscience literature: it uses a full-text description of the question and the studies and it provides an online query interface with a rich database of studies. For this, it departs from existing meta-analytic frameworks by treating meta-analysis as a prediction problem. It describes neuroscience concepts of interest by continuous combinations of terms rather than matching publications for exact terms. As it combines multiple terms and interpolates between available studies, it extends the scope of meta-analysis in neuroimaging. In particular, it can capture information for concepts studied much less frequently than those that are covered by current automated meta-analytic approaches.

Related work

A variety of prior works have paved the way for NeuroQuery. Brainmap (Laird et al., 2005) was the first systematic database of brain coordinates. NeuroSynth (Yarkoni et al., 2011) pioneered automated meta-analysis using abstracts from the literature, broadening a lot the set of terms for which the consistency of reported locations can be tested. These works perform classic meta-analysis, which considers terms in isolation, unlike NeuroQuery. Topic models have also been used to find relationships across terms used in meta-analysis. Nielsen et al. (2004) used a non-negative matrix factorization on the matrix of occurrences of terms for each brain location (voxel): their model outputs a set of seven spatial networks associated with cognitive topics, described as weighted combinations of terms. Poldrack et al. (2012) used topic models on the full text of 5800 publications to extract from term cooccurrences 130 topics on mental function and disorders, followed by a classic meta-analysis to map their neural correlates in the literature. These topic-modeling works produce a reduced number of cognitive latent factors –or topics– mapped to the brain, unlike NeuroQuery which strives to map individual terms and uses their cooccurences in publications only to infer the semantic links. From a modeling perspective, the important difference of NeuroQuery is supervised learning, used as an encoding model (Naselaris et al., 2011). In this sense, the supervised learning used in NeuroQuery differs from that used in Yarkoni et al. (2011) : the latter is a decoding model that, given brain locations in a study, predicts the likelihood of neuroscience terms without using relationships between terms. Unlike prior approaches, the maps of NeuroQuery are predictions of its statistical model, as opposed to model parameters. Finally, other works have modelled co-activations and interactions between brain locations (Kang et al., 2011; Wager et al., 2015; Xue et al., 2014). We do not explore this possibility here, and except for the density estimation NeuroQuery treats voxels independently.

Usage recommendations and limitations

We have thoroughly validated that NeuroQuery gives quantitatively and qualitatively good results that summarize well the literature. Yet, the tool has strengths and weaknesses that should inform its usage. Brain maps produced by NeuroQuery are predictions, and a specific prediction may be wrong although the tool performs well on average. A NeuroQuery prediction by itself therefore does not support definite conclusions as it does not come with a statistical test. Rather, NeuroQuery will be most successfully used to produce hypotheses and as an exploratory tool, to be confronted with other sources of evidence. To prepare a new functional neuroimaging study, NeuroQuery helps to formulate hypotheses, defining ROIs or other formal priors (for Bayesian analyses). To interpret results of a neuroimaging experiment, NeuroQuery can readily use the description of the experiment to assemble maps from the literature, which can be compared against, or updated using, experimental findings. As an exploratory tool, extracting patterns from published neuroimaging findings can help conjecture relationships across mental processes as well as their neural correlates (Yeo et al., 2015). NeuroQuery can also facilitate literature reviews: given a query, it uses its semantic model to list related studies and their reported activations. What NeuroQuery does not do is provide conclusive evidence that a brain region is recruited by a mental process or affected by a pathology. Compared to traditional meta-analysis tools, NeuroQuery is particularly beneficial (i) when the term of interest is rare, (ii) when the concept of interest is best described by a combination of multiple terms, and (iii) when a fully automated method is necessary and queries would otherwise need cumbersome manual curation to be understood by other tools.

Understanding the components of NeuroQuery helps interpreting its results. We now describe in details potential failures of the tool, and how to detect them. NeuroQuery builds predictions by combining brain maps each associated with a keyword related to the query. A first step to interpret results is to inspect this list of keywords, displayed by the online tool. These keywords are selected based on their semantic relation to the query, and as such will usually be relevant. However, in rare cases, they may build upon undesirable associations. For example, ‘agnosia’ is linked to ‘visual’, ‘fusiform’, ‘word’ and ‘object’, because visual agnosia is the type of agnosia most studied in the literature, even though ‘agnosia’ is a much more general concept. In this specific case, the indirect association is problematic because ‘agnosia’ is not a selected term that NeuroQuery can map by itself, as it is not well-represented in the source data. As a result, the NeuroQuery prediction for ‘agnosia’ is driven by indirect associations, and focuses on the visual system, rather than areas related to, for example auditory agnosia. By contrast, ‘aphasia’ is an example of a term that is well mapped, building on maps for ‘speech’ and ‘language’, terms that are semantically close to aphasia and well captured in the literature.

A second consideration is that, in some extreme cases, the semantic smoothing fails to produce meaningful results. This happens when a term has no closely related terms that correlate well with brain activity. For instance, ‘ADHD’ is very similar to ‘attention deficit hyperactivity disorder’, ‘hyperactivity’, ‘inattention’, but none of these terms is selected as a feature mapped in itself, because their link with brain activity is relatively loose. Hence, for ‘ADHD’, the model builds its prediction on terms that are distant from the query, and produces a misleading map that highlights mostly the cerebellum (https://neuroquery.org/query?text=adhd). While this result is not satisfying, the failure is detected by the NeuroQuery interface and reported with a warning stating that results may not be reliable. To a user with general knowledge in psychology, the failure can also be seen by inspecting the associated terms, as displayed in the user interface.

A third source of potential failure stems from NeuroQuery’s model of additive combination. This model is not unique to NeuroQuery, and lies at the heart of functional neuroimaging, which builds upon the hypothesis of pure insertion of cognitive processes (Ulrich et al., 1999; Poldrack, 2010). An inevitable consequence is that, in some cases, a group of words will not be well mapped by its constituents. For example, ‘visual sentence comprehension’ is decomposed into two constituents known to Neuroquery: ‘visual’ and ‘sentence comprehension’. Unfortunately, the map corresponding to the combination is then dominated by the primary visual cortex, given that it leads to very powerful activations in fMRI. Note that ‘visual word comprehension’, a slightly more common subject of interest, is decomposed into ‘visual word’ and ‘comprehension’, which leads to a more plausible map, with strong loadings in the visual word form area.

A careful user can check that each constituent of a query is associated with a plausible map, and that they are well combined. The NeuroQuery interface enables to gauge the quality of the mapping of each individual term by presenting the corresponding brain map as well as the number of associated studies. The final combination can be understood by inspecting the weights of the combination as well as comparing the final combined map with the maps for individual terms. Such an inspection can for instance reveal that, as mentioned above, ‘visual’ dominates ‘sentence comprehension’ when mapping ‘visual sentence comprehension’.

We have attempted to provide a comprehensive overview of the main pitfalls users are likely to encounter when using NeuroQuery, but we hasten to emphasize that all of these pitfalls are infrequent. NeuroQuery produces reliable maps for the typical queries, as quantified by our experiments.

General considerations on meta-analyses

When using NeuroQuery to foster scientific progress, it is useful to keep in mind that meta-analyses are not a silver bullet. First, meta-analyses have little or no ability to correct biases present in the primary literature (e.g., perhaps confirmation bias drives researchers to overreport amygdala activation in emotion studies). Beyond increased statistical power, one promise of meta-analysis is to afford a wider perspective on results—in particular, by comparing brain structures detected across many different conditions. However, claims that a structure is selective to a mental condition need an explicit statistical model of reverse inference (Wager et al., 2016). Gathering such evidence is challenging: selectivity means that changes at the given brain location specifically imply a mental condition, while brain imaging experiments most often do not manipulate the brain itself, but rather the experimental conditions it is placed in Poldrack (2006). In a meta-analysis, the most important confound for reverse inferences is that some brain locations are reported for many different conditions. NeuroQuery accounts for this varying baseline across the brain by fitting an intercept and reporting only differences from the baseline. While helpful, this is not a formal statistical test of reverse inference. For example, the NeuroQuery map for ‘interoception’ highlights the insula, because studies that mention ‘interoception’ tend to mention and report coordinates in the insula. This, of course, does not mean that interoception is the only function of the insula. Another fundamental challenge of meta-analyses in psychology is the decomposition of the tasks in mental processes: the descriptions of the dimensions of the experimental paradigms are likely imperfect and incomplete. Indeed, even for a task as simple as finger tapping, minor variations in task design lead to reproducible variations in neural responses (Witt et al., 2008). However, quantitatively describing all aspects of all tasks and cognitive strategies is presently impossible, as it would require a universally-accepted, all-encompassing psychological ontology. Rather, NeuroQuery grounds meta-analysis in the full-text descriptions of the studies, which in our view provide the best available proxy for such an idealized ontology.

Conclusion

NeuroQuery stems from a desire to compile results across studies and laboratories, an essential endeavor for the progress of human brain mapping (Yarkoni et al., 2010). Mental processes are difficult to isolate and findings of individual studies may not generalize. Thus, tools are needed to denoise and summarize knowledge accumulated across a large number of studies. Such tools must be usable in practice and match the needs of researchers who exploit them to study human brain function and disorders. NeuroSynth took a huge step in this direction by enabling anyone to perform, in a few seconds, a fully automated meta-analysis across thousands of studies, for an important number of isolated terms. Still, users are faced with the difficult task of mapping their question to a single term from the NeuroSynth vocabulary, which cannot always be done in a meaningful way. If the selected term is not popular enough, the resulting map also risks being unusable for lack of statistical power. NeuroQuery provides statistical maps for arbitrary queries – from seldom-studied terms to free-text descriptions of experimental protocols. Thus, it enables applying fully-automated and quantitative meta-analysis in situations where only semi-manual and subjective solutions were available. It therefore brings an important advancement towards grounding neuroscience on quantitative knowledge representations.

Materials and methods

We now expose methodological details: first the constitution of the NeuroQuery data, then the statistical model, the validation experiments in details, and the word-occurrence statistics in the corpus of studies.

Building the NeuroQuery training data

A new dataset

The dataset collected by NeuroSynth (Yarkoni et al., 2011) is openly available (https://github.com/neurosynth/neurosynth-data; copy archived at https://github.com/elifesciences-publications/neurosynth-data). In July, 2019, NeuroSynth contains 448255 unique locations for 14371 studies. It also contains the term frequencies for 3228 terms (1335 are actually used in the NeuroSynth online tool (http://neurosynth.org), based on the abstracts of the studies. However, it only contains term frequencies for the abstracts, and not the articles themselves. This results in a shallow description of the studies, based on a very short text (around 20 times smaller than the full article). As a result, many important terms are very rare: they seldom occur in abstracts, and can be associated with very few studies. For example, in our corpus of 13459 studies, ‘huntington disease’ occurs in 32 abstracts, and ‘prosopagnosia’ in 25. For such terms, meta-analysis lacks statistical power. When the full text is available, many more term occurrences – associations between a term and a study – are observed (Figure 16). This means that more information is available, terms are better described by their set of associated studies, and meta-analyses have more statistical power. Moreover, as publications cannot always be redistributed for copyright reasons, NeuroSynth (and any dataset of this nature) can only provide term frequencies for a fixed vocabulary, and not the text they were extracted from. We therefore decided to collect a new corpus of neuroimaging studies, which contains the full text. We also created a new peak activation coordinate extraction system, which achieved a higher precision and recall than NeuroSynth’s on a small sample of manually annotated studies.

Journal articles in a uniform and validated format

We downloaded around 149000 full-text journal articles related to neuroimaging from the PubMed Central (https://www.ncbi.nlm.nih.gov/pmc/, https://www.ncbi.nlm.nih.gov/books/NBK25501/) (Sayers, 2009) and Elsevier (https://dev.elsevier.com/api_docs.html) APIs. We focus on these sources of data because they provide many articles in a structured format. It should be noted that this could result in a selection bias, as some scientific journals – mostly paid journals – are not available through these channels. The articles are selected by querying the ESearch Entrez utility (Sayers, 2009) either for specific neuroimaging journals or with query strings such as ‘fMRI’. The resulting studies are mostly based on fMRI experiments, but the dataset also contains Positron Emission Tomography (PET) or structural Magnetic Resonance Imaging (MRI) studies. It contains studies about diverse types of populations: healthy adults, patients, elderly, children.

We use eXtensible Stylesheet Language Transformations (XSLT) to convert all articles to the Journal Article Tag Suite (JATS) Archiving and Interchange XML language (https://jats.nlm.nih.gov/archiving/) and validate the result using the W3C XML Schema (XSD) schemas provided on the JATS website. From the resulting XML documents, it is straightforward to extract the title, keywords, abstract, and the relevant parts of the article body, discarding the parts which would add noise to our data (such as the acknowledgements or references).

Coordinate extraction

We extract tables from the downloaded articles and convert them to the XHTML 1.1 table model (the JATS also allows using the OASIS CALS table model). We use stylesheets provided by docbook (https://docbook.org/tools/) to convert from CALS to XHTML. Cells in tables can span several rows and columns. When extracting a table, we normalize it by splitting cells that span several rows or columns and duplicating these cells’ content; the normalized table thus has the shape of a matrix. Finally, all unicode characters that can be used to represent ‘+' or ‘-' signs (such as − ‘MINUS SIGN’) are mapped to their ASCII equivalents, ‘+' (+ ‘PLUS SIGN’) or ‘-” (- ‘HYPHEN MINUS’). Once tables are isolated, in XHTML format, and their rows and columns are well aligned, the last step is to find and extract peak activation coordinates. Heuristics find columns containing either single coordinates or triplets of coordinates based on their header and the cells’ content. A heuristic detects when the coordinates extracted from a table are probably not stereotactic peak activation coordinates, either because many of them lie outside a standard brain mask, or because the group of coordinates as a whole fits a normal distribution too well. In such cases the whole table is discarded. Out of the 149000 downloaded and formatted articles, 13459 contain coordinates that could be extracted by this process, resulting in a total of 418772 locations.

All the extracted coordinates are treated as coordinates in the Montreal Neurological Institute (MNI) space, even though some articles still refer to the Talairach space. The precision of extracted coordinates could be improved by detecting which reference is used and transforming Talairach coordinates to MNI coordinates. However, differences between the two coordinate systems are at most of the order of 1 cm, and much smaller in most of the brain. This is comparable to the size of the Gaussian kernel used to smooth images. Moreover, the alignment of brain images does not only depend on the used template but also on the registration method, and there is no perfect transformation from Talairach to MNI space (Lancaster et al., 2007). Therefore, treating all coordinates uniformly is acceptable as a first approximation, but better handling of Talairach coordinates is a clear direction for improving the NeuroQuery dataset.

Coordinate extraction evaluation

To evaluate the coordinate extraction process, we focused on articles that are present in both NeuroSynth’s dataset and NeuroQuery’s, and for which the two coordinate extraction systems disagree. Out of 8692 articles in the intersection of both corpora, the extracted coordinates differ (for at least one coordinate) in 1961 (i.e. in 23% of articles). We selected the first 40 articles (sorted by PubMed ID) and manually evaluated the extracted coordinates. As shown in Table 2, our method extracted false coordinates from fewer articles: 3/40 articles have at least one false location in our dataset, against 20 for NeuroSynth. While these numbers may seem high, note that errors are far less likely to occur in articles for which both methods extract exactly the same locations.

Table 2. Number of extracted coordinate sets that contain at least one error of each type, out of 40 manually annotated articles.

The articles are chosen from those on which NeuroSynth and NeuroQuery disagree – the ones most likely to contain errors.

False positives False negatives
NeuroSynth 20 28
NeuroQuery 3 8

Density maps

For each article, the coordinates from all tables are pooled, resulting in a set of peak activation coordinates. We then use Gaussian Kernel Density Estimation (KDE) (Silverman, 1986; Scott, 2015) to estimate the density of these activations over the brain. The chosen bandwidth of the Gaussian kernel yields a Full Width at Half Maximum (FWHM) close to 9 mm, which is in the range of smoothing kernels that are typically used for fMRI meta-analysis (Wager et al., 2007; Wager et al., 2004; Turkeltaub et al., 2002). For comparison, NeuroSynth uses a hard ball of 10 mm radius.

One benefit of focusing on the density of peak coordinates (which is 1-normalized) is that it does not depend on the number of contrasts presented in an article, nor on other analytic choices that cause the number of reported coordinates to vary widely, ranging from less than a dozen to several hundreds.

Vocabulary and TFIDF features

We represent the text of our articles by TFIDF features (Salton and Buckley, 1988). These simple representations are popular in document retrieval and text classification because they are very efficient for many applications. They contain the (reweighted) frequencies of many terms in the text, discarding the order in which words appear. An important choice when building TFIDF vectors is the vocabulary: the words or expressions whose frequency are measured. It is common to use all words encountered in the training corpus, possibly discarding those that are too frequent or too rare. The vocabulary is often enriched with ‘n-grams’, or collocations: groups of words that often appear in the same sequence, such as ‘European Union’ or ‘default mode network’. These collocations are assigned a dimension of the TFIDF representations and counted as if they were a single token. There are several strategies to discover such collocations in a training corpus (Mikolov et al., 2013; Bouma, 2009).

We do not extract the vocabulary and collocations from the training corpus, but instead rely on existing, manually-curated vocabularies and ontologies of neuroscience. This ensures that we only consider terms that are relevant to brain function, anatomy or disorders, and that we only use meaningful collocations. Moreover, it helps to reduce the dimensionality of the TFIDF representations. Our vocabulary comprises five important lexicons of neuroscience, based on community efforts: the subset of Medical Subject Headings (MeSH) (https://www.ncbi.nlm.nih.gov/mesh) dedicated to neuroscience and psychology, detailed in Section 'The choice of vocabulary' (MeSH are the terms used by PubMed to index articles), Cognitive Atlas (http://www.cognitiveatlas.org/), NeuroNames (http://braininfo.rprc.washington.edu/NeuroNames.xml) and NIF (https://neuinfo.org/). We also include all the terms and bigrams used by NeuroSynth (http://neurosynth.org). We discard all the terms and expressions that occur in less than 5/10 000 articles. The resulting vocabulary contains 7547 terms and expressions related to neuroscience.

Summary of collected data

The data collection described in this section provides us with important resources: (i) Over 149K full-text journal articles related to neuroscience – 13.5K of which contain peak activation coordinates – all translated into the same structured format and validated. (ii) Over 418K peak activation coordinates for more than 13.5K articles. (iii) A vocabulary of 7547 terms related to neuroscience, each occurring in at least six articles from which we extracted coordinates. This dataset is the largest of its kind. In what follows we focus on the set of 13.5K articles from which we extracted peak locations.

Some quantitative aspects of the NeuroQuery and NeuroSynth datasets are summarized in Table 3.

Table 3. Comparison with NeuroSynth.

‘voc intersection’ is the set of terms present in both NeuroSynth’s and NeuroQuery’s vocabularies. The ‘conflicting articles’ are papers present in both datasets, for which the coordinate extraction tools disagree, 40 of which were manually annotated.

NeuroSynth NeuroQuery
Dataset size
articles 14 371 13 459
terms 3 228 (1 335 online) 7 547
journals 60 458
raw text length (words) ≈4 M ≈75 M
unique term occurrences 1 063 670 5 855 483
unique term occurrences in voc intersection 677 345 3 089 040
coordinates 448 255 418 772
Coordinate extraction errors on conflicting articles
articles with false positives / 40 20 3
articles with false negatives / 40 28 8
Text

In terms of raw amount of text, this corpus is 20 times larger than NeuroSynth’s. Combined with our vocabulary, it yields over 5.5M occurrences of a unique term in an article. This is over five times more than the word occurrence counts distributed by NeuroSynth (https://github.com/neurosynth/neurosynth-data). When considering only terms in NeuroSynth’s vocabulary, the corpus still contains over 3M term-study associations, 4.6 times more than NeuroSynth. Using this larger corpus results in denser representations, higher statistical power, and coverage of a wider vocabulary. There is an important overlap between the selected studies: 8 692 studies are present in both datasets – the Intersection Over Union is 0.45.

Coordinates

The set of extracted coordinates is almost the size of NeuroSynth’s (which is 7% larger with 448255 coordinates after removing duplicates), and is less noisy. To compare coordinate extractions, we manually annotated a small set of articles for which NeuroSynth’s coordinates differ from NeuroQuery’s. Compared with NeuroSynth, NeuroQuery’s extraction method reduced the number of articles with incorrect coordinates (false positives) by a factor of 7, and the number of articles with missing coordinates (false negatives) by a factor of 3 (Table 2). Less noisy brain activation data is useful for training encoding models.

Sharing data

We do not have the right to share the full text of the articles, but the vocabulary, extracted coordinates, and term occurrence counts for the whole corpus are freely available online (https://github.com/neuroquery/neuroquery_data).

Mathematical details of the NeuroQuery statistical model

Notation

We denote scalars, vectors and matrices with lower-case, bold lower-case, and bold-upper case letters respectively: x, x, X. We denote the elements of X by xi,j, its rows by xi, and its columns by x,i. We denote p the number of voxels in the brain, v the size of the vocabulary, and n the number of studies in the dataset. We use indices i, j, k to indicate indexing samples (studies), features (terms), and outputs (voxels) respectively. We use a hat to denote estimated values, for example B^.x,y is the vector scalar product.

TFIDF feature extraction

We represent a document by its TFIDF features (Salton and Buckley, 1988), which are reweighted Bag-Of-Words features. A TFIDF representation is a vector in which each entry corresponds to the (reweighted) frequency of occurrence of a particular term. The term frequency, tf, of a word in a document is the number of times the word occurs, divided by the total number of words in the document. The document frequency, df, of a word in a corpus is the proportion of documents in which it appears. The inverse document frequency, idf, is defined as:

idf(w)=log(df)+1=log|{i|woccursindocumenti}|n+1, (1)

where n is the number of documents in the corpus and | · | is the cardinality. Term frequencies are reweighted by their idf, so that frequent words, which occur in many documents (such as ‘results’ or ‘brain’), are given less importance. Indeed, such words are usually not very informative.

Our TFIDF representation for a study is the uniform average of the normalized TFIDF vectors for its title, abstract, full text, and keywords. Therefore, all parts of the article are taken into account, but a word that occurs in the title is more important than a word the article body (since the title is shorter).

TFIDF features exploit a fixed vocabulary – each dimension is associated with a particular word. The vocabulary we consider comprises 7547 terms or phrases related to neuroscience that occur in at least 0.05% of publications. These terms are extracted from manually curated sources shown in Table 1 and Table 4.

Table 4. Atlases included in NeuroQuery’s vocabulary.

Reweighted ridge matrix and feature (vocabulary) selection

Here we give some details about the feature selection and adaptive ridge regularization. After extracting TFIDF features and computing density estimation maps, we fit a linear model by regressing the activity of each voxel on the TFIDF descriptors (Section 'Overview of the NeuroQuery model'). We denote p the number of voxels, v the size of the vocabulary, and n the number of documents in the corpus. We construct a design matrix Xn×v containing the TFIDF features of each study, and the dependent variables YRn×p representing the activation density at each voxel for each study. The linear model thus writes:

Y=XB+E, (2)

where E is Gaussian noise and BRv×p are the unknown model coefficients. We use ridge regression (least-squares regression with a penalty on the 2 norm of the model coefficients). Some words are much more informative than others, or have a much stronger correlation with brain activity. For example, ‘auditory’ is well correlated with activations in the auditory areas, whereas ‘attention’ has a lower signal-to-noise ratio, as it is polysemic and, even when used as a psychological concept, has a weaker link to reported neural activations. Therefore it is beneficial to adapt the amount of regularization for each word, to strongly penalize (or even discard) the most noisy features.

Many existing methods for feature selection are not adapted to our case, because: (i) the design matrix X is very sparse, and more importantly (ii) we want to select the same features for ≈ 28 000 outputs (each voxel in the brain is a dependent variable). We therefore introduce a new reweighted ridge regression and feature selection procedure.

Our approach is based on the observation that when fitting a ridge regression with a uniform regularization, the most informative words are associated with large coefficients for many voxels. We start by fitting a ridge regression with uniform regularization. We obtain one statistical map of the brain for every feature (every term in the vocabulary). The maps are rescaled to reduce the importance of coefficients with a high variance. We then compute the squared 2 norms of these brain maps across voxels. These norms are a good proxy for the importance of each feature. Terms associated with large norms explain well the activity of many voxels and tend to be helpful features. We rely on these brain map norms to determine which features are selected and what regularization is applied. The feature selection and adaptive regularization are described in detail in the rest of this section.

Z scores for ridge regression coefficients

Our design matrix XRn×v holds TFIDF features for v terms in n studies. There are p dependent variables, one for each voxel in the brain, which form YRn×p. The first ridge regression fit yields coefficients B^(0)Rv×p:

B^(0)=argminBRv×p||YXB||F2+λ||B||F2, (3)

where λ>0 is a hyperparameter set with Generalized Cross-Validation (GCV); (Rifkin and Lippert, 2007). We then compute an estimate of the variance of these coefficients. The approach is similar to the one presented in Gaonkar and Davatzikos (2012) for the case of SVMs. A simple estimator can be obtained by noting that the coefficients of a ridge regression are a linear function of the dependent variables. Indeed, solving Equation 3 yields:

B^(0)=(XTX+λI)1XTY. (4)

Defining

M=(XTX+λI)1XTRv×n, (5)

for a voxel k{1,,p}, and a feature j{1,v},

b^j,k(0)=mj,y,k, (6)

where 𝒎jn is the ith row of M and y,kRn is the kth column of Y. The activations of voxel k across studies are considered to be independent identically distributed (i.i.d), so

Var(y,k)=Var(y1,k)Insk2In. (7)

An estimate of this variance can be obtained from the residuals:

s^k21ni=1n(y^i,k(0)yi,k)2=1ni=1n((XB^(0))i,kyi,k)2. (8)

A simple estimate of the coefficients’ variance is then:

σ^j,k2Var^(b^j,k(0))=s^k2mj,mj=s^k2i=1nmj,i2 (9)

We can thus estimate the standard deviation of each entry of 𝑩^(0). We obtain a brain map of Z scores for each term in the vocabulary: for term j{1,,v} and voxel k{1p},

z^j,kb^j,k(0)σ^j,k. (10)

We denote σj^=(σ^j,1,,σ^j,p)Rp; and the Z-map for term j.

Reweighted ridge matrix

Once we have a Z-map for each term, we summarize these maps by computing their squared Euclidean norm. In practice, we smooth the Z scores: z^j,k in Equation 10 is replaced by

ζ^j,k=b^j,k(0)σ^j,k+δ, (11)

where δ is a constant offset. The offset δ allows us to interpolate between basing the regularization on the Z scores, or on the raw coefficients, that is the β-maps. We obtain better results with a large value for δ, such as the mean variance of all the regression coefficients. This prevents selecting terms only because they have a very small estimated variance in some voxels. Note that this offset δ is only used to compute the regularization, and not to compute the rescaled predictions produced by NeuroQuery as in Equation 17.

We denote ζ^j=(ζ^j,1,,ζ^j,p)Rp,j{1,,v}. Next, we compute the mean µ and standard deviation e of {||ζ^j||22,j=1v}, and set an arbitrary cutoff

c=μ+2e. (12)

All features j such that ||𝜻^j||22c+ϵ, where ϵ is a small margin to avoid division by zero in Equation 14, are discarded. In practice we set ϵ to 0.001. The value of ϵ is not important, because features that are not discarded but have their ζ norm close to c get very heavily penalized in Equation 14 and have coefficients very close to .

We denote u<v the number of features that remain in the selected vocabulary. We denote ϕ:{1u}{1v} the strictly increasing mapping that reindexes the features by keeping only the u selected terms: ϕ({1u}) is the set of selected features. We denote PRu×v the corresponding projection matrix:

p,jT=eϕ(j),j{1u}, (13)

where {𝒆j,j=1v} is the natural basis of v. The regularization for the selected features is then set to

wj=1||ζ^ϕ(j)||22c. (14)

Finally, we define the diagonal matrix WRu×u such that the jth element of its diagonal is wj and fit a new set of coefficients B^Ru×p with this new ridge matrix.

Fitting the reweighted ridge regression

The reweighted ridge regression problem writes:

B^=argminBRu×p||YXPTB||F2+γTr(BTWB), (15)

Where γ>0 is a new hyperparameter, that is again set by Generalized Cross-Validation (GCV). With a change of variables this becomes equivalent to solving the usual ridge regression problem:

Γ^=argminΓ||YX~Γ||F2+γ||Γ||F2, (16)

where X~=XPTW12 and we recover B^ as B^=W12Γ^ .

The variance of the parameters B^ can be estimated as in Equation 9 – without applying the smoothing of Equation 11. NeuroQuery can thus report rescaled predictions

z^=xTB^(Var^(xTB^))12 (17)

One benefit of this rescaling is to provide the user a natural value to threshold the maps. As visible on Figures 4, 5 and 6, thresholding for example at 𝒛^3 selects regions typical of the query, that can be used for instance in a region of interest analysis.

Summary of the regression with adaptive regularization

The whole procedure for feature selection and adaptive regularization is summarized in Algorithm 1.

Algorithm 1 Reweighted Ridge Regression
Input: TFIDF features X, brain activation densities Y, regularization hyperparameter grid Λ, variance smoothing parameter δ use
use GCV to compute the best hyperparameter λΛ and
B^(0)=argminB||YXB||F2+λ||B||F2;
compute variance estimates σ^j2 as in Equation 9
ζ^jb^j(0)σ^j+δj{1...v};
compute c according to Equation 12
define ϕ the reindexing that selects features j such that ||ζ^j||22>c+ϵ;
define PRu×v the projection matrix for ϕ as in Equation 13
wj1ζ^ϕ(j)22cj{1...u};
Wdiag(wj,j=1...u);
use GCV to compute the best hyperparameter γΛ and
B^=argminB||YXPTB||F2+γTr(BTWB)
return B^, Var^(B^), γ, P, W

In practice, the feature selection keeps u ≈ 200 features. It has a very low computational cost compared to other feature selection schemes. The computational cost is that of fitting two ridge regressions (and the second one is fitted with a much smaller number of features). Moreover, the feature selection also reduces computation at prediction time, which is useful because we deploy an online tool based on the NeuroQuery model (https://neuroquery.org).

Smoothing: regularization at test time

In order to smooth the sparse input features, we exploit the covariance of our training corpus. We rely on Non-negative Matrix Factorization (NMF) (Lee and Seung, 1999). We use a NMF of XRn×v to compute a low-rank approximation of the covariance XTXRv×v. Thus, we obtain a denoised term co-occurrence matrix, which measures the strength of association between pairs of terms. We start by computing an approximate factorization of the corpus TFIDF matrix 𝑿:

U,V=argminUR0n×dVR0d×v||XUV||F2+λ(||U||F2+||V||F2)+γ(||U||1,1+||V||1,1), (18)

where d<v is a hyperparameter and ||||1,1 designates the sum of absolute values of all entries of a matrix. Computing this factorization amounts to describing each document in the corpus as a linear mixture of d latent factors, or topics. In natural language processing, similar decomposition methods are referred to as topic modelling (Blei et al., 2003; Deerwester et al., 1990).

The latent factors, or topics, are the rows of VRd×v: each topic is characterized by a vector of positive weights over the terms in the vocabulary. URn×d contains the weight that each document gives to each topic. For each term in the vocabulary, the corresponding column of V is a a d-dimensional embedding in the low-dimensional, latent space: this embedding contains the strength of association of the term with each topic. These embeddings capture semantic relationships: related terms tend to be associated with embeddings that have large inner products.

The hyperparameters d=300, λ=0.1 and γ=0.01 are set by evaluating the reconstruction error, sparsity of the similarity matrix, and extracted topics (rows of V) on an unlabelled (separate) corpus. We find that the NeuroQuery model as a whole is not very sensitive to these hyperparameters and we obtain similar results for a range of different values.

Equation 18 is a well-known problem. We solve it with a coordinate-descent algorithm described in Cichocki and Phan (2009) and implemented in scikit-learn (Pedregosa et al., 2011). Then, let NRd×d be the diagonal matrix containing the Euclidean norms of the columns of U, that is such that nii=||u,i||2 and let V~=NV. We define the word similarity matrix A=V~TV~Rv×v. This matrix is a denoised, low-rank approximation of the corpus covariance. Indeed

(19)XTX(UV)TUV(20)=VTNT(UN1)TUN1NV(21)V~TV~

The last approximation is justified by the fact that the columns of URn×d are almost orthogonal, and UTU is almost a diagonal matrix. This is what we observe in practice, and is due to the fact that n13 000 is much larger than d=300, and that to minimize the reconstruction error in Equation 18 the columns of U have an incentive to span a large subspace of n.

The similarity matrix A contains the inner products of the low-dimensional embeddings of the terms in our vocabulary. We form the matrix T by dividing the rows of A by their 1 norm:

ti,j=ai,j||𝒂i||1i=1v,j=1v. (22)

This normalization ensures that terms that have many neighbors are not given more importance in the smoothed representation. The smoothing matrix that we use is then defined as:

S=(1α)I+αT, (23)

with 0<α<1 (in our experiments α is set to 0.1). This smoothing matrix is a mixture of the identity matrix and the term associations T. The model is not very sensitive to the parameter α as long as it is chosen small enough for terms actually present in the query to have a higher weight than terms introduced by the query expansion. This prevents degrading performance for documents which contain well-encoded terms, which obtain good prediction even without smoothing. This explains why in Figure 3, the prediction for ‘visual’ relies mostly on the regression coefficient for this exact term, whereas the prediction for ‘agnosia’ relies on coefficients of terms that are related to ‘agnosia’ – ‘agnosia’ itself is not kept by the feature selection procedure.

The smoothed representation for a query q becomes:

x=STqRv (24)

where qRv is the TFIDF representation of the query in large vocabulary space, and 𝑺v×v is the smoothing matrix. And the prediction for q is:

y^=B^PSTq, (25)

where PRu×v is the projection onto the useful vocabulary (selected features), B^Rp×u are the estimated linear regression coefficients, y^Rp is the predicted map.

Validation experiments: additional details

Example meta-analysis results for the RSVP language task from the IBC dataset

Here we provide more details on the meta-analyses for ‘Read pseudo-words vs consonant strings’ shown in Figure 2. The PMIDS of the studies included in the GingerALE meta-analysis are: 15961322, 16574082, 16968771, 17189619, 17884585, 17933023, 18272399, 18423780, 18476755, 18778780, 19396362, 19591947, 20035884, 20600985, 20650450, 20961169, 21767584, 22285025, 22659111, 23117157, 23270676, 24321558, 24508158, 24667455, 25566039, 26017384, 26188258, 26235228, 28780219. Representing a total of 29 studies and 2025 peak activation coordinates. They are the studies from our corpus (the largest existing corpus of text and peak activation coordinates, with ≈ 14 000 studies) which contain the terms: ‘reading’, ‘pseudo’, ‘word’, ‘consonant’ and ‘string’. The map shown on the right of Figure 2 was obtained with GingerALE, 5000 permutations and the default settings otherwise. Note that an unrealistically low threshold is used for the display because the map would be empty otherwise. Figure 8 displays more maps with different analysis strategies: the details of the original contrasts and the difference between running NeuroQuery the contrast definition or the task definition. The task definition leads to predicted activations in the early visual cortex, as in the actual group-level maps from the experiment but unlike the predictions from the contrast definition, as the later contains no information on the stimulus modality.

Figure 8. Using meta-analysis to interpret fMRI maps.

Figure 8.

Example of the ‘Read pseudo-words vs. consonant strings’ contrast, derived from the RSVP language task in the IBC dataset. (a): the group-level map obtained from the actual fMRI data from IBC. (b): ALE map using the 29 studies in our corpus that contain all five terms from the contrast name. (c): NeuroQuery map obtained from the contrast name. (d): NeuroQuery map obtained from the page-long RSVP task description in the IBC dataset documentation: https://project.inria.fr/IBC/files/2019/03/documentation.pdf.

NeuroQuery performance on unseen pairs of terms

Figure 4 shows in a qualitative way that NeuroQuery can produce useful brain maps on a combination of terms that have not been studied together. To give a quantitative evaluation that is not limited to a specific pair of terms, we perform a systematic experiment, studying prediction on many unseen pairs of term. For this purpose, we chose pairs of terms in our full corpus and leave out all the studies where both of these terms appear. We train a NeuroQuery model on the reduced corpus of studies obtained by excluding studies with both terms, and evaluate its predictions on the left-out studies.

We choose terms that appear simultaneously in studies frequently (more than 500) to ensure a good estimation of the combined locations for these terms in the test set, but not too frequently (less than 1000), to avoid depleting the training set too much. Indeed, removing the studies for both terms from the corpus not only decreases the statistical power to map these terms but also, more importantly, it creates a negative correlation between these terms. Out of these terms, we select 1000 out random as a left-out and run the experiment 1000 times.

To evaluate NeuroQuery’s prediction on these unseen pairs of terms, we first use the same metrics as in Section 'Quantitative evaluation: NeuroQuery is an accurate model of the literature.' Figure 9 -left shows the log-likelihood of coordinates reported in a publication evaluated on left-out studies that contain the combination of terms excluded from the train set. Compared to testing on a random subset of studied, identically distributed to the training, there is a slight decrease in likelihood but it is small compared to the variance between cross-validation runs. Figure 9-right shows results for our other validation metric adapted from Mitchell et al. (2008): matching 1 publication out of 2 to its observed locations. The decrease in performance is more marked. However, it should be noted that the task is more difficult when the test set is made only of publications that all contain two terms, as these publications are all more similar to each other than random publications from the general corpus.

Figure 9. Quantitative evaluations on unseen pairs.

Figure 9.

A quantitative comparison of prediction on random unseen studies (i.i.d.cross-validation) to prediction on studies containing pairs of terms never seen before, using the two measures of predictions performance (as visible on Figure 7 for standard cross-validation).

To gauge the quality of the maps on unseen pairs, and not only how well the corresponding publications are captured, Figure 10 shows the Pearson correlation between the predicted brain map and the average density of the reported locations in the left-out studies. The excellent median Pearson correlation of .85 shows that the predicted brain map is indeed true to what a meta-analysis of these studies would reveal.

Figure 10. Consistency between prediction of unseen pairs and meta-analysis.

Figure 10.

The Pearson correlation between the map predicted by NeuroQuery on a pair of unseen terms and the average density of locations reported on the studies containing this pair of terms (hence excluded from the training set of NeuroQuery).

NeuroQuery prediction performance without anatomical terms

In Figure 11, we present an additional quantitative measure of prediction performance. We delete all terms that are related to anatomy in test articles, to see how NeuroQuery performs without these highly predictive features, which may be missing from queries related to brain function. As the GCLDA and NeuroSynth tools are designed to work with NeuroSynth data, they are only tested on NeuroSynth’s TFIDF features, which represent the articles’ abstracts.

Figure 11. Explaining coordinates reported in unseen studies.

Figure 11.

left: log-likelihood of reported coordinates in test articles. right: how often the predicted map is closer to the true coordinates than to the coordinates for another article in the test set [mitchell2008predicting, ]. The boxes represent the first, second and third quartiles of scores across 16 cross-validation folds. Whiskers represent the rest of the distribution, except for outliers, defined as points beyond 1.5 times the IQR past the low and high quartiles, and represented with diamond fliers.

Variable terminology

In Figure 12, we show predictions for some terms related to mental arithmetic. NeuroQuery’s semantic smoothing produces consistent results for related terms.

Figure 12. Taming arbitrary query variability Maps obtained for a few words related to mental arithmetic.

Figure 12.

By correctly capturing the fact that these words are related, NeuroQuery can use its map for easier words like ‘calculation’ and ‘arithmetic’ to encode terms like ‘computation’ and ‘addition’ that are difficult for meta-analysis.

Comparison with the BrainPedia IBMA study

To compare maps produced by NeuroQuery with a reliable ground truth, we use the BrainPedia study (Varoquaux et al., 2018), which exploits IBMA to produce maps for 19 cognitive concepts. Indeed, when it its feasible, IBMA of manually selected studies produces high-quality brain maps and has been used as a reference for CBMA methods (Salimi-Khorshidi et al., 2009). We download the BrainPedia maps and their cognitive labels from the NeuroVault platform (https://neurovault.org/collections/4563/). BrainPedia maps combine forward and reverse inference, and are thresholded to identify regions that are both recruited and predictive of each cognitive process. We treat these maps as a binary ground truth: above-threshold voxels are relevant to the map’s label. For each label, we obtain a brain map from NeuroQuery, NeuroSynth and GCLDA. We compare these results to the BrainPedia thresholded maps and measure the Area Under the ROC Curve. This standard classification metric measures the probability that a voxel that is active in the BrainPedia reference map will be given a higher intensity in the NeuroQuery prediction than a voxel that is inactive in the BrainPedia map.

We consider two settings. First, we use the original labels provided in the NeuroVault metadata. However, some of these labels are missing from the NeuroSynth vocabulary. In a second experiment, we therefore replace these labels with the most similar term we can find in the NeuroSynth vocabulary. These replacements are shown in Figure 13.

Figure 13. Comparison of CBMA maps with IBMA maps from the BrainPedia study.

Figure 13.

We use labelled and thresholded maps resulting from a manual IBMA. The labels are fed to NeuroQuery, NeuroSynth and GCLDA and their results are compared to the reference by measuring the Area under the ROC Curve. The black vertical bars show the median. When using the original BrainPedia labels, NeuroQuery performs relatively well but NeuroSynth fails to recognize most labels. When reformulating the labels, that is replacing them with similar terms from NeuroSynth’s vocabulary, both NeuroSynth and NeuroQuery match the manual IBMA reference for most terms. On the top, we show the BrainPedia map (first row) and NeuroQuery prediction (second row) for the quartiles of the AUC obtained by NeuroQuery on the original labels. A lower AUC for some concepts can sometimes be explained by a more noisy BrainPedia reference map.

When replacing the original labels with less specific terms understood by NeuroSynth, both NeuroQuery and NeuroSynth perform well: NeuroQuery’s median AUC is 0.9 and NeuroSynth’s is 0.8. When using the original labels, NeuroSynth fails to produce results for many labels as they are missing from its vocabulary. NeuroQuery still performs well on these uncurated labels with a median AUC of 0.8. Finally, we can note that although the BrainPedia maps come from IBMA conducted on carefully selected fMRI studies, they also contain some noise. As can be seen in Figure 13, BrainPedia maps that qualitatively match the domain knowledge also tend to be close to the CBMA results produced by NeuroQuery and NeuroSynth.

Comparison with Harvard-Oxford anatomical atlas

Here, we compare CBMA maps to manually segmented regions of the Harvard-Oxford anatomical atlas (Desikan et al., 2006). We feed the labels from this atlas to NeuroQuery, NeuroSynth and GCLDA and compare the resulting maps to the atlas regions. This experiment provides a sanity check that relies on an excellent ground truth, as the atlas regions are labelled and segmented by experts. For simplicity, atlas labels absent from NeuroSynth’s vocabulary are discarded. For the remaining 18 labels, we compute the Area Under the ROC Curve of the maps produced by each meta-analytic tool. This experiment is therefore identical to the one presented in Section 'Comparison with the BrainPedia IBMA study', except that the reference ground truth is a manually segmented anatomical atlas, and that we do not consider reformulating the labels. GCLDA is not used in this experiment as the trained model distributed by the authors does not recognize anatomical terms. We observe that both NeuroSynth and NeuroQuery match closely the reference atlas, with a median AUC above 0.9, as seen in Figure 14.

Figure 14. Comparison of predictions with regions of the Harvard-Oxford anatomical atlas.

Figure 14.

Labels of the Harvard-Oxford anatomical atlas present in NeuroSynth’s vocabulary are fed to NeuroSynth and NeuroQuery. The meta-analytic maps are compared to the manually segmented reference by measuring the Area Under the ROC Curve. The black vertical bars show the median. Both NeuroSynth and NeuroQuery achieve a median AUC above 0.9. On the top, we show the atlas region (first row) and NeuroQuery prediction (second row) for the quartiles of the NeuroQuery AUC scores.

Comparison with NeuroSynth on terms with strong activations

As NeuroSynth performs a statistical test, when a term has a strong link with brain activity and is popular enough for NeuroSynth to detect many activations, the resulting map is trustworthy and can be used as a reference. Moreover, it is a well-established tool that has been adopted by the neuroimaging community. Here, we verify that when a term is well captured by NeuroSynth, NeuroQuery predicts a similar brain map. To identify terms that NeuroSynth captures well, we compute the NeuroSynth maps for all the terms in NeuroSynth’s vocabulary. We use the Benjamini-Hochberg procedure to threshold the maps, controlling the FDR at 1%. We then select the 200 maps with the largest number of active (above-threshold) voxels. We use these activation maps as a reference to which we compare the NeuroQuery prediction. For each term, we compute the Area Under the ROC Curve: the probability that a voxel that is active in the NeuroSynth map will have a higher value in the NeuroQuery prediction than an inactive voxel. We find that NeuroQuery and NeuroSynth’s maps coincide well, with a median AUC of 0.90. The distribution of the AUC and the brain map corresponding to each quartile are shown in Figure 15.

Figure 15. Comparison with NeuroSynth.

Figure 15.

NeuroSynth maps are thresholded controlling the FDR at 1%. The 200 words with the largest number of active voxels are selected and NeuroQuery predictions are compared to the NeuroSynth activations by computing the Area Under the ROC Curve. The distribution of the AUC is shown on the top. The vertical black line shows the median (0.90). On the bottom, we show the NeuroQuery maps (first row) and NeuroSynth activations (second row) for the quartiles of the NeuroQuery AUC scores.

The NeuroQuery publication corpus and associated vocabulary

Word occurrence frequencies across the corpus

The challenge: most words are rare

As shown on Figure 16 right, most words occur in very few documents, which is why correctly mapping rare words is important. The problem of rare words is more severe in the NeuroSynth corpus, which contains only the abstracts. As the NeuroQuery corpus contains the full text of the articles (around 20 times more text), more occurrences of a unique term in a document are observed, as shown in Figure 16 left, and in Figure 17 for a few example terms.

Figure 16. Right: benefit of using full-text articles.

Figure 16.

Document frequencies (number of documents in which a word appears) for terms from the NeuroSynth vocabulary, in the NeuroSynth corpus (x axis) and the NeuroQuery corpus (y axis). Words appear in much fewer documents in the NeuroSynth corpus because it only contains abstracts. Even when considering only terms present in the NeuroSynth vocabulary, the NeuroQuery corpus contains over 3M term-study associations – 4.6 times more than NeuroSynth. Left: Most terms occur in few documents Plot of the document frequencies in the NeuroQuery corpus, for terms in the vocabulary, sorted in decreasing order. While some terms are very frequent, occurring in over 12 000 articles, most are very rare: half occur in less than 76 (out of 14 000) articles.

Figure 17. Document frequencies for some example words, in NeuroQuery’s and NeuroSynth’s corpora.

Figure 17.

Document set intersections lack statistical power. For example, ‘face perception’ occurs in 413 articles, and ‘dementia’ in 1312. 1703 articles contain at least one of these words and could be used for a multivariate regression’s prediction for the query ‘face perception and dementia’. Indeed, denoting c the dual coefficients of the ridge regression and X the training design matrix, the prediction for a query q is qtXtc, and any document that has a nonzero dot product with the query can participate in the prediction. However, only 22 documents contain both terms and would be used with the classical meta-analysis selection, which would lack statistical power and fail to produce meaningful results. Exact matches of multi-word expressions such as ‘creative problem solving’, ' facial trustworthiness recognition ', ‘positive feedback processing’, ‘potential monetary reward’, ‘visual word recognition’ (all cognitive atlas concepts, all occurring in less than 5/10 000 full-text articles), are very rare – and classical meta-analysis thus cannot produce results for such expressions. In Figure 18, we compare the frequency of multi-word expressions from our vocabulary (such as ‘face recognition’) with the frequency of their constituent words. Being able to combine words in an additive fashion is crucial to encode such expressions into brain space.

Figure 18. Occurrences of phrases versus its constituents How often a phrase from the vocabulary (e.g.

Figure 18.

‘face recognition’) occurs, versus at least one of its constituent words (e.g. ‘face’). Expressions involving several words are typically very rare.

The choice of vocabulary

Details on the Medical Subject Headings

The Medical Subject Headings (MeSH) are concerned with all of medicine. We only included in NeuroQuery’s vocabulary the parts of this graph that are relevant for neuroscience and psychology. Here we list the branches of Medical Subject Headings (MeSH) that we included in our vocabulary:

  • Neuroanatomy: ’A08.186.211’

  • Neurological disorders: ’C10.114’, ’C10.177’, ’C10.228’, ’C10.281’, ’C10.292’, ’C10.314’, ’C10.500’, ’C10.551’, ’C10.562’, ’C10.574’, ’C10.597’, ’C10.668’, ’C10.720’, ’C10.803’, ’C10.886’, ’C10.900’

  • Psychology: ’F02.463’, ’F02.830’, ’F03’, ’F01.058’, ’F01.100’, ’F01.145’, ’F01.318’, ’F01.393’, ’F01.470’, ’F01.510’, ’F01.525’, ’F01.590’, ’F01.658’, ’F01.700’, ’F01.752’, ’F01.829’, ’F01.914’

Many MeSH terms are too rare to be part of NeuroQuery’s vocabulary. Some are too specific, e.g. ‘Diffuse Neurofibrillary Tangles with Calcification’. More importantly, many terms are absent because for each heading, MeSH provides many Entry Terms – various ways to refer to a concept, some of which are almost never used in practice in the text of publications. For example NeuroQuery recognizes the MeSH Preferred Term ‘Frontotemporal Dementia’ but not some of its variations (https://meshb.nlm.nih.gov/record/ui?ui=D057180) such as ‘Dementia, Frontotemporal’, ‘Disinhibition-Dementia-Parkinsonism-Amyotrophy Complex’, or ‘HDDD1’. Note that even when absent from the vocabulary as single phrases, many of these variations can be parsed as a combination of several terms, resulting in a similar brain map as the one obtained for the preferred term.

Atlas labels included in the vocabulary

The labels from the 12 atlases shown in Table 4 were included in the NeuroQuery vocabulary.

NeuroSynth posterior probability maps

The NeuroSynth maps shown in Figure 6 are the NeuroSynth ‘association test’ maps. For completeness, in Figure 19 we show the other kind of map that NeuroSynth can produce, called ‘posterior probability’ maps.

Figure 19. NeuroSynth posterior probability maps for ‘language’ (top) and ‘reward’ (bottom), using the full corpus.

Figure 19.

Acknowledgements

JD acknowledges funding from Digiteo under project Metacog (2016-1270D). RP received funding from the US National Science Foundation (Award # OAC-1649658). TY acknowledges funding from NIH under grant number R01MH096906. BT received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 785907 (HBP SGA2) and No 826421 (VirtualbrainCloud). FS acknowledges funding from ANR via grant ANR-16- CE23-0007-01 (‘DICOS’). GV was partially funded by the Canada First Research Excellence Fund, awarded to McGill University for the Healthy Brains for Healthy Lives initiative. We also thank the reviewers, including Tor D Wager, for their suggestions that improved the manuscript.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Jérôme Dockès, Email: jerome@dockes.org.

Gaël Varoquaux, Email: gael.varoquaux@inria.fr.

Christian Büchel, University Medical Center Hamburg-Eppendorf, Germany.

Thomas Yeo, National University of Singapore, Singapore.

Funding Information

This paper was supported by the following grants:

  • Digiteo 2016-1270D - Projet MetaCog to Jérôme Dockès.

  • National Institutes of Health R01MH096906 to Tal Yarkoni.

  • Agence Nationale de la Recherche ANR-16- CE23-0007-01 to Fabian Suchanek.

  • H2020 European Research Council 785907 (HBP SGA2) to Bertrand Thirion.

  • H2020 European Research Council 826421 (VirtualbrainCloud) to Bertrand Thirion.

  • Canada First Research Excellence Fund Healthy Brains for Healthy Lives initiative to Gael Varoquaux.

  • National Science Foundation OAC-1649658 to Russell A Poldrack.

Additional information

Competing interests

Reviewing editor, eLife.

No competing interests declared.

Author contributions

Conceptualization, Resources, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology.

Conceptualization, Supervision, Methodology.

Software, Visualization.

Software, Visualization.

Conceptualization, Validation.

Conceptualization, Supervision, Validation, Visualization, Project administration.

Conceptualization, Supervision, Funding acquisition, Project administration.

Conceptualization, Software, Supervision, Funding acquisition, Validation, Visualization, Methodology, Project administration.

Additional files

Source data 1. Source code for figures and tables.
elife-53385-data1.zip (102.3MB, zip)
Transparent reporting form

Data availability

All the data that we can share without violating copyright (including word counts of publications) have been shared on https://github.com/neuroquery/ (copy archived at https://github.com/elifesciences-publications/neuroquery) alongside with the analysis scripts. Everything is readily downloadable without any authorization or login required. For each figure and table, the data directly used to generate it is made available in a separate zip file.

The following dataset was generated:

Dockès J, Varoquaux G. 2019. NeuroQuery. GitHub. neuroquery_data

References

  1. Bishop CM. Pattern Recognition and Machine Learning. Springer; 2006. [Google Scholar]
  2. Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. Journal of Machine Learning Research. 2003;3:993–1022. [Google Scholar]
  3. Bouma G. Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL; 2009. pp. 31–40. [Google Scholar]
  4. Bowden DM, Martin RF. NeuroNames brain hierarchy. NeuroImage. 1995;2:63–83. doi: 10.1006/nimg.1995.1009. [DOI] [PubMed] [Google Scholar]
  5. Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ, Munafò MR. Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience. 2013;14:365–376. doi: 10.1038/nrn3475. [DOI] [PubMed] [Google Scholar]
  6. Cichocki A, Phan A-H. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences. 2009;E92-A:708–721. doi: 10.1587/transfun.E92.A.708. [DOI] [Google Scholar]
  7. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R. Indexing by latent semantic analysis. Journal of the American Society for Information Science. 1990;41:391–407. doi: 10.1002/(SICI)1097-4571(199009)41:6&#x0003c;391::AID-ASI1&#x0003e;3.0.CO;2-9. [DOI] [Google Scholar]
  8. Desikan RS, Ségonne F, Fischl B, Quinn BT, Dickerson BC, Blacker D, Buckner RL, Dale AM, Maguire RP, Hyman BT, Albert MS, Killiany RJ. An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. NeuroImage. 2006;31:968–980. doi: 10.1016/j.neuroimage.2006.01.021. [DOI] [PubMed] [Google Scholar]
  9. Eickhoff SB, Laird AR, Grefkes C, Wang LE, Zilles K, Fox PT. Coordinate-based activation likelihood estimation meta-analysis of neuroimaging data: a random-effects approach based on empirical estimates of spatial uncertainty. Human Brain Mapping. 2009;30:2907–2926. doi: 10.1002/hbm.20718. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters. 2006;27:861–874. doi: 10.1016/j.patrec.2005.10.010. [DOI] [Google Scholar]
  11. Gaonkar B, Davatzikos C. Deriving statistical significance maps for svm based image classification and group comparisons. International Conference on Medical Image Computing and Computer-Assisted Intervention; 2012. pp. 723–730. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gardner D, Akil H, Ascoli GA, Bowden DM, Bug W, Donohue DE, Goldberg DH, Grafstein B, Grethe JS, Gupta A, Halavi M, Kennedy DN, Marenco L, Martone ME, Miller PL, Müller HM, Robert A, Shepherd GM, Sternberg PW, Van Essen DC, Williams RW. The neuroscience information framework: a data and knowledge environment for neuroscience. Neuroinformatics. 2008;6:149–160. doi: 10.1007/s12021-008-9024-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Humphries C, Binder JR, Medler DA, Liebenthal E. Syntactic and semantic modulation of neural activity during auditory sentence comprehension. Journal of Cognitive Neuroscience. 2006;18:665–679. doi: 10.1162/jocn.2006.18.4.665. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Kang J, Johnson TD, Nichols TE, Wager TD. Meta analysis of functional neuroimaging data via bayesian spatial point processes. Journal of the American Statistical Association. 2011;106:124–134. doi: 10.1198/jasa.2011.ap09735. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Laird AR, Lancaster JJ, Fox PT. Brainmap. Neuroinformatics. 2005;3:65–77. doi: 10.1385/NI:3:1:065. [DOI] [PubMed] [Google Scholar]
  16. Lancaster JL, Tordesillas-Gutiérrez D, Martinez M, Salinas F, Evans A, Zilles K, Mazziotta JC, Fox PT. Bias between MNI and talairach coordinates analyzed using the ICBM-152 brain template. Human Brain Mapping. 2007;28:1194–1205. doi: 10.1002/hbm.20345. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401:788–791. doi: 10.1038/44565. [DOI] [PubMed] [Google Scholar]
  18. Lipscomb CE. Medical subject headings (mesh) Bulletin of the Medical Library Association. 2000;88:265. [PMC free article] [PubMed] [Google Scholar]
  19. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems; 2013. pp. 3111–3119. [Google Scholar]
  20. Mitchell TM, Shinkareva SV, Carlson A, Chang KM, Malave VL, Mason RA, Just MA. Predicting human brain activity associated with the meanings of nouns. Science. 2008;320:1191–1195. doi: 10.1126/science.1152876. [DOI] [PubMed] [Google Scholar]
  21. Naselaris T, Kay KN, Nishimoto S, Gallant JL. Encoding and decoding in fMRI. NeuroImage. 2011;56:400–410. doi: 10.1016/j.neuroimage.2010.07.073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Newell A. You can’t play 20 questions with nature and win: Projective comments on the papers of this symposium. Visual Information Processing: Proceedings of the Eighth Annual Carnegie Symposium on Cognition.1973. [Google Scholar]
  23. Nielsen FA, Hansen LK, Balslev D. Mining for associations between text and brain activation in a functional neuroimaging database. Neuroinformatics. 2004;2:369–380. doi: 10.1385/NI:2:4:369. [DOI] [PubMed] [Google Scholar]
  24. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V. Scikit-learn: machine learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830. [Google Scholar]
  25. Piantadosi ST. Zipf's word frequency law in natural language: a critical review and future directions. Psychonomic Bulletin & Review. 2014;21:1112–1130. doi: 10.3758/s13423-014-0585-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Pinho AL, Amadon A, Ruest T, Fabre M, Dohmatob E, Denghien I, Ginisty C, Becuwe-Desmidt S, Roger S, Laurier L, Joly-Testault V, Médiouni-Cloarec G, Doublé C, Martins B, Pinel P, Eger E, Varoquaux G, Pallier C, Dehaene S, Hertz-Pannier L, Thirion B. Individual brain charting, a high-resolution fMRI dataset for cognitive mapping. Scientific Data. 2018;5:180105. doi: 10.1038/sdata.2018.105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Poldrack RA. Can cognitive processes be inferred from neuroimaging data? Trends in Cognitive Sciences. 2006;10:59–63. doi: 10.1016/j.tics.2005.12.004. [DOI] [PubMed] [Google Scholar]
  28. Poldrack RA. Subtraction and beyond: the logic of experimental designs for neuroimaging. In Foundational Issues in Human Brain Mapping. 2010;147:14. doi: 10.7551/mitpress/9780262014021.003.0014. [DOI] [Google Scholar]
  29. Poldrack RA, Mumford JA, Nichols TE. Handbook of Functional MRI Data Analysis. Cambridge University Press; 2011. [DOI] [Google Scholar]
  30. Poldrack RA, Mumford JA, Schonberg T, Kalar D, Barman B, Yarkoni T. Discovering relations between mind, brain, and mental disorders using topic mapping. PLOS Computational Biology. 2012;8:e1002707. doi: 10.1371/journal.pcbi.1002707. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Poldrack RA, Baker CI, Durnez J, Gorgolewski KJ, Matthews PM, Munafò MR, Nichols TE, Poline JB, Vul E, Yarkoni T. Scanning the horizon: towards transparent and reproducible neuroimaging research. Nature Reviews Neuroscience. 2017;18:115–126. doi: 10.1038/nrn.2016.167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Poldrack RA, Yarkoni T. From brain maps to cognitive ontologies: informatics and the search for mental structure. Annual Review of Psychology. 2016;67:587–612. doi: 10.1146/annurev-psych-122414-033729. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Rifkin RM, Lippert RA. Notes on Regularized Least Squares. MIT Press; 2007. [Google Scholar]
  34. Rubin TN, Koyejo O, Gorgolewski KJ, Jones MN, Poldrack RA, Yarkoni T. Decoding brain activity using a large-scale probabilistic functional-anatomical atlas of human cognition. PLOS Computational Biology. 2017;13:e1005649. doi: 10.1371/journal.pcbi.1005649. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Salimi-Khorshidi G, Smith SM, Keltner JR, Wager TD, Nichols TE. Meta-analysis of neuroimaging data: a comparison of image-based and coordinate-based pooling of studies. NeuroImage. 2009;45:810–823. doi: 10.1016/j.neuroimage.2008.12.039. [DOI] [PubMed] [Google Scholar]
  36. Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing & Management. 1988;24:513–523. doi: 10.1016/0306-4573(88)90021-0. [DOI] [Google Scholar]
  37. Sayers E. Entrez Programming Utilities Help [Internet] Bethesda, MD: National Center for Biotechnology Information (US); 2009. The e-utilities in-depth: parameters, syntax and more. [Google Scholar]
  38. Scott DW. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons; 2015. [DOI] [Google Scholar]
  39. Silverman BW. Density Estimation for Statistics and Data Analysis. CRC Press; 1986. [Google Scholar]
  40. Sitek EJ, Thompson JC, Craufurd D, Snowden JS. Unawareness of deficits in Huntington's disease. Journal of Huntington's Disease. 2014;3:125–135. doi: 10.3233/JHD-140109. [DOI] [PubMed] [Google Scholar]
  41. Turkeltaub PE, Eden GF, Jones KM, Zeffiro TA. Meta-analysis of the functional neuroanatomy of single-word reading: method and validation. NeuroImage. 2002;16:765–780. doi: 10.1006/nimg.2002.1131. [DOI] [PubMed] [Google Scholar]
  42. Turney PD, Pantel P. From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research. 2010;37:141–188. doi: 10.1613/jair.2934. [DOI] [Google Scholar]
  43. Ulrich R, Mattes S, Miller J. Donders's assumption of pure insertion: an evaluation on the basis of response dynamics. Acta Psychologica. 1999;102:43–76. doi: 10.1016/S0001-6918(99)00019-0. [DOI] [Google Scholar]
  44. Varoquaux G, Schwartz Y, Poldrack RA, Gauthier B, Bzdok D, Poline JB, Thirion B. Atlases of cognition with large-scale human brain mapping. PLOS Computational Biology. 2018;14:e1006565. doi: 10.1371/journal.pcbi.1006565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Wager TD, Jonides J, Reading S. Neuroimaging studies of shifting attention: a meta-analysis. NeuroImage. 2004;22:1679–1693. doi: 10.1016/j.neuroimage.2004.03.052. [DOI] [PubMed] [Google Scholar]
  46. Wager TD, Lindquist M, Kaplan L. Meta-analysis of functional neuroimaging data: current and future directions. Social Cognitive and Affective Neuroscience. 2007;2:150–158. doi: 10.1093/scan/nsm015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Wager TD, Kang J, Johnson TD, Nichols TE, Satpute AB, Barrett LF. A bayesian model of category-specific emotional brain responses. PLOS Computational Biology. 2015;11:e1004066. doi: 10.1371/journal.pcbi.1004066. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Wager TD, Atlas LY, Botvinick MM, Chang LJ, Coghill RC, Davis KD, Iannetti GD, Poldrack RA, Shackman AJ, Yarkoni T. Pain in the ACC? PNAS. 2016;113:E2474–E2475. doi: 10.1073/pnas.1600282113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Witt ST, Laird AR, Meyerand ME. Functional neuroimaging correlates of finger-tapping task variations: an ALE meta-analysis. NeuroImage. 2008;42:343–356. doi: 10.1016/j.neuroimage.2008.04.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Xue W, Kang J, Bowman FD, Wager TD, Guo J. Identifying functional co-activation patterns in neuroimaging studies via poisson graphical models. Biometrics. 2014;70:812–822. doi: 10.1111/biom.12216. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Yarkoni T, Poldrack RA, Van Essen DC, Wager TD. Cognitive neuroscience 2.0: building a cumulative science of human brain function. Trends in Cognitive Sciences. 2010;14:489–496. doi: 10.1016/j.tics.2010.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Yarkoni T, Poldrack RA, Nichols TE, Van Essen DC, Wager TD. Large-scale automated synthesis of human functional neuroimaging data. Nature Methods. 2011;8:665–670. doi: 10.1038/nmeth.1635. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Yeo BT, Krienen FM, Eickhoff SB, Yaakub SN, Fox PT, Buckner RL, Asplund CL, Chee MW. Functional specialization and flexibility in human association cortex. Cerebral Cortex. 2015;25:3654–3672. doi: 10.1093/cercor/bhu217. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision letter

Editor: Thomas Yeo1
Reviewed by: Tor D Wager2

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

This paper describes NeuroQuery, a new approach to automated meta-analysis of the neuroimaging literature. It is demonstrably superior for many purposes, particularly as a starting point for constraining predictive models and as regions or patterns of interest in new studies. We believe that this will be a very widely used tool. It's also a tremendous amount of work to build and validate, and very few groups have both the skills and motivation to build this and make it accessible to the broad neuroscience/neuroimaging community.

Decision letter after peer review:

Thank you for submitting your article "NeuroQuery: comprehensive meta-analysis of human brain mapping" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Christian Büchel as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Tor D. Wager (Reviewer #1).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

This paper describes NeuroQuery, a new approach to automated meta-analysis of the neuroimaging literature. It is demonstrably superior for many purposes – particularly as a starting point for constraining predictive models and as regions or patterns of interest in new studies. We believe that this will be a very widely used, and cited, tool. It's also a tremendous amount of work to build and validate, and very few groups have both the skills and motivation to build this and make it accessible to the broad neuroscience/neuroimaging community.

There are a number of exemplary features of this work, including:

– A full implementation that is openly shared on Github, including source code, models, and data

– A fully functional web interface that runs simply and beautifully

– Several types of validation of the model's performance, including (a) stability with rare terms listed in few studies, (b) reproducibility of maps with limited data, (c) better encoding/prediction of brain maps associated with terms.

This work dramatically expands the vocabulary of search terms that can be used in neuroimaging meta-analyses. We believe this is going to be really useful. The model framework is also interesting and motivated by some careful considerations in terms of how the model should be affected by certain classes of rare terms, features included in the model, etc. The bottom line for users is a better set of predictive maps. While there are many potential varieties of such models and one could second-guess some particular choices, the beauty of the authors' approach is that the code is available for others to try out variations and build/validate a different model based on their sensibilities and preferences.

Essential revisions:

1) We believe that the validations are insufficient for some of the claims. The authors should either expand their experiments to validate these claims or tone down their claims.

A) "NeuroQuery, can assemble results from the literature into a brain map based on an arbitrary query." This statement can be mis-interpreted to mean that users can enter any terms they want. My understanding is that the query has to comprise words from the 7547 dictionary words. Words outside the dictionary are ignored.

B) Introduction: "For example, some rare diseases, or a task involving a combination of mental processes that have been studied separately, but never – or rarely – together." This suggests NeuroQuery can do this with precision, but the authors have not experimentally demonstrated that activation maps involving "combination of mental processes that have been studied separately, but never together" can be accurately predicted. To validate this, the authors should consider cases involving compound mental processes (e.g., "auditory working memory") and remove all experiments involving both "auditory" + "working memory". Care has to be taken to also remove experiments using words similar to "auditory" + "working memory", so that experiments such as "auditory n-back" are also removed, even though the experiments did not explicitly use the term "working memory". Note that the entire experiments should be removed, rather than just the specific words. The authors can then re-train their model and show whether the query "auditory working memory" yields activation similar to the removed experiments involving auditory working memory.

C) For subsection “NeuroQuery can map rare or difficult concepts”, it's important to differentiate between a concept that is rarely studied versus a rare word that seldom appears in the literature even though variations of the word might be widely studied. The whole section seems to imply that the NeuroQuery is able to accomplish the former very well, but in the experiment, they chose frequent terms (e.g., language) and progressively delete them from their dataset, thus they are really testing the latter. To really test the former, rather than just deleting the word "language" from the documents, they should delete entire documents containing "language" and/or other terms similar to language (e.g., semantic).

2) Figure 8 (right panel) should be in the main text in addition to (or replacing) Figure 6. While the log likelihood measure is valid, the "absolute" measure is much more helpful to the users of knowing how much to trust the results. Along this note, it is somewhat concerning that the overall accuracy is only 70% – how much should a user trust a tool with a 70% accuracy? However, this is perhaps underselling NeuroQuery because coordinates tend to be sparse, so just based on the reported coordinates, this classification task might simply be very hard. What might be more useful would be for the user to get a sense of how much the NeuroQuery maps actually matches real activation maps. Can the authors perform the same experiment, but using real activation maps from NeuroVault or their own Individual Brain Charting dataset? This is just a suggestion. The authors can choose to just discuss this issue.

3) Results section: some pieces of argumentation presented here are not fully convincing. The authors propose that: "Term co-occurrences, on the other hand, are more consistent and reliable, and capture semantic relationships [Turney and Pantel, 2010]". Most of the brain mapping literature is made of attempts to differentiate cognitive processes inferred from the study of human behaviour such as "proactive control VS retroactive control", "recollection VS familiarity", "face perception VS place perception", "positive emotion VS negative emotion". In that context, it is unclear how terms co-occurrence for example between "face" and "place" would be more consistent and reliable than "face" alone. Term co-occurrence mainly reflects what type of concepts tend to be studied together.

4) The authors suggest that "It would require hundreds of studies to detect a peak activation pattern for "aphasia". It is not clear what do the authors mean here. Aphasia is a clinical construct referring to a behavioural disturbance. It is defined as "an inability (or impaired ability) to produce or understand speech". Accordingly, we don't see how one could search for a peak of activation related to aphasia? The line of argumentation could be clarified here by starting with concrete examples. Perhaps an example the authors may refer to is that, starting from a clinical point of view, researchers may want to investigate brain activation related to the processes of language production in order to better understand dysfunctions such those seen in aphasia. Accordingly, the related publication will probably mention "aphasia" "language production" "phonological output lexicon" etc… and this pattern of terms could in turn be linked to similar language related terms in other publications. An alternative guess is that maybe the authors actually refer here to lesion studies that have, for example, searched for correlation between grey matter volume atrophy and aphasia?

5) We have some suggestions for improving the Abstract/Introduction

A) Explaining how a predictive framework allows maps to be generated by generalizing from terms that are well studied ("faces") to those that are less well studied and inaccessible to traditional meta-analyses ("prosopagnosia"). And explaining that this is good for some use cases (generating sensible ROIs/making guesses about where future studies of prosopagnosia might activate), but not good for others (e.g., knowing that a particular area is (de)activated in studies of prosopagnosia).

B) Can you provide a bit more context on previous topic models (e.g., Nielsen/Hansen NNMF "bag of words" work from way back, Poldrack/Yarkoni topic models, GCLDA) and how this approach is different. After reading the Introduction, readers have a sense of what Neuroquery does (its goal), but not how it actually does it (no algorithmic insight). For example, Neuroquery "infers semantic similarities across terms used in the literature", but so does a topic model (which is not mentioned/cited). We recommend including more description early on of what the algorithmic differences are relative to previous work that confer advantages. This is explained later (with some redundancy across sections), but more up front would be helpful.

C) In the Introduction, the authors argue that meta-analyses are limited primarily by performance of in-sample statistical inference (they "model all studies asking the same question") and lack of use of a predictive framework. While there is great value in the current work, we don't think that is strictly true. Other meta-analyses have taken a predictive approach, in a limited way, and also modeled differences across study categories. The Naive Bayes classifier analyses in the Neurosynth paper (Yarkoni, 2011) are an example. And Neurosynth considers topics across tasks and fields of study. Analyzing studies by single search terms is an important limitation (topic models by Poldrack and Yarkoni help; perhaps acknowledge their 2014 paper). Viewing the paper as a whole, we understand why the authors focus on prediction vs. inference in the Introduction, but it's hard to understand how this other work fits in without having gone through the whole paper.

D) Likewise, the emphasis on the ability to interpolate struck us as odd. We understand this as generalizing based on semantic smoothing. This is useful, but we don't think what you meant was really clearly explained in the Introduction.

E) The authors make a compelling case that current meta-analyses are limited. The Pinho 2018 example is very helpful: An automated meta-analysis of all the studies performing the same or a similar contrast is not currently possible, primarily because we have lacked the tools to parse the natural language text in articles well enough to identify an "equivalence class" of functional contrasts.

F) "For example, some rare diseases, or a task involving a combination of mental processes that have been studied separately, but never – or rarely – together." Could the authors provide a concrete example?

6) We have some suggestions for improving the clarity of the Results section:

A) “NeuroQuery relies on a corpus of 13459 full-text neuroimaging publications”

What type of neuroimaging publications? task-based activations experiments? If so, of which type: PET and fMRI or only fMRI-based? does it also include structural MRI studies? does it include all types of populations (healthy adults, elderly, patients studies)?

B) It is not clear how to read Figure 1. What do the length and width of the color and grey lines represent? could this figure be more elaborated to integrate additional aspects of the procedure? e.g. each of the three brain slice has a color that seems to reflect their respective association with the term, how is all that information combine to a single brain pattern?

C) Terminological precision: Results paragraph one: the term "area" is usually reserved for brain territories defined based on microstructure features (e.g. Area 4). Here the authors refer to a spatial location in the brain (or maybe a zone of homogeneity with regards to a specific feature), so they should prefer the term "brain region" for that purpose.

D) Finally, the supplement material provides several quantitative evidence of significant improvement compared to Neurosynth, such as:

– "our method extracted false coordinates from fewer articles: 3 / 40 articles have at least one false location in our dataset, against 20 for NeuroSynth"

– "In terms of raw amount of text, this corpus is 20 times larger than NeuroSynth's"

– "Compared with NeuroSynth, NeuroQuery's extraction method reduced the number of articles with incorrect coordinates (false positives) by a factor of 7, and the number of articles with missing coordinates (false negatives) by a factor of 3 (Table 2)."

All these aspects could be summarized together with the more qualitative aspects in a table to emphasize the significant improvements over Neurosynth.

7) In the Discussion, it would be very helpful to give readers some intuition for when NeuroQuery will not yield sensible results and when/how exactly it should be used (e.g., even a table of use cases that would be appropriate and inappropriate) – and how to interpret carefully (e.g., look at the semantic loadings, and if there is one anatomical term that dominates, realize that you're essentially getting a map for that brain region). The ADHD example is useful but doesn't really cover the space of principles/use cases. Here are some possible examples we have thought of:

A) Some particular limitations may arise from the predictive nature of neuroquery, which may be less intuitive to many readers. For example, if I put in "aphasia", I will get map for "language", because aphasia is semantically close to language. This is very sensible, but users should not, of course, take this as a map of "aphasia" to be related other terms and used in inference. Users might "discover" that aphasia patterns are very closely related to "language" patterns and make an inference about co-localization of healthy and abnormal function. Of course, it's not your responsibility to control all kinds of potential misuse. But pointers would be helpful to avoid another, e.g., "#cingulategate" (Lieberman et al., 2016 PNAS).

B) For example, let's consider again the case about "combination of mental processes that have been studied separately, but never together". From my understanding of the algorithm, suppose users query "auditory" + "working memory", basically the prediction will be a linear combination of activation maps from "auditory" and "working memory" (+ similar terms due to the smoothing/query expansion). As such, this assumes that compound mental processes yield activations that are linear combination of activation maps of individual mental processes. This should be made clear.

C) Playing around with NeuroQuery, there are some queries that generate obviously wrong results. For example, "autobiographical memory" should probably yield the default network, but we getting hippocampal/retrosplenial activation instead. This presumably happens because NeuroQuery "expanded" the query to become memory because "autobiographical memory" is not one of the 200 keywords? Interesting that NeuroSynth does get it correct (https://neurosynth.org/analyses/terms/autobiographical%20memory/).

D) Perhaps a brief discussion of other limitations would be helpful. We submit that some of the fundamental problems are those not easily solved – that we usually perform meta-analyses based on studies of the same nominal task type (e.g., N-back), and sometimes minor variations in task structure can yield divergent findings. We don't know what all the dimensions are yet. This problem goes far beyond the challenge of establishing a set of consensus labels for task types and relevant cognitive processes. In short, we don't really even know what task features to label yet in many cases, and they don't combine additively. A stop-signal task with one adaptive random walk may be different than one with four, as it allows a different type of cognitive strategy.

E) When to use it: For common terms, meta-analysis (e.g., Neurosynth) does very well (e.g., Figure 4). When would the authors recommend using NeuroQuery over another meta-analytic tool? Maybe they could provide a summary of use cases and conditions (e.g., when few studies of a term/topic are available). Also see point (C).

8) Discussion of other approaches: We submit that the field has become tracked into a relatively narrow space of the possible options and techniques for meta-analysis, based on local analysis of coordinates in MKDA/ALE. Alternatives could be mentioned as potential future directions. For example, early work explored clustering of spatial locations and spatial discriminant tests (e.g., Wager et al., 2002, 2004, 2005), and later work has explored spatial models (e.g., Kang, 2011, Kang, 2014, Wager, 2015) and more advanced co-activation models (Xue, 2014). While this is obviously beyond the scope of the present paper, future work might consider models of spatial co-activation when generating predictions and inferences about meta-analytic maps.

9) The methods are a bit unclear:

A) Significant details about methodologies are missing. How are term frequency and inverse document frequency computed?

B) Equation 5: How is σ^ij computed? Square root of the entries of equation 8?

C) Equation 5: What is the difference between σ^i and σ^ij? σ^i is a column/row of σ^ij?

D) Yj is the j-th column not i-th column of Y?

E) M is a v x n matrix, so equation 8 is a v x v matrix? We are confused how this maps to σ^ij.

F) In Equation 10, is ||U||_1 just the sum of absolute values of all entries in U?

G) What is k set to be in equation 10?

H) "More than 72% of the time, NeuroQuery's output has a higher Pearson correlation with the correct map than with the negative example" – "correct map" refers to the KDE density maps?

10) Subsection “Smoothing: regularization at test time” is hard to read. It would be helpful if the authors explain the intuition behind the different steps and what the different matrices represent. For example, it might be helpful to explain that each row of V can be thought of as the association of words with topic k, so a higher value for row i, column j suggests the j-th dictionary word is more strongly related to the topic k. As another example, the authors should also unpack subsection “Smoothing: regularization at test time”: why do we take V and scale with 𝑛𝑖,𝑖 and then compute C and then l1-normalizing the rows of C to produce T and then finally S. What does each step try to do? We guess roughly speaking VV^T is like a co-occurrence matrix (how likely are two words likely to appear together?), but we are not sure why we have to do the extra normalization with 𝑛𝑖,𝑖, l1-normalization, etc.

eLife. 2020 Mar 4;9:e53385. doi: 10.7554/eLife.53385.sa2

Author response


Essential revisions:

1) We believe that the validations are insufficient for some of the claims. The authors should either expand their experiments to validate these claims or tone down their claims.

We have both added evidence, and toned down claims where they were not justified.

A) "NeuroQuery, can assemble results from the literature into a brain map based on an arbitrary query." This statement can be mis-interpreted to mean that users can enter any terms they want. My understanding is that the query has to comprise words from the 7547 dictionary words. Words outside the dictionary are ignored.

Indeed, we made this statement more precise.

B) Introduction: "For example, some rare diseases, or a task involving a combination of mental processes that have been studied separately, but never – or rarely – together." This suggests NeuroQuery can do this with precision, but the authors have not experimentally demonstrated that activation maps involving "combination of mental processes that have been studied separately, but never together" can be accurately predicted. To validate this, the authors should consider cases involving compound mental processes (e.g., "auditory working memory") and remove all experiments involving both "auditory" + "working memory". Care has to be taken to also remove experiments using words similar to "auditory" + "working memory", so that experiments such as "auditory n-back" are also removed, even though the experiments did not explicitly use the term "working memory". Note that the entire experiments should be removed, rather than just the specific words. The authors can then re-train their model and show whether the query "auditory working memory" yields activation similar to the removed experiments involving auditory working memory.

We thank the reviewers for outlining this aspect of the contribution that was not sufficiently validated. We have added a section (“NeuroQuery can map new combinations of concepts”) with new experiments to showcase that NeuroQuery indeed can map unseen combinations of terms, following the reviewers’ suggestion to fit NeuroQuery on a corpus of study excluding studies of the combination of two given terms. We have first shown the maps for qualitative assessment on the combination of "color" and "distance" (Figure 3). We have also performed a quantitative assessment leaving random pairs of terms unseen (also section “NeuroQuery can map new combinations of concepts”). When choosing the random pairs of terms, we did not remove related words as it is very difficult to do in an automated way, in particular it runs the risk of depleting the corpus by removing too many studies. However, for the specific case of "color" and "distance", we do not see any clear synonyms or alternate way of referring to the same exact mental process in our vocabulary. Hence we feel that the corresponding figure addresses the reviewers’ suggestion.

C) For subsection “NeuroQuery can map rare or difficult concepts”, it's important to differentiate between a concept that is rarely studied versus a rare word that seldom appears in the literature even though variations of the word might be widely studied. The whole section seems to imply that the NeuroQuery is able to accomplish the former very well, but in the experiment, they chose frequent terms (e.g., language) and progressively delete them from their dataset, thus they are really testing the latter. To really test the former, rather than just deleting the word "language" from the documents, they should delete entire documents containing "language" and/or other terms similar to language (e.g., semantic).

We have added at the end of subsection “NeuroQuery can map rare or difficult concepts” a paragraph making explicit the difference between these two aspects. We explain how NeuroQuery improves also for seldom-studied concepts. However, we can only present face validity as evidence for this claim, for lack of solid ground truth on these terms. We have added maps for "color" and "Huntington" on Figure 4: these terms are rare (the number of occurrences is shown on Figure 4), and they do not have synonyms in our vocabulary and therefore the quality of the maps that they produce is evidence that NeuroQuery maps rare concepts.

2) Figure 8 (right panel) should be in the main text in addition to (or replacing) Figure 6. While the log likelihood measure is valid, the "absolute" measure is much more helpful to the users of knowing how much to trust the results. Along this note, it is somewhat concerning that the overall accuracy is only 70% – how much should a user trust a tool with a 70% accuracy? However, this is perhaps underselling NeuroQuery because coordinates tend to be sparse, so just based on the reported coordinates, this classification task might simply be very hard. What might be more useful would be for the user to get a sense of how much the NeuroQuery maps actually matches real activation maps. Can the authors perform the same experiment, but using real activation maps from NeuroVault or their own Individual Brain Charting dataset? This is just a suggestion. The authors can choose to just discuss this issue.

We have moved the "mix and match" results to the main manuscript. While the reviewer finds that 72% accuracy in distinguishing studies is a disappointing result, we fear that it might not be a limitation of the statistical modeling, but rather a measure of noise in reported results. Indeed, the amount of variability in reported results can be gauged visually with the Neuroquery only interface: for a given query, the coordinates reported in the most relevant publications can be seen by hovering over the list, below the predicted map. Such an experiment shows a great variability in publications on similar topics.

We thank the reviewers for suggesting to compare NeuroQuery predictions to real activation maps. We have addressed it in subsection “NeuroQuery maps reflect well other meta-analytic maps”. We considered using the IBC dataset as suggested. IBC activation maps do not suffer from the sparsity of peak coordinates. However, as many individual fMRI maps, they are also tainted by an important amount of noise.

To confront NeuroQuery to a more reliable ground truth, we therefore preferred to rely on an Image-Based Meta-Analysis study. We compared NeuroQuery predictions to maps of the BrainPedia study, published last year by some authors of the current submission. This dataset is interesting because it covers a wide variety of cognitive concepts, the studies included in the meta-analysis were chosen manually, and the meta-analytic maps were carefully labelled. The maps and labels from this publication were uploaded to NeuroVault in December, 2018.

When reformulating the BrainPedia labels to match NeuroSynth’s vocabulary, both NeuroSynth and NeuroQuery matched well the IBMA results, with a median AUC of respectively 0.8 and 0.9. Importantly, while reformulating the labels is necessary to obtain results from NeuroSynth, NeuroQuery coped well with the original, uncurated labels, obtaining a median AUC of 0.8. This illustrates NeuroQuery’s capacity to handle less restricted queries, that enables it to be integrated in more automated pipelines.

Although they come from an IBMA of manually selected studies, BrainPedia maps still contain some noise. To perform a sanity check relying on an excellent ground truth, we obtained maps from NeuroSynth and NeuroQuery for labels of the Harvard-Oxford structural atlases. The CBMA maps matched well the regions manually segmented by experts, with a median AUC above 0.9.

Finally, we compared NeuroQuery predictions to NeuroSynth activations for terms that NeuroSynth captures well (the 200 words that produce the largest number of activations with NeuroSynth). We found that the results of NeuroQuery and NeuroSynth for these terms were close, with a median AUC of 0.9. Results from these experiments are presented in subsection “NeuroQuery maps reflect well other meta-analytic maps”.

3) Results section: some pieces of argumentation presented here are not fully convincing. The authors propose that: "Term co-occurrences, on the other hand, are more consistent and reliable, and capture semantic relationships [Turney and Pantel, 2010]". Most of the brain mapping literature is made of attempts to differentiate cognitive processes inferred from the study of human behaviour such as "proactive control VS retroactive control", "recollection VS familiarity", "face perception VS place perception", "positive emotion VS negative emotion". In that context, it is unclear how terms co-occurrence for example between "face" and "place" would be more consistent and reliable than "face" alone. Term co-occurrence mainly reflects what type of concepts tend to be studied together.

An important aspect of the semantic techniques that we use is that they give a continuous measure of relatedness. For this reason, we think that the use of oppositions in neuroimaging is not a fundamental roadblock: terms that are often studied in opposition are different, but they are related, on the bigger picture of cognition. All the examples listed by the reviewer are pairs drawn from the same subfield of cognition, which would be encoded as a specific branch in a cognitive ontology. The co-occurrences thus help meta-analysis: "recollection" is closer to "familiarity" than to "positive emotion". Yet, research in distributional semantics (which lead for instance to the famous "word2vec" model) has shown that they will have a less strong co-occurrence than exact synonyms, such as "acc" and "anterior cingulate cortex". We have added these considerations in the middle of subsection “Overview of the NeuroQuery model”: "The strength of these relationships encode semantic proximity, from very strong for synonyms that occur in statistically identical contexts, to weaker for different yet related mental processes that are often studied one opposed to the other."

4) The authors suggest that "It would require hundreds of studies to detect a peak activation pattern for "aphasia". It is not clear what do the authors mean here. Aphasia is a clinical construct referring to a behavioural disturbance. It is defined as "an inability (or impaired ability) to produce or understand speech". Accordingly, we don't see how one could search for a peak of activation related to aphasia? The line of argumentation could be clarified here by starting with concrete examples. Perhaps an example the authors may refer to is that, starting from a clinical point of view, researchers may want to investigate brain activation related to the processes of language production in order to better understand dysfunctions such those seen in aphasia. Accordingly, the related publication will probably mention "aphasia" "language production" "phonological output lexicon" etc… and this pattern of terms could in turn be linked to similar language related terms in other publications. An alternative guess is that maybe the authors actually refer here to lesion studies that have, for example, searched for correlation between grey matter volume atrophy and aphasia?

Indeed, the formulation here was not suitable, as "aphasia" is not a mental process but a pathology, and thus not mapped by brain activations. We changed the sentence to "It would require hundreds of studies to detect a pattern in localizations reported for’ aphasia’, as one would appear in lesion studies. But with the text of a few publications we notice that it often appears close to’ language’, which is indeed a related mental process". We choose to do a light rewording rather to discuss in depth how pathologies are captured in the brain-imaging literature: via lesions, but also by studying related mental processes, or simulated with transcranial magnetic stimulation. Indeed, we feel that this is not central to the topic of the manuscript.

5) We have some suggestions for improving the Abstract/Introduction

A) Explaining how a predictive framework allows maps to be generated by generalizing from terms that are well studied ("faces") to those that are less well studied and inaccessible to traditional meta-analyses ("prosopagnosia"). And explaining that this is good for some use cases (generating sensible ROIs/making guesses about where future studies of prosopagnosia might activate), but not good for others (e.g., knowing that a particular area is (de)activated in studies of prosopagnosia).

Indeed, these are good suggestions, that we added to the Introduction.

B) Can you provide a bit more context on previous topic models (e.g., Nielsen/Hansen NNMF "bag of words" work from way back, Poldrack/Yarkoni topic models, GCLDA) and how this approach is different. After reading the Introduction, readers have a sense of what Neuroquery does (its goal), but not how it actually does it (no algorithmic insight). For example, Neuroquery "infers semantic similarities across terms used in the literature", but so does a topic model (which is not mentioned/cited). We recommend including more description early on of what the algorithmic differences are relative to previous work that confer advantages. This is explained later (with some redundancy across sections), but more up front would be helpful.

The goal of a topic model is not to infer semantic similarity but to extract latent factors. This is a subtle difference, as much of the techniques are the same (though recent work in distributional semantics departs more and more from topic modeling), however this is the reason why we did not mention topic modeling. We have added a section on related work in the Discussion, that stresses the links and the differences to prior art. We would like our manuscript to be appealing to users of NeuroQuery, and not only to experts of meta-analysis or text mining, hence we choose to keep technicalities outside of the Introduction. Nevertheless, we added in the Introduction the mention of supervised learning to give the important keywords to the expert reader.

C) In the Introduction, the authors argue that meta-analyses are limited primarily by performance of in-sample statistical inference (they "model all studies asking the same question") and lack of use of a predictive framework. While there is great value in the current work, we don't think that is strictly true. Other meta-analyses have taken a predictive approach, in a limited way, and also modeled differences across study categories. The Naive Bayes classifier analyses in the Neurosynth paper (Yarkoni, 2011) are an example. And Neurosynth considers topics across tasks and fields of study. Analyzing studies by single search terms is an important limitation (topic models by Poldrack and Yarkoni help; perhaps acknowledge their 2014 paper). Viewing the paper as a whole, we understand why the authors focus on prediction vs. inference in the Introduction, but it's hard to understand how this other work fits in without having gone through the whole paper.

To discuss better the links with prior art, we have added a section: “Related work” Indeed, the reviewer is right that ingredients of our approach have been used for meta-analysis, and yet not in the way our work combines them. The predictive model of Neurosynth is a decoding model: it predicts (mutually exclusive) terms from brain activations, while we are doing the converse. As a result, it cannot combine information across the terms, which is the crucial aspect of our predictive framework. The Neurosynth maps are not based on the prediction of a model, but the ingredients of a model. Topic models have also been used (we believe the reviewer refers to the 2012 Poldrack paper), but again the prior art uses the model parameters to derive conclusions. Using predictions enables extrapolations, as demonstrated in our manuscript. In addition, the quality of these predictions can be directly assessed, as in the experiments that we perform.

D) Likewise, the emphasis on the ability to interpolate struck us as odd. We understand this as generalizing based on semantic smoothing. This is useful, but we don't think what you meant was really clearly explained in the Introduction.

We have reworked the part of the Introduction that mentioned interpolation to be more clearly related to the aspects that are important for our work: "They cannot model nuances across studies because they rely on in-sample statistical inference and are not designed to interpolate between studies that address related but different questions, or make predictions for unseen combination of mental processes."

E) The authors make a compelling case that current meta-analyses are limited. The Pinho 2018 example is very helpful: An automated meta-analysis of all the studies performing the same or a similar contrast is not currently possible, primarily because we have lacked the tools to parse the natural language text in articles well enough to identify an "equivalence class" of functional contrasts.

We thank the reviewers for this elegant phrasing. We have taken the liberty to add an adapted version in the Introduction, at the beginning of the paragraph on the Pinho, 2018 example.

F) "For example, some rare diseases, or a task involving a combination of mental processes that have been studied separately, but never – or rarely – together." Could the authors provide a concrete example?

We have added the specific case of "huntingon" and "agnosia": "For instance, there is evidence of agnosia in Huntington’s disease [sitek2014], but no neuroimaging study". Indeed, no publications in the dataset contain both terms, ie, no existing imaging study, but it’s not meaningless as a subject of interest (see https://www. ncbi.nlm.nih.gov/pubmed/25062855).

6) We have some suggestions for improving the clarity of the Results section:

A) “NeuroQuery relies on a corpus of 13459 full-text neuroimaging publications”

What type of neuroimaging publications? task-based activations experiments? If so, of which type: PET and fMRI or only fMRI-based? does it also include structural MRI studies? does it include all types of populations (healthy adults, elderly, patients studies)?

We added a short paragraph to clarify this point: "The articles are selected by querying the ESearch Entrez utility either for specific neuroimaging journals or with query strings suchs as’ fMRI’. The resulting studies are mostly based on fMRI experiments, but the dataset also contains PET or structural MRI studies. It contains studies about diverse types of populations: healthy adults, patients, elderly, children."

B) It is not clear how to read Figure 1. What do the length and width of the color and grey lines represent? could this figure be more elaborated to integrate additional aspects of the procedure? e.g. each of the three brain slice has a color that seems to reflect their respective association with the term, how is all that information combine to a single brain pattern?

We have clarified and expanded the caption of Figure 1, after "Details:.…". As is now explained in the caption, the length of the lines in the first graph shows the semantic similarity. The length of the lines in the second graph shows the weight of each term in the final prediction (the coefficient of its normalized brain map in the linear combination that produces the final prediction). The width is the same for all bars within each graph and carries no special meaning. The brain maps associated with each colored term are combined linearly, with the weights shown on the second graph, to form the final prediction shown on the far right. As this figure is meant to serve as a high-level, schematic summary of the main steps followed by NeuroQuery, we prefer not to complicate it further by integrating additional aspects of the procedure. Indeed, the exact details of the procedure are described at length in subsection “Overview of the NeuroQuery model”.

C) Terminological precision: Results paragraph one: the term "area" is usually reserved for brain territories defined based on microstructure features (e.g. Area 4). Here the authors refer to a spatial location in the brain (or maybe a zone of homogeneity with regards to a specific feature), so they should prefer the term "brain region" for that purpose.

We thank the reviewers for this correction and replaced "area" with "region" in two phrases: "which brain regions are most likely to be observed in a study."; "NeuroQuery is a statistical model that identifies brain regions".

D) Finally, the supplement material provides several quantitative evidence of significant improvement compared to Neurosynth, such as:

– "our method extracted false coordinates from fewer articles: 3 / 40 articles have at least one false location in our dataset, against 20 for NeuroSynth"

– "In terms of raw amount of text, this corpus is 20 times larger than NeuroSynth's"

– "Compared with NeuroSynth, NeuroQuery's extraction517 method reduced the number of articles with incorrect coordinates (false positives) by a factor of 7, and the number of articles with missing coordinates (false negatives) by a factor of 3 (Table 2)."

All these aspects could be summarized together with the more qualitative aspects in a table to emphasize the significant improvements over Neurosynth.

We have added Table 3 and referenced it in subsection “NeuroQuery maps reflect well other meta-analytic maps” on NeuroQuery as an openly available resource.

7) In the Discussion, it would be very helpful to give readers some intuition for when NeuroQuery will not yield sensible results and when/how exactly it should be used (e.g., even a table of use cases that would be appropriate and inappropriate) – and how to interpret carefully (e.g., look at the semantic loadings, and if there is one anatomical term that dominates, realize that you're essentially getting a map for that brain region). The ADHD example is useful but doesn't really cover the space of principles/use cases. Here are some possible examples we have thought of:

A) Some particular limitations may arise from the predictive nature of neuroquery, which may be less intuitive to many readers. For example, if I put in "aphasia", I will get map for "language", because aphasia is semantically close to language. This is very sensible, but users should not, of course, take this as a map of "aphasia" to be related other terms and used in inference. Users might "discover" that aphasia patterns are very closely related to "language" patterns and make an inference about co-localization of healthy and abnormal function. Of course, it's not your responsibility to control all kinds of potential misuse. But pointers would be helpful to avoid another, e.g., "#cingulategate" (Lieberman et al., 2016 PNAS).

We have added, in the limitation paragraph, a discussion of the possible failures of NeuroQuery, as well as indications on how these failures can be detected from the user interface.

We have added a lengthy paragraph in the Discussion (beginning of subsection “Usage recommendations and limitations”) on limitations, going into the various reasons why a brain map predicted by NeuroQuery may not match the expected map. This paragraph basically explains how to craft queries to break the tool. We do hope that, given this power, the reviewers will still appreciate that the average prediction for typical query is good.

With regards to the bigger picture mentioned by the reviewers, we feel misuse of meta-analysis, as in the cingulategate, arises from a lack of understanding of the limitations of meta-analysis. Hence, we added a broad picture section (Conclusion), discussing these limitations, and positioning explicitly NeuroQuery with regards to these. This paragraph gives a broader picture than the rest of the manuscript, and we hope that it does not feel off topic.

To avoid another "cingulategate", we also chose to be very explicit in the Discussion: "A NeuroQuery prediction by itself therefore does not support definite conclusions", and later: "What NeuroQuery does not do is provide conclusive evidence that a brain region is recruited by a mental process, or affected by a pathology".

B) For example, let's consider again the case about "combination of mental processes that have been studied separately, but never together". From my understanding of the algorithm, suppose users query "auditory" + "working memory", basically the prediction will be a linear combination of activation maps from "auditory" and "working memory" (+ similar terms due to the smoothing/query expansion). As such, this assumes that compound mental processes yield activations that are linear combination of activation maps of individual mental processes. This should be made clear.

Indeed this is our assumption, though it is a common one in functional neuroimaging (often referred to as the "pure insertion" hypothesis). We have made it very explicit in our own limitation paragraph and have exhibited a situation where it leads to undesirable result.

C) Playing around with NeuroQuery, there are some queries that generate obviously wrong results. For example, "autobiographical memory" should probably yield the default network, but we getting hippocampal/retrosplenial activation instead. This presumably happens because NeuroQuery "expanded" the query to become memory because "autobiographical memory" is not one of the 200 keywords? Interesting that NeuroSynth does get it correct (https://neurosynth.org/analyses/terms/autobiographical%20memory/).

Neuroquery’s map is actually quite close to neurosynth: both segment the hippocampus, the retrosplenial cortex and the precuneus. The difference lies mostly in the ACC/parietal parts of the default mode network. However we are aware that the query expansion is sometimes too sharp and can deteriorate results. In future work, we hope that ensemble models can help mitigate this issue. A first approach using random subspaces is already demonstrated in the gallery of neuroquery examples, and in an interactive demo, and seems to improve results for the case of "autobiographical memory".

The queries that give maps that do not ressemble what we would expect from an experiment targeted to answer the corresponding question are often queries that combine a term that is well mapped in brain-imaging experiments with a term that is harder to capture, such as "autobiographical memory of faces", as the published results are then dominated by "faces".

D) Perhaps a brief discussion of other limitations would be helpful. We submit that some of the fundamental problems are those not easily solved – that we usually perform meta-analyses based on studies of the same nominal task type (e.g., N-back), and sometimes minor variations in task structure can yield divergent findings. We don't know what all the dimensions are yet. This problem goes far beyond the challenge of establishing a set of consensus labels for task types and relevant cognitive processes. In short, we don't really even know what task features to label yet in many cases, and they don't combine additively. A stop-signal task with one adaptive random walk may be different than one with four, as it allows a different type of cognitive strategy.

We have included a discussion of the issue mentioned by the reviewers in the Discussion, starting with "Another fundamental challenge of meta-analyses in psychology is the decomposition of the tasks in mental processes […]"

E) When to use it: For common terms, meta-analysis (e.g., Neurosynth) does very well (e.g., Figure 4). When would the authors recommend using NeuroQuery over another meta-analytic tool? Maybe they could provide a summary of use cases and conditions (e.g., when few studies of a term/topic are available). Also see point (C).

We have added recommendations on when NeuroQuery is beneficial around, starting with "NeuroQuery will be most successfully used to produce hypotheses […]", and this paragraph finishes with a list of situations where using NeuroQuery is particularly indicated.

8) Discussion of other approaches: We submit that the field has become tracked into a relatively narrow space of the possible options and techniques for meta-analysis, based on local analysis of coordinates in MKDA/ALE. Alternatives could be mentioned as potential future directions. For example, early work explored clustering of spatial locations and spatial discriminant tests (e.g., Wager et al., 2002, 2004, 2005), and later work has explored spatial models (e.g., Kang, 2011, Kang, 2014, Wager, 2015) and more advanced co-activation models (Xue, 2014). While this is obviously beyond the scope of the present paper, future work might consider models of spatial co-activation when generating predictions and inferences about meta-analytic maps.

We now mention this line of work at the end of the paragraph on prior work, starting with "other works have modelled co-activations and interactions between brain locations"

One way to adapt the NeuroQuery model to better leverage the spatial structure of activations could be to use loadings on dictionary or ICA components as dependent variables rather than estimated density at each voxel.

9) The methods are a bit unclear:

We have made important changes to the organization and notations of the methods. Below we provide answers to the issues raised by the reviewers.

A) Significant details about methodologies are missing. How are term frequency and inverse document frequency computed?

We have added a section B.1 on the computation of TFIDF features.

B) Equation 5: How is σ^ij computed? Square root of the entries of equation 8?

σ^2jk is now explicitly defined in Equation 9.

C) Equation 5: What is the difference between σ^i and σ^ij? σ^i is a column/row of σ^ij?

Indeed σ^j is the vector containing the entries (σ^j,1,..., σ^j,p) as made explicit at the end of section B.2.1. We have also added a short section on notations

D) Y:,j is the j-th column not i-th column of Y?

Indeed this was a typo, which has now been corrected (below Equation 6.)

E) M is a v x n matrix, so equation 8 is a v x v matrix? We are confused how this maps to σ^ij.

We have now defined σ^ij in Equation 9 using the sum notation, we hope that this definition is now clear.

F) In Equation 10, is ||U||_1 just the sum of absolute values of all entries in U?

This is what we intended. This choice of notation was bad as it is usually employed for the operator norm induced by the l1 vector norm. We replaced this notation with that of the 𝐿1,1 norm ie ||.||1,1

G) What is k set to be in equation 10?

k has now been renamed d to avoid confusion with the index k used to index voxels. It is set to 300, as is now specified in the paragraph after Equation 18: "the hyperparameters d=300,.…"

H) "More than 72% of the time, NeuroQuery's output has a higher Pearson correlation with the correct map than with the negative example" – "correct map" refers to the KDE density maps?

"correct map" refers here to an unsmoothed maps where peaks have been placed in the voxels that contain the reported peak coordinates. We have now clarified this in the text "whether the predicted map is closer to the correct map (containing peaks at each reported location) or to the random negative example"

10) Subsection “Smoothing: regularization at test time” is hard to read. It would be helpful if the authors explain the intuition behind the different steps and what the different matrices represent. For example, it might be helpful to explain that each row of V can be thought of as the association of words with topic k, so a higher value for row i, column j suggests the j-th dictionary word is more strongly related to the topic k. As another example, the authors should also unpack subsection “Smoothing: regularization at test time”: why do we take V and scale with 𝑛𝑖,𝑖 and then compute C and then l1-normalizing the rows of C to produce T and then finally S. What does each step try to do? We guess roughly speaking VV^T is like a co-occurrence matrix (how likely are two words likely to appear together?), but we are not sure why we have to do the extra normalization with 𝑛𝑖,𝑖, l1-normalization, etc.

We have reworked this section in depth. More details on the interpretation of rows of 𝑉 are provided in the second paragraph after Equation 18, starting with "The latent factors, or topics, are the rows of.…"We provide some justification for scaling 𝑉 with 𝑛𝑖,𝑖 in equations 19-21: as the reviewers suspected, V~TV~~ 𝑉 𝑇 ~ 𝑉 can be seen as a regularized (low-rank) co-occurrence matrix. Our motivation for the subsequent l1-normalization is stated after Equation 22: "This normalization ensures that terms that have many neighbors are not given more importance in the smoothed representation". Overall we hope that this section is now clearer and provides better intuitions and justifications for the choices we made in the design of the NeuroQuery model.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Dockès J, Varoquaux G. 2019. NeuroQuery. GitHub. neuroquery_data

    Supplementary Materials

    Source data 1. Source code for figures and tables.
    elife-53385-data1.zip (102.3MB, zip)
    Transparent reporting form

    Data Availability Statement

    All the data that we can share without violating copyright (including word counts of publications) have been shared on https://github.com/neuroquery/ (copy archived at https://github.com/elifesciences-publications/neuroquery) alongside with the analysis scripts. Everything is readily downloadable without any authorization or login required. For each figure and table, the data directly used to generate it is made available in a separate zip file.

    The following dataset was generated:

    Dockès J, Varoquaux G. 2019. NeuroQuery. GitHub. neuroquery_data


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES