Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 May 5.
Published in final edited form as: Neuron. 2014 Sep 17;83(6):1246–1248. doi: 10.1016/j.neuron.2014.09.008

The Big Data Problem: Turning Maps into Knowledge

Florian Engert 1,*
PMCID: PMC4857857  NIHMSID: NIHMS781334  PMID: 25233305

Abstract

In this NeuroView, Engert discusses the challenges for the connectomics field in making insights about brain function from big data.


There has been a great deal of focus in recent years on efforts to map the brain. The ability to record from every neuron in the brain of an awake, and ideally behaving, animal is unquestionably immensely useful. In addition, having a wiring diagram at hand that can be overlaid on such activity maps is probably a dream come true for most systems neuroscientists. Given the vast number of neurons in the brain, however, such systematic analysis could yield enormous reams of data. The same could be said for efforts in the connectomics field to reconstruct structural connections throughout the brain via EM. Here, I argue that “big data” and the oft-discussed challenges inherent to it (e.g., mining, storing, and distributing it) is not the key challenge we face in transitioning from making neural maps to making useful insights into brain function. I would suggest that the essential ingredient that turns a useless map into an invaluable resource is the experimental design employed to gather and analyze the underlying data, and ultimately the thought process, creativity, and ingenuity that went into this design. This is where the hard work is—in formulating precisely the question of what we actually want to know, what an answer would look like, and what kind of insight we can take away from the experiment.

In this essay I will focus on two endeavors that are presently underway in the neurosciences that aim to collect rather large amounts of data: the Open Connectome Project (Burns et al., 2013; Kandel et al., 2013) and the BRAIN initiative (Devor et al., 2013; Kandel et al., 2013; Striedter et al., 2014). While it has been suggested that a critical challenge to be addressed with these initiatives is the issue of “big data” (Brinkmann et al., 2009; Choudhury et al., 2014; Swain et al., 2014), I will make the argument that it will be comparatively small data sets (on the order of a few terabytes at most) that will contain the relevant information and need to be distributed and made available as resources to the community. These small and information-rich data sets will include a description of all the neurons in the brain, their activity, and, ideally, their wiring diagram. The development of the methodologies necessary to generate these data sets is essential, it is important—and it is very difficult to do. But the difficulty lies primarily in developing the right technology. Overcoming these problems is essentially the goal of the BRAIN initiative and, in my opinion, a good place for investing money, energy, and time.

Big Data in Neuroscience?

Big data is a hot topic these days, and it’s not surprising that there is discussion in the community about what to do with the data generated by these endeavors. Big data can be defined in many ways, and the continuous increase in computational power leads to a somewhat amorphous concept of what we mean when we talk about big data. For the purposes of this commentary, I will define as big data anything that exceeds the size of a standard laptop hard drive.

It is useful and important to make a definitive distinction between big data and complex data, however, two concepts that frequently get mixed up. The former is just that: big. The latter is complicated, hard to interpret, and—usually—very hard to compress. It also requires the application of mathematical tools and quantitative methods to analyze. Complex data sets, quite often, are not big in the sense of “big data,” but they are ubiquitous in modern science.

How Big Is a Connectome?

Let’s consider the respective challenges of converting data into information within the connectome project and the BRAIN initiative. Connectomics relies on recovering a circuit diagram by imaging the whole region of interest at the resolution of an electron microscope (EM) (Briggman and Bock, 2012; Kleinfeld et al., 2011; Lichtman and Denk, 2011; Randel et al., 2014). These EM data sets then need to be analyzed by segmentation and reconstruction of the individual neurons, which ultimately allows the identification of all the synaptic connections. The final product is the circuit diagram of the complete network in the volume under scrutiny. The size of the raw data collected in such an enterprise is truly daunting.

Let us look at a few numbers: a mouse brain imaged at 5 nm × 5 nm × 40 nm resolution at a volume of approximately 500 mm3 would generate a raw data volume of 500 petabyte. Big data, indeed. However, what we want to get out of this volume is the connectivity matrix among the 100 million neurons that a mouse brain contains. If we assume ~1,000 connections for each neuron, the resulting connection matrix contains ~1011 entries. Assuming a bit depth of a few bytes, these 1011 entries result in a data set of a few hundred gigabytes, which will fit comfortably on an ordinary laptop hard drive. Complex data, but not big. It is true that we haven’t yet developed fast, reliable, and efficient segmentation and tracing algorithms to actually do the segmentation and tracing—and as such this particular problem of data compression is far from being solved. However, the solution to this problem will come most likely out of machine vision research and doesn’t quite have the flavor of “big data mining.” The task of segmentation and tracing itself is actually quite straightforward; it is easy to formulate and can be accomplished by a trained middle school student (see, for example, Eyewire.org), it’s just very hard to implement in computer algorithms at the moment (Jain et al., 2010; Turaga et al., 2010). However, once these algorithms have been developed, whole-brain EM volume data can be reduced and compressed by six orders of magnitude. Not so big data anymore. It is unquestionably important to allocate resources to solve this problem, but it is most likely going to be solved—in the end—by a handful of smart mathematicians and might not really require a national (or international) effort and billions of dollars. Once compressed in this manner—and converted into information—the data sets to be analyzed in the context of systems neuroscience questions will comfortably fit on a flash drive that you can carry in your pocket.

How Big Is an Activitome?

If we consider recording all the spikes in all the neurons of the brain, we can envision a similar compression. If we achieve such large-scale recording through some technology based on volume imaging (point- or sheet-scanning, spatial light modulation, etc.) coupled with genetically encoded activity indicators (GCaMPxx or voltage-sensitive protein), we are initially faced with similarly big data volumes: a mouse brain contains 500 × 109 cubic micron pixels (filling a volume of ~500 mm3), and if we want to record all of them for 20 min (1,000 s) at 1000 Hz, we again have 500 petabytes of raw data. Here, however, the initial compression is much more straightforward: you isolate all the cell bodies (100 million) and find the timestamps of all the fluorescence intensity spikes. With the assumption that all the neurons fire at an average rate of 5 Hz through the recording time period (probably an upper estimate since many neurons might be silent), we again end up with a data volume of 500 gigabytes. Quite manageable. Here, the mathematical tools to do this compression are more or less already in place. Segmentation of neuronal cell bodies and isolation of spikes from fluorescent traces is presently made difficult only by signal-to-noise problems. If the signals are large, this is easily done with the help of standard and ubiquitously available software.

Thus, in both cases, the size of the relevant data volumes can be reduced from hundreds of petabytes to a few hundred gigabytes, and this can be done by relatively straightforward analysis pipelines that are—at least intellectually—very straightforward. Furthermore, this data reduction will eventually be done on the fly, i.e., during the acquisition of the raw data, and will probably be achieved with dedicated hardware in the form of custom-designed coprocessors. Raw data sets might be very large, but once converted into information, the volumes aren’t big data anymore.

Large-Scale, Small-Scale: A Question of Style

I’ve argued that the big data in question could, with appropriate analysis and technological developments, be relatively easily compressed into information, albeit complex. But the big data still must be gathered. So what’s the best approach to collecting the data that will give us an unprecedented view into brain function? One could envision either large-scale, industrial data collection or the traditional small-scale, individual lab approach. Here, I will briefly discuss the potential contributions of both.

Whole brain imaging will greatly facilitate the identification and localization of essential neural subnetworks related to a behavioral context under scrutiny. The product or “deliverable” of whole brain imaging will then be a small and spatially identified subset of neurons that shows correlated activity with all—or any—aspect of the behavioral context. This is probably more useful than any other way of labeling subsets of cells if the goal is to decipher the roles of circuits in generating behavior. It offers an attractive and complementary approach to labeling neurons with genetic methods like enhancer trapping. The catch is that whole brain imaging has to be integrated into the experimental context and it has to be designed and optimized for the specific project. As such, it needs to be turned into a readily available technology for all laboratories and accessible on the small scale.

The issues are slightly different for connectomics, which has the goal of generating complete wiring diagrams that—ideally—can and should be overlaid onto previously acquired functional maps. Such an enterprise will require concerted and large-scale efforts and indeed might best be accomplished by industrially organized science at the more corporate level. Indeed, in recent years several voices have been raised that argue—occasionally quite convincingly—for neuroscience to move from tinkering in individual laboratories to industrial-scale research that allows for the many challenges to be tackled systematically and in a properly organized fashion.

I propose that there is equal space and opportunity for both: corporate-style/industrial-size science as well as the individual, small-scale, cottage industry style. Connectomics is clearly an example that is begging to be turfed out to a contract research organization (CRO), equipped with a park of various electron microscopes, where fixed brains can be automatically sectioned, mounted, imaged, and even segmented. Several successful service industries come to mind that all started out as relatively small-scale operations in individual laboratories and that are now used routinely by almost every laboratory in the world.

Sequencing services are being used ubiquitously around the world, yet the technology certainly started as some form of cottage industry by the likes of Sanger and colleagues. Oligonucleotide synthesis as well as protein sequencing is another powerful technology that quickly made it into a service industry. The generation of transgenic mice—a job that used to soak up a large part of a PhD thesis—is now in most cases outsourced to CROs. It is frequently observed that even the outsourcing of graduate student supervision occurs, in this case to thesis advisory committees and/or postdoctoral fellows.

Whole brain imaging, on the other hand, is difficult to envision as an industrial-scale, massively parallel high-throughput operation. The main reason for this is that such an operation usually requires a clear final product, a deliverable that can be quantitatively described, priced, benchmarked, and specified by intermediate milestones. These features seem quite feasible in the context of generating connectomes but appear to be ludicrous in the context of whole brain imaging. What would such a product look like? Here, clearly the deliverable is the technology and not the final data set, and as such the aims of the BRAIN initiative are perfectly aligned with these objectives.

Looking to the Future

Once the data are collected and compressed into information, the question becomes how best to turn this information into knowledge. The challenge in the neurosciences will be to come up with good questions and intelligent experimental assays—assays that ultimately will have to be anchored in behavior and that will have to give answers to questions of how specific behaviors are generated by the nervous system. For excellent specific examples, it is useful to go further back in the history of neuroscience and consider stories like the jamming avoidance reflex (JAR) of the weakly electric fish (Heiligenberg, 1991) and the generation of rhythmic activity in the somatogastric ganglion of the lobster (Marder et al., 2014; O’Leary and Marder, 2014).

New technologies that allow us to identify and isolate the neuronal subtypes that are actually involved in a specific task will of course be an important boon to this enterprise, and they will undoubtedly speed up the collection of necessary data. However, I doubt that these new technologies will lead to a paradigm shift or a fundamentally new way of doing neuroscience. The name of the game will always be to think carefully and deeply about how behavioral features can emerge out of neuronally implemented algorithms, and ideally these ideas ought to germinate and take shape well before we actually start generating data, be it big or small.

References

  1. Briggman KL, Bock DD. Curr Opin Neurobiol. 2012;22:154–161. doi: 10.1016/j.conb.2011.10.022. [DOI] [PubMed] [Google Scholar]
  2. Brinkmann BH, Bower MR, Stengel KA, Worrell GA, Stead M. J Neurosci Methods. 2009;180:185–192. doi: 10.1016/j.jneumeth.2009.03.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Burns R, Roncal WG, Kleissas D, Lillaney K, Manavalan P, Perlman E, Berger DR, Bock DD, Chung K, Grosenick L, et al. The Open Connectome Project Data Cluster: Scalable Analysis and Vision for High-Throughput Neuroscience. Proceedings of the 25th International Conference on Scientific and Statistical Database Management (SSDBM) 2013:27. doi: 10.1145/2484838.2484870. http://arxiv.org/abs/1306.3543. [DOI] [PMC free article] [PubMed]
  4. Choudhury S, Fishman JR, McGowan ML, Juengst ET. Front Hum Neurosci. 2014;8:239. doi: 10.3389/fnhum.2014.00239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Devor A, Bandettini PA, Boas DA, Bower JM, Buxton RB, Cohen LB, Dale AM, Einevoll GT, Fox PT, Franceschini MA, et al. Neuron. 2013;80:270–274. doi: 10.1016/j.neuron.2013.09.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Heiligenberg W. Neural Nets in Electric Fish. Cambridge, MA: MIT Press; 1991. [Google Scholar]
  7. Jain V, Seung HS, Turaga SC. Curr Opin Neurobiol. 2010;20:653–666. doi: 10.1016/j.conb.2010.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Kandel ER, Markram H, Matthews PM, Yuste R, Koch C. Nat Rev Neurosci. 2013;14:659–664. doi: 10.1038/nrn3578. [DOI] [PubMed] [Google Scholar]
  9. Kleinfeld D, Bharioke A, Blinder P, Bock DD, Briggman KL, Chklovskii DB, Denk W, Helmstaedter M, Kaufhold JP, Lee WC, et al. J Neurosci. 2011;31:16125–16138. doi: 10.1523/JNEUROSCI.4077-11.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Lichtman JW, Denk W. Science. 2011;334:618–623. doi: 10.1126/science.1209168. [DOI] [PubMed] [Google Scholar]
  11. Marder E, O’Leary T, Shruti S. Annu Rev Neurosci. 2014;37:329–346. doi: 10.1146/annurev-neuro-071013-013958. [DOI] [PubMed] [Google Scholar]
  12. O’Leary T, Marder E. Science. 2014;344:372–373. doi: 10.1126/science.1253853. [DOI] [PubMed] [Google Scholar]
  13. Randel N, Asadulina A, Bezares-Calderón LA, Verasztó C, Williams EA, Conzelmann M, Shahidi R, Jékely G. Elife. 2014:e02730. doi: 10.7554/eLife.02730. Published online May 27, 2014. http://dx.doi.org/10.7554/eLife.02730. [DOI] [PMC free article] [PubMed]
  14. Striedter GF, Belgard TG, Chen CC, Davis FP, Finlay BL, Güntürkün O, Hale ME, Harris JA, Hecht EE, Hof PR, et al. Brain Behav Evol. 2014;83:1–8. doi: 10.1159/000360152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Swain JE, Sripada C, Swain JD. Behav Brain Sci. 2014;37:101–102. doi: 10.1017/S0140525X13001908. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Turaga SC, Murray JF, Jain V, Roth F, Helmstaedter M, Briggman K, Denk W, Seung HS. Neural Comput. 2010;22:511–538. doi: 10.1162/neco.2009.10-08-881. [DOI] [PubMed] [Google Scholar]

RESOURCES