Abstract
Recent advances in neuroscience have enabled the exploration of brain structure at the level of individual synaptic connections. These connectomics datasets continue to grow in size and complexity; methods to search for and identify interesting graph patterns offer a promising approach to quickly reduce data dimensionality and enable discovery. These graphs are often too large to be analyzed manually, presenting significant barriers to searching for structure and testing hypotheses. We combine graph database and analysis libraries with an easy-to-use neuroscience grammar suitable for rapidly constructing queries and searching for subgraphs and patterns of interest. Our approach abstracts many of the computer science and graph theory challenges associated with nanoscale brain network analysis and allows scientists to quickly conduct research at scale. We demonstrate the utility of these tools by searching for motifs on simulated data and real public connectomics datasets, and we share simple and complex structures relevant to the neuroscience community. We contextualize our findings and provide case studies and software to motivate future neuroscience exploration.
Subject terms: Computational neuroscience, Network topology, Programming language, Software, Neural circuits, Computational neuroscience, Network models
Introduction
Modern nanoscale connectomics research commonly involves the conversion of microscopy imagery data into a graph representation of connectivity, where nodes represent neurons, and directed edges represent the synapses between them1. This process enables researchers to convert terabytes or even petabytes of imagery into megabytes or gigabytes of graph data. Conversion to a network format reduces the cost and complexity of interrogating the data, at the expense of losing information about cellular morphology2,3. Though this graph representation uses substantially less storage-space on disk, answering even seemingly simple network questions (e.g., identifying local graph structure around a particular neuron, or comparing the downstream targets of a certain cell type) may still exceed the computational power, timelines, and budgets available to many research teams, due to the exponential nature of many graph algorithms4,5. This issue is often ameliorated by including node or edge attribute constraints in the search, though this places an additional limitation on the types of questions that a researcher can expect to ask of a connectome dataset.
Connectomics researchers have begun to address these challenges of large-scale graph analysis by adopting existing large-scale graph management software from other domains, such as graph databases, and by enforcing consistent, well-architected data schemas6,7. These systems provide performant and cost-effective ways to manipulate larger-than-memory graphs, but tend to require familiarity with complex and nuanced graph query programming languages such as Gremlin or Cypher. Though graph databases continue to grow in popularity, the expertise to administer or use these technologies is still not common in the biological sciences.
In order to make the study of connectomes accessible and computationally efficient, we developed DotMotif, an intuitive but powerful graph tool designed to reduce the expertise and time required to begin interrogating biological graphs of any size. DotMotif acts as an interface to common graph management systems such as the NetworkX Python library or the Neo4j graph database, abstracting the intricacies of subgraph-query design and enabling researchers to focus on field-specific scientific inquiry. DotMotif assists researchers in query design by exposing an intuitive, simple syntax that automatically validates and optimizes user queries, and adapts to different graph tool backends without the need for additional user input. We present the DotMotif Python package and software architecture, which enables researchers to write queries while remaining agnostic to underlying technologies. We demonstrate DotMotif’s utility by assessing its advantages over manual query design, and we share several use-cases inspired by ongoing research in the connectomics community. We also share examples of how these findings might inform future work, including efforts to design better local connectivity rule sets, establish expected motif prevalence, and produce better generative models of a connectome for evaluation, null model design, or for comparison across species, brain region, and modality.
Background
Many of the fundamental questions of modern neuroscience rely on the study of simple circuits of neural connectivity in the brain. These simple circuits are hypothesized to be repeated and reused many times to perform a similar role. Though they might comprise only a small number of neurons, as shown in Fig. 1, these simple circuits or graph motifs may be critical to understanding the functional role of larger structures in the brain, such as cortical columns or other modules of computational importance8–10. Due to the nature of available high-resolution imaging modalities such as electron microscopy, it is not always feasible to collect functional as well as anatomical data at synaptic resolution. In this work, we define motifs as repeating subgraphs, but do not make assumptions about their computational significance. That is, we consider a motif to be any “commonly used network architecture”11 or “recurring, significant [pattern] of inter-connections”12, even if that subgraph structure has unknown computational properties.
Identifying these motifs is one motivation for the field of connectomics, the study of the brain through the lens of its connectivity. Many related research efforts seek to construct graph representations of the brain, commonly represented by the notation , where a graph G is made up of nodes V, representing neurons; (optionally directed) edges E, representing synapses between neurons; and arbitrary attributes A, which may be associated with vertices, edges, or the graph as a whole. For example, node attributes might include qualities such as cell type or functional information. Edge attributes might include synapse weight or neurotransmitter type. Graph attributes might include the species or individual from which the connectome was generated. Neuroscience questions may be reformulated as analyses on a graph, and the neuroscientist may add graph theory to the toolbox of strategies with which to understand the brain8,13. Such questions include searching for specific subgraph structures, investigating the connectivity of specific neuron cell types or categories, proofreading connectomes for accuracy, and generating summary statistics on the graph as a whole3,10,14–18.
Many graphs generated by the connectomics community in recent years have spanned multiple gigabytes of hard drive space2,10, rendering conventional graph toolkits, such as the common NetworkX library19 (or its counterparts in other programming languages) under-powered to address the needs of the scientific community. These tools, which often require all graph data to be stored in RAM, would require impractically expensive compute hardware in order to run fully in-memory, and would require impractically long timelines in order to run while swapping data from memory to disk. Instead, some teams6 have opted to leverage “out-of-memory” tools, such as Neo4j7, Cayley20, or other graph databases7,21, which operate on graph data much like conventional relational databases operate on tabular data. Despite their power, such tools require expertise in specialized query languages such as Cypher or Gremlin, much like relational databases require knowledge of languages such as SQL. Developing or hiring for this sort of domain expertise may be impractical for many neuroscience research laboratories and is independent from the core expertise needed to formulate and test neuroscience hypotheses. The need for computationally efficient, accessible, and intuitive neural graph analysis tools motivates this work.
Motif-based connectomics research is an emerging neuroscience sub-discipline. This community currently depends upon closed-form matrix algebra subgraph-counting techniques, which are only established for very specific types of motifs (e.g. fully connected motifs or star graphs). More complicated motifs, such as those which include information about cell or synapse types, or those with more topologically interesting graph connectivity, are currently understudied largely due to this technical gap. Existing tools, such as FANMOD, MAVisto, and others offer helpful interfaces for biological motif exploration with support for graph attributes, but encounter limitations when searching for arbitrarily large motifs, or when searching in many-thousand-node graphs11,12,22,23. In order to facilitate directed hypothesis testing in large-scale connectomics, the community will need motif search tools that are able to incorporate information about cell types, local morphology, functional and structural data, and other graph, node, and edge attributes, and which scale to arbitrarily large connectome network sizes.
Results
For DotMotif to be useful to the neuroscience community, it is important for potential end users to see its practical utility. We share illustrative DotMotif experiments run on a set of publicly-available, seminal connectome datasets. We first validate our results by comparing against existing online data, and then share a network-analysis result on both partial as well as complete connectomes. Lastly, we share performance benchmarks for graphs of various sizes. Collectively, these results serve to illustrate how our tools enrich the connectomics toolkit.
Datasets
We demonstrate the use of our tool on both near-complete (most or all of the organism’s neurons are included in the graph) connectomes as well as on partial connectomes. Here, we compare connectomes of the invertebrate nematode C. elegans24,27, the partial connectome of the invertebrate fruit fly from the Howard Hughes Medical Institute Janelia Hemibrain project2, and the partial vertebrate mouse visual cortex connectome from the Intelligence Advanced Research Project Activity (IARPA) Machine Intelligence from Cortical Networks (MICrONS) project10,28,29. Further dataset details are available in Table 1. These published graphs do not adhere to a particular database schema or storage system, and each has different node and edge attributes. Despite these differences, all three can be analyzed using the techniques we detail here.
Table 1.
Dataset | Nodes | Edges | Density | Species/area |
---|---|---|---|---|
C. elegans chemical synapses | 296 | 3427 | 0.064 | Caenorhabditis elegans |
MICrONS v185 | 334 | 1736 | 0.031 | Mus musculus visual cortex |
Hemibrain v1.2 | 21,740 | 3,550,404 | 0.011 | Drosophila melanogaster |
The C. elegans connectome was derived from the July 2020 C. elegans adjacency matrix at WormWiring.org, and synapses were filtered to only include chemical synapses for this analysis24,25. The MICrONS v185 graph was derived from the proofread soma subgraph at microns-explorer.org10,26. The Hemibrain v1.2 dataset was accessed through the neuPrint system6.
Tool validation: declarative queries
As a simple first step towards tool validation, we begin by demonstrating the utility of DotMotif in performing declarative queries on a large connectome graph. In order to illustrate compatibility with data stored in the neuPrint data format6, we ran a DotMotif search query on the partial Drosophila melanogaster connectome (dubbed “Hemibrain”)2. We then validated these results with the neuPrint API at neuprint.janelia.org6. In this example motif query (Fig. 2a), we wanted to find all Antennal Lobe inputs to Kenyon Cells with weights within a certain range. We identified seven instances of this motif (centered around antennal lobe neurons with IDs 1917188956, 1825085656, 5813063239, 1887163927, 1917188956, 1825085656 and 1037293275). With these references to neurons in the host graph, we were equipped to look for further patterns among these neurons. In order to confirm that the query above returned results that were identical to those obtained from the canonical neuPrint server, we validated these results with the equivalent Cypher command (Fig. 2b) in the neuPrint web application. This exercise demonstrates DotMotif can be used to easily convert declarative neuroscience questions into succinct, efficient graph queries.
Comparing undirected subgraph searches across connectomes
As an illustration of the minimal configuration requirements of the DotMotif package, we searched the human-proofread subgraph from the IARPA MICrONS project10,28,29 for all undirected graphs of size , with the ordering taken from the Atlas of Graphs30[p. 8–30]. We then counted the number of times each subgraph appeared. For example, we counted all undirected triangles (Atlas of Graphs ID #7) and discovered that there were 6894 unique triangle monomorphisms in the MICrONS graph when ignoring edge direction. There were 123,264 unique undirected rectangle (Atlas of Graphs ID #16) monomorphisms. The only undirected graph in this set with no occurrences in the MICrONS graph was Motif ID #208, which was the complete graph of 6 nodes. We publish this complete dataset of all motifs and their respective counts for community use (see Supplemental Material 1).
To contextualize these results, we calibrated the parameters of four simple random graph models (Erdős–Rényi31, Undirected Geometric32, Watts–Strogatz33, and Barbarási–Albert34) to match the graph density (D = ) of each estimated connectome, and ran the same count of subgraph motifs on samples taken from each random graph distribution. These models were chosen for their simplicity and small parameter spaces. We similarly performed the same subgraph motif counts on the configuration model (X-swap)35—a degree-preserving graph randomizer used in previous neuroscience efforts10. Unlike the other models, this model begins with an existing connectome graph and performs arbitrary swaps of edges such that the overall in- and out-degrees of each node remains the same, but the precise pattern of connectivity changes36. Because the X-swap procedure preserves the degree sequence of its source network, random permutations generated in this manner closely mirror the motif counts of the original source connectomes. More details about the random graphs used here are available in “Methods”. The graph densities were set to 0.03 for the MICRONS graph and 0.06 for the C. elegans graph (Table 1). In order to compare these results with a complete connectome example dataset, we ran this count-scan on all chemical synapses of the hermaphrodite C. elegans connectome24,25,27,37 available from WormWiring.org24. All motifs tested occurred at least once in the C. elegans connectome. We discovered that the motifs encountered most frequently in one connectome tended to be the most frequent in other connectomes and their random-graph counterparts, but the absolute counts of each motif differed dramatically based upon the connectome and motif in question (Fig. 3).
We also discovered that the distribution of undirected motifs (black datapoints, Fig. 3) followed a similar curve trajectory for many brain graph datasets including the two shown here, though the parameters of that curve varied across connectomes. Queries from this set of experiments were run using Neo4j, NetworkX, and GrandIso executors, as convenient, leveraging our ability to seamlessly switch between executors when using DotMotif. Random graph parameters selected for these experiments are explained in greater depth in “Methods”. All data and results from this study are available as described in Supplemental Material 1. These experiments may aid in the design or selection of random graph models to approximate a connectome graph, and in the design or selection of query graphs when studying reconstructed connectomes.
Directed three-node motif searches in MICrONS and C. elegans
Recent work has investigated the relative prevalence of directed three-node motifs in brain graphs and the importance of such motifs for local computation10,38–40. Here we count all unique, directed, three-node motifs in the MICrONS Phase 1 and C. elegans connectomes. Lower-density three-node motifs appeared with greater frequency (due to the larger number of embeddable permutations. Despite similar neuron counts, the connectomes differed dramatically in both absolute count and distribution shape, with many directed motifs occurring frequently in the C. elegans graph but few to no times in the MICrONS graph (Fig. 4).
These simple motifs serve as a useful frame for further investigations. Though there are only thirteen three-node directed graphs, exhaustively searching for all larger directed motifs quickly becomes computationally infeasible (there are 199 connected four-node directed graphs, 9364 connected five-node directed graphs, and over 1.5 million connected six-node directed graphs). It is possible now to search specifically for larger supergraphs of only motifs which occur at least once in this study, greatly reducing the search space of future exhaustive motif searches in these connectomes. Unlike the distributions of undirected motifs, the relative distribution of these directed motifs does not appear to follow a consistent shape between connectomes. This suggests that even these small motifs may hold useful knowledge about the nature of connectome graphs, such as distinguishing between feed-forward-dominated or feedback-dominated tissue.
Benchmarks
Users may find that different DotMotif executors suit the needs of different research questions. In order to illustrate these differences, we compared the performance of three DotMotif executors available for use in our downloadable Python module. The GrandIsoExecutor and NetworkXExecutor are both pure-Python, and are therefore desirable for use in constrained compute environments such as shared servers. The Neo4jExecutor requires a Neo4j database, which runs as a standalone executable. Figure 5 illustrates that the novel GrandIso-based executor developed under this effort always outperforms the NetworkX executor, regardless of host-graph size and motif. (Due to the additional overhead-time spent provisioning Neo4j databases, we do not compare these much longer-running results here.) Though we compare searches for small undirected motifs in Erdős–Rényi graphs in Fig. 5, the GrandIso search outperformed NetworkX VF2 on the exhaustive connectome motif searches in Figs. 3 and 4. All graphs were generated using the NetworkX fast_gnp_random_graph algorithm with graph sizes uniformly sampled between 100 and 300, with densities uniformly sampled between 0.05 and 0.3. Complete results are available in Supplemental Material 1.
When choosing between Python-based and graph-database executors, there is a tradeoff between data-ingest rate and search speed. The NetworkXExecutor and GrandIsoExecutor, which convert DotMotif DSL syntax into a series of Python commands, are preferable for running graph queries on a small host graph, or for running the same queries on many host graphs, due to low startup-time overhead. The Neo4jExecutor, which converts DotMotif DSL syntax into a Cypher query for execution with the Neo4j graph database, has significant data-ingest overhead for the first query on a graph, and so it is not ideal for running individual motif queries. However, users may prefer a Neo4j executor when running multiple queries on the same host graph, when many users are running motif queries simultaneously on the same host, or if motif searches are likely to be repeated. For queries on host-graphs of sufficiently large size, the Neo4jExecutor startup overhead becomes trivial compared to the total runtime of the query, and for larger-than-memory host-graphs, the Neo4jExecutor and NeuPrintExecutors may be the only feasible options. For most in-memory exploratory data analysis, we recommend the use of the GrandIsoExecutor. DotMotif users may trivially switch between Executors without modifying their queries, as the DotMotif engine stores queries in an intermediate format after parsing and validation steps (Fig. 6).
Methods
In developing DotMotif, we aimed to ensure that our software was both accessible—intuitive for new users but powerful enough for power-users—as well as sufficiently computationally efficient to execute most common queries on commodity hardware. In the interest of research flexibility, we also required that the software adapt to best utilize available hardware resources. For this reason, DotMotif includes several executors, each of which leverage a different graph analysis technology. Executors are discussed in greater depth in the Benchmarks section.
Formal definition of the subgraph search task
Here we consider the ideas of subgraph isomorphism and subgraph monomorphism, which have slightly varying definitions in the graph theory literature. When we refer to a subgraph, we always refer to a node-induced subgraph, which is inline with the usage in popular subgraph isomorphism research, including research on the commonly used VF2 algorithm41–43. Given graphs and , we say G is isomorphic to H if there exists a bijection such that if and only if 44. An extension of this idea is that a subgraph isomorphism exists between G and H if and only if there exists a subgraph of G such that is isomorphic to H. A monomorphism loosens the bijection requirement to simply injection, meaning that the match in the host graph may contain extra edges.
DotMotif finds all subgraphs in a search graph that match a query graph, where the term match used here means that one of the above described mappings exists. By default, DotMotif will perform a search for all subgraphs that are monomorphic to the query subgraph. A user may choose to search for exact matches only, in which case DotMotif identifies subgraph isomorphisms, rather than monomorphisms. This behavior is controlled by a user toggle during the construction of a query.
Optimizations to the subgraph isomorphism task in Python
For the purposes of quick iteration upon neuroscience hypotheses, our team required a high-speed subgraph isomorphism algorithm that was both pure Python—in order to reduce the barrier to entry for a fresh installation—as well as optimized to minimize CPU and memory usage. To satisfy this need, we developed a subgraph monomorphism search implementation that performs favorably against the standard NetworkX implementation (Fig. 5). Our pure-Python implementation matches the results of the NetworkX GraphMatcher monomorphism detection module, with significantly lower CPU and memory overhead. Additionally, while many modern subgraph iso- and monomorphism implementations rely on a directed, acyclic graph state-space representation, our implementation stores its state space in a one-dimensional queue (Algorithm 1). In future work, we expect that due to this property, our algorithm may be trivially parallelized to operate on multiple cores at once.
Optimized interfaces for common graph libraries
Many existing research questions have been explored with RAM-sized graphs in mind, and have utilized popular Python graph libraries such as NetworkX or IGraph19,45, or emerging libraries such as Networkit46. In order to easily transition these algorithms and analyses to graphs of larger size, we developed an open-source graph library that enables a user to write code using familiar graph APIs (e.g. NetworkX’s nx.DiGraph), but which automatically converts these commands to run instead on any of a number of optimized, scalable backends, such as a graph database implemented in SQL or Amazon Web Services DynamoDB. This tool, alongside the subgraph iso- and monomorphism improvements detailed above, enabled us to achieve substantial performance speed-ups in performing the analyses in this paper. Links to documentation and the source code for this library are available in Supplemental Material 1. All benchmarks listed here were performed on consumer laptop hardware with 16 gigabytes of RAM and a 3.1 GHz Intel Core i7 processor.
Comparison to random graph models
One way to determine if a motif is significant within a real biological dataset is to compare its observed frequency to its expected frequency within a random graph model. There are many different random graph models, each with different properties and capturing a different aspect of real world data47. Here, we perform subgraph search on five different random graph models.
The Erdős–Rényi random model has parameters n, or the number of nodes, and p, or the probability that any of the potential edges will independently exist within the graph31. The Erdős–Rényi model is easy to analyze and understand, and thus is included in this analysis. However, the edge independence assumption often fails for real datasets. Thus, other models are required to provide further context.
Geometric random graphs are desirable because they naturally capture spatial relationships that occur within the brain48. Nodes are uniformly distributed within a region and two nodes are adjacent if they fall within a radius r, which is an adjustable parameter32. This model, however, does not capture long-range neural connections, resulting in some spatial relationships that are not captured by the point cloud approach of the geometric model.
The Watts–Strogatz model addresses the edge dependence issue more directly, recognizing that many real world networks have high clustering coefficients, which is a measure of the extent to which two adjacent nodes have similar neighbors33. The Barabási–Albert model is a preferential-attachment model that ensures a power-law degree distribution, making it a scale free graph model34,47.
The degree-preserving edge-randomization graph model implemented here, based upon efforts such as those in Refs.36,49, accepts a graph as input-parameter and performs edge-swaps between two randomly-selected edges and so that the resulting edges are and . This model preserves the in- and out-degree distribution, as well as the in-degree sequence. Here, this model serves as a control in order to determine the role of degree distribution when determining the prevalence of a motif.
Each of the described random models has a set of parameters resulting in different graph properties. These parameters may be selected to generate graphs “similar” to the real dataset in question. Similarity can mean different things and we try to match the models to a characteristic that is closely tied to motif search. One obvious choice is to match the graph density (D=) of each estimated connectome because density is a metric that can be more easily tweaked across models. Furthermore, a density that is too high or too low can result in inflated or underestimated subgraph counts, respectively50. Other metrics besides graph density may be considered, such as the degree distribution. Regardless of the metric, the parameters selected for a random graph model will differ for each connectome being analyzed. More specifically, differences in density for C. elegans and MICrONS datasets will result in differently parametrized random graph models. By performing subgraph search across these models tuned or calibrated to some characteristic of the real dataset, we narrow the problem space of what drives a given motif to occur at a given frequency. By understanding which models closely approximate certain phenomena such as motif count, we may be able to better identify biological mechanisms that induce these patterns of connectivity in the connectome.
Architecture
DotMotif is comprised of three submodules: a parser module, an optimizer module, and finally an executor module. In order of use, the parser module is responsible for converting the DotMotif domain-specific language into an in-memory representation. The optimizer module is then responsible for converting and simplifying the in-memory motif into its simplest possible representation. The optimizer module may also optionally check for violations of biological priors, in a validation step. Finally, the executor module converts the optimized motif into a query that can be submitted to a graph analysis tool. Each executor is responsible for generating its own target-specific queries. (For example, the Neo4jExecutor generates Cypher queries, and the NetworkXExecutor converts queries to a sequence of Python commands.) Through the coordination of these submodules, DotMotif provides a framework for posing and answering complex graph queries.
DotMotif domain-specific language parser
We identified an impedance mismatch between the flexibility of common query languages (such as Cypher7) and the needs of the research community, where many research questions require overwhelmingly complex or verbose queries. We opted to develop a domain-specific query language to aid in the construction of queries. This enables research-driven query design agnostic to the underlying frameworks.
The DotMotif domain-specific language (DSL) borrows from DOT-like syntax51 as well as from SQL-like syntax in order to expose a succinct and user-friendly query language. For example, the simple query A -> B will return a list of all edges in the complete graph (in other words, the list of all subgraphs of that comprise one edge from a node to node ). In order to make queries understandable and maintainable, the DotMotif DSL supports “macros”, or composable building-blocks that can be combined to generate more complicated queries (Fig. 2c). These macros minimize “boilerplate” syntax without leading to duplicate notation (Fig. 2d). Further illustrations of the DotMotif DSL syntax are available in Fig. 7a,b.
By default, DotMotif returns a single representative element of each automorphism group in the result set. An automorphism is an isomorphism from a graph onto itself44. Returning a single representative of the automorphism group avoids over-counting motifs in the search graph due to motif symmetries. The ability to count only one representative of this group is useful to many common neuroscience questions, but it is not easily accomplished with conventional graph tools. A DotMotif user may also explicitly specify that two nodes in a motif are interchangeable with the automorphism operator (Fig. 7a). Whether the user uses automatic query automorphism detection or notates it manually, the DotMotif optimizer will enrich the query motif to lower the space of possible matches, and thus return a result more rapidly. DotMotif considers both semantic as well as syntactic feasibility when automatically determining automorphism mappings42.
This language is formally defined in Extended Backus–Naur Form52,53 by the grammar file referenced in Supplemental Material 1. This syntax is visualized using the railroad-diagram convention in Fig. 8.
Query optimization and validation
After a query has been parsed and ingested, it is passed to an optimization stage in which the query is reduced to a simplified form in order to improve the speed of query execution. Optionally, a user may also enable the automatic detection of impossible structures—or even biologically implausible structures—in a Validation step (Fig. 7c).
These verbose error messages are intended to warn a user quickly and clearly if a graph query is unlikely or impossible, rather than allowing the user to proceed with a long-running query that is destined to fail. In contrast, the equivalent Cypher command running in Neo4j will still go through the process of execution on the full host graph, but will return no results. DotMotif includes the option to add validators for biological plausibility. These options allow the user to prohibit motifs that—for example—require inhibition and excitation to be performed by the same neuron. These validators, like all validators in the DotMotif package, are both optional as well as modifiable.
Query execution
A parsed, optimized, and validated query can now be evaluated against a large graph. Where necessary, it is possible to run these queries fully in-memory on consumer hardware. DotMotif includes both pure Python as well as graph-database query executors. There is no difference between the results returned from each executor, and so they may be used interchangeably, depending upon the needs and parameters of the experiments and environments.
In order to take advantage of graph database technology, we have implemented a Neo4jExecutor, which leverages the Neo4j7 database and its built-in subgraph match detection algorithm, an implementation of the VF2 algorithm41,42. In order to run queries in this environment, our software converts the optimized query to Cypher, the database query language used natively by Neo4j. If a user does not have a running Neo4j database, our library also includes routines to provision a Docker container54 and quickly ingest the data in an appropriate format for Neo4j to use. A user may also direct queries to a running neuPrint database6. A list of all available executors is available in Table 2.
Table 2.
Executor | NetworkX | GrandIso | Neo4j | NeuPrint |
---|---|---|---|---|
Pure-Python? | Yes | Yes | No | No |
Larger-than-RAM graphs? | No | Yes | Yes | Yes |
Simultaneous searches? | No | No | Yes | Yes |
Export Cypher queries? | No | No | Yes | Yes |
All executors are capable of running any of the queries in this paper. Pure Python executors are desirable due to their low configuration and installation overhead, though they tend to be slower than graph databases. Searches in larger-than-RAM graphs are partially supported in the GrandIso executor with the use of the Grand library discussed in “Optimized interfaces for common graph libraries”. Graph databases support this by default. Both neuPrint and Neo4j run as a separate service, and both accept simultaneous client connections, whereas the Python executors cannot. The user can use these executors interchangeably, and may find that graph-database executors work better for specific types of motif searches.
Discussion
Though modern connectomics research commonly interprets neural connectivity as a network, the field currently suffers from a lack of accessible, performant graph analysis and subgraph-search tools. In this work, we present DotMotif, a combined domain-specific language and interface to powerful graph-search tools. Our hope is that this tool and others like it will reduce the barrier-to-entry for researchers unacquainted with graph theory or graph databases, and will enable researchers to interrogate increasingly common connectome datasets with ease. Though the undirected graphs from the graph atlas study may not directly provide answers to open questions in neuroscience, this broad search can narrow down where it may be most impactful to dig deeper, perhaps leading to more interesting directed graph searches or neural simulations over smaller directed versions55. Such exhaustive motif searches may also enable better characterization of local network properties and dynamics39. Future work may include random assignment of edge-direction in order to perform an unbiased search across the space of all possible directed graphs, akin to other random-graph searches8. We intend to further explore other metrics beyond density for random graph model calibration in the future.
Through these directed and undirected motif studies, it may also be possible to build a heuristic to predict motif-counts in a connectome using a small number of simpler (i.e. lower computational cost) motif searches, and avoid explicit counting altogether. For example, if the goal of a research study were to determine if the actual count of a motif in question aligns with predictions for a connectome of a certain modality, size, species, and brain region, such a model would circumvent the need to search in random graph models, and would instead enable a simpler, analytic approach.
Recent neuroscience research10,12,38,56,57 has provided many valuable explanations for neuron-to-neuron connectivity rules. We intend for these tools to offer a quantitative way to expand these rules to larger numbers of neurons, and to free the neuroscience community to explore new questions in the connectomics domain. Similarly, several efforts have begun to map subneuron connectivity patterns, identifying recurring motifs in the relationships of multiple synapses between the same neurons58–60. We believe that the systems presented here will provide a substrate upon which to build not only nanoscale connectivity research but also subcellular motif work and meso- to macro-scale connectomics, through modalities such as X-ray microtomography or MRI18,61,62, by identifying motifs in network connectivity across brain regions or in large-gauge neurites63–65.
Despite many of the simplifications and optimizations to the subgraph monomorphism task mentioned here, this task in general is NP-complete43,66–68. Even in a relatively small connectome such as that of C. elegans, certain sufficiently common or poorly-optimized motif searches may still remain infeasible to run on consumer hardware. In our experience, the addition of further constraints, such as specifying edge direction, specifying node- and edge-attributes, or searching for query graphs with nodes of high degree, may reduce the execution time of complex queries; but other seemingly helpful modifications, such as increasing the diameter of a query graph, may counter-intuitively increase runtime quite substantially. As a result of such longer query runtimes or more compute-hungry queries, researchers will be able to run fewer queries per study.
In our ongoing work, we hope to reduce this barrier by developing more rigorous automated query optimizations and by publishing connectome query results for reuse. We also hope to use tools from probabilistic and Bayesian network analysis to better understand how global network properties (such as degree sequence or density) impact motif incidence. We will introduce and refine new validators that aid DotMotif query designers in identifying qualities of a query graph that may lead to long runtimes, and we hope that the atlas results published above may shed light on how to select query graphs for further study in a dataset. We intend to disseminate an open-data motif encyclopedia so that long-running or complex queries may be run on a dataset once and then shared with the community.
We also note that faster implementations of the motif-search problem—relevant to connectomics, pharmaceutical research, and other domains—may pave the way to faster and more efficient scientific discovery.
The field of connectomics is at an inflection point as datasets continue to grow in size, as technology and neuroscience provide avenues to create and study large datasets at unprecedented scales beyond the analysis capabilities of a single lab. Open, free, and public datasets, as well as accessible and affordable tools to understand those public data, are of paramount importance.
We are releasing the DotMotif codebase, as well as all demonstration code and data in this manuscript, as open-source and open-data tools to support community discovery. We invite collaborators to share questions that allow us extend DotMotif to test scientific hypotheses.
Citation diversity statement
Recent work in several fields of science has identified a bias in citation practices such that papers from women and other minority scholars are under-cited relative to the number of such papers in the field69–73. Here we sought to proactively consider choosing references that reflect the diversity of the field in thought, form of contribution, gender, race, ethnicity, and other factors. First, we obtained the predicted gender of the first and last author of each reference by using databases that store the probability of a first name being carried by a woman73,74. By this measure (and excluding self-citations to the first and last authors of our current paper), our references contain 7.94% woman(first)/woman(last), 14.29% man/woman, 22.22% woman/man, and 55.56% man/man. This method is limited in that a) names, pronouns, and social media profiles used to construct the databases may not, in every case, be indicative of gender identity and b) it cannot account for intersex, non-binary, or transgender people. Second, we obtained predicted racial/ethnic category of the first and last author of each reference by databases that store the probability of a first and last name being carried by an author of color75,76. By this measure (and excluding self-citations), our references contain 6.33% author of color (first)/author of color(last), 31.65% white author/author of color, 13.92% author of color/white author, and 48.1% white author/white author. This method is limited in that (a) names and Florida Voter Data to make the predictions may not be indicative of racial/ethnic identity, and (b) it cannot account for Indigenous and mixed-race authors, or those who may face differential biases due to the ambiguous racialization or ethnicization of their names. We look forward to future work that could help us to better understand how to support equitable practices in science.
Supplementary Information
Acknowledgements
The authors would like to thank Konrad Kording for advice and review of this manuscript. We would like to thank the teams that generated the connectome data used in this analysis. Research reported in this publication was supported by the National Institute of Mental Health of the National Institutes of Health under Award Numbers R24MH114799 and R24MH114785. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This work was also completed with the support of JHU/APL Internal Research Funding.
Author contributions
J.M. and J.S. conducted the experiments and designed the software. E.R. and E.J. analyzed the results. B.W., D.B. and W.G.R. conceived and designed the experiments, analyzed the results, and secured funding for this work. All authors wrote and reviewed the manuscript.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Jordan K. Matelsky, Email: jordan.matelsky@jhuapl.edu
William Gray-Roncal, Email: william.gray.roncal@jhuapl.edu.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-021-91025-5.
References
- 1.Bassett DS, Zurn P, Gold JI. On the nature and use of models in network neuroscience. Nat. Rev. Neurosci. 2018;19:566–578. doi: 10.1038/s41583-018-0038-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Xu CS, et al. A connectome of the adult drosophila central brain. biorXiv. 2020 doi: 10.1101/2020.01.21.911859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Vogelstein, J. T. et al. A community-developed open-source computational ecosystem for big neuro data. Nat. Methods.15, 846–847. 10.1038/s41592-018-0181-1 (2018). [DOI] [PMC free article] [PubMed]
- 4.Hočevar T, Demšar J. Combinatorial algorithm for counting small induced graphs and orbits. PLoS One. 2017;12:e0171428. doi: 10.1371/journal.pone.0171428. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Scheffer LK. Graph properties of the adult drosophila central brain. biorXiv. 2020 doi: 10.1101/2020.05.18.102061. [DOI] [Google Scholar]
- 6.Clements J, et al. neuPrint: Analysis tools for EM connectomics. biorXiv. 2020 doi: 10.1101/2020.01.16.909465. [DOI] [Google Scholar]
- 7.Fernandes D, Bernardino J. Graph databases comparison: AllegroGraph, ArangoDB, InfiniteGraph, Neo4J, and OrientDB. Data. 2018;20:373–380. [Google Scholar]
- 8.Sporns O, Kötter R. Motifs in brain networks. PLoS Biol. 2004;369:20. doi: 10.1371/journal.pbio.0020369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Mountcastle VB. Modality and topographic properties of single neurons of Cat’s somatic sensory cortex. J. Neurophysiol. 1957;20:408–434. doi: 10.1152/jn.1957.20.4.408. [DOI] [PubMed] [Google Scholar]
- 10.Turner NL, et al. Multiscale and multimodal reconstruction of cortical structure and function. biorXiv. 2020 doi: 10.1101/2020.10.14.338681. [DOI] [Google Scholar]
- 11.Schreiber F, Schwobbermeyer H. MAVisto: A tool for the exploration of network motifs. Bioinformatics. 2005;21:3572–3574. doi: 10.1093/bioinformatics/bti556. [DOI] [PubMed] [Google Scholar]
- 12.Milo R, et al. Network motifs: Simple building blocks of complex networks. Science. 2002;298:824–827. doi: 10.1126/science.298.5594.824. [DOI] [PubMed] [Google Scholar]
- 13.Pospelov N, et al. Spectral peculiarity and criticality of a human connectome. Phys. Life Rev. 2019;31:240–256. doi: 10.1016/j.plrev.2019.07.003. [DOI] [PubMed] [Google Scholar]
- 14.Reilly EP, et al. Neural reconstruction integrity: A metric for assessing the connectivity accuracy of reconstructed neural networks. Front. Neuroinform. 2018;12:74. doi: 10.3389/fninf.2018.00074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Reilly EP, et al. Complex Networks XI. ***: Springer; 2020. Connecting neural reconstruction integrity (NRI) to graph metrics and biological priors; pp. 182–193. [Google Scholar]
- 16.Sporns O. Structure and function of complex brain networks. Dialogues Clin. Neurosci. 2013;15:247–262. doi: 10.31887/DCNS.2013.15.3/osporns. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Swanson LW, Lichtman JW. From Cajal to connectome and beyond. Annu. Rev. Neurosci. 2016;39:197–216. doi: 10.1146/annurev-neuro-071714-033954. [DOI] [PubMed] [Google Scholar]
- 18.Bassett DS, Sporns O. Network neuroscience. Nat. Neurosci. 2017;20:353–364. doi: 10.1038/nn.4502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring Network Structure, Dynamics, and Function using NetworkX. In Proceedings of the 7th Python in Science Conference, Pasadena, CA, USA (eds Varoquaux, G. et al.) 11–15 (2008).
- 20.Google. Cayley: An open-source graph database. https://github.com/cayleygraph/cayley (2020).
- 21.Wang R, Yang Z, Zhang W, Lin X. An empirical study on recent graph database systems. Lecture Notes Comput. Sci. 2020 doi: 10.1007/978-3-030-55130-8_29. [DOI] [Google Scholar]
- 22.Wernicke S, Rasche F. FANMOD: A tool for fast network motif detection. Bioinformatics. 2006;22:1152–1153. doi: 10.1093/bioinformatics/btl038. [DOI] [PubMed] [Google Scholar]
- 23.Ribeiro P, Silva F. G-Tries: A data structure for storing and finding subgraphs. Data Min. Knowl. Discov. 2013;28:337–377. doi: 10.1007/s10618-013-0303-4. [DOI] [Google Scholar]
- 24.Cook SJ, et al. Whole-animal connectomes of both Caenorhabditis elegans sexes. Nature. 2019;571:63–71. doi: 10.1038/s41586-019-1352-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.White JG, Southgate E, Thomson JN, Brenner S. The structure of the nervous system of the nematode C. elegans. Philos. Trans. R. Soc. Lond. 1986;314:1–340. doi: 10.1098/rstb.1986.0056. [DOI] [PubMed] [Google Scholar]
- 26.Becker, L. et al. MICrONS Layer 2/3 Data Tables. 10.5281/zenodo.3710459 (2020).
- 27.Jarrell TA, et al. The connectome of a decision-making neural network. Science. 2012;337:437–444. doi: 10.1126/science.1221762. [DOI] [PubMed] [Google Scholar]
- 28.Dorkenwald S, et al. Binary and analog variation of synapses between cortical pyramidal neurons. biorXiv. 2019 doi: 10.1101/2019.12.29.890319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Schneider-Mizell CM, et al. Chandelier cell anatomy and function reveal a variably distributed but common signal. biorXiv. 2020 doi: 10.1101/2020.03.31.018952. [DOI] [Google Scholar]
- 30.Read RC, Wilson RJ. An Atlas of Graphs. ***: Oxford University Press Inc; 2005. [Google Scholar]
- 31.Erdös P, Rényi A. On random graphs I. Publ. Math. Debrecen. 1959;6:290. [Google Scholar]
- 32.Penrose D, Penrose M, Press OU. Random Geometric Graphs. Oxford Studies in Probability. ***: Oxford University Press; 2003. [Google Scholar]
- 33.Watts DJ, Strogatz SH. Collective dynamics of “small-world” networks. Nature. 1998;393:440–442. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]
- 34.Albert R, Barabási A-L. Statistical mechanics of complex networks. Rev. Mod. Phys. 2002;74:47–97. doi: 10.1103/revmodphys.74.47. [DOI] [Google Scholar]
- 35.Newman MEJ. The structure and function of complex networks. SIAM Rev. 2003;45:200–202. doi: 10.1137/s003614450342480. [DOI] [Google Scholar]
- 36.Roberts, E. S. & Coolen, A. C. C. Unbiased degree-preserving randomisation of directed binary networks. Phys. Rev. E85, 046103 (2011). [DOI] [PubMed]
- 37.Brittin CA, Cook SJ, Hall DH, Emmons SW, Cohen N. Volumetric reconstruction of main Caenorhabditis elegans neuropil at two different time points. biorXiv. 2018 doi: 10.1101/485771. [DOI] [Google Scholar]
- 38.Gal E, Perin R, Markram H, London M, Segev I. Neuron geometry underlies universal network features in cortical microcircuits. biorXiv. 2019 doi: 10.1101/656058. [DOI] [Google Scholar]
- 39.Curto, C. et al. Robust motifs of threshold-linear networks. arXiv:1902.10270 (2019).
- 40.Whalen AJ, Brennan SN, Sauer TD, Schiff SJ. Observability and controllability of nonlinear networks: The role of symmetry. Phys. Rev. X. 2015;5:011005. doi: 10.1103/PhysRevX.5.011005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Cordella, L. P., Foggia, P., Sansone, C. & Vento, M. An improved algorithm for matching large graphs. In In: 3rd IAPR-TC15 Workshop on Graph-based Representations in Pattern Recognition, Cuen, 149–159 (2001).
- 42.Cordella L, Foggia P, Sansone C, Vento M. A (sub)graph isomorphism algorithm for matching large graphs. Pattern Anal. Mach. Intell. IEEE Trans. 2004;26:1367–1372. doi: 10.1109/TPAMI.2004.75. [DOI] [PubMed] [Google Scholar]
- 43.Ullmann JR. An algorithm for subgraph isomorphism. J. ACM (JACM) 1976;23:31–42. doi: 10.1145/321921.321925. [DOI] [Google Scholar]
- 44.West DB. Introduction to Graph Theory. ***: Prentice Hall; 2001. [Google Scholar]
- 45.Csardi G, Nepusz T. The igraph software package for complex network research. Int. J. Complex Syst. 2006;1695:20. [Google Scholar]
- 46.Staudt, C. L. et al. NetworKit: A tool suite for large-scale complex network analysis. Netw. Sci. 4(4), 508–530. 10.1017/nws.2016.20 (2015).
- 47.Chung J, et al. Statistical connectomics. Center Open Sci. 2020 doi: 10.31219/osf.io/ek4n3. [DOI] [Google Scholar]
- 48.Stiso J, Bassett DS. Spatial embedding imposes constraints on neuronal network architectures. Trends Cogn. Sci. 2018;22:1127–1142. doi: 10.1016/j.tics.2018.09.007. [DOI] [PubMed] [Google Scholar]
- 49.Maslov S, Sneppen K. Specificity and stability in topology of protein networks. Science. 2002;296:910–913. doi: 10.1126/science.1065103. [DOI] [PubMed] [Google Scholar]
- 50.van Wijk BCM, Stam CJ, Daffertshofer A. Comparing brain networks of different size and connectivity density using graph theory. PLoS ONE. 2010;5:e13701. doi: 10.1371/journal.pone.0013701. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Gansner ER, Koutsofios E, North SC, Vo K phong. A technique for drawing directed graphs. IEEE Trans. Softw. Eng. 1993;19:214–230. doi: 10.1109/32.221135. [DOI] [Google Scholar]
- 52.Backus, J. W. The syntax and semantics of the proposed international algebraic language of the Zurich ACM-GAMM Conference. In IFIP Congress, 125–131 (Butterworths, London, 1959).
- 53.Knuth DE. Backus normal form vs. backus naur form. Commun. ACM. 1964;7:735–736. doi: 10.1145/355588.365140. [DOI] [Google Scholar]
- 54.Merkel D. Docker: Lightweight linux containers for consistent development and deployment. Linux J. 2014;2014:2. [Google Scholar]
- 55.Aqil M, Atasoy S, Kringelbach ML, Hindriks R. Graph neural fields: A framework for spatiotemporal dynamical models on the human connectome. PLoS Comput. Biol. 2021;17:e1008310. doi: 10.1371/journal.pcbi.1008310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Betzel RF, et al. Generative models of the human connectome. Neuroimage. 2016;124:1054–1064. doi: 10.1016/j.neuroimage.2015.09.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Schröter M, Paulsen O, Bullmore ET. Micro-connectomics: Probing the organization of neuronal networks at the cellular scale. Nat. Rev. Neurosci. 2017;18:131–146. doi: 10.1038/nrn.2016.182. [DOI] [PubMed] [Google Scholar]
- 58.Morgan JL, Berger DR, Wetzel AW, Lichtman JW. The fuzzy logic of network connectivity in mouse visual thalamus. Cell. 2016;165:192–206. doi: 10.1016/j.cell.2016.02.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Morgan JL, Lichtman JW. An individual interneuron participates in many kinds of inhibition and innervates much of the mouse visual thalamus. Neuron. 2020;106:468–481.e2. doi: 10.1016/j.neuron.2020.02.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Wilson AM, et al. Developmental rewiring between cerebellar climbing fibers and purkinje cells begins with positive feedback synapse addition. Cell Rep. 2019;29:2849–2861.e6. doi: 10.1016/j.celrep.2019.10.081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Yu Y-C, Bultje RS, Wang X, Shi S-H. Specific synapses develop preferentially among sister excitatory neurons in the neocortex. Nature. 2009;458:501–504. doi: 10.1038/nature07722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Prasad JA, et al. A three-dimensional thalamocortical dataset for characterizing brain heterogeneity. biorXiv. 2020 doi: 10.1101/2020.05.22.111617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Fornito A, Zalesky A, Breakspear M. Graph analysis of the human connectome: Promise, progress, and pitfalls. Neuroimage. 2013;80:426–444. doi: 10.1016/j.neuroimage.2013.04.087. [DOI] [PubMed] [Google Scholar]
- 64.Huang W, et al. A graph signal processing perspective on functional brain imaging. Proc. IEEE. 2018;106:868–885. doi: 10.1109/JPROC.2018.2798928. [DOI] [Google Scholar]
- 65.Sizemore AE, et al. Cliques and cavities in the human connectome. J. Comput. Neurosci. 2017;44:115–145. doi: 10.1007/s10827-017-0672-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Han, M., Kim, H., Gu, G., Park, K. & Han, W.-S. Efficient subgraph matching. Proceedings of the 2019 International Conference on Management of Data - SIGMOD ’19. 10.1145/3299869.3319880 (2019).
- 67.Vassilevska, V. & Williams, R. Finding, minimizing, and counting weighted subgraphs. Proceedings of the 41st Annual ACM Symposium on Symposium on Theory of Computing-STOC ’09. 10.1145/1536414.1536477 (2009).
- 68.Hell P, Nešetřil J. Colouring, constraint satisfaction, and complexity. Comput. Sci. Rev. 2008;2:143–163. doi: 10.1016/j.cosrev.2008.10.003. [DOI] [Google Scholar]
- 69.Mitchell SM, Lange S, Brus H. Gendered citation patterns in international relations journals. Int. Stud. Perspect. 2013;14:485–492. doi: 10.1111/insp.12026. [DOI] [Google Scholar]
- 70.Dion ML, Sumner JL, Mitchell SM. Gendered citation patterns across political science and social science methodology fields. Polit. Anal. 2018;26:312–327. doi: 10.1017/pan.2018.12. [DOI] [Google Scholar]
- 71.Caplar N, Tacchella S, Birrer S. Quantitative evaluation of gender bias in astronomical publications from citation counts. Nat. Astron. 2017;1:0141. doi: 10.1038/s41550-017-0141. [DOI] [Google Scholar]
- 72.Maliniak D, Powers R, Walter BF. The gender citation gap in international relations. Int. Organ. 2013;67:889–922. doi: 10.1017/S0020818313000209. [DOI] [Google Scholar]
- 73.Dworkin, J. D. et al. The extent and drivers of gender imbalance in neuroscience reference lists. bioRxiv. 10.1101/2020.01.03.894378 (2020). https://www.biorxiv.org/content/early/2020/01/11/2020.01.03.894378.full.pdf. [DOI] [PubMed]
- 74.Zhou, D. et al. Gender diversity statement and code notebook v1.0. 10.5281/zenodo.3672110 (2020).
- 75.Ambekar, A., Ward, C., Mohammed, J., Male, S. & Skiena, S. Name-ethnicity classification from open sources. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 49–58 (2009).
- 76.Sood, G. & Laohaprapanon, S. Predicting race and ethnicity from the sequence of characters in a name. arXiv:1805.02109 (arXiv preprint) (2018).
- 77.Atkins-Bittner, T. Railroad-diagram generators. https://github.com/tabatkins/railroad-diagrams (2020).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.