Skip to main content
AMIA Summits on Translational Science Proceedings logoLink to AMIA Summits on Translational Science Proceedings
. 2020 May 30;2020:646–653.

Can Neo4j Replace PostgreSQL in Healthcare?

Jessica A M Stothers 1, Andrew Nguyen 1
PMCID: PMC7233060  PMID: 32477687

Abstract

Our current big data landscape prompts us to develop new analytical skills in order to make the best use of the abundance of datasets at hand. Traditionally, SQL databases such as PostgreSQL have been the databases of choice, and newer graph databases such as Neo4j have been relegated to the analysis of social network and transportation datasets. In this paper, we conduct a side by side comparison of PostgreSQL (using SQL) and Neo4j (using Cypher) using the MIMIC-III patient database as a case study. We found that, while Neo4j is more time intensive to implement, its queries are less complex and have a faster runtime than comparable queries performed in PostgreSQL. This leads to the conclusion that while PostgreSQL is adequate as a database, Neo4j should be considered a viable contender for health data storage and analysis.

Introduction

We live in a world of big data, where data scientists work with access to more information than ever before. While this big data provides us with an unprecedented ability to investigate questions and find answers, it also requires us to develop new analytical skills in order to make the best use of the datasets. These big datasets require databases to contain and interpret them, and until recently relational databases such as PostgreSQL1 were the default choice. Relational databases store data in a tabular format, in rows and columns, and interact with the data through structured queries. Data values are stored within a predefined schema, which makes little allowance for the ability to adapt the database in response to new data2 . Relational databases certainly have their value and offer data safety (through ACID compliance) and query flexibility. They are also supported by many pre-built tools to enable more effective data migration, interaction and analytics3. However, due to their reliance on static schemas, relational databases can be inflexible and unsuited to a dynamic data environment. Relational databases can be difficult to scale horizontally, and may show poor performance with large datasets as they have difficulty with high-volume read and write requests4.

NoSQL databases were introduced to solve many of these issues, and because of this have had increasing popularity. NoSQL databases follow CAP (Consistency, Availability, and Partition Tolerance) and are well suited to network partitioning, enabling horizontal scalability5. This makes graph databases a useful tool for storing highly variable data that is difficult to contain in a pre-defined schema. Graph databases such as AllegroGraph, Datastacks and Neo4j are a particularly interesting form of NoSQL database, which store data in graph format as nodes and relationships, focusing on the relationships between nodes as a key feature of the database (Figure 1). In property graph databases such as Neo4j, every node and relationship can have properties6. For example, in Kan et al.’s 2017 paper “Topology Modeling and Analysis of a Power Grid Network Using a Graph Database”, one way that Kan et al. use Neo4j is to determine which equipment has power if power is provided to certain nodes7. With buses represented by nodes, each bus node has a voltage property and a transformer or line relationship to other bus nodes. The transformer and line relationships have properties, such as whether the incoming switch is on or off.

Figure 1.

Figure 1.

Relational vs Graph Databases

Graph databases are also built to easily run queries involving multiple levels of relationships between instances8. Because graph databases focus on the relationships between instances, they are a natural choice to store relationally focused data and have been widely implemented to store social network, transportation and biology datasets9. Many large companies have also started to make use of graph databases, with companies such as Walmart and eBay using Neo4j to generate recommendations for customers10. While these datasets seem to be natural fits for graph storage, graph databases can also be applied to traditionally tabular datasets and can help users to gain new insights on data clusters and relations.

In this project, we conducted a case study to investigate whether patient data could be viably stored and analyzed within a graph database. This project is a feasibility study with a public dataset to ascertain whether clinical researchers without previous database building experience would benefit from examining clinical data within a graph database. While relational databases such as PostgreSQL do provide an acceptable level of query speed, they are not well suited for tasks such as linking disparate data sets or gaining an intuitive understanding of the data at hand. With this project we hope to identify whether the graph database Neo4j can compete with PostgreSQL in ease of use and query speed, while offering advantages in database flexibility and data exploration. We chose Neo4j with Cypher as the graph database and PostgreSQL with SQL as the tabular baseline for comparison (Figure 2). We compared the databases on command complexity and efficiency, as well as ease of data importation and analysis. Our objective in doing so was to gain some insight on which database was better suited for clinical researchers investigating patient data.

Figure 2.

Figure 2.

PostgreSQL and Neo4j

Methods

Data

For this project, we worked with the MIMIC-III Critical Care data, an open database containing deidentified data from over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 201211. For the purposes of this project we used a subset containing the data of 100 patients. We selected this dataset as this sample data is freely available to the public, making our project reproducible by any interested party seeking to explore graph databases in a healthcare setting or build upon our work.

Computational Environment

To ensure a similar computational environment all databases were created on the same computer, a Macbook Pro with 8GB of RAM and a 2.7 GHz Intel Core i5 processor.

Queries

We compared operations between the three databases for twenty-two queries ranging in complexity from two to thirty-five clauses. With the exception of the two queries used to compare operations with and without the use of join tables, all SQL queries were sourced from the documentation page for MIMIC-III12. These queries were chosen as representative queries as they are freely accessible and available for any interested party to reproduce. We created Cypher queries to best approximate the available SQL queries while attempting to minimize runtime. The Cypher queries attempted to replicate the same level of explicitness and readability as the SQL queries, so as to be useable by a broad audience.

PostgreSQL

As the MIMIC-III database is designed to be interfaced through PostgreSQL, we started by building a local PostgreSQL database using MIMIC-III’s provided instructions12. To do this, we used pgAdmin 4 version 4.5 with PostgreSQL. We then performed the example queries provided by MIMIC-III and recorded their time to completion and complexity. These queries were used as a baseline to compare operations between PostgreSQL and Neo4j.

Neo4j - Unmodified

We then imported the patient data into Neo4j and replicated the PostgreSQL schema as closely as possible within Neo4j, where data is stored as nodes with properties representing column values. We installed the Neo4j APOC library extension to enable more efficient Cypher queries,13 in particular the ability to iterate over batches was of especial use. Each table within PostgreSQL became a node label within Neo4j, and each row a node with properties linked to the PostgreSQL column values for that entry. We replicated the PostgreSQL indexes and created relationships between node labels wherever relationships were indicated within MIMIC-III’s schema14. We then replicated the example queries within Neo4j using Cypher and recorded their time to completion and complexity, with an example query presented in Table 1. It should be noted that no explicit relationships exist within the PostgreSQL database and these relationships must be replicated by joining tables while performing SQL queries, however the relationships were clearly intended to exist between the tables and so we included them within the unmodified Neo4j database.

Table 1.

Example comparison of command complexity and execution time in PostgreSQL (SQL) and Neo4j (Cypher)

Show all of the patients diagnosed with Hypertension NOS
Language Query Number of Clauses Runtime Runtime Max DB Memory Usage
SQL SELECT *
FROM diagnoses_icd dia
INNER JOIN admissions adm
ON dia.hadm_id = adm.hadm_id
AND dia.icd9_code = ′4019′
INNER JOIN d_icd_diagnoses dicd
ON dia.icd9_code = dicd.icd9_code;
7 158 ms 152 MB 4 MBa
Cypher (unmodified db) MATCH
(adm:Admissions)-[r:ShareHADMID]-(dia:Di
agnoses_ICD {icd9_code:
′4019′})-[r2:ShareICD9Code]-(dicd:D_ICD_
Diagnoses)
RETURN adm, r, dia, r2, dicd;
2 38 ms 682 MB 1 GBb
Cypher (modified db) MATCH
(adm:Admissions)-[r:DiagnosedWith]-(dicd:D
_ICD_Diagnoses {icd9_code: ′4019′})
RETURN adm, r, dicd;
2 21 ms 349 MB 1 GBb
a

work_mem: determines how much memory can be used by sort operations and hash tables before writing to temporary disk files

b

dbms.memory.heap.max_size: determines how much memory can be used for query execution, transaction state and graph management

Neo4j - Modified

Next we created a modified Neo4j database to take advantage of Neo4j’s graph structure while performing queries on the patient dataset. We began with a replica of the unmodified Neo4j MIMIC-III database and then replaced all join tables with relationships containing the properties inherent in the join table. This reduced the number of pass-through relationships. Following this, we eliminated redundant relationships where both the Admissions and Patient node label were linked to the same node. In doing so we worked under the assumption that a researcher querying the database would be primarily interested in the relationships between Admissions and other node labels, as Admissions contains most of the relevant information on individual patients. We left a relationship between Patients and Admissions as well as ICU_Stays so that a researcher could still easily access the necessary patient information if needed. If this assumption of primary interest in Admissions was violated, runtime would increase and queries would increase in complexity for any query involving Patients and another node label. This is because either Admissions or ICU_Stays would essentially function as a join table between Patients and the relevant node label. Lastly, we replicated the queries performed in the PostgreSQL and unmodified Neo4j databases and performed some additional queries that took advantage of the simpler database structure and reduction of pass-through relationships. We then replicated these additional queries in the PostgreSQL and unmodified Neo4j databases. All queries were compared for complexity and completion time.

Results

Database Implementation

While importing data into Neo4j did not take noticeably longer than importing data into PostgreSQL, creating relationships between the data in Neo4j did take a significant amount of time. This time increased sharply with operational complexity and the number of nodes involved, with the creation of 753 relationships in an operation involving 13,240 nodes taking 91 ms while the creation of 3432 relationships in an operation involving 16,457 nodes took 8, 948, 044 ms (2.5 hours). Though both Neo4j databases took up more disk space than the PostgreSQL database, modifying the Neo4j database by replacing join tables with relationships and removing redundant relationships caused the modified Neo4j database to occupy around half of the disk space of the unmodified Neo4j database.

Command Complexity

With the exception of database metadata commands, for which no direct comparison exists between PostgreSQL and Neo4j, commands were almost universally shorter with fewer clauses in Neo4j’s Cypher than in PostgreSQL’s SQL (Table 2). There was no difference between the command clause number in the two Neo4j databases trialed. The metadata commands have been excluded from comparison in the tables below.

Table 2.

Clause Complexity in PostgreSQL (SQL) and Neo4j (Cypher)

Language Average Number of Clauses per Operation
SQL 10.33
Cyper (unmodified database) 7.86
Cypher (modified database) 7.86

Operation Runtime

Once the database was implemented, operations were substantially faster in Neo4j than in PostgreSQL (Table 3). While there was very little difference in operation runtime between the modified and unmodified Neo4j databases for most commands, for operations which involved the deleted join tables the modified Neo4j database was faster (avg 47.5 ms for unmodified vs avg 30 ms for modified).

Table 3.

Operation Runtime in PostgreSQL (SQL) and Neo4j (Cypher)

Language Average Operation Runtime
SQL 154.48 ms
Cyper (unmodified database) 48.86 ms
Cypher (modified database) 45.26 ms

Repository

A repository of the Cypher queries used to create the modified and unmodified Neo4j databases, as well as the SQL and Cypher queries used for comparison along with their clause count and runtime can be found at https://github.com/JStothers/MIMIC-III-PostgreSQL-vs-Neo4j.

Discussion

After creating patient databases in PostgreSQL and Neo4j, we observed that both databases performed well, but there were some notable differences between them. The PostgreSQL database was significantly faster to implement with no commands taking more than a few minutes to complete. In contrast, while importing the data did not take notably longer in Neo4j, creating complex relationships between node labels required a significant amount of time and resources, even when using a reduced dataset of 100 patients. This discrepancy was most clear when implementing relationships to replace join tables, as these relationships required matching between more than two node labels and took ten thousand times longer than the creation of simpler relationships. We were able to reduce the operation time from days to hours by utilizing a periodic iterate command, but even so this operation took exponentially longer than any of the PostgreSQL implementation operations. It should be noted that PostgreSQL does not include relationships in its structure so there is no comparable PostgreSQL operation other than the creation of join tables, and so there was not a direct comparison for this implementation operation. It should also be noted that, by including this relationship creation operation in implementation Neo4j, we were able to reduce the operation cost to perform matching queries described by the relationship. So, while Neo4j required more time to set up the database, by doing so it laid the groundwork for more efficient queries. This is an observation that has been experimentally confirmed by other researchers4, 15, 16.

Overall Neo4j Cypher queries involved fewer clauses and were computationally more efficient than PostgreSQL SQL queries, with the exception of metadata queries, which required use of the APOC library to provide similar results to PostgreSQL’s built in metadata query. Neo4j also lacked a specific groupby function, which made groupby style operations marginally more difficult than in PostgreSQL. The most notable downside to Neo4j was the difficulty in creating relationships between node labels involving over ten thousand nodes. As the dataset grew larger, relationship creation operations quickly became impossible to implement without resorting to batch implementation methods from the APOC library. This problem can be mitigated in Neo4j by modeling data in a way that makes queries more efficient with shorter, simpler query paths, a necessity that is not encountered to such a large degree in relational databases17. While this is not an insurmountable barrier, it is a problem that will be common to many seeking to translate relational datasets into graph databases.

Implementation aside, Neo4j provided many advantages relating to operational simplicity, especially in the use of relationships to avoid joining tables. For this particular project, Neo4j’s builtin date-time calculations were also helpful in reducing code complexity and operation time when performing date-time operations such as age at admission. The ability to include properties in a relationship as opposed to using a join table is a clear advantage of Neo4j. While we did not fully take advantage of Neo4j’s flexibility in this project, it would be comparatively simple to do so by integrating this dataset with another. For example, we could link a Medicare/Medicaid claims dataset with the MIMIC-III dataset by forming relationships on shared ICD-9 codes without the need for new cumbersome join tables. Neo4j’s lack of a rigid schema gives it an ability to link disparate datasets quickly and easily, without compromising the existing network.

Neo4j is also well suited to storing information structures that are not well suited to relational databases, such as ontologies or networks. The Unified Medical Language System (UMLS) semantic network was created to link together different health and biomedical vocabularies and standards in the interest of promoting interoperability. This network would be an ideal candidate for representation in a graph database. Relationships could be created based on existing concept unique identifiers (CUIs) to easily link separate ontologies. The UMLS could also be linked to existing patient datasets such as the MIMIC-III data by linking ICD to SNOMED codes, allowing patient databases access to a wealth of relevant information. The potential uses for graph databases are endless, and many studies have already made use of graph databases to understand connections between heterogenous biological data such as disease networks or gene-disease associations18,19,20. These studies have found that Neo4j is well suited to storing complex biological relationships, and that the graph database structure enables researchers to gain insight into novel relationships between biological entities. Neo4j’s simple language and meaningful, intuitive query results make Neo4j an excellent database choice for domain experts without previous database experience, seeking to explore their own data.

Conclusion

While PostgreSQL is adequate as a database and has the advantage of familiarity as well as a wider resource library, Neo4j provides significant operational advantages that should not be discounted. Cypher has high readability as a query language, and Neo4j’s ability to present query results in multiple formats allows for greater flexibility than PostgreSQL and gives creative insights in data interpretation. The ability to immediately see data clusters is helpful in gaining high level understanding of a dataset, as well as for identifying areas for further data exploration. Neo4j is not necessarily a replacement for PostgreSQL, but it should be seen as a viable alternative in areas other than social network or transportation mapping, such as healthcare. While this project did not explore the comparative performance of Neo4j and PostgreSQL with large datasets, this investigation has shown that Neo4j is a feasible database choice for clinical researchers and has laid the groundwork for future tests. In all, Neo4j’s efficient queries and attractive interface lead to ease of use and an intuitive user experience, especially for clinical user without previous database experience. Neo4j has value as a first choice for clinical database implementation and should be trialed in many more dataset types in the future.

Figures & Table

References

  • 1.About PostgreSQL [Internet] The PostgreSQL Global Development Group. c2019. [cited 2019 Aug 13]. Available from: https://www.postgresql.org/about/
  • 2.Kaur K, Rani R. Modeling and querying data in noSQL databases; 2013 IEEE International Conference on Big Data; 2013. pp. 1–7. [Google Scholar]
  • 3.Redmond E, Wilson JR. Seven databases in seven weeks: a guide to modern databases and the noSQL movement. Raleigh: Pragmatic Bookshelf. c2012 [Google Scholar]
  • 4.van der Veen JS, van der Waaij B, Meijer RJ. Sensor data storage performance: SQL or noSQL, physical or virtual; 2012 IEEE Fifth International Conference on Cloud Computing; 2012. pp. 431–438. [Google Scholar]
  • 5.Mondal AS, Sanyal M, Chattapadhyay S, Mondal KC. Performance analysis of structured, un-structured, and cloud storage systems. International Journal of Ambient Computing and Intelligence. 2019;10:1–29. [Google Scholar]
  • 6.Robinson I, Webber J, Eifrem E. Graph Databases. 2nd Ed. Sebastopol: O’Reilly Media; 2015. [Google Scholar]
  • 7.Kan B, Zhu W, Liu G, Chen X, Shi D, Yu W. Topology Modeling and Analysis of a Power Grid Network Using a Graph Database. International Journal of Computational Intelligence Systems. 2017;10:1355–1363. [Google Scholar]
  • 8.Messaoud C, Fissoune R, Badir H. A performance evaluation of noSQL databases to manage proteomics data. Int. J. Data Mining and Bioinformatics. 2018;21:70–89. [Google Scholar]
  • 9.Hölsch J, Schmidt T, Grossniklaus M. On the performance of analytical and pattern matching graph queries in Neo4j and a relational database. EDBT/ICDT Workshops. 2017 [Google Scholar]
  • 10.The World of Graphs — Powered by Neo4j [Internet] Neo4j, Inc. c2019 [cited 2019 Nov 25]. Available from: https://neo4j.com/customers/ [Google Scholar]
  • 11.Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG. MIMIC-III, a freely accessible critical care database. Scientific Data. 2016:3. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Johnson Alistair EW, David Stone J., Celi Leo A., Pollard Tom J. The MIMIC code repository: enabling reproducibility in critical care research. Journal of the American Medical Informatics Association. 2017:ocx084. doi: 10.1093/jamia/ocx084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.APOC User Guide 3.5 [Internet] Neo4j, Inc. 2019 May 25 [cited 2019 Aug 9]. Available from: https://neo4j-contrib.github.io/neo4j-apoc-procedures/#introduction.
  • 14.SchemaSpy Analysis of mimic.mimiciii - All Relationships [Internet] The MIMIC Code Repository. 2017 Oct 12 [cited 2019 Aug 9]. Available from: https://mit-lcp.github.io/mimic-schema-spy/relationships.html. [Google Scholar]
  • 15.Mondal AS, Sanyal M, Chattopadhyay S, Mondal KC. Comparative analysis of structured and un-structured databases. In: Mandal J, Dutta P, Mukhopadhyay S, editors. Computational Intelligence, Communications, and Business Analytics. CICBA 2017. Communications in Computer and Information Science. vol 776. Singapore: Springer; c2017. pp. 226–241. [Google Scholar]
  • 16.Holzschuher F, Peinl R. Querying a graph database– language selection and performance considerations. Journal of Computer and System Sciences. 2016;82:45–68. [Google Scholar]
  • 17.Armbruster M. Welcome to the dark side: Neo4j worst practices (& how to avoid them) Neo4j, Inc. [Internet] 2016 Feb 24 [cited 2019 Aug 13]. Available from: https://neo4j.com/blog/dark-side-neo4j-worst-practices/
  • 18.Yoon BH, Kim SK, Kim SY. Use of Graph Database for the Integration of Heterogeneous Biological Data. Genomics & Informatics. 2017;15:19–27. doi: 10.5808/GI.2017.15.1.19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Lysenko A, Roznovăţ IA, Saqi M, Mazein A, Rawlings CJ, Auffray C. Representing and querying disease networks using graph databases. BioData Min. 2016;9:23. doi: 10.1186/s13040-016-0102-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Mullen J, Cockell SJ, Woollard P, Wipat A. An Integrated Data Driven Approach to Drug Repositioning Using Gene-Disease Associations. PLoS One. 2016;11:e0155811. doi: 10.1371/journal.pone.0155811. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES