Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2015 Nov 5;2015:306–313.

An Associative Memory Model for Integration of Fragmented Research Data and Identification of Treatment Correlations in Breast Cancer Care

Ashis Gopal Banerjee 1, Mridul Khan 2, John Higgins 3, Annarita Giani 1, Amar K Das 3
PMCID: PMC4765707  PMID: 26958161

Abstract

A major challenge in advancing scientific discoveries using data-driven clinical research is the fragmentation of relevant data among multiple information systems. This fragmentation requires significant data-engineering work before correlations can be found among data attributes in multiple systems. In this paper, we focus on integrating information on breast cancer care, and present a novel computational approach to identify correlations between administered drugs captured in an electronic medical records and biological factors obtained from a tumor registry through rapid data aggregation and analysis. We use an associative memory (AM) model to encode all existing associations among the data attributes from both systems in a high-dimensional vector space. The AM model stores highly associated data items in neighboring memory locations to enable efficient querying operations. The results of applying AM to a set of integrated data on tumor markers and drug administrations discovered anomalies between clinical recommendations and derived associations.

Keywords: Associative memory, breast cancer treatment, electronic medical record, tumor registry, data integration, correlation

Introduction

While critical data on tumor pathology, genomic biomarkers, patient demographics, and clinical treatments are often gathered in different resources, investigators need all of these results to be brought together for comparative effectiveness, population research and translational science. For example, data linkage between the SEER cancer registry and the Medicare claims database has been valuable in discovering population-based patterns in cancer screening and treatment outcomes that would not be possible otherwise1. As part of the Oncoshare project24, a collaborative multi-institutional study of patterns, predictors, and outcomes of breast cancer care, data integration between local and state cancer registries with electronic health records revealed the limitations of relying on any single source for research. Before linking data among different systems into Oncoshare, the separate data sources revealed varying rates of cancer-specific diagnostic tests and treatments. For example, the rate of mastectomy in one hospital’s registry was 41% but the facility’s billing record only indicated that about 22% of patients received such surgical intervention2. After manual verification of the missing data, the rate of mastectomy across data sources was found to be 43%. In addition, linking revealed data patterns that were implausible, such as patients whose treatment dates from the EMR occurred a year before receiving cancer diagnosis in the registry.

Many academic medical centers are undertaking this data integration effort by creating data warehouses that incorporate research-related data from a variety of systems. Such an approach requires considerable planning and programming efforts that may require a year or more before the data warehouse is usable for research3. While data warehousing permits robust and rapid querying over uniformly represented data, this solution has a number of drawbacks for researchers as end users. Data warehouses typically do not incorporate all of the data in source systems, do not allow for multiple relationships to be maintained among source data, do not manage provenance of the data, and do not monitor changes in the schema or data quality in source data.

Several informatics groups have created new methods to tackle various aspects of this problem. For example, the Electronic Medical Records and Genomics (eMERGE) Network method aims to link samples and results collected from genetic studies to electronic medical records data to allow for high throughput biomarker studies5. The widely used Informatics for Integrating Biology and the Bedside (i2b2) data repository method allows for the merger of multiple sources of genetic, phenotypic, and other types of data into a single integrated schema that supports queries across data sources6. This prior work on data linkage and data warehousing methods provides data integration at the data or schema level, but not at the systems level. That is, these methods do not model or automatically adapt to changes in the information systems that provide data. Consumers of integrated data may, as a result, not be aware of which systems provide data and when new systems have emerged or existing ones are altered.

In this paper, we explore the use of a novel data modeling approach, called associative memory (AM), which permits the rapid integration and correlation of data from multiple data sources7,8. Our analysis technique comprises an AM model7 that allows for a form of cognitive computing to mimic the human capabilities of processing, encoding, consolidating, and retrieving information from a constant influx of data streams captured via the sensory organs. We choose this model as it has the potential to address the volume, velocity, veracity, and variety (the four V’s of big data) challenges of data source agnostic aggregation and analysis. We apply the AM approach to rapidly integrate patient data collected from tumor registry and electronic heath records, and conveniently identify the correlations between tumor markers, patient factors, and selection of administered drugs. While some of the identified correlations follow expected clinical patterns, others point to anomalies between clinical guidelines and drug administrations, indicating the need for further studies on the quality of drug administration recording as well as patient and physician adherence to the recommended guidelines. We also compare the simplicity of this approach against the more traditional effort needed with generating reports using SQL.

Methods

Design and experimental setting

Our breast cancer data sets come from two sources at Dartmouth–Hitchcock Medical Center (DHMC). The first source is a patient tumor registry and the second source is an electronic medical record (EMR) database, which is from the Epic vendor that was installed at DHMC on April 2, 2011. The current tumor registry has existed in its current form for the past decade. To evaluate our proposed AM approach, we obtained IRB approval to extract patient data from both the systems on November 1, 2014. The EMR data consisted of encounter data that recorded diagnostic and treatment information, including medication administration records. The tumor registry data consisted of stage (pathological, clinical, and combined) and grade information, along with tumor markers, specifically progesterone receptor (PR), estrogen receptor (ER), and HER2/neu.

Using the diagnostic date stored in the institutional tumor registry, we identified female breast cancer patients who had invasive breast cancer (Stage I–IV) after April 2, 2011, and we cross-linked these cases to their data within the EMR. Since we required treatment data and treatment is normally completed within a year, we limited the date of diagnosis to be one year before November 1, 2014. The data sets consist of 928 patients and 50,490 encounter records collected over the specified period. For automated de-identification of patient data, we identify each patient solely by a digital fingerprint that is created by a one-hash algorithm4. Using the digital fingerprint, we define a fixed, random temporal offset between plus or minus 30 days that is added to all the time stamps for a patient, effectively altering the true time of occurrence but maintaining the relative temporal distance between the events.

While various AM models have been proposed, we adapt a specific implementation described in a 2011 patent7 that is efficient, easy-to-use, intuitive, and scalable. By replacing the data items that are conventionally stored in tables or tuples with elementary “atoms” of information residing in a common n-dimensional (n ~ 1 billion) vector space of contextual associations among the data attributes, this AM model (Figure 1) provides a unified and compact representation for any data type, dynamic or stationary, structured or unstructured, of arbitrary size and granularity.

Figure 1:

Figure 1:

Conceptual illustration of associative memory system showing data aggregation and analysis techniques

The atomic pieces of information are represented as byte arrays of arbitrary sizes where the maximum size limit is enforced by the operating system constraints. The associations among the data items are naturally formed based on all the attributes such as the names, counts, hierarchical relationships, and qualitative and quantitative properties of the items. The quantitative properties are categorical strings, discrete integers, or continuous-valued floating point numbers, names and qualitative properties are represented as strings, and hierarchical relations are denoted as binary integers linking pairwise items. Each attribute type, other than hierarchical relations, then forms a dimension of the associative vector space (AVS) in which all the data items are organized.

Naturally, the data attributes vary widely among items representing fundamentally different entities such as chemotherapy protocols and surgical procedures, but are identical, albeit with different values, for the same entity. Various instances of identical entities lie in the same sub-spaces of the AVS, whereas instances of different entities occupy different sub-spaces of the common AVS. The sub-spaces for the different entities may overlap, indicating the presence of common attributes among them. Multiple occurrences of the same entity instance are represented as the same atomic piece of information with the provision to add more attributes, and, thereby, increase the dimensionality of the occupying sub-space. All the instances of a particular entity or similar entities are strongly connected to each other, thereby forming a natural cluster using a simple heuristic k-means method. This method employs a hybrid Euclidean (for real-valued attributes) - Hamming (for string attributes) distance function as the similarity metric or the connection weights between the data items. Instances belonging to different clusters may also have some connections, but those are much sparser with lower weights. These connections are bi-directional and dynamic, thereby enabling the additions of new associations as more data is ingested.

While commonly contextualized (clustered) data entities are co-located in the organizational space, a virtual pointer-like token, called the relationship construct, provides the means to connect anything in the AVS to anything else. Each token uniquely identifies a particular atom of information, is the virtual location of the item in the AVS, and is the logical address of where the item exists on the physical storage medium. This capability ameliorates the requirement for physical co-location to articulate sub-spaces and instead uses associative nearness (shortest connecting path length) or dimensional proximity (number of overlapping sub-space dimensions) to enable endless clustering possibilities of data entities in an unlimited number of sub-spaces. This multi-faceted holographic-like framework allows for viewing of data from virtually any perspective without the need for additional processing. Furthermore, one no longer needs to search for identifying any form of associations among the data items as all such associations are maintained as tokens, co-incident with the related data items. Thus, the associations are obtained by merely referencing the items of interest and using the co-incident tokens to directly index the referenced items in the storage medium. This novelty allows for real-time correlation analysis as long as the data attributes are defined by the user. Figure 2 shows the user interface for interacting with and querying the data.

Figure 2:

Figure 2:

Simple user interface to define and run queries on breast cancer patient associative memory model

We use this AM system for aggregating and analyzing the breast cancer tumor registry and EMR data sets. Cancer patient factors, namely, comorbidities and hormone-receptor status, together with diagnosis and treatment information like cancer stage, chemotherapy protocol, and secondary therapeutic drugs constitute the set of attributes. Once the AM model is generated, automated queries are run using a C# API provided by the implementation system of our choice9 to retrieve the associations between the drugs of interest (both chemotherapy and secondary treatment drugs) and patient factors.

Outcome measurement

In this paper, we investigate a specific clinical question of identifying the correlations between treatment drugs and the stage of breast cancer and hormone-receptor status of patients. Our output metric is always the number of supporting evidences, i.e., the count of patients with identical factors who are administered a particular drug. Simple graphical displays such as scatter plots are generated using Python version 2.7 to visualize the presence and strength of the correlations. Comparisons are made with SQL to highlight the usefulness of the AM system in identifying the correlations with less implementation effort and in identical level of accuracy.

Two-sample t-tests with unequal sample variances (Welch’s test) are run using R version 3.1.3 to test the statistical significance of the correlations, where the first sample consists of the proportion of patients of a particular type (e.g., specific hormone-receptor status) who are treated with a fixed set of chemotherapy or both chemotherapy and secondary drugs, and the second sample comprises the proportion of patients who are receiving the same set of drugs but are not of the first type. Patients of unknown types are excluded while computing the proportions. The null hypothesis that the two samples means are equal is rejected if the corresponding p value is less than 0.05, thereby establishing a correlation between the patient type and administered drugs with 95% significance level.

Results

We obtained the results on an Intel Core i7-4500U processor with 8 GB RAM and 1.8 GHz processor speed in 64-bit Windows 8 operating environment. Using the AM system, it took 6.065 minutes to generate all the patient counts as functions of breast cancer stage and hormone-receptor status, and 0.781 minutes to obtain all the patient counts as functions of hormone-receptor status and specific chemotherapy and hormone-targeted drugs. While SQL queries took about the same time to execute, they required a lot more programming effort, as shown by the snippet of required SQL query in Figure 3. No differences in the retrieved patient counts were observed between the two approaches.

Figure 3:

Figure 3:

Screenshots of code using (a) associative memory system to query any combination of patient, tumor, and drug factors, and (b) SQL to specifically query patient counts for drugs based on hormone-receptor status

Using our AM model, we first examined the relationship between breast cancer stage and patient hormone receptor status. Figure 4 shows the results of variations in the number of patients who have different stages of invasive breast cancer and their hormone-receptor status using a scatterplot with circle size proportional to the patient count. The results indicate a predominance of patients with ER+/PR+ and triple positive (ER+/PR+/HER2neu+) status along with early stage patients (1A, 2B, and 2C) within the extracted cohort. Similar predominance has been observed in other breast cancer patient cohorts.

Figure 4:

Figure 4:

Variations in the number of breast cancer patients who various stages of invasive breast cancer and hormone-receptor status displayed using a scatterplot with circle size proportional to the patient count.

We next examined the correlations derived from the AM model between hormone-receptor status and choice of chemotherapy agent that was administered, which is shown in Figure 5 as a scatter plot. The figure indicates a predominance of the use of cyclophosphamide, doxorubicin, and paclitaxel, which are agents associated with recently recommended chemotherapy protocols for patients who have stages I–III breast cancer.

Figure 5:

Figure 5:

Variations in the number of breast cancer patients receiving different chemotherapy drugs based on hormone-receptor status displayed using a scatterplot with circle size proportional to the patient count.

We then examined the relationship between hormone-receptor status and hormone-targeted treatments in the form of trastuzumab, pertuzumab, and methotrexate. Clinically, according to national guidelines such as those published by the National Comprehensive Cancer Network, we expect that the hormone-targeted treatments would be only used in patients who are HER2/neu positive10. In our analyses using the AM model, we, however, found that a small subset of patients who were HER2/neu negative (categorized as triple negative or ER, PR positive) had also received hormone-targeted treatments, as shown in Figure 6.

Figure 6:

Figure 6:

Variations in the number of breast cancer patients receiving different hormone agents based on hormone-receptor status displayed using a scatterplot

We also computed two-sample t tests to validate the correlations between chemotherapy and hormone-targeted agents and hormone-receptor status in breast cancer patients. The results are shown in Table 1. Statistically significant differences between patients who were ER+, PR+, and ER-PR+ versus patients who had other hormone statuses in the receipt of both chemotherapy and a hormone-targeted agent were observed, which is clinically expected. Interestingly though, patients who were HER2/neu positive and ER, PR negative were not different in their receipt of chemotherapy and hormone treatment. Furthermore, HER2/neu positive patients with some combination of ER or PR positive status actually received significantly less chemotherapy and hormone-targeted agents than the others. Both of these findings contradict clinically expected patterns.

Table 1:

p values corresponding to two-sample t tests to validate the correlations between drug treatment and hormone-receptor status in breast cancer patients with statistically significant correlations (p < 0.05) highlighted in bold; all the tests are one-sided with * and ** denoting greater and lower means of the first sample, respectively

Hormone-receptor status Drugs
Chemotherapy Both chemotherapy and hormone treatment
Triple negative vs. other* 0.19 0.25
ER+, PR+, and ER-PR+ vs. other* 0.02 0.02
HER2/neu+ vs. other* 0.29 0.10
ER-HER2/neu+, PR-HER2/neu+, triple positive vs. other** 0.02 0.01

Discussion

The integration of heterogeneous data from multiple source systems is a time-consuming process and a rate-limiting step to rapid exploration of data, whether such efforts occur through standard data warehousing approaches or by data integration platforms, such as i2b2. In our work, we use a novel method for ingesting data from multiple sources into an associative memory (AM) model that is designed to mimic the ways humans store information through content-addressable memory and make rapid associations between the concepts. We have taken a commercially available database method for associative memory and implemented a programmatic interface that allows us to query the naturally occurring clusters generated by the model. In this paper, we have shown the feasibility of the AM approach to organize heterogeneous data efficiently based on attributes, and rapidly generate associations among the data attributes. Although the run time of corresponding SQL queries is similar to the AM model, the SQL query requires more technical expertise to formulate and thus more data engineering effort.

Using the approach, we interrogated the AM system to find associations between breast cancer stage, hormone-receptor status, and drug administration. The results of these analyses show expected clinical relationships between tumor factors and therapy choice. However, we also found a few anomalous patterns between hormone-receptor status and the use of hormone-targeted therapy. These patterns may be due to several factors including incorrect pathological identification or administration records, and thus require us to validate the information using other sources of data. Of note, however, is that our statistical validation step did not provide us the expected pattern of distinguishing hormone-receptor status between those who received chemotherapy alone and those who received those agents in conjunction with hormone-targeted therapy.

The AM approach we have chosen allows us to make rapid exploration of other patterns within the data and to easily scale the method to much larger data sets. For our future work, we will import data into the AM model from other types of cancers that are within the tumor registry, and we will begin exploring expected and unexpected clinical patterns with clinical feedback. We recognize that reviewing such highly multi-dimensional relationships in the data will require visualization approaches, and we are developing 3D methods using the OpenGL standard to enable such exploration directly by clinicians.

Acknowledgments

We thank the assistance of Drs. Tracy Onega and Judy Rees, Directors of the Norris Cotton Cancer Center Registry Resource for their support of this research.

References

  • 1.Warren JL, Klabunde CN, Schrag D, Bach PB, Riley GF. Overview of the SEER-Medicare data: content, research applications, and generalizability to the United States elderly population. Medical Care. 2002;40(8):IV-3. doi: 10.1097/01.MLR.0000020942.47004.03. [DOI] [PubMed] [Google Scholar]
  • 2.Kurian AW, Mitani A, Desai M, Yu PP, Seto T, Weber SC, Olson C, Kenkare P, Gomez SL, de Bruin MA, Horst K, Belkora J, May SG, Frosch DL, Blayney DW, Luft HS, Das AK. Breast cancer treatment across health care systems: linking electronic medical records and state registry data to enable outcomes research. Cancer. 2014;120:103–11. doi: 10.1002/cncr.28395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Weber SC, Seto T, Olson C, Kenkare P, Kurian AW, Das AK. Oncoshare: Lessons learned from building an integrated multi-institutional database for comparative effectiveness research. In AMIA Annu Symp Proc. 2012;2012:970–8. [PMC free article] [PubMed] [Google Scholar]
  • 4.Weber SC, Lowe H, Das A, Ferris T. A simple heuristic for blindfolded record linkage. J Am Med Inform Assoc. 2012;19:e157–e161. doi: 10.1136/amiajnl-2011-000329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.McCarty CA, Chisholm RL, Chute CG, Kullo IJ, Jarvik GP, Larson EB, Li R, Ritchie MD, Roden DM, Struewing JP, Wolf WA. The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics. 2011;26:4–13. doi: 10.1186/1755-8794-4-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S, Kohane I. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) J Am Med Inform Assoc. 2010;17:124–30. doi: 10.1136/jamia.2009.000893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lansner A. Associative memory models: from the cell-assembly theory to biophysically detailed cortex simulations. Trends Neurosci. 2009;32:178–86. doi: 10.1016/j.tins.2008.12.002. [DOI] [PubMed] [Google Scholar]
  • 8.Everett R. Data Base and Knowledge Operating System. 8,051,102 B2. U S Patent. 2011
  • 9.AtomicDB. [accessed March 6, 2015]. http://www.atomicdb.net/atomicdb.html.
  • 10.National Comprehensive Cancer Network Guideline on Breast Cancer Care. [accessed March 6, 2015]. http://www.nccn.org/professionals/physician_gls/PDF/breast.pdf.

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES