SPIN Query Tools for De-identified Research on a Humongous Database

Clement J McDonald; Paul Dexter; Gunther Schadow; Henry C Chueh; Greg Abernathy; John Hook; Lonnie Blevins; J Marc Overhage; Jules J Berman

. 2005;2005:515–519.

SPIN Query Tools for De-identified Research on a Humongous Database

Clement J McDonald ^1,², Paul Dexter ^1,², Gunther Schadow ^1,², Henry C Chueh ³, Greg Abernathy ¹, John Hook ¹, Lonnie Blevins ¹, J Marc Overhage ^1,², Jules J Berman ⁴

PMCID: PMC1560587 PMID: 16779093

Abstract

The Shared Pathology Informatics Network (SPIN), a research initiative of the National Cancer Institute, will allow for the retrieval of more than 4 million pathology reports and specimens. In this paper, we describe the special query tool as developed for the Indianapolis/Regenstrief SPIN node, integrated into the ever-expanding Indiana Network for Patient care (INPC). This query tool allows for the retrieval of de-identified data sets using complex logic, auto-coded final diagnoses, and intrinsically supports multiple types of statistical analyses. The new SPIN/INPC database represents a new generation of the Regenstrief Medical Record system – a centralized, but federated system of repositories.

Pathology reports are the central focus of the Shared Pathology Informatics Network (SPIN)¹ because they provide detailed reports about pathologic specimens, as well as pathways for retrieving those specimens- stored as paraffin blocks. All participating SPIN institutions store large numbers of pathology reports, providing search access to auto-coded final diagnoses, as well as global text searching. The SPIN institutions are connected in a peer-to-peer network² that enables researchers to query de-identified pathology reports and, increasingly, other cancer related information. SPIN institutions now include the University of Pittsburgh, Harvard, UCLA, and Indiana University/Regenstrief.

Using SPIN, researchers can query the contents of more than 4 million pathology reports (and associated clinical records) to locate specimens with particular pathologic/ clinical/ biologic characteristics, and then obtain de-identified, tissue and associated clinical data, from the specimen holders for answering biologic questions about the causes and potential cures of cancer. Over time, these numbers should increase substantially. The procedures, policies and machinery needed for engaging additional organizations in the SPIN consortium, authenticating users, and delivering de-identified specimens is under development and will be the subject of a separate report.

This report focuses on the special query tool developed for the Indianapolis/Regenstrief SPIN peer-to-peer node as part of the Indiana Network for Patient Care (INPC), a Local Health Information Infrastructure³ that serves clinical care, as well as public health and research purposes⁴ within central Indiana.⁵ Because of its complex internal organizational structure and broader charter, the SPIN/INPC database and the Indiana SPIN query tool have features that are not yet available to all of the SPIN nodes, but illustrate the research potential of large, clinically rich databases envisioned by the National Cancer Institute (NCI). SPIN peer-to-peer technology developed by our Harvard collaborators will increase the SPIN/INPC database availability to cancer researchers. In this report, we describe the content and capabilities of the SPIN/INPC database and its query tools. IRB approval was obtained in this proposed study.

SPIN/INPC Database – data sources and technology

The SPIN/INPC database is a new generation of the Regenstrief Medical Record system (RMRS) moved from our home-grown database and HP’s VMS operating system to an Oracle 10g database and Red Hat’s Linux operating system. It carries all data collected for 33 years from Wishard hospital, 15 years from IU and Methodist hospital (now merged as Clarian Health systems), 15 years of tumor registry data from the state health department, and 2 to 7 years from the other INPC institutions – Community, St. Vincent’s and St. Francis hospitals – in a federated database. Together these five independent health-providing institutions account for 165,878 admissions, 450,000 ED visits and 2.7 million clinic visits per year. The INPC database now carries more than 800 million rows of discrete observations, and 14 million narrative text reports containing 364 million words.

At a minimum, each of the participating institutions provides the INPC with laboratory, radiology, and EKG reports, hospital dictation (e.g. discharge summaries, operative notes) and admission/ discharge/ transfer information. Clarian and Wishard, the two institutions that use the RMRS as their native medical record system, include much more. RxHub provides the prescription records available from four pharmacy benefits mangers for Indianapolis Emergency Department visits. And, more data are on the way - Indiana Medicaid and a major Insurance carrier have agreed to provide us with administrative data including filled prescriptions for both clinical and limited research purposes, and active negotiations are underway with other Indiana sources.

Indiana researchers have long used the RMRS for research purposes. Indeed, the Associate Dean of Research at the IU School of Medicine has estimated that more than 50% of the 3000 active IU human research protocols use the RMRS for some aspect of their research. RMRS was the source of data used to prove the association between erythromycin and pyloric stenosis among newborns⁶ and the lack of association between elevated liver enzymes pretreatment and risk of statin hepatotoxicity.⁷

Data flows, linkages and mappings

The INPC is a centralized system. Because the system serves clinical care, most of the data are fed in real-time from participating organizations–mostly as HL7⁸ messages–a total of 84 million such messages per year from 8 different institutions and more than 92 individual source systems within those institutions.

Each hospital assigns its own patient identifier, so the system uses a deterministic linking algorithm similar to that of Grannis et al⁹ to aggregate patients reported under different chart numbers under one global identifier. The linking algorithm boils down the 4.7 million distinct patient registry records in the SPIN/INPC database to 3 million unique persons.

The observation and report identifying codes come in the messages as local idiosyncratic codes. We map them to a common standard LOINC®¹⁰; so that when the same test or measurements are reported under different local codes we can equate them for clinical and research purposes. Mapping the local observation codes from each new data server requires manual effort, ranging from a few person days for some EKG systems to six to twelve person months for a laboratory with 2000–4000 distinct test observations. We use the RELMA mapping tool¹⁰ (freely available) to develop these mappings, and to create “synthetic” master files by distilling hundreds of thousands of HL7 messages into records that include the test name, reporting units, normal ranges and sample values for each unique test code found in this set of messages.

When INPC receives universal codes as values (e.g. ICD9, CPT, ICD0-3), we maintain them as delivered. We have tools for mapping local result codes, but do not get local codes for most of the observations values for which we would have expected them. For example, culture isolates are usually reported as text such as “Staphylococcus Aureus” or “gram positive rods,” rather than as codes. These text strings tend to be consistent within a given source system; so, it is possible to extract this information via text parsing tools for public health care reporting and research projects.

INPC/SPIN is a centralized, but federated system of repositories. Within the federation, each institutional repository uses the same concept dictionary, the same data structure, and the same software, but it stores its data in separate physical files. The database schema contains 30 objects, the most important of which are the institution, patient (registration record), encounter, order, discrete observation, text reports and multi media report (fax, images, etc.). The fields in each of these tables are roughly those contained in the HL7 segments that correspond to these objects.

A major goal of the SPIN project and all of the SPIN participants is to facilitate research using de-identified records. To produce de-identified data from the Indiana SPIN node, we remove all forbidden HIPAA defined fields, translate dates into integer years for patients below 90 and into decades for patients older than 90, and replace zip codes with the three most significant digits or less in order to comply with the “minimum of 20,000 inhabitants” from the HIPAA requirement. Finally, we scrub all of the narrative text to eliminate identifying numbers, addresses, and proper names and so on. The system can provide two kinds of de-identified access. One that provides only statistical summary data within categories limited to a minimum bin size and another that provides patient level – but de-identified data. General researchers have no direct access to identified data.

Posing Queries to the database

A single observation is composed of many components. At present the value, date (truncated to comply with HIPAA requirements), abnormal flag, and three distinct specimen components are all query accessible. Users can search numeric content for values greater, less than, or equal to a threshold value. They can search text content for combinations of words and phrases using Google-like text syntax.¹¹, and they can search coded content for the presence of individual codes or sets of codes. They can also search for words within the display names for observations with coded values. As do all of the SPIN nodes, we apply an autocoder¹² to the final diagnosis section of narrative pathology reports, so pathology reports can be searched by codes generated by autocoders or by the raw text. We will be including an alternative and much faster autocoder in the next revision.¹³

Steps of an IU SPIN query

IU SPIN queries are divided into three related steps: the definition of the cohort, the data set and the statistical analysis plan.

The first step in a query is to create a cohort. A cohort is just a list of subjects. Users create cohorts by specifying a set of criteria that patient records must satisfy in order to be included in the cohort.

The second step is to create a data set. A data set is a table whose rows represent the subjects in a cohort and whose columns represent different characteristics about each subject. Users create data sets by specifying the data elements they want to retrieve for a particular cohort and then running that data set query. De-identified data sets can be delivered to users who have appropriate privileges. In principle the creation of a cohort and the creation of a data set could be combined in one step. However because the same data set query could be run for many different cohorts and because cohorts can come from other systems, decoupling the specification and production of cohorts from the specification and production of the data sets, offers many advantages.

The third step (if needed) is the specification of a statistical analysis. This specification defines the kind of statistical analyses to be run, and the variables to be included in the analysis. For now, researchers can choose from five kinds of analyses: 1) a statistical summary (mean, standard deviation, min, max and distribution of each variable), 2) dynamic cross tabulation, 3) logistic regression, 4) simple regression, and 5) survival analysis. For the dynamic cross tabulation, users need to specify the breakdown variables and their cut points. For the other three analyses, users need to enter the algebraic equation that defines the relationship between the independent and dependent variables. Users can choose to define more than one version of each kind of analysis in the analysis plan, if they choose.

The statistical power in the SPIN database comes from an open source statistical analysis language called “R” ¹⁴, as well as predefined R analysis tools (Hmisc and Design) provided by Frank Harrell and colleagues.¹⁵ The statistical tools give the researcher a chance to get a sense of the statistical importance of data relationships as they ask ad hoc questions and develop hypotheses. However, these tools are not a substitute for biostatistician consultation or full-blown hypothesis testing. The inclusion of the statistical tools is particularly advantageous to de-identified research as it enables statistical analysis without exposing any data to human eyes.

Typically the three steps are run in sequence, and the output of one serves as input to the next, e.g. the output of the cohort query defines the patients for whom the data set is being constructed, and the output of a data set query is the grist for the statistical analyses step. However, each of these operations can be run independently.

The Query program in operation

Figure 1 shows the screen for cohort query (first step). On the left half of the screen is the hierarchical table of variables available for building the query. The cancer research hierarchy shown includes 415 variables – including tumor registry variables, laboratory tests of relevance to cancer research, and pathology reports.

Step 1- Form (for defining cohorts) with two menus open for specifying search criteria on a selected variable.

Notice that the second column (A in figure 1) shows the number of patients with data for each variable, which tells the user where the data “money” resides.

The LOINC code is given in the third column. Users can choose a variable by entering any part of the variable name and picking from the resulting menu or by opening up the hierarchy until they see the variable they want, and then picking it.

For each variable that the user picks, a window opens to let the user specify further criteria about that variable. Clinical observations repeat over time. The right side of the pop up (B in figure 1) asks the user to narrow the query, choosing one observation from the possible repeats (e.g., the earliest, the most recent, the maximum or the minimum) or for combining these observations into one value (e.g. average, count). The allowed options, of course, vary with the data type of the variable. In Figure 1, we chose the first (“earliest”) observation.

Each observation has multiple components, e.g. a value and date of observation. Each component that is available for a variable is presented as separate tab. Figure 1 shows the value tab opened and date tab closed. For some variables –e.g. for pathology reports a specimen component will also appear. The controls that appear under the value tab vary with the observation data type. “Cancer primary site” which we used in the example, is a coded result. So, the computer opens a second window to provide access to the variable’s associated code set—in this case ICDO-3.

Notice that each row in the menu of options includes the ICDO-3 code (C in figure 1), its text description, as well as the number of cases (D in figure 1) in the database for that code—again to help the user to pick the most appropriate codes.

The user can sort the table by any column and can find codes whose descriptions contain a given word by typing that word into the field above the table. The user can pick multiple codes at once – but there is only one prostate code in ICDO-3 and we picked that code in Figure 1. If we were looking at the “surgical pathology final diagnosis coded” variable, we could have picked from the auto coded UMLS diagnoses codes. And if we picked the surgical pathology report – we could have searched for text patterns within that report.

The user can choose as many variables as needed to define the cohort. We think of the variable plus its criteria as a clause, whose subject is the variable and whose predicate is the criteria. As users complete the clause for each selected variable, they are labeled by alphabetic characters, e.g. A, B, and so on. These do not show in the example given in Figure 1.

The screen for the second step, i.e., the definition of the data set is very similar to the screen for the first step. It includes the same hierarchical table of variable choices on the left. And when the user selects a variable, it opens a secondary window. However, in this case, the secondary window asks the user to specify a) the instance, e.g. the first, last, maximum, and b) the component e.g. the value, the date, etc, of the variable that the user wants included in a data set column. Accordingly the data set columns are labeled with the instance specifier, variable name, and component, e.g., “last glucose date”, or “max glucose value”. When the user runs the data set query, the computer returns the specified data set for the current cohort.

In the third step, the user specifies the statistical analyses they want to apply to their data set (A in figure 2). Depending upon the analyses the user may be asked to specify more. For example, the dynamic cross tabulation requires the specification of a set of breakdown variables and, for each breakdown variable, cut points. The regression and survival analyses require the specification of the Left and right side of a predictive algebraic equation, e.g. B= A + C + E + G + I (See area C in figure 2.) The meaning of each variable in the algebraic expression is spelled out in the table below the equation. See area B in Figure 2.

Screen used to specify Statistical analysis plan

With these tools, a user can define a cohort of interest such as “patients with tumor registry coded prostate cancer,” then generate a data set with the patient’s HIPAA adjusted age at tumor diagnosis (from the tumor registry), his first PSA, from hospital data, the patients race and stage (also from tumor registry data) and the patients survival status (from the overall database), and then run a logistic regression to predict the effect of these variables on current survival.

We continue to enhance this query tool and database. All of the text reports (not just he pathology reports) are now scrubbed and indexed so they can be searched for words and combinations of words. Users with the right privileges can obtain copies of the de-identified data set. Columns will contain numbers, codes, or text depending upon the data type of the retrieved value. Depending upon their access privileges uses can also retrieve the body of scrubbed text reports.

With this humongous database, the query response varies from seconds to tens of minutes and depends mostly on the number of records found by the query. Queries that return tens or hundreds of cases tend to be fast–returning their results in a few seconds. Those that return thousands or hundreds of thousands tend to be slow–minutes to tens of minutes. The statistical analyses tend to be very fast (at most a few seconds). The statistical analyses do fail to complete at times–and we get overflow and/or divide by zero errors–because we have not yet screened out columns that create mathematical problems in the analysis software. We expect to have implemented even more features and to have the feedback from a large cohort of users by the time of the Fall AMIA meeting and will present this as an update at that time.

Acknowledgments

This work was performed at the Regenstrief Institute, Indianapolis, IN, and was supported in part by the National Cancer Institute grant U01 CA91343, a Cooperative Agreement for The Shared Pathology Informatics Network, the Indiana Genomics Initiative (INGEN) of Indiana University, which is supported in part by Lilly Endowment Inc., a grant from the Indiana Twenty-First Century Research and Technology Fund for proposal ID 510040784, and contracts N01-LM-6-3546 and N01-LM-3-3501 from the National Library of Medicine.

References

1.National Institutes of Health. Url: http://www.nci.nih.gov/
2.Consented High-performance Indexing and Retrieval of Pathology Specimens (CHIRPS) - Namini AH, Berkowicz DA, Kohane IS, Chueh HC. A Submission Model for Use in the Indexing, Searching, and Retrieval of Distributed Pathology Case and Issue Specimen. Medinfo. 2004;1264–67 [PubMed]
3.Local Health Information Infrastructure - Information for Health-A Strategy for Building the National Health Information Infrastructure. Report and Recommendations from the National Committee on Vital and Health Statistics, Washington, D.C., Nov 15, 2001.
4.Overhage JM, Suico J, McDonald CJ. Electronic Laboratory Reporting: Barriers, Solutions and Findings. J Public Health Manag Pract. 2001 Nov;7(6):60–6. doi: 10.1097/00124784-200107060-00007. [DOI] [PubMed] [Google Scholar]
5.McDonald CJ, Overhage JM, Barnes M, et al. The Indiana Network for Patient Care An Evolving Local Health Information Infrastructure. Health Affairs (in press). [DOI] [PubMed]
6.Mahon Be, Rosenman MB, Kleiman MB. Maternal and infant use of erythromycin and other macrolide antibiotics as risk Factors for infantile hypertrophic pyloric stenosis. J Pediatr. 2001 Sep;139(3):380–4. doi: 10.1067/mpd.2001.117577. [DOI] [PubMed] [Google Scholar]
7.Chalasani N, Aljadhey H, Kesterson J, Murray MD, Hall SD. Patients with Elevated Liver Enzymes are Not at Higher Risk for Statin Hepatotoxicity. Gastroenterology. 2004 May;126(5):1287–92. doi: 10.1053/j.gastro.2004.02.015. [DOI] [PubMed] [Google Scholar]
8.Health Level Seven. An application protocol for electronic data exchange in healthcare environments, version 2.4.Ann Arbor, MI:Health Level Seven,2002.
9.Grannis SJ, Overhage JM, McDonald CJ. Analysis of Identifier Performance using a Deterministic Linkage Algorithm. JAMIA (Symposium Supplement) Proceedings of the American Medical Informatics Association Annual Symposium, Fall 2002; 305–309. [PMC free article] [PubMed]
10.McDonald CJ, Huff SM, Suico JG, Hill G, Leavelle D, Aller R, et al. LOINC, A Universal Standard For Identifying Laboratory Observations: A 5-Year Update. Clin Chem 2003 49:4; 624–633. Url: http://www.loinc.org/relma/download/relma [DOI] [PubMed]
11.The Essentials of Google Search. Url: http://www.google.com/help/basics.html
12.Schadow G, McDonald CJ. Extracting structured information from free text pathology reports. AMIA Annu Symp Proc. 2003:584–8. [PMC free article] [PubMed] [Google Scholar]
13.Berman JJ. Doublet method for very fast autocoding. BMC Med Inf & Dec Making. 2004 Sept 15.4(1):16. doi: 10.1186/1472-6947-4-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Dalgaard P. Introductory Statistics with R. Springer; 1 edition (August 12, 2002) 1–288.
15.Alzola C, Harrell F. An Introduction to S and The Hmisc and Design Libraries. Nov 16, 2004. Url: http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RS/sintro.PDF

[b1-amia2005_0515] 1.National Institutes of Health. Url: http://www.nci.nih.gov/

[b2-amia2005_0515] 2.Consented High-performance Indexing and Retrieval of Pathology Specimens (CHIRPS) - Namini AH, Berkowicz DA, Kohane IS, Chueh HC. A Submission Model for Use in the Indexing, Searching, and Retrieval of Distributed Pathology Case and Issue Specimen. Medinfo. 2004;1264–67 [PubMed]

[b3-amia2005_0515] 3.Local Health Information Infrastructure - Information for Health-A Strategy for Building the National Health Information Infrastructure. Report and Recommendations from the National Committee on Vital and Health Statistics, Washington, D.C., Nov 15, 2001.

[b4-amia2005_0515] 4.Overhage JM, Suico J, McDonald CJ. Electronic Laboratory Reporting: Barriers, Solutions and Findings. J Public Health Manag Pract. 2001 Nov;7(6):60–6. doi: 10.1097/00124784-200107060-00007. [DOI] [PubMed] [Google Scholar]

[b5-amia2005_0515] 5.McDonald CJ, Overhage JM, Barnes M, et al. The Indiana Network for Patient Care An Evolving Local Health Information Infrastructure. Health Affairs (in press). [DOI] [PubMed]

[b6-amia2005_0515] 6.Mahon Be, Rosenman MB, Kleiman MB. Maternal and infant use of erythromycin and other macrolide antibiotics as risk Factors for infantile hypertrophic pyloric stenosis. J Pediatr. 2001 Sep;139(3):380–4. doi: 10.1067/mpd.2001.117577. [DOI] [PubMed] [Google Scholar]

[b7-amia2005_0515] 7.Chalasani N, Aljadhey H, Kesterson J, Murray MD, Hall SD. Patients with Elevated Liver Enzymes are Not at Higher Risk for Statin Hepatotoxicity. Gastroenterology. 2004 May;126(5):1287–92. doi: 10.1053/j.gastro.2004.02.015. [DOI] [PubMed] [Google Scholar]

[b8-amia2005_0515] 8.Health Level Seven. An application protocol for electronic data exchange in healthcare environments, version 2.4.Ann Arbor, MI:Health Level Seven,2002.

[b9-amia2005_0515] 9.Grannis SJ, Overhage JM, McDonald CJ. Analysis of Identifier Performance using a Deterministic Linkage Algorithm. JAMIA (Symposium Supplement) Proceedings of the American Medical Informatics Association Annual Symposium, Fall 2002; 305–309. [PMC free article] [PubMed]

[b10-amia2005_0515] 10.McDonald CJ, Huff SM, Suico JG, Hill G, Leavelle D, Aller R, et al. LOINC, A Universal Standard For Identifying Laboratory Observations: A 5-Year Update. Clin Chem 2003 49:4; 624–633. Url: http://www.loinc.org/relma/download/relma [DOI] [PubMed]

[b11-amia2005_0515] 11.The Essentials of Google Search. Url: http://www.google.com/help/basics.html

[b12-amia2005_0515] 12.Schadow G, McDonald CJ. Extracting structured information from free text pathology reports. AMIA Annu Symp Proc. 2003:584–8. [PMC free article] [PubMed] [Google Scholar]

[b13-amia2005_0515] 13.Berman JJ. Doublet method for very fast autocoding. BMC Med Inf & Dec Making. 2004 Sept 15.4(1):16. doi: 10.1186/1472-6947-4-16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b14-amia2005_0515] 14.Dalgaard P. Introductory Statistics with R. Springer; 1 edition (August 12, 2002) 1–288.

[b15-amia2005_0515] 15.Alzola C, Harrell F. An Introduction to S and The Hmisc and Design Libraries. Nov 16, 2004. Url: http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RS/sintro.PDF

PERMALINK

SPIN Query Tools for De-identified Research on a Humongous Database

Clement J McDonald, MD

Paul Dexter, MD

Gunther Schadow, MD, PhD

Henry C Chueh, MD, MS

Greg Abernathy, MD

John Hook

Lonnie Blevins

J Marc Overhage, MD, PhD

Jules J Berman, PhD, MD

Abstract

SPIN/INPC Database – data sources and technology

Data flows, linkages and mappings

Posing Queries to the database

Steps of an IU SPIN query

The Query program in operation

Figure 1.

Figure 2.

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

SPIN Query Tools for De-identified Research on a Humongous Database

Clement J McDonald, MD

Paul Dexter, MD

Gunther Schadow, MD, PhD

Henry C Chueh, MD, MS

Greg Abernathy, MD

John Hook

Lonnie Blevins

J Marc Overhage, MD, PhD

Jules J Berman, PhD, MD

Abstract

SPIN/INPC Database – data sources and technology

Data flows, linkages and mappings

Posing Queries to the database

Steps of an IU SPIN query

The Query program in operation

Figure 1.

Figure 2.

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases