Abstract
Translational studies for therapeutic development require cohort identification to identify appropriate biological materials from patients that can be utilized to test a specific hypothesis. Robust health information systems exist, but there are numerous challenges in accessing the information to select appropriate biological specimens needed for translational experiments. This chapter on methods describes the current standard process for cohort identification utilized by the Cutaneous Oncology Program and the Collaborative Data Services Core (CDSC) at Moffitt Cancer Center. The methods include utilization of graphical user interfaces coupled with database querying. As such, this chapter outlines the regulatory and procedural processes needed to utilize a health information management system to filter patients for cohort identification.
Keywords: Bioinformatics, Cohort identification, Translational science, Informatics, Resource paper
1. Introduction
Technological advances have vastly affected the way healthcare uses data, more specifically cancer research. Interfaces between the electronic medical record (EMR) systems and databases have grown over the years and has greatly improved health services and cohort identification for research [1, 2]. These systems are composed of health information systems and health information management systems.
Health information systems (HIS) comprise both the IT infrastructure (hardware and software) and the workflow processes required to facilitate the appropriate data capture, storage, and utilization of patient information [3].
A health information management system (HIMS), including the electronic medical record, collects, analyzes, reports, and protects medical information that is vital to the quality of patient care [4, 5]. Although the HIMS does not need to be electronic, increasing institutions are utilizing electronic medical records as their primary source of patient medical information [6]. HIMS involves three fields: computer science, information technology, and business [7]. It starts from when patient health information is entered to the release of said information. The use of these systems should not compromise Health Insurance Portability and Accountability Act (HIPAA) privacy and security laws. In order to comply, there is a protocol that is approved by institutional and/or central Institutional Review Board (IRB) to conduct the research [8, 9]. These IRBs/protocols list the key personnel that have access to the data, the levels (roles) applied to the key personnel to determine the resources they have access to, and experimental procedures that may be conducted with patient derived biological materials [10]. The conduct of research done on clinical trials is outside the scope of this article.
There are institutional wide protocols that allow for research to be performed in a prespecified manner. For example, at Moffitt Cancer Center, there is the Total Cancer Care (TCC) protocol. TCC is an observational, prospective study with an objective of recruiting patients to participate [11, 12]. The TCC study evaluates the long-term effects of different cancers, treatments, and lifestyle. In addition, the TCC study is supported by the Health and Research Informatics (HRI) platform. HRI is the enterprise-wide patient data warehouse; it contains information from several Moffitt data source systems and from patients consented to the TCC protocol from the different consortium sites. Also, a molecular data warehouse is available to cancer investigators for future research. The molecular data warehouse contains clinically generated and research-generated molecular data. Clinical and molecular data are linked to tissue samples through data elements such as anatomic site of origin, tissue type, and histology. Patients who consent to this study will be followed throughout their lifetime; they can withdraw consent later if desired. The patients sign consent to allow release of their medical health information and the use of their tissue samples for cancer research. In addition, the protocol allows the patient to be contacted in the future if new findings could influence their care including clinical trials. TCC is currently composed of ten Florida hospitals including Moffitt Cancer Center, and eight national sites. This network of institutions recruits patients to the study, ships tissue to Moffitt Cancer Center’s biobank to store and conduct molecular analysis, and provide medical data of the TCC consented patients to be integrated into a data warehouse [11].
Multiple institutions can collaborate and form networks within the scope of their individual IRB approved institutional protocols. The Oncology Research Information Exchange Network (ORIEN) founded by Moffitt Cancer Center and the Ohio State University Comprehensive Cancer Center—Arthur G. James Cancer Hospital and Richard J. Solove Research Institute, is one such type of collaboration and now includes many more members [13, 14]. Datasets within each individual institution resides within their data warehouse with strict rules for access based upon their institutionally approved protocols. The network allows for utilization of this data as defined by the protocols. This data among other forms of data are stored in a system, a data warehouse.
A data warehouse (DW) is a system that collects, integrates, and aggregates various sources of data via Extract Transform and Load (ETL) processes within the organization for query, analyses, and reports (separate from the Electronic Health Record (EHR) but derives data from the EHR). External datasets can also be integrated into the DW. Results from a healthcare DW must be accessible to differing stakeholders, which can include healthcare regulators, physicians, hospital administrators, healthcare decision makers, and researchers. Data abstracted from data warehouses are done so using queries [15]. DWs are either set up as multidimensional hierarchical or relational databases [16]. This is different than a “flat database” that exists in an Excel-like table that only has columns, rows, and tables that makes accessing relevant information difficult. The main difference between relational and hierarchal databases is that relational databases involves joining tables that uses primary keys or foreign keys, which are common identifiers in each table to be able to join the corresponding information [17]. A hierarchical database organizes information based on its relationship to each other relative to the parent table. An example of this is a company organization chart that shows how the CEO has the VPs under their umbrella, and these VPs have managers/directors under theirs, and so on. Derivatives of hierarchical database languages such as M (Mumps) have been useful in health care [18, 19]. Many EHR commercial companies also utilize MUMPS. However, traditional hierarchical databases are difficult to utilize for cohort identification because the investigator would need to look up the parameter of interest for every single patient in the database. For cohort identification research, it is best to use a relational database as the data we deal with are not based on a relationship within a table but rather with multiple tables within the database. One can easily imagine how it can become difficult to analyze Excel tables where discrete values may be obtained from the medical record.
Optimizing the medical record to allow for discrete values to be visible for different parameters is the prevue of the HIMS. Different institutions have their own HIMS that have different variating levels of electronic integration. Part of the data-life cycle in research after it is extracted from the HIMS is create, read/retrieve, update, and delete (CRUD). The life cycle of data within the electronic medical record is beyond the scope of this article. Different users have different methods of using CRUD basic operations. In healthcare research, CRUD can be used as creating a new record in a table, retrieving data points/health information, updating the health information, and deleting variables or rows that are no longer in use [20, 21]. Typically, CRUD operations are performed in a Structure Query Language (SQL), a programming language that can be used for creating and querying databases. When identifying a cohort of patients, accessing this health information through the electronic format can come in two ways: front-end and back-end. The front-end system is a more user-friendly graphical user interface that most of the time is a web-based application, whereas the back-end usually involves coding in a specific database management language such as SQL [22].
Multiple front-end systems with easy to use graphical interfaces exist. One such system used at Moffitt Cancer Center is a front-end system used to ascertain the data called TransMed® Systems [23]. TransMed® is used to query the Health and Research Informatics (HRI). TransMed® reports cover information related to patient demographics, clinic services, diagnosis, treatment, comorbidities, patient reported outcomes, and biospecimen samples availability for research purposes. This tool allows users to filter the data in accordance to the IRB/protocol. For those with limited coding experience, this tool provides an efficient and quick way to obtain data [24]. However, it is limited to providing simple searches and if complex queries are needed this method requires additional tools to manipulate the data. Some of these complications include potentially missing cohort subjects (dropped data points) because the interface does not support a particular type of query. As a result, separate reports must be run on the existing cohort in order to collect the data points needed.
As such, a programming language is needed to access the system at the back-end for greater user control of searchers. SQL is used to access the information within the database as it is a highlevel programming language. This provides great flexibility and broad utility. SQL also allows users to build their databases vertically, where the layout is column to column rather than row to row. In addition, the language itself can handle large numbers of transactions in a single query, it is optimal for large number of table rows, and it is fast for searching, querying, and retrieving data from multiple tables. With this, there are some cons in using SQL as a way to query data and build databases. There is an in-depth complexity when interfacing a SQL database which requires more than inserting additional lines of code. Here is an example of a basic SQL query code that joins two tables (Patient and Sample) on a common field (MRN in this case): select Patient_ID, MRN, Gender from Patient where MRN in (select MRN from Sample where (Sample_Type = “FFPE” and Collection_Site_Category = ‘Thoracic’ AND Current_Quantity > 0)). This query allows the user to pull specific medical record numbers along with the unique patient identifier from a specified table named “Patient” if there is at least one FFPE block available in the biobank collected from the thorax. Join operations provide flexibility to relational databases, but at the same time they are computationally costly, because every join creates temporary data structures in memory containing the joined tables. There are also variations of SQL in which each variation has a subtle coding difference.
The main issue in selecting subjects for cohorts in translational studies is that the data in the data warehouse (stored as a relational database in the current example) is not in a format that can be easily read by a translational researcher. The data must be migrated to a format that is useful for the end user and abides by protocol requirements. There are two types of migration scripts: automated and manual. An automated script uses a synchronization tool and checked/altered by a developer. A manual script is written by a database programmer (e.g., SQL script, SQL stored procedure). There are some setbacks involving the use and creating of this script. It may be impossible to preserve data in the correct place and though the script will know the “before” and “after” versions of the database, the transitional path is unclear to the user. An example of this would be migrating table A to table B. It is the same table but with a different name, however while in transition, the tool may drop all the contents of Table A and solely create Table B [22]. With this being said, manual scripts are powerful tools but safety precautions should be in place such that no data is lost.
SQL itself comes in various Relational Database Management Systems (RDBMS) such as MySQL, Oracle, and Microsoft SQL. MySQL is an open-source RDBMS. An open-sourced relational database is software whose source code is open and available for editing, studying, and redistributing. A closed-source is software whose source code is kept private to prevent copying [25]. The use of either database is important as new data emerges and comes in various forms from various sources unstructured. This new data needs to be stored in a relational database and relatively at a low-cost which is where open-sourced relational databases such as MySQL becomes useful [26]. New database platforms such as NoSQL offer increased scalability, flexible database environment with the ability to have many different document types embedded in the database instead of symbolic links and increased efficiency [27–29]. However, work in this area has not yet reached mainstream use for clinical management systems,
To summarize the requirements needed to perform searches specific for translational research projects several components are needed. The health information management system (HIMS) of an organization uses medical information to help the quality of patient care. As the medical information involves patients’ privacy, HIPAA and IRB/protocol provisions are put in place. Using the IRB/protocols, researchers are able to utilize data for their research [30]. At Moffitt, the TCC study consents patients to have access to their medical information, tissue samples, and ability to follow up in the future if new developments occur that will affect care, such as being in a clinical trial [11]. Institutions can collaborate with respect to their individual IRB protocols and contribute with data to be aggregated and stored in data warehouses (i.e., ORIEN) [13]. Data from multiples sources are stored in a data warehouse and access to this warehouse involves complex queries [31]. Because we are looking at multiple sources and tables, having a relational database makes data accessible through querying [32]. The research-only data is separated out of the EHR of the HIMS in data warehouses such that there are no inadvertent changes to the medical record. Simple deidentified queries are performed on the user-friendly graphical tool, TransMed®. Once a cohort is identified, the honest broker in the context of an IRB approved protocol can provide the subject information to the investigator. However, searches typically require a deeper dive than the graphical tool even though this tool usually provides good insight on to where to begin the search. A database programmer who is also separate from the main research team (to maintain adherence to protocols) is needed to obtain the information. In today’s environment, this usually means utilizing some variant of the relational SQL database programming language although this may change in the future with improving technologies [22]. The key is that data within the data warehouse must be migrated to an environment that is suitable for analysis by the translational investigator. Typically, this comes in the form of an Excel worksheet. Whether using front-end or back-end methods of obtaining the data, informational retrieval systems are needed to conduct searches. What come next are the software requirements for the system we use in-house and a case report.
2. Software Dependencies
TransMed®/HRI—a commercial system used in our studies. The name of the company changed recently, but we use TransMed® in the text above as this is how the literature references the system. The new website is https://inteliquet.com/.
SQL—it is a query language and there are variations of this language used in different products which include MySQL, PostgreSQL, Microsoft SQL, and Oracle. This is used to house and manage several databases. The majority of databases use SQL as a way to manage and warehouse data, however there are other databases such as Microsoft Access that do not use SQL. Microsoft Access is a great database system to use for those whose data isn’t large as a nontechnical person can create a database, the cost isn’t expensive as it comes with the Microsoft Suite, and there is a small learning barrier. However, if your data is big (larger than 2 gigabytes), you have a low of users and objects, a more robust RDBMS would be the ideal data warehouse/management system to use. Microsoft Access is intended to be used as a gap between Excel and SQL server.
R Statistical Software and Programming/R-studio [33, 34]. This is an object oriented fourth generation programming language that is used by our group for data cleanup [35, 36]. Cleaning data is detecting and correcting/removing data points that may be inaccurate/extraneous and removing the data from the final data sets.
Microsoft Excel—the format of the deliverables is often put into Excel as that is one of the preferred visualization and analysis tool by researchers and utilized at Moffitt Cancer Center.
PowerChart® (Cerner)—Cerner is the company that produces the electronical medical record system used by Moffitt Cancer Center and PowerChart® is the electronic medical record system that at its core is a database. It is to be noted that the data for research is stored in a data warehouse separate from the clinical database.
3. Method
Patients are consented to TCC protocol. Patients are approached for TCC consent in which sample collection and biobanking storage of biospecimens collected may be used for research.
IRB/Protocol approval for the researchers to be allowed access to data.
A deidentified search is conducted using TransMed® (Intelliquest®) for general query such as age, tumor, stage, and primary site.
For a more specific query such as tissue, and blood availability, SQL is utilized to obtain this information from the data warehouse utilizing the honest broker and the data warehouse. Information is collated utilizing SQL that will then be extracted into an Excel sheet.
Receive query and assess feasibility with current databases. This involves reading the protocol and checking the databases to see what fields are available for query. In some instances, there is a query where there is not a direct linkage within the database and therefore other means will have to be assessed to define such criteria. An example of this would be assessing metastatic status in body locations (sites). This is not directly measured/recorded in the databases, and therefore other means of information extraction is required. This includes pulling metastatic site at diagnosis and tissue that have been deemed metastatic through pathology.
- Select patients using selected inclusion and exclusion criteria.
- Through TransMed® (Intelliquest®), a front-end system, we are able to use an interface that allows us to choose criteria. However, potential drawbacks are that the system represents an overview. If a researcher is assessing a specific demographic of patients, we may need to broaden the criteria to get a general subset of patients. This general subset will allow us to manually further down the cohort through chart review.
- In SQL based systems, we program the query based on the criteria given. However, similar to HRI, if a query push is too specific in which the data is not recorded in the database, a general query will be made and a chart review will take place to finalize the cohort.
Once the data is abstracted, the data is cleaned based on the researchers’ request. This may involve using R statistical software and programming, filtering in Excel, and additional data quality control measures are taken place within the data pull. This involves a randomization system and a manual chart review per the requirements of the IRB approved protocol to assure that the final cohort of patients reflects the eligibility criteria and appropriate data points. The data is released to the researchers.
In summary, patients who are consented to the TCC protocol are entered into a data warehouse that allows cancer investigators to query for personal health information relevant to research. A deidentified search is done to identify a cohort of patients using the research’s criteria for a general query. If more specifics are needed such as biospecimens with specific characteristics, then SQL will be used to pull these results. The protocol written by the researcher will be assessed against the databases and feasibility will be determined. If there is a variable requested that is unavailable in the database, other means of pulling this variable will be discussed. Once the details have been finalized, the actual query will take place using the front-end system and/or SQL querying and data will be abstracted. The data is then cleaned using R software and sent to the researcher in Excel format once completed.
References
- 1.Lau EC, Mowat FS, Kelsh MA et al. (2011) Use of electronic medical records (EMR) for oncology outcomes research: assessing the comparability of EMR information to patient registry and health claims data. Clin Epidemiol 3:259–272 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Embi PJ, Payne PR (2009) Clinical research informatics: challenges, opportunities and definition for an emerging domain. J Am Med Inform Assoc 16(3):316–327 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Berg M (2006) Health information management: integrating information technology in health care work. Routledge, New York [Google Scholar]
- 4.Hammond WE, Jaffe C, Cimino JJ, Huff SM (2014) Standards in biomedical informatics In: Shortliffe EH, Cimino JJ (eds) Biomedical informatics: computer applications in health care and biomedicine. Springer; London, London, pp 211–253 [Google Scholar]
- 5.Vogel LH (2014) Management of information in health care organizations In: Shortliffe EH, Cimino JJ (eds) Biomedical informatics: computer applications in health care and biomedicine. Springer; London, London, pp 443–474 [Google Scholar]
- 6.Yasnoff WA (2014) Health information infrastructure In: Shortliffe EH, Cimino JJ (eds) Biomedical informatics: computer applications in health care and biomedicine. Springer; London, London, pp 423–441 [Google Scholar]
- 7.Chen ET (2013) An observation of healthcare knowledge management. Commun IIMA 13 (3):7 [Google Scholar]
- 8.Payne TH, Graham G (2006) Managing the life cycle of electronic clinical documents. J Am Med Inform Assoc 13(4):438–445 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Bankert EA, Amdur RJ, Amdur RJ (2006) Institutional review board: management and function. Jones and Bartlett, Sudbury, MA [Google Scholar]
- 10.Parker GE (2016) A framework for navigating Institutional Review Board (IRB) oversight in the complicated zone of research. Cureus 8 (10):e844. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Li B, Eschrich SA, Berglund A et al. (2017) Use of the total cancer care system to enrich screening for CD30-positive solid tumors for patient enrollment into a brentuximab vedotin clinical trial: a pilot study to evaluate feasibility. JMIR Res Protoc 6(3):e45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Fenstermacher DA, Wenham RM, Rollison DE, Dalton WS (2011) Implementing personalized medicine in a cancer center. Cancer J 17 (6):528–536 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lindsey H (2017) ORIEN uses big data to improve care for high-risk patients. Oncol Times 39(11):31–40 [Google Scholar]
- 14.Dalton WS, Sullivan D, Ecsedy J, Caligiuri MA (2018) Patient enrichment for precision-based cancer clinical trials: using prospective cohort surveillance as an approach to improve clinical trials. Clin Pharmacol Ther 104(1):23–26 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Silver M, Sakata T, Su HC, Herman C, Dolins SB, O’Shea MJ (2001) Case study: how to apply data mining techniques in a healthcare data warehouse. J Healthc Inf Manag 15 (2):155–164 [PubMed] [Google Scholar]
- 16.Coronel C, Morris S (2019) Database systems: design, implementation, and management. Cengage Learning, Inc. 13th edition, Boston, MA, USA [Google Scholar]
- 17.Murphy SN, Morgan MM, Barnett GO, Chueh HC (1999) Optimizing healthcare research data warehouse design through past COSTAR query analysis. Proc AMIA Symp:892–896 [PMC free article] [PubMed] [Google Scholar]
- 18.Levy C, Beauchamp C, Hammond JE (1995) A managed care workstation for support of ambulatory care in Veterans Health Administration medical centers. J Med Syst 19 (5):387–396 [DOI] [PubMed] [Google Scholar]
- 19.O’Kane KC, McColligan EE (1997) A case study of a MUMPS intranet patient record. Healthc Inf Manage 11(3):81–95 [PubMed] [Google Scholar]
- 20.Garcia-Molina H, Ullman JD, Widom J (2014) Database systems the complete book. Pearson Prentice Hall, 2nd editionUpper Saddle River, NJ, USA [Google Scholar]
- 21.Lianas L, Frexia F, Delussu G, Anedda P, Zanetti G (2009) pyEHR: A scalable clinical data management toolkit for biomedical research projects. Paper presented at: 2014 IEEE 16th international conference on e-health networking, applications and services (Healthcom); 15–18 Oct 2014 [Google Scholar]
- 22.Kline K (2017) SQL in a nutshell: a desktop quick reference guide. O’Reilly Media [Google Scholar]
- 23.Miller ML, Ruprecht J, Wang D et al. (2011) Physician assessment of disease activity in JIA subtypes. Analysis of data extracted from electronic medical records. Pediatr Rheumatol Online J 9(1):9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Dougoud-Chauvin V, Lee JJ, Santos E et al. (2018) Using Big Data in oncology to prospectively impact clinical patient care: a proof of concept study. J Geriatric Oncol 9(6):665–672 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Stallings W (1987) Handbook of computer-communications standards: the open systems interconnection (OSI) model and OSI-related standards. Macmillan [Google Scholar]
- 26.Marsan J, Pare G (2013) Antecedents of open source software adoption in health care organizations: a qualitative survey of experts in Canada. Int J Med Inform 82(8):731–741 [DOI] [PubMed] [Google Scholar]
- 27.Gopinath MP, Tamilzharasi GS, Aarth SL, Mohanasundram (2017) An analysis and performance evaluation of NoSQL databases for efficient data management in E-health clouds. Int J Pure Appl Math 117(21):177–197 [Google Scholar]
- 28.Ercan MZ, Lane M (2014) Evaluation of NoSQL databases for EHR systems. 25th Australasian conference on information systems, 2014, Auckland, New Zealand [Google Scholar]
- 29.Wang X, Williams C, Liu ZH, Croghan J (2019) Big data management challenges in health research-a literature review. Brief Bioinform 20(1):156–167 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Dyrbye LN, Thomas MR, Mechaber AJ et al. (2007) Medical education research and IRB review: an analysis and comparison of the IRB review process at six institutions. Acad Med 82 (7):654–660 [DOI] [PubMed] [Google Scholar]
- 31.Lyman JA, Scully K, Harrison JH (2008) The development of health care data warehouses to support data mining. Clin Lab Med 28 (1):55–71 [DOI] [PubMed] [Google Scholar]
- 32.Farooqui NA, Mehra R (2018) Design of a data warehouse for medical information system using data mining techniques. Paper presented at: 2018 fifth international conference on parallel, distributed and grid computing (PDGC), 20–22 Dec 2018 [Google Scholar]
- 33.Rstudio web page. https://rstudio.com. Accessed 9 Jan 2020
- 34.The R project for statistical computing. https://www.r-project.org/. Accessed 9 Jan 2020
- 35.Chan BKC (2018) Data analysis using R programming. Adv Exp Med Biol 1082:47–122 [DOI] [PubMed] [Google Scholar]
- 36.Gabbrielli M, Martini S (2010) Programming languages: principles and paradigms. Springer, London/New York [Google Scholar]
