Development of a HIPAA-compliant environment for translational research data and analytics

Wayne Bradford; John F Hurdle; Bernie LaSalle; Julio C Facelli

doi:10.1136/amiajnl-2013-001769

. 2013 Aug 2;21(1):185–189. doi: 10.1136/amiajnl-2013-001769

Development of a HIPAA-compliant environment for translational research data and analytics

Wayne Bradford ¹, John F Hurdle ², Bernie LaSalle ², Julio C Facelli ^1,²

PMCID: PMC3912719 PMID: 23911553

Abstract

High-performance computing centers (HPC) traditionally have far less restrictive privacy management policies than those encountered in healthcare. We show how an HPC can be re-engineered to accommodate clinical data while retaining its utility in computationally intensive tasks such as data mining, machine learning, and statistics. We also discuss deploying protected virtual machines. A critical planning step was to engage the university's information security operations and the information security and privacy office. Access to the environment requires a double authentication mechanism. The first level of authentication requires access to the university's virtual private network and the second requires that the users be listed in the HPC network information service directory. The physical hardware resides in a data center with controlled room access. All employees of the HPC and its users take the university's local Health Insurance Portability and Accountability Act training series. In the first 3 years, researcher count has increased from 6 to 58.

Keywords: High-performance Computing, Translational Medical Research, Clinical Research Informatics, HIPAA

Introduction

Translational research increasingly depends on reusing clinical data for research purposes.¹ In the USA and elsewhere, stringent regulations require maintaining the strict confidentiality of data collected during clinical encounters.^2–4 These regulations can be a significant deterrent to translational investigators who are familiar with the much more fluid environments found in high-performance computing centers (HPC). Informatics solutions are required to bridge this gap, ones that make available the advantages of an HPC to researchers without undue barriers while preserving the confidentiality of health data.

Several approaches to this problem have been reported in the literature. The University of California, San Diego's iDASH system (integrating data for analysis, anonymization, and sharing), funded by the National Institutes of Health as a national center for biomedical computing, supports large-scale data analytics and storage for multi-institutional research projects.⁵ This platform and its tools are well thought out and lend themselves to large, multisite trials. As a result of its scale and demand on resources, however, the iDASH architecture is not easily replicated by most medical centers. The University of California, San Francisco implemented a toolset called the integrated data repository, which brings together a variety of phenotypic will data such as laboratory, diagnoses, demographics, procedures, pathology and radiology reports, and vital sign data collected at the University of California San Francisco Medical Center.⁶ As with the system we describe here, the integrated data repository maintains scrupulous logs of user activity. Its primary objective is to support cohort-finding and hypothesis testing. Cimino et al⁷ describe a clinical repository architecture, the biomedical translational research information system, in use at the National Institutes of Health. While it is not designed specifically for high-performance computing, the biomedical translational research information system is unique in its principled application of controlled terminologies that facilitate the sharing of research data across diverse projects. On the other hand, the informatics core at the University of Indiana Clinical and Translational Sciences Award, called the advanced biomedical IT core, offers truly exceptional high-performance computing services and massive data storage capacity (on the scale of petabytes).⁸ Its focus on big data skirts the issue of confidentiality under the Health Insurance Portability and Accountability Act (HIPAA), a drawback for many clinical applications. These examples are by no means exhaustive, but they illustrate that the translational research community has yet to settle on best practice.

Note that we focus here on computing environments where research analytics actually take place, distinguishing these from research enterprise data warehouses that may serve as a source of clinical data. We present the architecture adopted by the University of Utah. Our aim is to provide an example of a successful solution, based on commodity hardware and software, that others may wish to adopt.

The problem space

Until the deployment of the protected environment (PE), researchers at the university operated stand-alone computers/servers. This is the typical scenario at most research medical centers. This ‘island’ approach has several limitations. It discourages cross-group collaboration and leads to duplication of software/hardware purchasing. At Utah, we found that security and risk management were not well understood, controlled, nor documented by the researcher community. Rarely did our research groups have the time and expertise to instantiate necessary management controls. Finally, as datasets grow in this era of big data, the need for increasing computational power grows apace. Upgrading hardware under the ‘island’ model is far less cost effective than upgrading a central HPC.

The Utah HPC setting

Our Center for High Performance Computing (CHPC) supports a wide variety of research users on our main campus (eg, chemistry, biology, natural resources, etc.) as well as on our health sciences campus. One of us (JFH) built a small compute cluster within the CHPC for processing clinical notes that was certified as HIPAA compliant by the university's information security and privacy office (ISPO), establishing the feasibility of safely housing clinical data in an HPC. The university administration then tasked the CHPC to address the general needs of researchers working with sensitive clinical data. It was clear from the outset that institutional financial support would be limited. Existing infrastructure would have to be used to the maximum possible extent.

Processing data containing protected health information (PHI) was not something that easily fitted into our CHPC infrastructure. The ethos of traditional scientific computing favors computational power, flexibility, and reliability over privacy. Meeting the regulatory compliance stipulations that safeguard PHI without degrading the experience of our non-medical research groups prompted us to design and deploy the PE.

Methods

Our needs assessment of the health sciences research community to define a computationally rich, HIPAA-compliant environment was based on a series of discussions with researchers at the University of Utah. We addressed the needs and data management work styles of researchers in obstetrics, geriatrics, nephrology, internal medicine, epidemiology, family and consumer studies, biomedical informatics, and pediatrics. These meetings were conducted over several years and were general in nature, as we were responding to needs as manifested. Our approach can be characterized as an evolutionary approach in which researchers approached CHPC to use services. Over a period of time CHPC staff developed a good understanding of research needs that highlighted two distinct but complementary components:

The need to provide very large storage capacity and diverse analytical software that are well integrated into a high-performance computing setting, and
The ability to provide virtual machines (VM) to deploy applications containing PHI; for example, clinical trials database management tools or specialized systems to provide personalized health data accessible to patients.

In retrospect, a more systematic needs assessment would probably have produced a superior design, especially from the perspective of the researchers using the system. The computer science literature describes several effective and agile approaches to systematic needs assessment, for example, the work described in Erickson et al.⁹ Our needs-assessment approach did have the advantage of facilitating a fluid dialog between CHPC staff and researchers, building both trust and an understanding of the necessary compromises that are inevitable in an interactive design cycle. Once the general framework of needs was defined, we worked closely with the university's information security operations (ISO) and the ISPO to define and test a security framework that was deemed at least as secure as that for clinical information systems.

In designing the PE, we wanted to isolate it as much as possible while still utilizing parts of the core infrastructure. One simple step we took was to assign the network IP address space for the PE to logical subnets and virtual local area networks separated from existing services. Subnet addressing supports a ‘virtual’ network system made up of multiple networks sharing the same internet address. With this approach we were able to utilize most of our existing core services such as virtual private network (VPN), domain name system, network time protocol, Kerberos (an authentication protocol), and our applications tree (a structured collection of software applications that supports access to current and past versions of software).

In the early planning stages we engaged the university's ISO and the ISPO to gain a better understanding of the security requirements associated with PHI and HIPAA-regulated data. Working with the ISO/ISPO, especially using them for external ‘unauthorized access’ testing, was an important step in designing a system with appropriate security controls and safeguards.

Physical infrastructure

The physical hardware resides in a data center with controlled room access. These hosts are racked in a locked cabinet and hosts have locked server bezels. Physical access to the data center is reviewed biannually and documented on an access-controlled departmental wiki. Back-ups are restricted to one specific back-up server on one particular port. Back-up data traffic is automatically encrypted (BLOWFISH) at the client side before traversing the network. Back-up media are stored in locked cabinets in the access-restricted data center. All CHPC staff who interact with the PE take the university's HIPAA training courses, and many have completed the well-known collaborative institutional training initiative human subjects research training.¹⁰

HPC analytical environment

This space is for computationally intensive needs such as data mining, machine learning, natural language processing (NLP), statistics, and other operations across large datasets. This requires large (and secure) storage, high network-bandwidth capacity, and high-performance compute clusters. As noted above, we are able to share our complete applications tree with the PE in read-only mode. The most popular applications used in the HIPAA environment are MySQL,¹¹ a collection of NLP tools including MetaMap¹² and CLUTO¹³ (a clustering tool), the popular WEKA data mining package,¹⁴ and the R statistical package.¹⁵ The two predominant pipeline strategies our researchers employ are UIMA¹⁶ for NLP work and PhP-wrapped R for bioinformatics projects.

For access to the environment, we chose a double authentication mechanism. The first level of authentication uses the campus VPN, and only pre-approved users may log into the VPN pool of IP addresses for the PE. The second authentication level requires users to be listed in the CHPC network information service (NIS) directory server in order to interact with the PE, and both authentification mechanisms use the same user ID and passwords from the campus central University of Utah's Kerberos authentication system.

Access to the compute clusters is provided via front-end login servers. The login servers are restricted by router access control lists to remote desktop protocol for Windows hosts and secure shell for the Linux hosts. Public key, host, or RHOSTS-based authentication are not allowed. The login servers also employ firewall services to limit access to VPN addresses.

All interactions between the login servers and the cluster file server, batch controller, and computing nodes takes place on an isolated back-end, high-speed Infiniband network.

VM implementation

The VM environment is for scientists who need lightweight computing services that do not justify the expense or capabilities of a dedicated server. The VM cluster consists of four servers (two VMware ESX, one Windows, one Red Hat Linux) and a disk tray. One Windows server runs VMware vCenter Server that coordinates the load-balancing and failover (ie, auto-recovery) functions of the two ESX servers. This server does not process any protected data. One Red Hat Linux server acts as an administrative access point and it too does not process any protected data. The two VMware servers host the actual guest VM. These servers do process protected data but do not store it internally (ie, all transactions are RAM based). The disk tray provides shared storage to the two VMware servers. These disks store the actual VM, and thus all sensitive data in those VM. We require all VM to encrypt their disk. The VM and applications are regularly scanned by the university ISO for compliance with encryption and HIPAA standards.

Administrative procedures

In order for a user to access the PE they must meet all the following requirements:

Have an active account in the University of Utah's Kerberos authentication system, using the university's standard procedures. This can also be extended to external collaborators.
Have an active CHPC account. Using CHPC accounts requires approval of a faculty principal investigator.
Have an active CHPC account created in the protective environment's NIS and be a member of the ‘NIS group’ that is listed in a security access configuration file. A CHPC PE account requires verification and completion of the university's HIPAA privacy and security training courses.
Be added to the HIPAA VPN pool, and use this VPN encrypted tunnel to access designated login nodes.

Permission to use a given dataset is governed by the approval of the university's institutional review board (IRB). If the IRB approves a project that uses a PHI dataset, the researcher is given an IRB number, which is then shared with the CHPC. The researcher lists the users who will be permitted to access the data in the PE. That list is independently verified with the IRB and it forms the basis of the UNIX group defined for the project. At this point, the data may be transferred to CHPC and only the NIS group will have access to it.

Logical access is monitored by SYSLOG, a standard Linux logging solution, and PSACCT, a process monitoring utility. Logs are kept both locally and on a remote SYSLOG server, and they are routinely reviewed. Logs are currently kept indefinitely. Log ‘watch reports’ are emailed daily to designated administrator accounts. These daily log watch reports show the last 24 h of accounts and IP addresses logging in, and importantly, those that fail to log in. While this may not be a method that can be scaled to a very large installation, it has been proved effective and reliable for our current user base. Firewall configuration prevents ‘brute force’ login attempts. Access to view or manipulate other users/groups data is enforced using UNIX file and directory permission(s).

When an account is locked or disabled at the level of the campus, the VPN, or the local department, login to the PE is prevented. Account IRB authorization is reviewed biannually. IRB project personnel lists are the authoritative source for who has access to PE data. If a person is not listed on an approved IRB project, then they are not allowed in the UNIX group for access to that project's data. We are working on an automated web service between the university's electronic IRB system and the CHPC with the aim of real time authorization. We recognize that internal threats (ie, a CHPC employee attempts unauthorized data access) could be a serious issue, but at this point we have not implemented any logical or physical means to prevent them. However, we do employ the same educational and disciplinary measures used by our clinical IT departments.

Results

Figure 1 provides an overview of the architecture of the system described in this case report. In table 1 we show that, in its first 3 years of operation, the number of researchers using the PE system grew from six to 58. It has proved to be a popular resource.

A simplified ‘cartoon’ diagram of the architecture described in the Methods section.

Table 1.

Growth of the performance computing center protected space

Date	No of hosts*	Total disk capacity (TB)	No of researchers
February 2009	9	5.6	6
October 2010	16	27.7	26
April 2011	19	33.7	37
March 2012	20	33.7	58

Open in a new tab

*A host is a login server or a multi-core compute node. Core count varies from 8 to 16 per node. These snapshot counts were taken when user demand required adding more hardware. No special outreach to attract users was attempted between count epochs.

The VM environment in the PE has been ‘live’ for a little less than 2 years. Currently, there are three VM user groups as shown in table 2.

Table 2.

Resources allocated in the protected virtual space

Application	No of VM	RAM (GB)	Disk (TB)
REDCap	8	26	8
AsthmaTracker	4	8	4
caTissue	4	8	4

Open in a new tab

VM, virtual machines.

Users report that the dual-authentication process is more cumbersome than they would like; however, they note that this is a small price to pay for being freed of the worries of updating software, making secure and consistent back-ups, enforcing access control, and paying for new hardware.

Conclusion

As described in the Methods section, we acknowledge that our needs-assessment approach could have been improved with more systematic techniques. However, our evolutionary design approach did result in a usable and popular resource .We have been able to reuse a substantial part of the CHPC infrastructure to develop a new PE that increases user productivity while maintaining regulatory compliance. Our hardware is commodity technology, for example, Dell, which is in widespread use at multiple medical centers. Most of our software is open source (eg, MySQL, MetaMap, CLUTO, R, etc.), and to the extent that we use commercial software, such as SAS and MatLab, we rely only on software that is also widely available. The access technologies described above (eg, VPN or Kerberos) are industry standards and thus are easily generalized. Users from various units on campus are utilizing this infrastructure, and it is stimulating new collaborations between, among others, the Departments of Biomedical Informatics, Pediatrics (Primary Children Medical Center), Radiology, the College of Nursing, and the university's research infrastructure being built under our Clinical and Translational Science Award. For these researchers the tangible benefits they realize include:

Access to HPC power;
Freedom from systems management issues (eg, rapid response to electrical power issues, provision of reliable cooling and heating, etc.);
VPN support for a ‘work-anywhere’ computing experience;
Automatic software updating;
A hardened, secure environment far superior to office computers or departmental servers.

For the university, this PE resource allows much better overall risk compliance and reduced exposure to the inadvertent disclosure of confidential PHI.

Acknowledgments

The authors would like to think the staff of the University of Utah Center for High Performance Computing for their support in this project. We gratefully acknowledge the insightful comments of the reviewers; they led to a much improved manuscript.

Footnotes

Contributors: All of the authors contributed to the conception and design of one or more parts of the PE system described in the manuscript. The manuscript was initially drafted by WB, BL, and JCF, with substantial input and final editing by JFH. All the authors gave final publishing approval.

Funding: This work has been supported in part by the National Center for Research Resources award UL1RR025764; National Library of Medicine awards 5RC2LM010798, 5R01LM010981, and 5R21LM009967; and DHHS Health Resources and Services award 1D1BRH20425-01-00.

Competing interests: None.

Provenance and peer review: Not commissioned; externally peer reviewed.

References

1.Prokosch HU, Ganslandt T. Perspectives for medical informatics. Reusing the electronic medical record for clinical research. Methods Inf Med 2009;48:38–44 [PubMed] [Google Scholar]
2.Miller JD. Sharing clinical research data in the United States under the Health Insurance Portability and Accountability Act and the privacy rule. Trials 2010;11:112. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Gunter KP. The HIPAA privacy rule: practical advice for academic and research institutions. Healthc Financ Manage 2002;56:50–4 [PubMed] [Google Scholar]
4.Gunn PP, Fremont AM, Bottrell M, et al. The Health Insurance Portability and Accountability Act Privacy Rule: a practical guide for researchers. Med Care 2004;42:321–7 [DOI] [PubMed] [Google Scholar]
5.Ohno-Machado L, Bafna V, Boxwala AA, et al. iDASH: integrating data for analysis, anonymization, and sharing. J Am Med Inform Assoc 2012;19:196–201 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Integrated Data Repository. University of California, San Francisco. http://it.ucsf.edu/services/integrated-data-repository (last referenced 2 Jun 2013)
7.Cimino JJ, Ayres EJ. The clinical research data repository of the US National Institutes of Health. Stud Health Technol Inform 2010;160:1299–303 [PMC free article] [PubMed] [Google Scholar]
8.Advanced Biomedical IT Core (ABITC) University of Indiana [cited 10 May 2013]. http://kb.iu.edu/data/avoh.html.
9.Erickson J, Lyytinen K, Siau K. Agile modeling, agile software development, and extreme programming: the state of research. J Database Manag 2005;16:88–100 [Google Scholar]
10.CITI CITI Collaborative Institutional Training Initiative. https://www.citiprogram.org/default.asp (last referenced 27 Sept 2012)
11.MySQL The world's most popular open source database. [cited 10 May 2013]. http://www.mysql.com/.
12.Aronson A. MetaMap: Mapping Text to the UMLS Metathesaurus. 2006. [20 Jan 2008]. http://skr.nlm.nih.gov/papers/references/metamap06.pdf#page=9&zoom=100&pagemode=none.
13.Zhao Y, Karypis G. Data clustering in life sciences. Mol Biotechnol 2005;31:55–80 [DOI] [PubMed] [Google Scholar]
14.Hall M, Frank E, Holmes G, et al. The WEKA data mining software: an update; SIGKDD explorations. New York, NY: ACM, 2009;11. doi:10.1145/1656274.1656278
15.R Development Core Team (2008) R: A language and environment for statistical computing . Vienna, Austria: R Foundation for Statistical Computing; ISBN 3-900051-07-0. http://www.R-project.org. [Google Scholar]
16.IBM Unstructured Information Management Architecture (UIMA). [20 Mar 2010]. http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.index.html.

[R1] 1.Prokosch HU, Ganslandt T. Perspectives for medical informatics. Reusing the electronic medical record for clinical research. Methods Inf Med 2009;48:38–44 [PubMed] [Google Scholar]

[R2] 2.Miller JD. Sharing clinical research data in the United States under the Health Insurance Portability and Accountability Act and the privacy rule. Trials 2010;11:112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Gunter KP. The HIPAA privacy rule: practical advice for academic and research institutions. Healthc Financ Manage 2002;56:50–4 [PubMed] [Google Scholar]

[R4] 4.Gunn PP, Fremont AM, Bottrell M, et al. The Health Insurance Portability and Accountability Act Privacy Rule: a practical guide for researchers. Med Care 2004;42:321–7 [DOI] [PubMed] [Google Scholar]

[R5] 5.Ohno-Machado L, Bafna V, Boxwala AA, et al. iDASH: integrating data for analysis, anonymization, and sharing. J Am Med Inform Assoc 2012;19:196–201 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Integrated Data Repository. University of California, San Francisco. http://it.ucsf.edu/services/integrated-data-repository (last referenced 2 Jun 2013)

[R7] 7.Cimino JJ, Ayres EJ. The clinical research data repository of the US National Institutes of Health. Stud Health Technol Inform 2010;160:1299–303 [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Advanced Biomedical IT Core (ABITC) University of Indiana [cited 10 May 2013]. http://kb.iu.edu/data/avoh.html.

[R9] 9.Erickson J, Lyytinen K, Siau K. Agile modeling, agile software development, and extreme programming: the state of research. J Database Manag 2005;16:88–100 [Google Scholar]

[R10] 10.CITI CITI Collaborative Institutional Training Initiative. https://www.citiprogram.org/default.asp (last referenced 27 Sept 2012)

[R11] 11.MySQL The world's most popular open source database. [cited 10 May 2013]. http://www.mysql.com/.

[R12] 12.Aronson A. MetaMap: Mapping Text to the UMLS Metathesaurus. 2006. [20 Jan 2008]. http://skr.nlm.nih.gov/papers/references/metamap06.pdf#page=9&zoom=100&pagemode=none.

[R13] 13.Zhao Y, Karypis G. Data clustering in life sciences. Mol Biotechnol 2005;31:55–80 [DOI] [PubMed] [Google Scholar]

[R14] 14.Hall M, Frank E, Holmes G, et al. The WEKA data mining software: an update; SIGKDD explorations. New York, NY: ACM, 2009;11. doi:10.1145/1656274.1656278

[R15] 15.R Development Core Team (2008) R: A language and environment for statistical computing . Vienna, Austria: R Foundation for Statistical Computing; ISBN 3-900051-07-0. http://www.R-project.org. [Google Scholar]

[R16] 16.IBM Unstructured Information Management Architecture (UIMA). [20 Mar 2010]. http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.index.html.

PERMALINK

Development of a HIPAA-compliant environment for translational research data and analytics

Wayne Bradford

John F Hurdle

Bernie LaSalle

Julio C Facelli

Abstract

Introduction

The problem space

The Utah HPC setting

Methods

Physical infrastructure

HPC analytical environment

VM implementation

Administrative procedures

Results

Figure 1.

Table 1.

Table 2.

Conclusion

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Development of a HIPAA-compliant environment for translational research data and analytics

Wayne Bradford

John F Hurdle

Bernie LaSalle

Julio C Facelli

Abstract

Introduction

The problem space

The Utah HPC setting

Methods

Physical infrastructure

HPC analytical environment

VM implementation

Administrative procedures

Results

Figure 1.

Table 1.

Table 2.

Conclusion

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases