Strategies for maintaining patient privacy in i2b2

Shawn N Murphy; Vivian Gainer; Michael Mendis; Susanne Churchill; Isaac Kohane

doi:10.1136/amiajnl-2011-000316

. 2011 Oct 7;18(Suppl 1):i103–i108. doi: 10.1136/amiajnl-2011-000316

Strategies for maintaining patient privacy in i2b2

Shawn N Murphy ^1,^2,^3,^✉, Vivian Gainer ³, Michael Mendis ³, Susanne Churchill ³, Isaac Kohane ^2,^4,⁵

PMCID: PMC3241166 PMID: 21984588

Abstract

Background

The re-use of patient data from electronic healthcare record systems can provide tremendous benefits for clinical research, but measures to protect patient privacy while utilizing these records have many challenges. Some of these challenges arise from a misperception that the problem should be solved technically when actually the problem needs a holistic solution.

Objective

The authors' experience with informatics for integrating biology and the bedside (i2b2) use cases indicates that the privacy of the patient should be considered on three fronts: technical de-identification of the data, trust in the researcher and the research, and the security of the underlying technical platforms.

Methods

The security structure of i2b2 is implemented based on consideration of all three fronts. It has been supported with several use cases across the USA, resulting in five privacy categories of users that serve to protect the data while supporting the use cases.

Results

The i2b2 architecture is designed to provide consistency and faithfully implement these user privacy categories. These privacy categories help reflect the policy of both the Health Insurance Portability and Accountability Act and the provisions of the National Research Act of 1974, as embodied by current institutional review boards.

Conclusion

By implementing a holistic approach to patient privacy solutions, i2b2 is able to help close the gap between principle and practice.

Keywords: automated learning; bioinformatics; clinical informatics; clinical research; clinical research informatics; common rule; data exchange; data models; discovery, text and data mining methods; genomics; HIPAA; image representation; information storage and retrieval; knowledge representations; linking the genotype and phenotype; medical informatics; medical records; natural-language processing; patient privacy; processing and display analysis; software architecture; visualization of data and knowledge

Surveys have found that patients' opinions of how their data should be protected fall along a continuum, and although most can be classified as cautious regarding the use of their healthcare record (EHR) data for research, the majority of patients are not averse to the idea.¹ The most prevalent reason for keeping EHR data private is the perceived risk to patients' personal lives. Stigmatizing health conditions appearing in the EHR can threaten social relationships and status.² It should not be taken for granted that EHR data be used for anything other than caring for the specific patient from whom it was collected, but when patients' legitimate concerns are dealt with in a sensitive manner, it is possible to work with EHR data in ethical ways while promoting clinical research. We have identified three areas of importance in maintaining patient privacy: de-identification of data, the patients' trust in the researcher and in the research and the technical data security of the computer system.

Is there a form of de-identification appropriate for all-comers? Algorithms have been developed for structured data to prevent the disclosure of sensitive information even when distributed at a detailed, non-aggregated, line-item patient level using a method known as ‘k-level anonymity’,³ ⁴ where k represents the number of peoples' records that must be indistinguishable from another record in the set if it is to pass scrutiny. When a patient record exceeds this level of uniqueness, data values are removed until the records are no longer unique. Although superficially such methods seem like an adequate solution to the de-identification problem, they have been shown to be subject to ‘reverse engineering’, an undoing of the obfuscation.⁵ Furthermore, they often remove critical attributes from the data.⁶

Another popular form of de-identification used for medical records is the ‘scrubbing’ of textual medical reports.^7–9 Computer programs search the text and attempt to remove patient names, dates, locations and other potentially identifying information. These programs perform to various levels of accuracy, and involve similar trade-offs as those described above for structured data. To ensure that the data are de-identified and ‘unmatchable’ to the original record, sentence structure and other important attributes of the data must often be removed.¹⁰

The failure of technology alone to offer a foolproof de-identification solution is not surprising. People are extremely resourceful at solving challenging puzzles, such as the re-identification of de-identified data. However, the true risk may be greatly overemphasized by these demonstrations,¹¹ and results in two not entirely satisfactory approaches to de-identification: one that produces de-identified output that has become stripped of meaningful data, and another that maintains germane information using methods that can be breached if they fall into the wrong hands.

In attempts to solve this paradox, illogical decisions can be made about patient privacy solutions. For example, at Marshfield Clinic there has been an enormous investment in a bank of over 20 000 consented patients who are genotyped using donated blood and tissue. These genotypes are combined with de-identified phenotypic data from the Marshfield Clinic electronic medical record.¹² A well-intentioned policy was put into place to keep the people who view identified phenotypic data from having access to the associated de-identified genomic data, with the reasoning that a person who could see both datasets might find a way to tie them together. As all the physicians at Marshfield Clinic must have access to the EHR, the outcome is that many Marshfield physicians who are investigators can not look at the data from their studies.

One approach to resolving such privacy management discordance is to match the level of data de-identification with the trustworthiness of the data recipients, in which the more identified the data, the more ‘trustworthy’ the recipients are required to be, and vice versa. This solution necessitates that the idea of trustworthiness be quantified and governed by established socially acceptable processes, such as criminal history checks, letters of reference and credentialing systems that have been used in many scenarios in society to perform objective trust assessments. Specific methods used at Partners Healthcare and Harvard University will be described later in the paper. The level of trust for a data recipient becomes a critical factor in determining what data may be seen by that person.

We also need to consider the technical protection of the patient data itself, for which the Health Information Technology for Economic and Clinical Health (HITECH) Act requires covered entities to conduct a risk analyses and implement physical, administrative and technical safeguards that each covered entity determines are reasonable and appropriate.¹³ Technical safeguards to consider consist of user access and authentication controls, assignment of privileges, maintaining file and system integrity, back-ups, monitoring processes, log-keeping, auditing and physically securing the data. A range of possible solutions exists for managing the technical protection of the data and represents different choices of the competing aspects of risk, cost and flexibility. A solution at the University of California at San Francisco (UCSF) was to create an exclusive, protected area for data and analysis inside a specially firewalled area for the research community. The incentive to use the protected area is that legal coverage is provided should a data breach occur within the protected area. This solution guarantees that the technical safeguards implemented by the institution within the protected area, such as firewalls and network intrusion detection, virtual private networks and disk encryption are followed by the researchers. However, this requires a high resource commitment from the institution to maintain the protected area, and the use of specialized software on privately funded platforms is not supported.

With more responsibility and trust given to the researchers, institutions such as Partners Healthcare have similar policies as UCSF; however, researchers are free to use most areas behind the institutional firewalls. Researchers must prove their knowledge of the security policies by taking a certified course on human subject research protection and by specifying the technical protections of the patient data in their institutional review board (IRB) applications. The researchers have more freedom to use their local computational platforms and software, but the institution loses the ‘guarantee’ of a flawless implementation of its technical security policies as there would be in the UCSF solution. Therefore, the more liberal solution at institutions such as Partners Healthcare leads to more attention needing to be paid to data de-identification or encryption, and better determination of the trustworthiness and abilities of its data recipients to set up a technically safe environment.

Our objective was to create the i2b2 software platform so that it complied with real-world use cases for how patient privacy solutions were implemented, but given that no solution would be perfect, represents a balance between the data de-identification technology, the safety of the technical platform, and the requirement of various levels of trust in the researchers. The use cases were simplified to five patient privacy levels with clear requirements in each of these three components, not because the situation was simple, but because of the complexity of keeping the platform consistent across the data protection levels. Of course, as i2b2 is open source it can be adapted to satisfy the patient privacy requirements of a local site; however, careful attention must be paid to a consistent data protection formulation throughout the platform.

Materials and methods

Case studies were performed to study implementations of patient privacy systems and understand the logic and consequences of different approaches. All of these case studies represented an interpretation of the legal structures embodied in the Health Insurance Portability and Accountability Act (HIPAA) and the US Department of Health and Human Services common rule. As a result, different categories of data protection and patient privacy levels were implemented in the i2b2 platform.¹⁴ Explicit regulations concerning healthcare privacy are established in the HIPAA and more recently in the HITECH Act, which defines the term ‘protected health information’(PHI), as any information about patient health that can be linked to an individual patient or their relatives.¹³ ¹⁵ These regulations state that patients must be notified in a ‘HIPAA notification’ that their records will be used for medical research. Furthermore, if PHI is to be used, either the consent of the patient must be obtained, or the consent must be waived by the IRB as per the common rule, which governs the use of research subjects as a whole.¹⁶ All medical records containing PHI used for research must be tracked as they leave a covered entity. In order to forego PHI regulations, medical records can undergo one of two forms of HIPAA-defined de-identification before research. The ‘safe-harbor’ method removes 18 identifiers enumerated in section 164.514 (b) (2) of the regulations and footnoted below.ⁱ Alternatively, a determination may be written by a qualified statistician that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by the anticipated recipient to identify the subject of the information. HIPAA also defines a special ‘limited dataset’ (LDS) for certain types of research that, unlike a de-identified dataset, may include the specific PHI items of dates, cities, states, zip codes, and other uniquely identifying characteristics not listed as direct identifiers as defined in 164.514 (e) (2) of the regulations. The advantage of the LDS is that even though it is not de-identified, it can still be used or disclosed for research purposes without an authorization or a waiver of the authorization requirement. A covered entity must, however, enter into a data use agreement with the recipient of the LDS before using or disclosing it. Tracking requirements are also relaxed.

Five implementations of data protection and patient privacy in clinical research were considered. The first case was an implementation at the Partners Healthcare System. For the Partners Healthcare System IRB to approve the concept of an openly queryable research database with general access by Partners investigators, the IRB needed to be convinced that patient populations could be defined in a software tool without allowing specific patients to be identified. A software tool was developed,¹⁷ and an obfuscation method applied to patient counts.¹⁸ The obfuscation method performs Gaussian function-based blurring of patient counts, combined with monitoring the number of query repetitions with similar results to ensure a statistical de-identification process. Researchers are restricted to obtaining aggregate counts of patients that result from custom queries; they can not access the underlying data directly until specific studies are approved by the IRB, at which point they can obtain access to PHI for the patients defined in the custom queries through database administrators.

The second use case was an implementation from the University of Massachusetts at Worchester. In this implementation two research databases exist in parallel, one with de-identified data per HIPAA safe harbor, and one with fully identified data. The fully identified database is kept private, and the de-identified database can be queried for aggregate results by University of Massachusetts investigators without specific IRB approvals, similar to the Partners Healthcare System model. The queries can be run in the identified database by administrators to obtain PHI for specific IRB-approved studies.

The third use case was an implementation based upon the HIPAA-defined LDS. With access to the LDS, patient-level data can be available to the researcher without the application of data altering algorithms such as k-level anonymity. Individual records can be viewed with true dates and locations as specific as zip codes. Unique features about the patients can be retained in the dataset on each patient.

The fourth use case arises from implementations at Vanderbilt and the i2b2 Driving Biology Projects.¹⁹ The IRB will approve the use of blood samples for research on patients who are presented a simple “opt-out” option on the condition that only LDS of the medical record data are matched with the blood samples. Because these studies use natural language processing (NLP) of text data to provide detail on patient phenotypes that is not available in the coded data, it is important to provide text notes with the medical record data. These notes can be de-identified, but the IRB must determine whether the notes, which contain a small percentage of remaining identifiers (typically 2–5%), are sufficiently scrubbed to be included with the LDS. This variation of the use of the LDS is an important, distinct use case in conducting medical research⁹ and was important for the i2b2 platform to recognize.

The fifth use case is seen in implementations focused upon recruiting patients for clinical trials based upon the analysis of EHR data. Initial queries may be done in an obfuscated or de-identified form, but eventually the value of those queries is to produce an identified list of individuals once IRB approval has been obtained so that they may be contacted to join a clinical trial. The PHI of the patient is always going to be needed for the contact to take place and is an important part of the database.

Results

The following five privacy levels were created in i2b2 to match to use cases described above:

The ‘obfuscated data user’ is a researcher with a trustworthiness that has not been investigated, and a client machine with possibly low technical security. For this user, data are obfuscated as it is served to the client machine, and access to the client is through an institutional intranet with a proved institutional login process. Obfuscation is applied to aggregate patient counts that are reported as a result of ad-hoc queries on the client machine through a previously reported query tool.¹⁷ Obfuscation methods include the addition of a random number to the patient counts that has a distribution defined by a Gaussian function.¹⁸ Numerous repetitions of a query by a single user must be detected and interrupted because they will converge on the true patient count making proper user identification absolutely necessary for the methods to function properly. The obfuscated aggregate results of queries are the only items available to the user. Line item patient data, in which results of individual patients are reported, are not available to this user. A narrative text report on an individual patient is not available to this type of user, nor is any PHI. The advantage of the obfuscated user type is that it protects individual patient data, while the underlying database can contain limited and coded datasets. Unlike the LDS, the data can be served to this user without a data use agreement. However, the blocking of replicate queries as required in order to implement obfuscation can be restrictive and intrusive to users, and without stringent user management an individual could create multiple user accounts and overcome the obfuscation method.

The ‘aggregated data user’ is one who is querying against a fully HIPAA de-identified dataset and, like the obfuscated data user, is a researcher with uninvestigated trustworthiness and a client machine with low technical security. Notably though, in this use case, the dataset on the server is fundamentally different and needs to be completely de-identified, not containing any PHI. Therefore, exact numbers from aggregate query results are permissible, because the risk of identifying a patient is low even if the entire dataset was to be reconstructed from the aggregate queries. The dataset that queries are performed against has all HIPAA PHI removed so the dates are only accurate to the year, and the zip codes are only accurate to the first three digits. Other “overly identifying” data must also be removed, such as highly unique diagnoses or treatments. Narrative text is not available, as overly unique features could not be adequately controlled using existing de-identification text-processing software. The advantage of the aggregated data user is that careful user management is not required, unlike the obfuscated user case. The disadvantage is that the burden of a completely successful HIPAA de-identification falls upon the data manager and covered entity.

The next category is the ‘LDS data user’, who is allowed access to the HIPAA-defined LDS and line item patient data. Because it could be fairly easy to re-identify an individual from this dataset even from public data sources,¹³ the researcher in this case is expected to have credentials that indicate the proper level of trustworthiness. On the i2b2 platform, this user may have direct access to the set of raw patient data in the database. When PHI must be placed in the i2b2 patient database, it is encrypted to prevent an LDS data user (who will not be granted the decryption key) from viewing it. The advantage of an LDS data user is that as there is full access to the LDS, any kind of low-level patient-oriented analysis can be carried out without restriction. The disadvantage is that the data manager must implement a robust, technically secure platform to control access to trustworthy users only. A variation often used by the i2b2 platform is to maintain what is technically a ‘coded dataset’, in which the HIPAA LDS definition is not formally met because it remains linked to patients through an encrypted identifier thus allowing updates to be performed to the data.

The fourth category of user is the ‘notes-enabled LDS data user’. This user is given the added privilege of being allowed to view text notes that have been scrubbed (PHI has been removed) and the narrative text conforms as best as possible to the HIPAA definition of a LDS with a margin of error as previously described. This special type of user has been separated from the LDS data user to recognize the known imperfections of text de-identification. The IRB must be willing to accept some chance that not all PHI was removed from the records, and this user category will be able to see the PHI of those patients who were unsuccessfully scrubbed. Examples of common failures are references to kin such as ‘the sister of the major’ or names that are common words such as ‘Wolf’ or ‘Brown’. Technical security is handled in a manner similar to that of the LDS data in that users have direct access to the i2b2 patient data. The advantage of a notes-enabled LDS data user is to have access to rich data in the scrubbed text and enable the use of NLP on those narratives that are part of the medical record. The disadvantages are that a technology investment must be made to achieve adequate scrubbing of the text, and a human investment must be made to validate that the text scrubber algorithm is adequate.

The final category of user is the ‘PHI-viewable data user’, who has access to the dataset with any contained PHI fully available. These users must clearly be scrutinized very carefully by the governance bodies granting data access. The hospitals and accompanying IRB will typically have many requirements for the researcher in this scenario to ensure trustworthiness. At Partners Healthcare, there are special tests that a researcher must complete to be named on an IRB proposal (known as the “collaborative IRB Training Initiative”), and there are background checks (Massachusetts Criminal Offender Record Information) that are carried out as conditions of employment. PHI data requests can only be made by full faculty members at Harvard or Tufts University. On the i2b2 platform, the PHI is encrypted with the advanced encryption standard,²⁰ and only PHI-viewable data users for that project are provided the key for the decryption of the data. Because this PHI is rendered unusable, unreadable, or indecipherable to unauthorized individuals by one or more of the methods specified in the HITECH guidance, such information is ‘secured’ PHI and breaches are not subject to reporting or penalties.¹³ Narrative text reports are encrypted, but fully available to the authorized user with the decryption key. Of note, data that comply with the LDS definition always remain completely unencrypted in the i2b2 patient database so that users with LDS access and PHI access can work smoothly and consistently in the same database. The advantages of the PHI-viewable data user is that the full power of the EHR data can be realized, including the potential for recruiting targeted patient populations for clinical trials. Disadvantages are that these users are at maximal risk of potential breaches of patient privacy and all data must be tracked indefinitely should they move out of the enterprise.

A summary of these results is presented in table 1. Note that the underlying data in the clinical research chart is assumed to remain the same unless accommodating aggregate data users. All the other user categories use the same underlying data, just with different restrictions implemented by the i2b2 software platform.

Table 1.

Categories of privacy levels that patient data may be viewed by i2b2 Users

Privacy level of i2b2 user	Trustworthiness required of data recipient	Technical security	Data available to user	Underlying data in i2b2 project
Obfuscated data user	Low—users can only see approximate patient counts	Low—only client-side application exposed to users. must be able to recognize individual users	Users have access to data by client-side application only	All patient data exist in database (with PHI encrypted)
Aggregated data user	Low—users only see HIPAA de-identified data	Low—but data manager assumes burden of de-identifying data	HIPAA de-identified data	HIPAA de-identified data only exist in database
LDS data user	Medium—users can see LDS as defined by HIPAA, some risk of re-identification present	Medium—requires user-facing direct access to the database	HIPAA de-identified data	All patient data exist in database (with PHI encrypted)
Notes-enabled LDS data user	Medium—users see LDS and narrative text that is mostly de-identified	Medium—requires user-facing direct access to the database	HIPAA de-identified data and de-identified narrative text	All patient data exist in database as well as scrubbed narrative text (with PHI encrypted)
PHI-viewable data user	High—users can see all protected health information on patients	High—requires management of encryption keys	All patient data may be accessed	All patient data exist in database (with PHI encrypted)

Open in a new tab

HIPAA, Health Insurance Portability and Accountability Act; LDS, limited dataset; PHI, protected health information.

Discussion

Approaches to patient privacy can be considered as a balance of technical restrictions and trust in people. The i2b2 platform implements five categories of users (table 1), based upon the legal structures from HIPAA and the HHS common rule, each with a guiding use case from real-world experience, and each put into practice to allow the i2b2 software to remain consistent despite the complexities of the patient data. In doing so, the enterprise is able to comply with HIPAA statutes and with the provisions of the IRB. However, the compliance is strongly founded in personal trust, as the IRB must often be willing to grant a specific waiver of consent to investigators so they may view PHI possibly as the notes-enabled LDS data user and definitely as the PHI-viewable data user. This determination can be clearly made by the IRB through understanding the five privacy levels in the i2b2 platform, levels that are themselves a synthesis of previous IRB decisions.

Patient data in i2b2 is organized into projects. A common way to implement i2b2 is for a hospital/entity to have an enterprise-wide dataset that contains all the patient data from the hospital/entity, and to make subsets of the patient data that are copied and put into physically separate smaller project datasets.¹⁴ The existence of physically separate databases helps guard against software errors that may mistakenly expose rows of data from the wrong set of patients, and provides the way to regulate access to particular rows of data when the researchers have direct access to the database (although column access can be restricted with standard database authorization software.). The enterprise dataset usually has a large number of obfuscated data users who can use a query tool to find patient cohorts for specific studies. If a sufficient number of patients are found, an IRB protocol is written by the investigator, and when approved, the data required for the study are copied into a separate project dataset by administrators who are members of the PHI-viewable category. The subprojects have specific privacy levels for users depending on the roles in the project. In a clinical trial, most users would be expected to be in the PHI-viewable category. In a data mining project, most users would be expected to be in the LDS data category. If the data mining project included NLP, most users would probably be in the notes-enabled LDS category, with a few PHI-viewable users available to validate the NLP process. Of course, when text scrubbing software is not available, most of the users in the project would need to be in the PHI-viewable category. This flexibility in assigning users to privacy categories allows an enterprise to comply with the ‘minimum necessary’ data requirements of the HIPAA statute, and complies with the IRB mission to offer the lowest risk research environment for the patients and the hospital. Table 2 describes some sample projects and their possible privacy level user distributions.

Table 2.

Expected user distributions in i2b2 privacy categories

Example projects hosted in i2b2
Example project #1—enterprise cohort discovery
Example number of patients = 3 million
Purpose = find number of patients that match specific characteristics to determine if a study is feasible
Total number of users = many
Proportions of users
Obfuscated data users	99%
LDS data users	None
Notes-enabled LDS data users	None
PHI-viewable data users	1% (administrators/data quality review)
Example project #2—clinical trial (possibly derived from #1)
Example number of patients = 7000
Purpose = contact patients with specific characteristics to enroll in a clinical
Total number of users = highly restricted
Proportions of users
Obfuscated-data users	None
LDS-data users	20% (statisticians)
Notes-enabled LDS-data users	None
PHI-viewable-data users	80% (study coordinators/investigators)
Example project #3—patient safety surveillance (possibly derived from #1)
Example number of patients = 2.5 million
Purpose = determine if disproportionate number of adverse events occur with specific medications
Total number of users = somewhat restricted
Proportions of users
Obfuscated data users	None
LDS data users	80% (statisticians and investigators)
Notes-enabled LDS data users	10% (natural language coders)
PHI-viewable data users	10% (quality review and data administrators)

Open in a new tab

Project database #1 (for high-level cohort discovery) contains a complete set of patient data, and the cohort patient subsets that are found in project #1 are copied to distinct project databases #2 and #3 which contain data relevant to their own project scope. Typically one would expect the enterprise data project to have mostly Obfuscated-data users, whereas the clinical trial database would have mostly PHI-viewable data users, and the Data Mining project database would have mostly LDS-data users.

The i2b2 architecture is designed from the ground up to provide consistency implementing user privacy categories. For example, the i2b2 platform returns detailed patient data through a software data object known as the ‘patient data object’. No matter what part of the system requests the data object, only the LDS data users of a project (and those with higher privileges, such as notes-enabled LDS data users and PHI-viewable data users) are able to obtain this data object. Another example is the strategy of i2b2 to use physical, rather than virtual, separation of data into project pools. This provides a way for each project to constrain its users to the patient and data subsets approved for use by the project. Finally, it should be noted that one of the merits of open source software platforms such as i2b2 is that they are subjected to constant scrutiny from the open community of software developers and software architecture flaws are quickly highlighted.

Implementing methods for maintaining patient privacy are notably complex, but in i2b2 the methods boil down to two: the i2b2 project is created and managed so that only the patient data that are necessary for the project physically reside in the project's clinical research database, and the five categories of user privacy levels manage the ability of i2b2 project participants to view the data at four privacy levels. (Four rather than five are specified, because the aggregated data user must have the data built into a special database with only de-identified data and so is a separate implementation.)

What is not addressed in the i2b2 core implementation is the complex organizational polices that are part of every hospital's mode of operation. For example, in Boston, the Harvard University-affiliated hospitals are intensely competitive with each other. Therefore, when a distributed query system was developed across the hospitals, some special restrictions by a ‘data steward’ needed to be put into place to make sure hospital supporters were not asking the types of questions that would be used to compete with one another.²¹ It would be impossible to foresee all of the variations needed in i2b2 implementations across the country, but this is why the open source nature of i2b2 is so valuable. i2b2 provides a foundation for managing patient privacy, but is adaptable to local policy when needed through local industriousness and creativity.

Acknowledgments

The contributions of Rajesh Kuttan, Lori Phillips, Wensong Pan and Janice Donahue have been invaluable for the construction and distribution of this software, and Diane Keogh for her support.

Footnotes

Funding: This work was funded by the National Institutes of Health through the NIH Roadmap for Medical Research, grant U54LM008748.

Competing interests: None.

Provenance and peer review: Not commissioned; externally peer reviewed.

ⁱ

The following identifiers of the individual or of relatives, employers, or household members of the individual must be removed: (1) names; (2) all geographical subdivisions smaller than a state, except for the initial three digits of the zip code if the geographical unit formed by combining all zip codes with the same three initial digits contains more than 20 000 people; (3) all elements of dates except year, and all ages over 89 years or elements indicative of such age; (4) telephone numbers; (5) fax numbers; (6) email addresses; (7) social security numbers; (8) medical record numbers; (9) health plan beneficiary numbers; (10) account numbers; (11) certificate or licence numbers; (12) vehicle identifiers and licence plate numbers; (13) device identifiers and serial numbers; (14) URL; (15) IP addresses; (16) biometric identifiers; (17) full-face photographs and any comparable images; (18) any other unique, identifying characteristic or code, except as permitted for re-identification in the privacy rule.

References

1.Willison DJ, Schwartz L, Abelson J, et al. Alternatives to project-specific consent for access to personal information for health research: what is the opinion of the Canadian public? J Am Med Inform Assoc 2007;14:706–12 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Damschroder LJ, Pritts JL, Neblo MA, et al. Patients, privacy and trust: patients' willingness to allow researchers to access their medical records. Soc Sci Med 2007;64:223–35 [DOI] [PubMed] [Google Scholar]
3.Sweeney L. k-Anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl-based Syst 2002;10:557–70 [Google Scholar]
4.Fischetti M, Salazar JJ. Models and algorithms for the 2-dimensional cell suppression problem in statistical disclosure control. Math Program 1999;84:283–312 [Google Scholar]
5.Dreiseitl S, Vinterbo S, Ohno-Machado L. Disambiguation data: extracting information from anonymized sources. JAMIA 2001;Suppl 8:144–8 [PMC free article] [PubMed] [Google Scholar]
6.Ohno-Machado L, Vinterbo SA, Dreiseitl S. Effects of data anonymization by cell suppression on descriptive statistics and predictive modeling performance. JAMIA 2001;Suppl 8:503–7 [PMC free article] [PubMed] [Google Scholar]
7.Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc 2007;14:550–63 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Friedlin FJ, McDonald CJ. A software tool for removing patient identifying information from clinical documents. J Am Med Inform Assoc 2008;15:601–10 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Meystre SM, Friedlin FJ, South BR, et al. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med Res Methodol 2010;10:70. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Berman JJ. Concept-match medical data scrubbing. How pathology text can be used in research. Arch Pathol Lab Med 2003;127:680–6 [DOI] [PubMed] [Google Scholar]
11.Yakowitz J, Barth-Jones D. The Illusory Privacy Problem in Sorrell v. IMS Health. 2011. http://www.techpolicyinstitute.org/files/the illusory privacy problem sorrell1.pdf (accessed 22 May 2011).
12.McCarty CA, Wilke RA. Biobanking and pharmacogenomics. Pharmacogenomics 2010;11:637–41 [DOI] [PubMed] [Google Scholar]
13.US Department of Health and Human Services Health Information Technology for Economic and Clinical Health (HITECH) Act, Title XIII of Division A and Title IV of Division B of the American Recovery and Reinvestment Act of 2009 (ARRA) (Pub. L. 111–5). Washington, DC: US Department of Health & Human Services, 2009. http://www.HHS.gov [Google Scholar]
14.Murphy SN, Weber G, Mendis M, et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc 2010;17:124–30 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.US Department of Health and Human Services HHS Standards for Privacy of Individually Identifiable Health Information; Final Rule: 45 CFR Parts 160 and 164. Washington, DC: US Department of Health & Human Services, 2002. http://www.HHS.gov [Google Scholar]
16.US Department of Health and Human Services Department of Health and Human Services. 45 CFR (Code of Federal Regulations), 46. Protection of Human Subjects (Common Rule). Federal Register, vol. 56, 18 June 1991. Washington, DC: US Department of Health & Human Services, 1991:28003 http://www.HHS.gov [Google Scholar]
17.Murphy SN, Gainer V, Chueh HC. A visual interface designed for novice users to find research patient cohorts in a large biomedical database. JAMIA 2003;Suppl 10:489–93 [PMC free article] [PubMed] [Google Scholar]
18.Murphy SN, Chueh HC. A security architecture for query tools used to access large biomedical databases. JAMIA 2002;Suppl 9:552–6 [PMC free article] [PubMed] [Google Scholar]
19.Murphy S, Churchill S, Bry L, et al. Instrumenting the health care enterprise for discovery research in the genomic era. Genome Res 2009;19:1675–81 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Daemen J, Govaerts R. The block cipher rijndael. In: Quisquater J, Schneier B, eds. Smart Card Research and Applications. Berlin Heidelberg: LNCS 1820: Springer-Verlag, 2000:277–84 [Google Scholar]
21.Weber GM, Murphy SN, McMurry AJ, et al. The Shared Health Research Information Network (SHRINE): a prototype federated query tool for clinical data repositories. J Am Med Inform Assoc 2009;16:624–30 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b1] 1.Willison DJ, Schwartz L, Abelson J, et al. Alternatives to project-specific consent for access to personal information for health research: what is the opinion of the Canadian public? J Am Med Inform Assoc 2007;14:706–12 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b2] 2.Damschroder LJ, Pritts JL, Neblo MA, et al. Patients, privacy and trust: patients' willingness to allow researchers to access their medical records. Soc Sci Med 2007;64:223–35 [DOI] [PubMed] [Google Scholar]

[b3] 3.Sweeney L. k-Anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl-based Syst 2002;10:557–70 [Google Scholar]

[b4] 4.Fischetti M, Salazar JJ. Models and algorithms for the 2-dimensional cell suppression problem in statistical disclosure control. Math Program 1999;84:283–312 [Google Scholar]

[b5] 5.Dreiseitl S, Vinterbo S, Ohno-Machado L. Disambiguation data: extracting information from anonymized sources. JAMIA 2001;Suppl 8:144–8 [PMC free article] [PubMed] [Google Scholar]

[b6] 6.Ohno-Machado L, Vinterbo SA, Dreiseitl S. Effects of data anonymization by cell suppression on descriptive statistics and predictive modeling performance. JAMIA 2001;Suppl 8:503–7 [PMC free article] [PubMed] [Google Scholar]

[b7] 7.Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc 2007;14:550–63 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b8] 8.Friedlin FJ, McDonald CJ. A software tool for removing patient identifying information from clinical documents. J Am Med Inform Assoc 2008;15:601–10 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b9] 9.Meystre SM, Friedlin FJ, South BR, et al. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med Res Methodol 2010;10:70. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b10] 10.Berman JJ. Concept-match medical data scrubbing. How pathology text can be used in research. Arch Pathol Lab Med 2003;127:680–6 [DOI] [PubMed] [Google Scholar]

[b11] 11.Yakowitz J, Barth-Jones D. The Illusory Privacy Problem in Sorrell v. IMS Health. 2011. http://www.techpolicyinstitute.org/files/the illusory privacy problem sorrell1.pdf (accessed 22 May 2011).

[b12] 12.McCarty CA, Wilke RA. Biobanking and pharmacogenomics. Pharmacogenomics 2010;11:637–41 [DOI] [PubMed] [Google Scholar]

[b13] 13.US Department of Health and Human Services Health Information Technology for Economic and Clinical Health (HITECH) Act, Title XIII of Division A and Title IV of Division B of the American Recovery and Reinvestment Act of 2009 (ARRA) (Pub. L. 111–5). Washington, DC: US Department of Health & Human Services, 2009. http://www.HHS.gov [Google Scholar]

[b14] 14.Murphy SN, Weber G, Mendis M, et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc 2010;17:124–30 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b15] 15.US Department of Health and Human Services HHS Standards for Privacy of Individually Identifiable Health Information; Final Rule: 45 CFR Parts 160 and 164. Washington, DC: US Department of Health & Human Services, 2002. http://www.HHS.gov [Google Scholar]

[b16] 16.US Department of Health and Human Services Department of Health and Human Services. 45 CFR (Code of Federal Regulations), 46. Protection of Human Subjects (Common Rule). Federal Register, vol. 56, 18 June 1991. Washington, DC: US Department of Health & Human Services, 1991:28003 http://www.HHS.gov [Google Scholar]

[b17] 17.Murphy SN, Gainer V, Chueh HC. A visual interface designed for novice users to find research patient cohorts in a large biomedical database. JAMIA 2003;Suppl 10:489–93 [PMC free article] [PubMed] [Google Scholar]

[b18] 18.Murphy SN, Chueh HC. A security architecture for query tools used to access large biomedical databases. JAMIA 2002;Suppl 9:552–6 [PMC free article] [PubMed] [Google Scholar]

[b19] 19.Murphy S, Churchill S, Bry L, et al. Instrumenting the health care enterprise for discovery research in the genomic era. Genome Res 2009;19:1675–81 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b20] 20.Daemen J, Govaerts R. The block cipher rijndael. In: Quisquater J, Schneier B, eds. Smart Card Research and Applications. Berlin Heidelberg: LNCS 1820: Springer-Verlag, 2000:277–84 [Google Scholar]

[b21] 21.Weber GM, Murphy SN, McMurry AJ, et al. The Shared Health Research Information Network (SHRINE): a prototype federated query tool for clinical data repositories. J Am Med Inform Assoc 2009;16:624–30 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Strategies for maintaining patient privacy in i2b2

Shawn N Murphy

Vivian Gainer

Michael Mendis

Susanne Churchill

Isaac Kohane

Series information

Abstract

Background

Objective

Methods

Results

Conclusion

Materials and methods

Results

Table 1.

Discussion

Table 2.

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Strategies for maintaining patient privacy in i2b2

Shawn N Murphy

Vivian Gainer

Michael Mendis

Susanne Churchill

Isaac Kohane

Series information

Abstract

Background

Objective

Methods

Results

Conclusion

Materials and methods

Results

Table 1.

Discussion

Table 2.

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases