Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Feb 1.
Published in final edited form as: J Biomed Inform. 2016 Dec 19;66:42–51. doi: 10.1016/j.jbi.2016.12.008

Towards a Privacy Preserving Cohort Discovery Framework for Clinical Research Networks

Jiawei Yuan a, Bradley Malin b,c, François Modave d, Yi Guo d, William R Hogan d, Elizabeth Shenkman d, Jiang Bian d,*
PMCID: PMC5316314  NIHMSID: NIHMS840169  PMID: 28007583

Abstract

Background

The last few years have witnessed an increasing number of clinical research networks (CRNs) focused on building large collections of data from electronic health records (EHRs), claims, and patient-reported outcomes (PROs). Many of these CRNs provide a service for the discovery of research cohorts with various health conditions, which is especially useful for rare diseases.

Supporting patient privacy can enhance the scalability and efficiency of such processes; however, current practice mainly relies on policy, such as guidelines defined in the Health Insurance Portability and Accountability Act (HIPAA), which are insufficient for CRNs (e.g., HIPAA does not require encryption of data – which can mitigate insider threats). By combining policy with privacy enhancing technologies we can enhance the trustworthiness of CRNs. The goal of this research is to determine if searchable encryption can instill privacy in CRNs without sacrificing their usability.

Methods

We developed a technique, implemented in working software to enable privacy-preserving cohort discovery (PPCD) services in large distributed CRNs based on elliptic curve cryptography (ECC). This technique also incorporates a block indexing strategy to improve the performance (in terms of computational running time) of PPCD. We evaluated the PPCD service with three real cohort definitions: 1) elderly cervical cancer patients who underwent radical hysterectomy, 2) oropharyngeal and tongue cancer patients who underwent robotic transoral surgery, and 3) female breast cancer patients who underwent mastectomy) with varied query complexity. These definitions were tested in an encrypted database of 7.1 million records derived from the publically available Healthcare Cost and Utilization Project (HCUP) Nationwide Inpatient Sample (NIS). We assessed the performance of the PPCD service in terms of 1) accuracy in cohort discovery, 2) computational running time, and 3) privacy afforded to the underlying records during PPCD.

Results

The empirical results indicate that the proposed PPCD can execute cohort discovery queries in a reasonable amount of time, with query runtime in the range of 165 to 262 seconds for the 3 use cases, with zero compromise in accuracy. We further show that the search performance is practical because it supports a highly parallelized design for secure evaluation over encrypted records. Additionally, our security analysis shows that the proposed construction is resilient to standard adversaries.

Conclusions

PPCD services can be designed for clinical research networks. The security construction presented in this work specifically achieves high privacy guarantees by preventing both threats originating from within and beyond the network.

Keywords: privacy-preserving cohort discovery, clinical research network (CRN), data privacy, searchable encryption, Patient-Centered Clinical Research Network (PCORnet), OneFlorida Clinical Data Research Network (CDRN)

Graphical abstract

graphic file with name nihms840169u1.jpg

1. Introduction

Clinical research networks (CRNs) are receiving an increasing amount of attention due, in part, to their ability to offer a collaborative environment for researchers across disparate organizations.31 Moreover, CRNs are designed to leverage various types of data collected by both the healthcare systems (e.g., electronic health records, or EHRs, and claims) and directly from patients themselves (e.g., patient-reported outcomes, or PROs). It is anticipated that the analysis of such data will lead to advances in medical knowledge, progress in healthcare delivery, and improvements in population health. For example, the national Patient-Centered Clinical Research Network (PCORnet)37, funded by the Patient-Centered Outcomes Research Institute (PCORI)30,36, is an expansive network of networks of organizations who are partnered to conduct research. These PCORnet sites collect data from multiple sources and make them available for research. In particular, they provide an invaluable cohort discovery service that proves particularly useful for identifying cohorts of a variety of health conditions, and especially for rare diseases.

However, there is a potential for significant data privacy threats to be realized in CRNs. Data are shared by multiple participating health care organizations (HCOs), across different technical infrastructures, and thus are more susceptible to breaches. Existing CRNs have invested substantial effort towards protecting the patients' privacy as well as other sensitive information (e.g., organizations' billing information) involved in their networks. However, existing effort does not sufficiently address all important adversary models. In particular, we think a CRN should be protected against not only outside attackers but also malicious insiders (e.g., employees of the participating HCOs). Further, the current practice of privacy protection in health care often relies heavily on policies and guidelines such as the de-identification process defined by the Privacy Rule of the Health Insurance Portability and Accountability Act of 1996 (HIPAA), which are inadequate to cover all scenarios and use cases in these emerging research networks. For example, the HIPAA Security Rule only considers the use of encryption for data at rest and in transit, but not for data in use that has the potential to significantly reduce the risk of insider attacks. The Health Information Technology for Economic and Clinical Health Act (HITECH) of 200929 imposed additional security measures, such as the data breach notification requirements for unsecured Protected Health Information (PHI), but such measures are also insufficient in terms of privacy protection through technology.

Furthermore, from a technological perspective, the majority of efforts dedicated to protecting health data in the wild (beyond automating the de-identification process) are focused on authentication, authorization, and data encryption in transit and at rest. Health IT professionals and health practitioners often assume the data are sufficiently secure when they live in a data center that meets compliance responsibilities (e.g., the HIPAA Security Rule22 and FISMA32). Many of these compliance requirements are based on security standards and cybersecurity frameworks established by the National Institute of Standard and Technology (NIST), such as NIST 800-5333 for Security and Privacy Controls for Federal Information Systems and Organizations33. The security controls in the NIST frameworks are important to deploy, but they are insufficient to ensure all possible security and privacy guarantees, especially in the CRN environment. Consider several pressing concerns:

  • How can we protect health data against insider attacks due to malicious system administrators or negligence?

  • How can we resolve trust issues where data contributors want to have control over their own datasets?

  • What security controls should be instituted to limit the damage of large-scale data breaches, such as the recent Anthem breach where the hackers gained access to data on over 80 million health care consumers?19

Due in part to such privacy and trust concerns, CRNs often restrict data sharing by providing access to de-identified data only. These practices limit the utility of the data, especially for cohort discovery queries. For example, a resistant hypertension phenotype specification is dependent on the dates of patients' medication refills.25 However, while there is a potential for retaining dates under the Expert Determination implementation of HIPAA de-identification, it is explicitly forbidden as one of eighteen types of identifiers under the Safe Harbor implementation, which is often invoked in practice.

1.1. Background

To address the aforementioned data privacy and trust issues in CRNs, one solution is to permit participating organizations to share data in an encrypted format and, at a later point in time, directly perform processing tasks for cohort discovery without decrypting patient-specific records. In doing so, the participating organizations can reap the benefits of a CRN while ensuring that both patient, as well as proprietary business, data are obscured from other organizations in the CRN. To achieve such a solution, privacy-preserving techniques including symmetric searchable encryption (SSE)13 and public searchable encryption (PSE)1 schemes are promising candidates. Both SSE and PSE aim to query data directly over encrypted data without decryption.

In particular, SSE enables data owners to outsource their data to an untrusted server (e.g., a public cloud) in an encrypted format and, subsequently, search the server using predefined keywords without decryption.34,35 Since SSE schemes require that the data be preprocessed and encrypted under the same key, these methods are only appropriate for scenarios that either 1) involve a single data source or 2) there exists a centralized fully trusted entity to collect and encrypt all data from different data sources. However, the former requirement contradicts the use cases of a CRN, which involves multiple data sources (organizations) contributing data to form a collaboration. Furthermore, different organizations in a CRN do not necessarily trust each other (or their ability to maintain data securely), such that they may only want to share their patients' sensitive data in a privacy-preserving manner. Thus, SSE techniques cannot be directly applied in a CRN.

To overcome the aforementioned limitations of SSE, the notion of PSE1 was introduced. PSE methods enable multiple data sources to encrypt their data with a shared public key, and send their encrypted data to a third party, where the encrypted data can be searched by the organization holding the private key at a later point in time. However, these PSE methods are vulnerable with regard to both inside6 and outside7 keyword guessing attacks, because their encryptions for search requests are deterministic (i.e., encryption results of the same request are always the same). Thus, existing PSE methods should not be directly applied in a CRN.

1.2. Contribution

In this paper, we introduce a novel privacy-preserving cohort discovery (PPCD) technique for CRNs. Specifically, the contributions of this work are:

  • The technique supports flexible privacy-preserving frequency counts over encrypted patient data shared by multiple health data sourcesi.

  • By designing our algorithmic construction with an underlying elliptic curve cryptography (ECC) system28, we provide strong data privacy guarantees for our construction.

  • Our approach also prevents the disclosure of each data source's privacy to any other data source in the network. Notably, our construction allows different data sources to encrypt their data with random secrets known only to themselves.

  • The search performance of our construction is reasonable for practical use because of our highly parallelized design of secure evaluation over encrypted data records with simple time-efficient computations. Our prototype implementation incorporates big data processing techniques (i.e., MapReduce) to enhance the performance.

The remainder of the paper is organized as follows. We first describe the system model, privacy requirements and assumptions of a clinical research network. We then introduce the proposed security construction for implementing PPCD services. We also present real-world use cases based on published comparative effectiveness research studies that rely upon the Healthcare Cost Utilization Project (HCUP) Nationwide Inpatient Sample (NIS) data set. Further, we discuss how we invoke block indexing to improve search performance. In Section 3, we detail our evaluation and analysis results of the proposed security construction in terms of both performance and security. In particular, we evaluate the performance of the proposed PPCD framework with a large publicly available database of inpatient claims records.20 In Section 4, we further discuss how to improve search performance of the proposed PPCD framework with big data processing techniques, and considering a hybrid privacy protection policy with varied levels of privacy requirements. Finally, we discuss the limitations and practical implementation considerations, especially how the proposed PPCD framework can accommodate different clinical research network architectures.

2. Material and Methods

2.1. The system model of a clinical research network

Our system model of a CRN is based on our experience in building the OneFlorida Data Trust - a secure centralized data repository hosted by the University of Florida (UF), which integrates EHR, claims, and PRO data from contributing organizations in the OneFlorida Clinical Research Consortium (CRC). The OneFlorida CRC is one of the 13 Clinical Data Research Networks (CDRNs) funded by PCORI that contributes to the national PCORnet.26

PCORI follows a two-stage process to develop PCORnet, which consists of a series of CDRNs and Patient-Powered Research Networks (PPRNs), as well as a Coordinating Center. In this paper, we focus on regional research networks like OneFlorida rather than the overall PCORnet. Specifically, we focus on the network architecture of the OneFlorida Data Trust, which has a centralized network hub and a centralized data repository. We acknowledge that other research networks may use different network architectures. For instance, some are fully distributed data networks such as SAFTInet21, where there is not a centralized network hub. However, as we review in the Discussion, the principles and methods described in this paper can be adapted to fit these alternative network designs.

In this paper, we assume that a CRN consists of a network data hub (e.g., UF in the OneFlorida CRC) with multiple organizations contributing data to the hub. Figure 1 illustrates an overview of such a CRN. This model is consistent with the OneFlorida Data Trust and many other CDRNs in PCORnet. Each organization is considered to be a data source that periodically shares its data to the network in an encrypted format. To achieve strong privacy protection, each data source in our construction encrypts using random secrets known only to the data source. The hub collects the encrypted datasets and offers a PPCD service over the aggregated encrypted data. Investigators submit cohort discovery queries to the hub who, subsequently, performs the request in a privacy-preserving manner (i.e., without decrypting the aggregated encrypted datasets).

Figure 1.

Figure 1

Overview of the proposed privacy-preserving cohort discovery framework based on the OneFlorida clinical data research network architecture.

2.2. Privacy requirements and assumptions

We consider all the data shared by an organization to be private with respect to its source organization. Searching over the shared encrypted data collection is permitted, however, only aggregated counts can be reported.ii The actual demographic and clinical values (e.g., age, gender, diagnosis code, or lab results) in a patient record cannot be disclosed to any other organization in the network. We assume outside adversaries can compromise partial organizations in a CRN by breaking their network security framework or compromising their employees.

Adversaries are expected to utilize these compromised organizations to attempt to reveal the private information contributed by other uncompromised organizations.iii To achieve such a strong level of privacy, we require the hub to neither store nor log any cohort discovery queries or results in plaintext format. In other words, neither shared data nor cohort discovery results appear in the clear.

We further assume that the hub is honest-but-curious, a standard assumption in secure computation environments. It honestly follows the security protocols, but it could be curious to learn arbitrary encrypted data (thus violating the intent of the encrypting organization). For example, the hub may perform brute force attacks to learn other organizations' private data (i.e., submit all possible cohort discovery requests over the shared data). Nevertheless, such brute force attacks are 1) computationally expensive and 2) can be detected with other security controls such as proper auditing procedures.

2.3. A security construction for privacy-preserving cohort discovery

Our construction consists of three main phases: a) Framework Setup (Figure 2: Setup Algorithm), b) Data Encryption (Figure 3: Encryption Algorithm), and c) Privacy-Preserving Cohort Discovery (Figure 4: PPCD Algorithm).

Figure 2. Setup Algorithm.

Figure 2

PPCD framework setup and key generation.

Figure 3. Encryption Algorithm.

Figure 3

Each data source will encrypt its data with a combination of the public key of the data hub and a secret key only known to the source organization itself.

Figure 4. PPCD Algorithm.

Figure 4

Privacy-preserving cohort discovery query generation and polynomial evaluation.

To setup the framework in a CRN, the data hub first executes the Setup algorithm to generate a public/private key pair as shown in Figure 2 following the BGN cryptosystem8. When a data source shares its patient records, it will encrypt all of its data by running the Encryption algorithm (Figure 3) with a combination of the hub's public key and a random secret key only known to the data source itself. In doing so, the hub is able to partially decrypt the data source's encrypted records for cohort discovery purposes only, but cannot remove all of the randomness introduced by the encryption to learn any private information. Later, when an investigator submits a cohort discovery query, the hub can process the request in a privacy-preserving manner by invoking the PPCD algorithm. As shown in Figure 4, the hub can directly evaluate whether or not an encrypted record satisfies the query conditions without decrypting either the encrypted data records or the encrypted cohort discovery query requests.

2.4. Use cases of privacy-preserving cohort discovery

The OneFlorida Data Trust data did not have a sufficient amount of data available at the time of this investigation to perform scalability experiments. Thus, to promote transparency and reproducibility, we evaluated our framework with a publically accessible dataset, the Healthcare Cost Utilization Project (HCUP) Nationwide Inpatient Sample (NIS)20.

In preparation for the system design and subsequent evaluation, we reviewed the literature and formalized several cohort discovery use cases of the HUCP NIS datasets. In particular, three cohorts are identified as: 1) elderly cervical cancer patients who underwent radical hysterectomy,16 2) oropharyngeal and tongue cancer patients who underwent robotic transoral surgery,18 and 3) female breast cancer patients who underwent mastectomy.17 The details of the cohort definitions can be found in Table 1.

Table 1.

Cohort discovery use cases for performance evaluation of the proposed security construction.

Use Case Publication Phenotype Summary Cohort Description
1 George et al.16 Elderly cervical cancer patients underwent radical hysterectomy Female cohorts with a diagnosis of invasive cervical cancer (ICD-9 180.x) who underwent radical hysterectomy (ICD-9 68.6, 68.61, or 68.69). Patients are further stratified by age into the following groups: <50 years, 50–59 years, 60–69 years, and ≥70 years.
2 Chung et al.18 Oropharyngeal and tongue cancer patients underwent robot assisted transoral surgery Patients were identified by ICD-9-CM procedure codes for partial pharyngectomy (29.33) and partial glossectomy (25.1,25.2) and were restricted to patients with a diagnosis code specifying malignancy of the oropharynx (146.0, 146.1, 146.2, 146.3, 146.4, 146.5, 146.6, 146.7, 146.8, or 146.9), base of tongue (141.0), or anterior tongue (141.1, 141.2, 141.3, 141.4, 141.5, 141.8, and 141.9), respectively.
3 Habermann et al.17 Female breast cancer patients underwent mastectomy All female inpatients undergoing mastectomy for invasive or in situ breast cancer (ICD-9 diagnosis codes 174, 174.0, 174.1, 174.2, 174.3, 174.4, 174.5, 174.6, 174.8, 174.9, 233, or 233.0). These patients are further partitioned into two groups: 1) undergoing unilateral mastectomy (UM) (ICD-9 procedure codes 85.33, 85.34, 85.41, 85.43, 85.45, or 85.47); and 2) undergoing bilateral mastectomy (BM) (ICD-9 procedure codes 85.35, 85.36, 85.42, 85.44, 85.46, or 85.48).

To orient the reader, let us begin with a sample cohort discovery query derived from a use case developed by George et al16 using the HCUP NIS data. The goal of this cohort discovery query is to identify females with a diagnosis of invasive cervical cancer (ICD-9: 180.x), who underwent a radical hysterectomy (ICD-9: 68.6, 68.61, or 68.69) surgery. Patients are further stratified by age into the following groups: <50 years, 50-59 years, 60–69 years, and ≥70 years. When an investigator submits the request, the hub will generate the following SQL-like request

  • SELECT COUNT DISTINCT Individuals

  • WHERE Sex = ‘F’

  • AND Diagnosis Code in (180.0, 180.1, 180.8, 180.9)

  • AND Procedure Code in (68.6, 68.61, 68.69)

  • AND Age ≥70 (*or 60 ≤Age <70, 50 ≤Age <60, and Age < 50, for other age groups)

To execute this query, the hub first runs the PPCD request generation process in the PPCD algorithm as shown in Figure 4. Specifically, for each search condition, the hub generates encrypted search values (e.g., ‘F’, 180.0, 180.1, 180.8, 180.9, 68.6, 68.61, 68.69, 50∼59, 60∼69, 70∼79, 80∼89, and ≥90). Note that for range conditions such as “Age >70”, we derive auxiliary range information (i.e., “70∼79, 80∼89, and ≥90”) to facilitate range searches rather than check for every single age value greater than 70. In general, we preprocess the data to generate and store auxiliary range information for fields that are frequently involved in range searches. For example, a new field “age-range” with value “70∼79” will be added for a patient record with age = 74. It is worth noting that all auxiliary fields in our construction are also encrypted and can be queried in a privacy-preserving manner just as regular fields.

To improve search performance over categorical fields, like diagnosis codes and procedure codes, with a large number of permissible values (i.e., 18,000 ICD-9 codes and 68,000 ICD-10 codes, respectively), the hub also extracts blocking keys (e.g., 18** for 180.0, 180.1, 180.8, and 180.9; and 6*** for 68.6, 68.61, and 68.69) and generates their corresponding hash values (detailed in Section 2.5 below). When the hub receives an encrypted query request, it first identifies potential candidates by matching blocking keys. Then, the hub evaluates each candidate record using the privacy-preserving polynomial evaluation (PPPE)iv process as described in the PPCD algorithm (Figure 4). Later, the PPCD algorithm checks whether an encrypted record contains the target value by evaluating whether or not the value is a root of the record's corresponding polynomial in a privacy-preserving manner.

For the AND logic relationship in a query request, the evaluation of an encrypted record halts once a field involved in the query fails the equivalence test. Thus, our implementation initiates the evaluation with fields that are less likely to be satisfied (i.e., fields that will yield a smaller number of candidates), e.g., the Procedure Code will be checked first before checking the Sex field because a fewer number of patients will receive a specific set of procedures (e.g., a radical hysterectomy surgery) than the total number of female patients in the dataset. The number of potential candidates based on each field value can be learned either i) when data records are deposited into the hub or ii) submitted as auxiliary information by each data source then aggregated by the hub.

The latter approach is preferable as it requires less computation to generate summary statistics on plaintext patient records prior to the data submission. In addition, it is worth noting that the polynomial evaluation of each encrypted record is independent from the other records, and therefore, is a parallelizable workload. Thus, our implementation efficiently distributes the evaluation processes to multiple CPU cores and multiple machines. We discuss the parallelization with big data processing frameworks in Section 4.1.

2.5. Construction of Block Indexing

As mentioned, we invoke Block Indexing to improve performance. In block indexing, an attribute, commonly called the blocking key, is extracted from each record for each field (e.g., age, gender, and race), and is used to group records into blocks14. Patient records that have the same blocking key will be grouped into the same block. Note that a blocking key is associated with a specific data field. Thus, a patient record can be in different blocks depending on the data field of interest.

Later on, when a cohort discovery request is submitted with blocking keys, the hub only needs to find candidate records matching the blocking keys. The blocking process allows the system to avoid assessments with large amounts of irrelevant records (which is computationally burdensome because it requires the use of computationally expensive PPPE). In the example above, we extracted the first 1 or 2 digits as the blocking keys for the diagnosis and procedure codes, (i.e., 18** for diagnosis codes 180.0, 180.1, 180.8, 180.9 and 6*** for procedure codes 68.6, 68.61, 68.69), respectively.

The blocking keys can be tuned according to the privacy requirement of the data fields involved. For example, the 5- digit ZIP code is a potential identifier under the Safe Harbor de-identification model.22 Yet the initial three digits of a ZIP code (provided the resulting geographic unit has less than 20,000 inhabitants) is permissible under Safe Harbor and, thus, can be extracted as blocking keys for the “ZIP code” field. At the same time, we could invoke a stronger privacy guarantee if we use only the first two digits of a ZIP code as the blocking key. However, such a setting could significantly impact the search performance, as it will yield more candidate records in each block. Thus, the privacy settings of these blocking keys should be based on users' experience balancing privacy requirements and computational efficiency.

To further enhance privacy, instead of storing raw blocking keys, we use their corresponding one-way hash values (e.g., SHA-256). Since one-way hash functions are deterministic, the hashed blocking keys can be used for block indexing without extra transformation.

3. Results

3.1. Performance evaluation and analysis

We evaluated the proposed PPCD framework on the three cohort discovery queries (Table 1 using the 2013 HCUP NIS database, which contains 7.1 million patient records defined over 141 fields. We verified that all results were correct in that the query results over the encrypted and plaintext settings were indistinguishable. We implemented our prototype in Java (v1.7.0) with Apache HBase a distributed, scalable, big data store (v0.98.16 with Hadoop v2.6.3). All experiments were performed on the Microsoft Azure cloud, with four Ubuntu 14.04 instances, each of which utilizes 16 CPU cores (2GHz) and 28GB memory. We use the Java Pairing-Based Cryptography Library (JPBC) 2.0.015 to support the implementation of elliptic curve cryptography (ECC).v The order of the group in our framework is set to 1024 bits, with two primes (i.e., p and q) each using are 512 bits according to the BGN cryptosystemvi.

Given that NIS has a flat data structure, we used a single HBase table to host the entire dataset. HBase uses the concept of a rowkey to uniquely identify each record. In our experiments, we derived artificial rowkeys for the primary keys (PKs) of the database via the indexing key (i.e., ‘KEY_NIS’ field) defined by the NIS dataset. To support block indexing, we created a Blocking Key Mapping table (see Figure 5) with two columns that maps the rowkeys of individual records' to their associated blocking keys. The first column (the rowkey of the blocking key mapping table) is constructed by concatenating the hashed blocking key and the rowkey of the corresponding record. Note that blocking keys are prefixed with the names of the fields (e.g., DX_18 represents diagnosis code group 18*) to avoid collision across different fields.vii As shown in Figure 5, to execute a privacy-preserving query, we perform an HBase scan query on the Blocking Key Mapping table to find potential candidate groups. Then, all rowkeys identified by the blocking key query are considered as potential candidates for the secure polynomial evaluation.

Figure 5.

Figure 5

The privacy-preserving query process with block indexing strategy.

3.1.1. Block Indexing Strategy

We used the first digit as the blocking key for ICD-9 procedure codes (e.g., 6* for 68.91), and the first two digits as the blocking key for ICD-9 diagnosis code (e.g., 14* for 146.0). For the Age field, we use “18-”, “18-30”, “30-50”, “50-70” and “70+” as the blocking keys corresponding to the specified age groups. No blocking key was created for the Gender field.

As shown in Table 2, block indexing filters out a substantial number of records quickly. Specifically, it narrows down the potential candidates from 7.1 million to ∼8,000, 3,700, and 15,000 for use cases #1, to #3, respectively. Then, by invoking PPPE on these candidates, we find all satisfied patient records within 79, 74, and 262 seconds for use cases #1 to #3, respectively. These results indicate that the query cost is not completely determined by the number of potential candidates. The number of fields and the number of values to be checked for each field may also increase the evaluation cost for each candidate. For example, use case #2 has 18 target values for the diagnosis code field, while use case #1 has only 7 target values.

Table 2.

Results for the PPCD system over the three use cases as tested in the HCUP NIS.

Use Case # Total Records Potential Candidates By Blocking Keys Satisfied Patient Records Query Cost (Parallelized) Query Cost (Not-Parallelized) Query Cost without Encryption
1 7.1 Million 8046 Age ≥ 70: 18
60 ≤ Age < 70: 54
50 ≤ Age <60: 83
Age < 50: 242
178 seconds 3523 seconds 2.76 seconds
2 3709 1097 165 seconds 3337 seconds 1.53 seconds
3 15385 UM: 7027
BM: 3891
262 seconds 11550 seconds 2.98 seconds

In addition, the number of final qualified records will also affect the query performance. As discussed above, for conditions joined by the AND logic, we can skip a candidate once a single condition fails. However, to find a record that satisfies the criteria for the cohort, the query engine needs to evaluate all of the possible search conditions. For example, only 155 records satisfied all of the cohort selection criteria among the 8,046 encrypted candidates for use case #1, while use case #2 had 1,097 satisfied records with only 3,709 candidates. However, as shown in Table 2, use cases #1 and #2 (i.e., 178 vs. 165 seconds) have a very similar query cost.

3.1.2. Parallelization of privacy-preserving polynomial evaluation

To demonstrate the advantages of a parallelized design, we compare the performance of parallelized queries in our design with the non-parallel version. Table 2 shows that the non-parallel implementation is about 45 times more time consuming than the parallelized version, which uses 64 CPU cores. For further optimization, we address how to securely incorporate MapReduce into the framework in the Discussion below.

3.2. Security analysis

To provide strong privacy and security, we rigorously followed the BGN cryptosystem8 to perform the system setup process as shown in the Setup Algorithm. In particular, we use the exact same parameter selection and key generation processes as BGN. With regard to the data encryption process in the Encryption Algorithm, our construction encrypts each polynomial coefficient similar to the BGN cryptosystem with one difference, which adds one more random value to further blind the data. Specifically, given a message m, the BGN cryptosystem encrypts it as E(m) = gm·hr, where g is a generator of a multiplicative cyclic group (G) and r is a random number, and our construction encrypts it as E(m) = gβm·hr, where β is a random number selected for each encryption. Thus, our encryptions achieve at least the same security guarantee as the BGN cryptosystem. Since the BGN cryptosystem is secure against a polynomial time-bounded adversary (i.e., the computation ability of the adversary is limited to a number of polynomial operations) based on the subgroup decision hard problem8, our protocol is also secure against such adversaries.

In the PPCD algorithm (Figure 4), our construction converts the plaintext query into an encrypted format and evaluate the encrypted query without decrypting it. Particularly, each target query value involved in our query is encrypted by two random numbers rj1 and rj2. Since θkrj1+νjkrj2 has the same distribution as rj1+rj2,9 computing θkrj1+νjkrj2 on group G using Qj,k=gq(θkrj1+νjkrj2) becomes the discrete logarithm problem10. Since the discrete logarithm problem is a well-known NP-hard problem and computationally infeasible, our encryption for a query request is secure against a polynomial time adversary.

Finally, the polynomial evaluation for an encrypted record using Evalj only outputs two possible values: 1) if the encrypted record satisfies the query, Evalj outputs e(g,g)0, which only contains public keys (g) without any information about either the data record or the query request; 2) if the encrypted record does not satisfy the search conditions, Evalj only generates a random output blinded by random numbers introduced in the encryption of the record and the query request. Thus, the polynomial evaluation process also achieves the required privacy protection for data involved.

We also note that our construction considers both external attackers and insiders. For example, compromised data sources have a negligible probability of learning honest data sources' data even if they collude with each other. This is because each data source chooses a secret key known only to itself. In addition, as data stored in the databases in our construction are encrypted, insider attackers (e.g., IT employees of a healthcare organization, such as system and database administrators) cannot access the data in plaintext, which reduces the likelihood of insider attacks caused by these employees' misbehavior or negligence.

4. Discussion

4.1. Performance improvement with big data technology

As noted earlier, each encrypted record can be evaluated against a search condition independently, such that the process can be parallelized. Notably, we can distribute the workload of the evaluation across all records by using standard distributed computing frameworks like MapReduce. In doing so, we can construct Mappers to conduct the PPPE (i.e., compute Evalj and check if Evalj == e(g,g)0) in parallel.

Each Mapper will output a key-value pair, where the rowkey of each encrypted record as the key and the checking result (true or false) of a field j as the value (e.g., j-true or j-false). All (rowkey, j-true/false) key-value pairs produced by the mappers will be sent to reducers. The reducers will aggregate results based on keys, and output whether or not an encrypted data record satisfy the query condition or not. By integrating MapReduce into our design, we could fully distribute the privacy-preserving query process to further improve performance. In our current implementation, we used a NoSQL database - HBase, which was developed based on Hadoop and supports MapReduce natively.

4.2. Performance improvement with hybrid privacy protection

It is worth noting that not all patient information may require the same level of privacy protection. For instance, in practice, certain types of PHI such as names, Social Security numbers (SSNs), and email addresses, will likely require a higher level of protection because disclosure of such information can lead to patient identification and/or subsequent misuse of their information. Nevertheless, some other patient records, such as de-identified lab results, may not be as sensitive, such that it may be permissible to give the data hub access to the plaintext data.28 Such methods have been proposed in existing literature especially in the area of outsourcing private data for secure classification. Thus, we can adopt a hybrid privacy protection approach and treat data with different sensitive levels differently to improve search efficiency. In particular, the proposed security construction is suitable to protect highly sensitive information, such as PHI, to achieve a high level of privacy guarantee.

For information that is not deemed to be sensitive, data sources can encrypt them using standard public key crypto systems such as the RSA algorithm1 with the hub's public key. Then, the hub can decrypt them and re-encrypt them with an order-preserving encryption.1 Specifically, order-preserving encryptions have the property that if A>B (e.g., the serum creatinine levels, (A = 1.1 mg/dl) > (B = 0.8 mg/dl)), then the encrypted values E(A)>E(B) (i.e., E(1.1) > E(0.8)). With this property, the hub can perform privacy-preserving range search with similar performance as searching on unencrypted data.

4.3. Implementation considerations and limitations

Through we focused our evaluation on the HCUP NIS dataset and the HBase database system, the proposed framework can be deployed in more traditional relational database infrastructures (e.g., Microsoft SQL Server) and existing clinical data semantics (e.g., the PCORnet Common Data Model (CDM)27 or the Observational Medical Outcomes Partnership (OMOP) CDM23). To do so, we need to encrypt all elements in the existing database with the proposed construction and remove all plaintext data from the database. After encryption, we will store the encrypted value of each element back into the tables (and replace the corresponding plaintext record).

When a query request is submitted, we can use SQL queries to identify the fields of interest and then retrieve relevant encrypted data. Then, the search is performed in the application layer with our PPCD Algorithm. Due to the use of encrypted records in the database, indexes that are commonly used to improve query performance in relational databases are inevitably broken (e.g., a B-tree index that relies on knowing the content of a data element to assign it to the correct branch of the tree), which will lead to slower searches than the plaintext setting.

4.4. Generalizability to other research network architectures

In this work, we focused our discussion using the OneFlorida network architecture, where there is a hub that collects and hosts all shared data in a centralized database (i.e., the OneFlorida Data Trust). The proposed PPCD framework, however, can be easily generalized to other architectures of research networks. In the cases where the network hub is not a centralized database but only an interface for investigators to submit PPCD queries, under the same security constructions, the privacy-preserving queries can be distributed to and executed by each individual data source, the query results are aggregated on the network hub and then presented to investigators.

However, additional security precautions need to be taken to protect the communication between the network hub and data sources. Similarly, given proper protection of communication, in the cases of a two-stage (or multi-stage) network like PCORnet that consists of many smaller CDRNs, the search queries can be propagated to and materialized at each data holder, and the results are aggregated at the root node (e.g., the Coordinating Center in PCORnet) of the overall network in a privacy-preserving way. Last, some CRNs adopt a fully distributed network model, where no hub exists in the network. With such network architecture, the node where the query is submitted can act temporarily as the network hub for that specific query.

4.5. Limitations

The findings of this investigation illustrate the potential for distributed privacy preserving cohort discovery with a cryptographically strong basis. At the same time, it should be acknowledged that many cohort definitions are complex, such as those associated with clinical trials, which cannot be readily handled under our current framework. This issue is an artifact of multiple confounding problems, which are worth noting for future extensions of the secure computation approaches to cohort discovery.

First, clinical trial eligibility criteria are rarely designed with computation in mind, thus limiting their translation into predicates that are easy to express in secure functions. This is not to say that such investigations lack a logical framework, but that the definitions of trial eligibility often rely upon criteria with temporal indications, such as life expectancy (e.g., greater than two months) or the amount of time an individual has been free of a tumor. Second, the data available in existing data sources are often limited to those that can be expressed in discrete format. For example, a criterion—“able to tolerate chemotherapy”— cannot be asserted in a search query unless such variable exists directly in the dataset as a discrete True or False field. In practice, these characteristics of patients either do not exist or only exist as free-text data in existing data sources. Thus, participants in the research network may need to derive and append numerous variables to records in anticipation of the wide variety of inquiries that investigators with to pursue. However, it should be recognized that the design and application of common data models has the potential to constrain the expressiveness of such search criteria. Nevertheless, how to handle these complex search criteria warrant future investigations to improve the utility of the proposed framework. We suspect that custom solutions, that are dependent on both the availability of data and the characteristics of the search criteria, may need to be handled on a case-by-case basis.

5. Conclusion

In this paper, we introduced a software solution for privacy-preserving cohort discovery (PPCD) services over clinical research networks. The proposed security construction is designed to achieve high privacy guarantees (because of its foundation in eliptic curve cryptography) by preventing attackers originating from outside, as well as within, the network. We implemented the software using modern big data processing techniques (e.g., Apache HBase and Hadoop) and integrated various optimization strategies (e.g., block indexing) to improve the query-response performance. We evaluated the software using the publically available HCUP NIS dataset, and queries based on the patient cohort discovery definitions from three real use cases. The analysis indicates that PPCD can be accomplished in a reasonable amount of time without diminishing the correctness of the aggregate query results. In future work, we plan on complementing the secure evaluation model with additional techniques designed for protecting the semantics of privacy in databases. In particular, we are working to integrate a policy manager that mitigates inferential disclosure (e.g., via differential privacy24) into the PPCD framework.

Highlights.

  • We design a privacy-preserving cohort discovery framework for distributed networks.

  • We show how real-word cohort specifications can be translated within the framework.

  • Cohort queries can be executed in a timely manner on a database of 7 million records.

  • A parallelized design with efficient block indexing improves the query-response time.

Acknowledgments

This work was supported in part by NIH grants UL1TR001427, UL1TR001445, and R01LM009989, the OneFlorida Cancer Control Alliance (funded by James and Esther King Biomedical Research Program, Florida Department of Health Grant Number 4KB16), the OneFlorida Clinical Research Consortium and the Mid-South Clinical Data Research Network funded by the Patient Centered Outcomes Research Institute (PCORI). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or PCORI.

Footnotes

i

We use the concepts of cohort discovery, search, and count interchangeably.

ii

Thresholds can be integrated into this framework to prevent the disclosure of counts that are deemed to be too small.

iii

For example, organizations A, B, and C can be compromised and then utilized to disclose organization D's private dataset.

iv

In essence, it outputs an equivalence testing result, which indicates whether a particular data record matches a search condition. Specifically, our construction uses field values as roots to build polynomials in the data encryption process in the Encryption Algorithm.

v

Considering the required security guarantees, we use Type A1 pairing to support fields of composite order under ECC.

vi

This level of protection ensures sufficient resilience against the subexponential-time factoring attack, which further achieves semantic security under the subgroup decision assumption as the BGN cryptosystem.

vii

For example, the same blocking key 20 can be derived from both ZIP codes (i.e., ZIP_20: codes corresponding to 20*) and ages (i.e., AGE_20: 20 < age < 29).

Conflict of Interest: The authors declare that they have no conflict of interest.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Boneh D, Crescenzo GD, Ostrovsky R, Persiano G. Public key encryption with keyword search. Proceedings of Cryptology – EUROCRYPT. 2004:506–22. [Google Scholar]
  • 2.Waters B, Balfanz D, Durfee G, Smetters DK. Building an encrypted and searchable audit log. Proceedings of the 11th Annual Network and Distributed System Security Symposium. 2004 [Google Scholar]
  • 3.Baek J, Safiavi-naini R, Susilo W. Public key encryption with keyword search revisited. Proceedings of International Conference on Computational Science and its Applications. 2008:1249–59. [Google Scholar]
  • 4.Khader D. Public key encryption with keyword search based on k-esilient IBE. Proceedings of the International Conference on Computational Science and its Applications. 2006:298–308. [Google Scholar]
  • 5.Crescenzo GD, Saraswat V. Proceedings of the Cryptology 8th International Conference on Progress in Cryptology. 2007. Public key encryption with searchable keywords based on Jacobi symbols; pp. 282–96. [Google Scholar]
  • 6.Jeong IR, Kwon JO, Hong D, Lee DH. Constructing PEKS schemes secure against keyword guessing attacks is possible? Computer Communication. 2009;32:394–6. [Google Scholar]
  • 7.Yau WC, Heng SH, Goi BM. Off-line keyword guessing attacks on recent public key encryption with keyword search schemes. Proceedings of the 5th International Conference on Autonomic and Trusted Computing. 2008:100–5. [Google Scholar]
  • 8.Boneh D, Goh EJ, Nissim K. Evaluating 2-DNF formulas on ciphertexts. Proceedings of the 2nd International Conference on Theory of Cryptography. 2005:325–341. [Google Scholar]
  • 9.Katz J, Lindell J. Chapter 11, Introduction to Modern Cryptography. Chapman & Hall/CRC; 2007. [Google Scholar]
  • 10.McCurley KS. The discrete logarithm problem. Cryptology and Computational Number Theory, Applied Mathematics. 1990;42:49–74. [Google Scholar]
  • 11.RSA Laboratories. PKCS #1: RSA Cryptography Standard Version 2.2. [Accessed on March 10, 2016]; Available online: ftp://ftp.rsasecurity.com/pub/pkcs/pkcs-1/pkcs-1v2-1.pdf.
  • 12.Boldyreva A, Chenette N, O'Neill A. Order-preserving encryption revisited: Improved security analysis and alternative solutions. Proceedings of the 31st Annual Conference on Advances in Cryptology. 2011:578–95. [Google Scholar]
  • 13.Song D, Wagner D, Perrig A. Practical techniques for searches on encrypted data. Proceedings of the IEEE Symposium on Security and Privacy. 2000:44–55. [Google Scholar]
  • 14.Baxter R, Christen P, Churches T. A comparison of fast blocking methods for record linkage. Proceedings of ACM Workshop on Data Cleaning, Record Linkage, and Object Consolidation. 2003:25–7. [Google Scholar]
  • 15.Angelo DC, Iovino V. The Java Pairing-Based Cryptography Library (JPBC) [Accessed on March 10, 2016]; Available online: http://gas.dia.unisa.it/projects/jpbc/-.VtxuCKgrJ95.
  • 16.George EM, Tergas AI, Ananth CV, et al. Safety and tolerance of radical hysterectomy for cervical cancer in the elderly. Gynecol Oncol. 2014;134(1):36–41. doi: 10.1016/j.ygyno.2014.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Habermann EB, Thomsen KM, Hieken TJ, Boughey JC. Impact of availability of immediate breast reconstruction on bilateral mastectomy rates for breast cancer across the United States: data from the nationwide inpatient sample. Ann Surg Oncol. 2014;21(10):3290–6. doi: 10.1245/s10434-014-3924-y. [DOI] [PubMed] [Google Scholar]
  • 18.Chung TK, Rosenthal EL, Magnuson JS, Carroll WR. Transoral robotic surgery for oropharyngeal and tongue cancer in the United States. Laryngoscope. 2015;125(1):140–5. doi: 10.1002/lary.24870. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Munro D. Why Anthem was wrong not to encrypt. [Accessed on March 7, 2016]; Available online: http://thehealthcareblog.com/blog/2015/02/22/why-anthem-was-wrong-not-to-encrypt/
  • 20.Healthcare Cost and Utilization Project (HCUP) Nationwide Inpatient Sample (NIS) databases. [Accessed on March 7, 2016]; Available online: https://www.hcup-us.ahrq.gov/nisoverview.jsp.
  • 21.Schilling LM, Kwan BM, Drolshagen CT, et al. Scalable Architecture for Federated Translational Inquiries Network (SAFTINet) Technology Infrastructure for a Distributed Data Network. EGEMS. 2013;1(1):1027. doi: 10.13063/2327-9214.1027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.U.S. Dept. of Health & Human Services. Standards for privacy of individually identifiable health information, final rule, 45 CFR, pt 160–164. 2002 [PubMed] [Google Scholar]
  • 23.Voss EA, Makadia R, Matcho A, et al. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. J Am Med Inform Assoc. 2015;22(3):553–64. doi: 10.1093/jamia/ocu023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Dwork C. Differential privacy. Encyclopedia of Cryptography and Security. 2011:338–40. [Google Scholar]
  • 25.dbGap. eMERGE Network Study of the Genetic Determinants of Resistant Hypertension. [Accessed on March 4, 2016]; Available online: http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000297.v1.p1.
  • 26.PCORI. OneFlorida Clinical Research Consortium. [Accessed on March 4, 2016]; Available online: http://www.pcori.org/research-results/2015/oneflorida-clinical-research-consortium.
  • 27.PCORnet. PCORnet Common Data Mode. [Accessed on March 7, 2016]; Available online: http://www.pcornet.org/pcornet-common-data-model/
  • 28.Pattuk E, Kantarciolgu M, Ulusoy, Malin B. Optimizing secure classification performance with privacy-aware feature selection. Proceedings of the 36th IEEE International Conference on Data Engineering. 2016 in press. [Google Scholar]
  • 29.Paar C, Pelzl J. A Textbook for Students and Practitioners. Springer; 2009. Elliptic Curve Cryptosystems: Chapter 9 of Understanding Cryptography. [Google Scholar]
  • 30.Shlby J, Slutsky J. How a unique provision in the American Recovery and Reinvestment Act set a foundation for the Patient-Centered Outcomes Research Institute. J Comp Eff Res. 2014 Nov;3(6):565–6. doi: 10.2217/cer.14.59. [DOI] [PubMed] [Google Scholar]
  • 31.Fleurence RL, Curtis LH, Califf RM, Platt R, Selby JV, Brown JS. Launching PCORnet, a national patient-centered clinical research network. J Am Med Inform Assoc. 2014;21(4):578–82. doi: 10.1136/amiajnl-2014-002747. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Federal Information Security Modernization Act (FISMA) U.S. Department of Homeland Security. [Accessed on March 7, 2016]; Available online: https://www.dhs.gov/fisma.
  • 33.Security and Privacy Controls for Federal Information Systems and Organizations. [Accessed on March 7, 2016];National Institute of Standards and Technology. Available online: http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-53r4.pdf.
  • 34.Wang C, Cao N, Li J, Ren K, Lou W. Secure ranked keyword search over encrypted cloud data. Proceedings of the 30th IEEE International Conference on Distributed Computing Systems (ICDCS '10) 2010:253–62. [Google Scholar]
  • 35.Cash D, Jaeger J, Jarecki S, Jutla CS, Krawczyk H, Rosu MC, Steiner M. Dynamic searchable encryption in very-large databases: Data structures and implementation. Proceedings of the 21st Annual Network and Distributed System Security Symposium (NDSS) 2014:1–16. [Google Scholar]
  • 36.Washington AE, Lipstein SH. The Patient-Centered Outcomes Research Institute--promoting better information, decisions, and health. N Engl J Med. 2011 Oct 13;365(15):e31. doi: 10.1056/NEJMp1109407. [DOI] [PubMed] [Google Scholar]
  • 37.Collins FS, Hudson KL, Briggs JP, Lauer MS. PCORnet: turning a dream into reality. J Am Med Inform Assoc. 2014 Jul-Aug;21(4):576–7. doi: 10.1136/amiajnl-2014-002864. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES