Abstract
Reliable cohort discovery is an essential early part of clinical study design. Indeed, it is the defining feature of many clinical research networks, including the recently launched Accrual to Clinical Trials (ACT) network. As currently deployed, however, the ACT network only allows cohort queries in isolated silos, rendering cohort discovery across sites unreliable. Here we demonstrate a novel protocol to provide network participants access to more accurate combined cohort estimates (union cardinality) with other sites. A two-party Elgamal protocol is implemented to ensure privacy and security imperatives, and a special attribute of Bloom filters is exploited for accurate and fast cardinality estimates. To emulate mandatory privacy protecting obfuscation factors (like those applied to the counts reported for individual sites by ACT), we configure the Bloom filter based on the individual site cohort sizes, striking an appropriate balance between accuracy and privacy. Finally, we discuss additional approval and data governance steps required to incorporate our protocol in the current ACT infrastructure.
Introduction
Cohort discovery is the search for counts of patients (subjects) who share some set of attributes. The purpose of cohort discovery may be either retrospective or prospective. A retrospective study may look at incidence of a particular disease within a certain population over time. It may look at the effectiveness of past treatments for different populations or analyze a wide range of aggregate data to evaluate a hypothesis. It may even involve mining the data for hidden insights. Prospective studies typically have in mind the eventual recruitment of subjects for a clinical trial. But investigators must first discover if sufficient numbers of subjects exist to allow conclusions realistically to be drawn with statistical significance or with some level of confidence.
Clearly, working across multiple sites will render more studies feasible than confining research to a single site. This is particularly important for rare diseases or for populations with lower incidence of particular diseases. However, the naïve summing of cohort counts from different sites is very likely to be overly optimistic. Kho and associates,1 in a city-wide linkage study in Chicago report a cohort overlap of 28% for a common disease. Overcounting is a particular problem in the context of public health. But undercounting is also a potential problem. Given the siloed nature of medical records, it is possible for the attributes of interest for individual subjects to be scattered across multiple sites. To include such individuals, it is necessary to have a unified picture of patient attributes.
Attaining such a unified picture is the goal of various privacy preserving record linkage (PPRL) systems, which is another active area of research.1–3 Some clinical research networks adopt PPRL to achieve optimal cohort discovery and flexibility in data interrogation. But PPRL approaches are by their very nature centralized and, despite efforts2 to mitigate security exposure, suffer from the need for all participating institutions to trust a common honest broker. This results in the need for complex arrangements for institutional oversight and a single point of failure. But fundamentally, PPRL is not necessary for the common task of cohort discovery.
Cohort discovery is perhaps the core use case for clinical data research networks like Accrual to Clinical Trials (ACT)4 and its predecessor Shared Health Research Information Network (SHRINE).5 As a network linking multiple sites, ACT should provide investigators with an easy and accurate way to estimate the number of cohort members that exist across different sites. Unfortunately, at present the ACT interface only report counts for individual sites and overlooks the fact that individual patients may have been treated at the more than one site. This could lead naïve investigators to potential overestimates of cross-site cohort sizes if they simply sum the counts from the individual sites.
Recent work by Raisaro and associates6 suggests a straightforward method for securely aggregating cohort counts across multiple sites using additively homomorphic Elgamal encryption. However, their method does not address the potential for overestimating cohort size. Indeed, their approach performs the naïve addition on behalf of the investigators and might serve to obscure the fact that overestimation is occurring. Furthermore, their approach includes a third party or cloud provider to carry out the computation. We here demonstrate a system with neither of these disadvantages.
Like this prior work, our solution satisfies the security requirement against an honest-but-curious (semi-honest) adversary and leverages an elliptic curve Elgamal cryptosystem. But while they frame their work in terms of centrally linked Informatics for Integrating Biology to the Bedside (i2b2)7 sites, we here focus on an application to the more distributed ACT network. We also take advantage of a unique characteristic of Bloom filters,8 first observed by Michael and associates,9,10 whereby the cardinality of set membership may be inferred from the “density” of the Bloom filter. This technique allows us to derive accurate estimates of cohort sizes across two sites and also to tune the probabilistic accuracy of the results with parameters.
By controlling the size of our Bloom filter, our system provides a mechanism to provide a range of differential privacy through the expected accuracy of its cardinality estimate. We find our obfuscation method to be more principled than the one currently observed in ACT and i2b2 itself, which simply by convention add a random number between -10 and +10 to all reported results.11,12
Our work here may be seen as a new protocol for private set intersection cardinality, very similar to the approach taken by Egert et al.,13 but differing in some implementation details.
Method
As detailed in previous studies,10,14 given a set of n elements and a Bloom filter BF with m bits and k hash functions, the expected total number of zero bits within the Bloom filter can be approximated by
(1) |
Let patient cohorts from two different healthcare providers be represented by two private sets S1 and S2, with their respective cohort sizes represented as |S1| and |S2| Because the common elements between the two sets will be mapped to the same Bloom filter bits, the following holds true:
(2) |
As first observed by Michael et al.,9,10 this equality indicates that the number of zero bits from BF(S1)∪BF(S2) can be used to estimate the union cardinality |S1 ∪ S2| Using Equations (1) and (2), we obtain this:
(3) |
Therefore, the union cardinality for |S1 ∪ S1| can be approximated as follows:
(4) |
Equation (4) indicates that the union cardinality for two private set can be approximated using the number of zero bits in the union of the two Bloom filters. However, obtaining BF(S1)∪BF(S2) securely for two private sets is not straightforward. Although the Bloom filter bits themselves do not immediately reveal private information, directly comparing the two discloses the private set’s membership information. In our context, this means allowing one site to learn which of its own patients also have had encounters at another healthcare provider. Such information is likely to be considered sensitive health information and thus must be protected. Therefore, a proper protocol is required to guarantee that at the end of the computations, the sites involved learn the union cardinality with the other site, but nothing else about the other’s private set contents, beyond perhaps the size of the intersection of patients shared by both sites.
We now demonstrate a secure protocol to address a common use-case scenario for the ACT network: After conducting an initial preliminary cohort query, an investigator from one site desires more accurate details on a combined cohort with another site. This would be a natural follow-up step for the current ACT workflow. In our notation, the site who initiates a joint cohort size calculation is site 1, and the other site 2.
Note that because the respective isolated cohort counts |S1| and |S2| have been received (in the output from the regular ACT query), obtaining union cardinality |S1 ∪ S2| automatically unveils intersection cardinality |S1 ∩ S2|. Determining the specifics of such an intersection has received considerable attention from biomedical informatics researchers1,2,15 because these subjects’ clinical data resides in disparate silos, and it is often necessary to carry out privacy preserving recording linkage to assemble a more complete healthcare picture for those subjects. We focus on cardinality, which is a much more common task and one which is often carried out with a non-human subject determination and much less institutional (IRB) oversight.
The specific design goal for our two party protocol is securely computing the number of common zero bits of two Bloom filters BF(S1) and BF(S2), in order to estimate the union cardinality for |S1 ∪ S2|. In the following pseudocode, we implement a secure protocol based on the Elgamal cryptosystem. The standard Elgamal algorithm is known to be multiplicatively homomorphic: Enc(X)Enc(Y) = Enc(XY). That is, the decryption step recovers the multiplication of original values. A modification known as exponential Elgamal makes the process additively homomorphic by performing encryption as E(gXgY) = E(gX+Y), where g is the generator for a cyclic group. Such techniques have been used in secure applications, such as electronic voting.16 Our main intuition here is to leverage the homomorphically additive feature of Elgamal encryption to count the common zero bits securely. We define this secure protocol as follows:
Algorithm 1. Protocol for secure computation of private set union cardinality.
The above algorithm describes a secure two-party protocol to compute the union cardinality for BF(S1) and BF(S2). Site 1 first generates Elgamal keys, where 𝑔is the generator for a cyclic group whose order is P, α is private key, and β = gα is the public key. To leverage the homomorphically additive property, the zero bits and one bits from the Bloom filter of site 1 are replaced by g1 and g0 respectively. This will make the exponents of g essentially function as a counter when site 2 performs multiplications of the cyphertext it receives. Because site 2 only performs multiplications upon its zero bits, the exponents of 𝑔will record the number of zero bits in the union of two Bloom filters, i.e., . This ensures the decryption will recover . To determine the exponent value here, one could employ a brute-force approach by comparing it with gi for every i between 1 and m. However, more optimized techniques are available.17 In the end, the union cardinality—that is, the exact number of patients found in the two cohorts—can be estimated using Equation (4).
The above security protocol is described using Elgamal encryption in modular arithmetic mode. A major drawback for such an implementation is that the key size must be large enough (2048 bits per the current NIST recommendation18) to satisfy security requirements. This leads to runtime performance that is too slow for practical use. To address this, we choose an elliptic curve Elgamal implementation, which provides the required security, but with shorter keys and much faster execution, as shown below. When the elliptic curve is applied, homomorphic addition is performed directly, with no need to perform exponentiation. Other authors19,20 describe the properties of elliptic curve Elgamal systems.
Note that Egert and associates13 present a very similar solution. Our approach differs primarily in step 2, where we omit the use of a one-time ephemeral key as unnecessary in this application, as discussed below. We assert that the correctness and security proofs provided in this prior work also apply to the approach presented here. Crucially, our implementation incorporates the elliptic curve optimization to make its use practical.
Results
In this section, we conduct a systematic analysis to evaluate the overall accuracy and performance of our method. Regarding accuracy, we conduct a systematic survey to assess the impact of various Bloom filter parameters: cohort size n, filter size m, and the number of hash functions k. As shown in Equation (1), given the observed number of zero bits in the Bloom filter, the original cohort size can be approximated. The accuracy can be quantified as the difference between the approximated size �_ and actual size 𝑛. Regarding performance, we measure running time for the proposed security protocol, decomposed into the three steps described in Algorithm 1.
In our analysis, we define the ranges for the surveyed parameters n, m and k by leveraging domain characteristics of healthcare data. The objective here is to confine the parameters to plausible ranges and then fine tune them to obtain the desired accuracy. Medium-to-large healthcare organizations typically have patient counts measured on the order of millions. In our simulated ACT workflow, however, during the initial site-specific cohort query, the sizes of the input cohorts have been reduced by filtering on specified attributes such as age, visit time, disease conditions, etc. We therefore survey the parameter landscape using realistically sized cohorts: “100K” represents a large cohort, “3K” a medium-small cohort, “300” a small cohort, and “10” a rare disease cohort. In addition, we include a synthesized medium cohort called “27K,” which we use to conduct performance evaluation. The “27K” refers to the joint cohort from two 15K cohort, with 3K overlapping. Note the different sizes used here refer to the projected joint cohort size across two institutions, not the cohort size within one institution (or the naïve sum of the original site-specific sizes). For example, “3K” can be conceived as the joint cohort of two 1.8K cohort, with 0.6K overlapping. Also, as long as the parties involved adopt the same Bloom filter parameters, such an accuracy analysis can be carried out directly using the Bloom filter computed using the projected joint cohort, as indicated by Equation (2).
We generate synthetic patient identifiers using a Python synthetic data module for the five cohort types described above. The input token to the Bloom filter is constructed by concatenating full name, birthday, gender, and race. In our experiments, we first choose Bloom filters of three different sizes (m) including 220, 221, 222 bits. As described below, Bloom filters of such sizes produce reasonably small error for the joint cohort size approximations. An additional 223-bit Bloom filter is introduced only for the 27K cohort and performance testing. In addition, we vary the number of hash functions (k) used from 5 to 70 at an interval of 5. Thus, we have 56 different sets of Bloom filter parameters (based on 4 different m’s and 14 different k’s) to measure the accuracies for joint cohort size approximation.
For each of these Bloom filters, a set of hash functions must be chosen. For example, when m=220 and k = 5, five hash functions whose output domain is 220 are required. Here we construct the hash functions needed by taking five groups of 20 bits directly from SHA512. Our rationale here is that, assuming SHA512 is a perfectly random hash function, the adjacent bits are independent. Therefore, we can use, for example, [SHA5121:20 SHA51221:40 SHA51241:60 SHA51261:80 SHA51281:100] for [hash1 hash2 hash3 hash4 hash5]. In our experiments, we run 100 trials for each set of Bloom filter parameters. In order to use different hashing function in each trial, each time a random salt is applied.
Table 1 records the average absolute error from each of the 100 trials over different sets of Bloom filter parameters. At the end of each trial, the approximated joint cohort size (denoted as �_) is approximated using the number of the zero bits, k, and m, as indicated by Equation (4). For the two small cohorts (300 and 10), the approximation produces essentially no error at all. For the 3K cohort, the approximation error is bounded under 2. With carefully chosen values for k and m, the two larger cohorts can be approximated with a target accuracy. We observe the general trend that larger Bloom filters (m) produce more accurate approximations. Additional hash functions (k), however, have more variable effects—at times only causing small fluctuations and at other times producing large swings in the approximation error.
Table 1.
Effect of varying Bloom filter parameters on the joint cohort size approximation error (absolute value).
Cohort | BF Size (m) | Number of Hashes (k) | ||||||
---|---|---|---|---|---|---|---|---|
5 | 10 | 15 | 20 | 25 | 30 | 35 | ||
100K | 220 | 59.1150 | 72.7650 | 79.3392 | 86.8683 | 106.0820 | 103.4911 | 112.5107 |
221 | 42.5947 | 42.4416 | 45.2651 | 59.3722 | 48.2498 | 47.1774 | 56.3000 | |
222 | 28.4333 | 31.8758 | 31.0468 | 29.1293 | 33.9877 | 32.5676 | 28.2345 | |
2K | 220 | 17.1687 | 14.3976 | 15.9641 | 18.1680 | 18.3876 | 16.6319 | 16.6352 |
221 | 11.7782 | 11.8477 | 11.7399 | 9.8007 | 13.4917 | 11.1053 | 11.9843 | |
222 | 7.7362 | 7.8110 | 7.4603 | 7.9397 | 8.4320 | 7.8703 | 8.1500 | |
223 | 5.0591 | 4.8840 | 5.3341 | 5.6396 | 4.9046 | 5.9379 | 5.8194 | |
3K | 220 | 1.8284 | 1.8424 | 1.8066 | 1.8051 | 1.8041 | 1.6581 | 1.9110 |
221 | 1.3190 | 1.2602 | 1.4094 | 1.1475 | 1.2309 | 1.1529 | 1.1945 | |
222 | 0.9129 | 0.8501 | 1.0611 | 0.7915 | 0.8765 | 0.9656 | 0.9088 | |
300 | 220 | 0.1718 | 0.1866 | 0.1495 | 0.1695 | 0.1662 | 0.1624 | 0.1799 |
221 | 0.1217 | 0.1274 | 0.1239 | 0.1350 | 0.1015 | 0.1168 | 0.1049 | |
222 | 0.0809 | 0.0820 | 0.0907 | 0.0781 | 0.0881 | 0.0878 | 0.0796 | |
10 | 220 | 0.0002 | 0.0005 | 0.0014 | 0.0025 | 0.0027 | 0.0020 | 0.0035 |
221 | 0.0001 | 0.0012 | 0.0004 | 0.0010 | 0.0018 | 0.0017 | 0.0019 | |
222 | 0.0001 | 0.0001 | 0.0002 | 0.0002 | 0.0011 | 0.0007 | 0.0007 | |
40 | 45 | 50 | 55 | 60 | 65 | 70 | ||
100K | 220 | 133.6641 | 140.8472 | 157.6722 | 222.0954 | 240.5228 | 259.3404 | 323.5785 |
221 | 65.7969 | 58.5237 | 58.2900 | 67.6420 | 73.1976 | 72.0229 | 79.0250 | |
222 | 32.5583 | 33.8773 | 37.4493 | 35.3310 | 34.3872 | 34.5136 | 37.3122 | |
27K | 220 | 20.8807 | 18.8007 | 18.9996 | 17.2436 | 20.9258 | 19.3951 | 21.6140 |
221 | 12.0104 | 12.1935 | 12.9939 | 11.6454 | 12.0297 | 12.8091 | 12.1870 | |
222 | 8.8822 | 8.9735 | 7.3361 | 7.7145 | 6.9361 | 7.8177 | 7.6313 | |
223 | 5.5293 | 4.9521 | 5.4996 | 5.4670 | 5.8028 | 5.7409 | 5.5582 | |
3K | 220 | 1.7643 | 1.8360 | 1.7693 | 1.8842 | 1.5426 | 1.9212 | 1.8256 |
221 | 1.1981 | 1.2312 | 1.0481 | 1.4357 | 1.0894 | 1.2089 | 1.2983 | |
222 | 0.8335 | 0.8151 | 0.9473 | 0.7925 | 0.8839 | 0.9478 | 0.7537 | |
300 | 220 | 0.1664 | 0.1834 | 0.1633 | 0.1855 | 0.1434 | 0.1730 | 0.1781 |
221 | 0.1087 | 0.1182 | 0.1042 | 0.1264 | 0.1217 | 0.1138 | 0.1161 | |
222 | 0.0877 | 0.0870 | 0.0839 | 0.0743 | 0.0770 | 0.0796 | 0.0954 | |
10 | 220 | 0.0045 | 0.0040 | 0.0043 | 0.0043 | 0.0050 | 0.0051 | 0.0061 |
221 | 0.0015 | 0.0017 | 0.0024 | 0.0025 | 0.0041 | 0.0030 | 0.0028 | |
222 | 0.0010 | 0.0007 | 0.0010 | 0.0017 | 0.0015 | 0.0025 | 0.0018 |
To study these effects further, we carry out an experiment using the 3K cohort. We extend the range of Bloom filter sizes to span from 212 bits to 222 bits, and we choose three different values for k. As shown in Figure 1, when m is fixed, using more hash functions eventually sharply increases the error because when the Bloom filter becomes saturated (i.e., has very few zero bits), the approximation is no longer reliable. The approximation method eventually fails completely when k is so large there are no zero bits left. Therefore, it is crucial to find a combination of m and k to avoid Bloom filter saturation.
Figure 1.
How Bloom filter parameters m and k can be chosen to emulate obfuscation factor used in ACT.
In Figure 1 we note one configuration of particular interest. A Bloom filter with m = 217 and k = 30 produces an average absolute error of 5. In the current nationwide ACT network, a default obfuscation factor is chosen from the random uniform distribution [-10, 10] and applied to all reported site counts.11,12 This corresponds to an expected absolute error of 5. Similarly, an average absolute error 10 emulates an obfuscation factor from the equivalent range of [-20, 20]. Therefore, using these Bloom filter settings for the “3K” cohort provides a joint cohort size �_ obfuscated in a manner similar to what is currently seen in ACT. Other obfuscation factors may also be targeted.
For our performance test, we observe the median cohort sizes returned for all production queries received by our own ACT node to be approximately 15,000. We therefore synthesize two sets of patients of this size with 20% (3000) overlapping records. As this results in a total joint cohort size of 27,000, we call this cohort “27K” in Table 1.
The 20% overlap was chosen to reflect the average percentage reduction in three cohorts studied by Kho and associates.1 Using the heuristic described above, we conducted a parameter search to obtain the ideal m and k values for this cohort size. Setting m = 223 and k = 45 produces the average absolute error of 4.95, which is closest to 5 among all the sets of parameters and corresponds to the obfuscation factor in the range [-10, 10]. Our goal here is to produce results that are indistinguishable from a real-world ACT cohort query with appropriate obfuscation.
Using these parameters with our 27K synthesized data, we measure the performance of our proposed two-party secure protocol. The Bloom filters for both sites are trivially computed in a few seconds. Performing the Elgamal encryption, however, depends on how it is implemented. For our initial proof of concept, we naïvely computed modular arithmetic operations per the original Algorithm 1. Using this approach, step 1, the most time-consuming step, takes more than one day. With an optimized elliptic curve solution (using a 256-bit elliptic curve key to provide the same level of security as a 3072-bit RSA key), steps 1–3 take 49.4 minutes, 26.5 minutes, and 6.2 milliseconds, respectively. So overall, the protocol can be completed for two input sets of size 15,000 with 20% overlap in 76 minutes. General running time can be extrapolated linearly. For example, the step-1 encryption time using only a 222-bit Bloom filter should be half the current running time because the number encryptions are reduced by half.
Our experiments were conducted using an open-source elliptic curve Elgamal Java library,21 operating over OpenSSL 1.1.1.a, on a prime256v1 elliptic curve. The experiments were performed on a system has an Intel i7-8750H 2.2GHz CPU and 16 GB memory.
Discussion
In this section, we discuss how our approach can enable more accurate joint cohort discovery as an extension to the existing ACT workflow. We also contemplate additional data governance and approval steps that might be required for this extension. The complete extended workflow is illustrated in Figure 2. In the current ACT network, participating sites connect to a central hub, through which they submit and respond to cohort queries. The workflow ends when the isolated cohort counts from the participating sites are returned to the querying site. Through our protocol, we enable a “phase II” extension of the existing workflow for cross-site joint cohort discovery.
Figure 2.
Extending current ACT workflow into joint discovery phase.
Armed with compiled list of cohort counts, the requester may identify sites as targets for potential joint cohort discovery. They might select institutions located nearby or those known to specialize in the diseases identified in phase I. Such institutions would be candidates for more accurate joint cohort discovery because patients are likely to visit multiple healthcare providers when they are geographically close to each other or to travel to visit specialist physicians no matter where they are located. Both of these situations could lead to significant cohort overlap. In our illustration, such a site is found to be institution X, whereas the requester is from institution A.
Consider the 27K cohort example. At this stage we assume Nlocal = NA = 15000, and NX = 15000. That is, institutions A and X both have 15000 patients matching the query A issues. The joint cohort discovery phase begins when the requester submits a request to the ACT hub to pursue joint cohort discovery based on the results from phase I. The hub then checks whether such request conforms to its own policy and forwards the request to institution X once approved. Upon receiving this request, X may need to check with its local IRB for review. Although the subsequent workflow only calculates union cardinality and no linkage information on the patient level between A and X is exposed, since A knows NA and learns an obfuscated NX during phase I, it can estimate the proportion of its own cohort who also visit X using where and approximates the set union cardinality of SA ∪ SX. Revealing such information (while minimal) might be concerning to the IRB of X because it reveals provider information of its patients on a probabilistic level. In our 27K example, A will learn the patients in its own cohort have approximately a 20% chance of also using X as their healthcare provider. Thus, only if the IRB of X approves such level of disclosure should phase II continue.
Once the ACT hub receives approval from X, it proceeds to determine the appropriate Bloom filter parameters. As demonstrated previously, the choice of the parameters depends on the desired approximation accuracy. Notice that in the performance test using 27K, we use the exact |SA ∪ Sx|, but the ACT hub must estimate this value within [max(|SA|, |SX|), |SA| + |SX|], the theoretically possible range for |SA ∪ Sx|. Note that the values for |SA| and |SX| are obfuscated already in phase I. The ACT hub therefore only provides an estimate for |SA ∪ Sx|. Using Table 1, the ACT hub then extrapolates the needed Bloom filter parameters for the desired error level. In the 27K example, we represent the approximation error as the obfuscation factor, thereby demonstrating our algorithm can provide an almost identical behavior of the current ACT network. (We note in passing that fixing the so-called obfuscation factor to a value in the range [-10, 10] is rather more arbitrary than scientific. More rigorous and quantifiable privacy guarantees could be enforced at this point in the protocol.)
Klann22 and associates have described an ad-hoc principle to dynamically determine the appropriate amount of random noise. The data we present in Table 1 can also be used to interpolate Bloom filter parameters to meet a specific level of accuracy, which is functionally equivalent to an obfuscation factor. Such a strategy of interpolating Bloom filter parameters could be deployed at the ACT hub to determine suitable sets of parameters programmatically. A possible enhancement might be to allow the desired accuracy to be negotiated by the two parties involved. As a general guideline, higher accuracy always requires larger Bloom filter.
Papapetrou and associates10 provide a closed form for the probability that set cardinality as reflected by the number of one bits contained in the set's Bloom filter representation falls within a specified range. Following their notation, let be the expected number of elements in a Bloom filter given t bits are set to 1. They establish
(1) |
and that and , the set cardinality lies between nl and nr with probability at least
(6) |
Thus, after site A calculates the set union cardinality, they can reason probabilistically about the amount of obfuscation actually afforded by the Bloom filter. Despite our best estimates, the parameters m and k selected by the hub may produce a Bloom filter that is too accurate. If so, site A may infer the exact cardinality to a high degree of certainty by simply evaluating Equation (6). Clearly, the desire for accurate counts and the need to obscure counts are at cross purposes. The inherent ambiguity in Bloom filters guarantee some level of obfuscation. In some cases, however, this obfuscation factor is vanishingly small, and transparently so. This fact should be part of site X’s IRB calculus, and mitigating this issue is a topic for future work.
At the end of the joint cohort discovery phase, optionally (and perhaps in the interests of fairness), A may share the answer of with X through the ACT hub. However, doing so may raise additional data governance issues. Also, while the main objective of phase II is to solve the overcounting problem, by applying some modified strategies in phase I, it is also possible to mitigate undercounting. For example, if A specializes in disease C, and X specializes in disease D, using a C^D query at institution A, and a D^C query at institution X, the two cohorts obtained could be processed in phase II to estimate the cohort size for the patients who have both C and D, which would likely be undercounted during phase I using a generic C^D query. However, as this strategy also raises governance concerns, undercounting remains a topic for future work.
With regard to the security aspects of the phase II extension, we use secure cryptographic algorithms Elgamal and elliptic curve. The sensitive information as (highlighted in red in Figure 2) includes subject PHI and the unencrypted Bloom filters. However, these sensitive data never leave the local institutions. The information crossing the boundary of institution A is the encryption of the A’s Bloom filter. Since each bit in A’s Bloom filter has been individually encrypted, institution X cannot tell whether it is a zero bit or one bit. As the workflow continues, institution X performs homomorphic additive encryption and obtain one final ciphertext that no one but institution A can decrypt because only A has the required Elgamal private key. Equally important is that institution X’s Bloom filter is also sufficiently concealed after executing the additive homomorphic encryption. Even if provided with the complete input and output for the homomorphic encryption, it is computationally infeasible to recover zero-bit positions in institution X’s Bloom filter. Assuming the number of these zero bits is z, it takes on the order of trials to determine the indices of zero bits by brute-force. Due to the relatively large Bloom filter created from any realistically sized cohort as well as the fact that only Institution X knows the exact value of z, such an attack is beyond the reach of any computationally bounded adversary.
Conclusion
Given the above discussion, we assert our proposed approach is sufficiently secure to be used in practice, such as in the actual ACT deployment. In the 27K example, the total encryption and decryption time for institution A is about 49.4 minutes, and the total homomorphic encryption time for institution X is 26.5 minutes. This indicates that the likely bottleneck might be the additional local approval and ACT hub approval at the beginning of phase II. Given these facts, we also assert our proposed approach is sufficiently efficient to be used in practice. A software demo for our presented workflow, including the dataset and programs to reproduce our results, can be found on GitHub.1 Our implementation uses only standard Python, Java and open source libraries. No special vendor software is required. Therefore, it is easy for participating ACT institutions to adopt this approach locally.
Acknowledgement
This research was supported by NCATS CTSA grant UL1TR002003 and NCATS ACT grant UL1TR000005.
Footnotes
See https://github.com/dongxiao/UnionCardinality.
References
- 1.Kho AN, Cashy JP, Jackson KL. Design and implementation of a privacy preserving electronic health record linkage tool in chicago. J Am Med Informatics Assoc. 2015;22(5):1072–1080. doi: 10.1093/jamia/ocv038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Dong X, Randolph DA, Kolar Rajanna SK. AMIA Annu Symp Proc. 2019. 2019. Enabling privacy preserving record linkage systems using asymmetric key cryptography. to appear. [PMC free article] [PubMed] [Google Scholar]
- 3.Boyd AD, Saxman PR, Hunscher DA. The University of Michigan honest broker: A web-based service for clinical and translational research and practice. J Am Med Informatics Assoc. 2009;16(6):784–791. doi: 10.1197/jamia.M2985. doi:10.1197/jamia.M2985. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Visweswaran S, Becich MJ, D’Itri VS. Accrual to Clinical Trials (ACT): A clinical and translational science award consortium network. JAMIA Open. 2018;1(2):147–152. doi: 10.1093/jamiaopen/ooy033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Weber GM, Murphy SN, McMurry AJ. The Shared Health Research Information Network (SHRINE): A prototype federated query tool for clinical data repositories. J Am Med Informatics Assoc. 2009;16(5):624–630. doi: 10.1197/jamia.M3191. doi:10.1197/jamia.M3191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Raisaro JL, Klann JG, Wagholikar KB. San Francisco. 2018. Feasibility of homomorphic encryption for sharing i2b2 aggregate-level data in the cloud. In: AMIA Jt Summits Transl Sci Proc; pp. 176–185. [PMC free article] [PubMed] [Google Scholar]
- 7.Murphy SN, Weber G, Mendis M. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) J Am Med Inf Assoc. 2010;17(2):124–130. doi: 10.1136/jamia.2009.000893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Bloom BH. Space/time trade-offs in hash coding with allowable errors. Commun ACM. 1970;13(7):422–426. [Google Scholar]
- 9.Michael L, Nejdl W, Papapetrou O. Niagara Falls, Ontario. 2007. Improving distributed join efficiency with extended Bloom filter operations. In: International Conference on Advanced Information Networking and Applications; pp. 187–194. doi:10.1109/AINA.2007.80. [Google Scholar]
- 10.Papapetrou O, Siberski W, Nejdl W. Cardinality estimation and dynamic length adaptation for Bloom filters. Distrib Parallel Databases. 2010;28(2):119–156. doi:10.1007/s10619-010-7067-2. [Google Scholar]
- 11.Murphy SN, Chueh HC. Proc AMIA Symp. 2002. A security architecture for query tools used to access large biomedical databases; pp. 552–556. [PMC free article] [PubMed] [Google Scholar]
- 12.University of Colorado. ACT network FAQ’s. http://www.actnetwork.us/national/faqs-46EU-1377S2.html. Published 2019. [Google Scholar]
- 13.Egert R, Fischlin M, Gens D. Foo E, Stebila D, eds. Australasian Conference on Information Security and Privacy. Brisbane, Australia: 2015. Privately computing set-union and set-intersection cardinality via Bloom filters; pp. 413–430. doi:10.1007/978-3-319-19962-7. [Google Scholar]
- 14.Broder A, Mitzenmacher M. Network applications of Bloom filters: a survey. Internet Math. 2004;1(4):485–509. doi:10.1080/15427951.2004.10129096. [Google Scholar]
- 15.Miyaji A, Nakasho K, Nishida S. Privacy-preserving integration of medical data: a practical multiparty private set intersection. J Med Syst. 2017;41(3):1–10. doi: 10.1007/s10916-016-0657-4. doi:10.1007/s10916-016-0657-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Bernhard D, Warinschi B. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics) 2014. Cryptographic voting — a gentle introduction. doi:10.1007/978-3-319-10082-1_7. [Google Scholar]
- 17.Shafagh H, Hithnawi A, Burkhalter L. SenSys 2017 - Proceedings of the 15th ACM Conference on Embedded Networked Sensor Systems. 2017. Secure sharing of partially homomorphic encrypted IoT data. doi:10.1145/3131672.3131697. [Google Scholar]
- 18.Barker EB, Dang QH. 2015. Recommendation for Key Management Part 3: Application-Specific Key Management Guidance. [Google Scholar]
- 19.Rabah K. Inf Technol J. 2005. Elliptic curve ElGamal encryption and signature schemes. doi:10.3923/itj.2005.299.306. [Google Scholar]
- 20.Cerveró MÀ, Mateu V, Miret JM. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2014. An efficient homomorphic e-voting system over elliptic curves. doi:10.1007/978-3-319-10178-1_4 6. [Google Scholar]
- 21.Burkhalter L. Additive homomorphic EC-ElGamal. [Internet]. Available from: https://github.com/lubux/ecelgamal. [Google Scholar]
- 22.Klann JG, Joss M, Shirali R. AMIA Jnt Summits Transl Sci Proc. >San Francisco: 2018. The ad-hoc uncertainty principle of patient privacy. [PMC free article] [PubMed] [Google Scholar]