Enabling Privacy Preserving Record Linkage Systems Using Asymmetric Key Cryptography

Xiao Dong; David A Randolph; Subhash Kolar Rajanna

. 2020 Mar 4;2019:380–388.

Enabling Privacy Preserving Record Linkage Systems Using Asymmetric Key Cryptography

Xiao Dong ¹, David A Randolph ¹, Subhash Kolar Rajanna ¹

PMCID: PMC7153159 PMID: 32308831

Abstract

We present a systemic approach to devise and deploy Privacy Preserving Record Linkage (PPRL) systems using asymmetric key cryptography and illustrate the strengths of such an approach. With our approach, the security implications of sharing a common secret salt across the network may be avoided, allowing the local participating sites to use private keys along with the current cryptographic hashes to maximally secure their own data. In addition, the final cyphertext tokens are compatible with those used by existing record linkage modules, allowing seamless integration with the existing PPRL infrastructures for downstream analysis. Finally, study-specific hash production requires action only by the central party. The main intuition for this work is derived from how asymmetric key approaches have enabled internet-scale applications. We demonstrate that such a design, where the local sites no longer need special-purpose software, affords greater flexibility and scalability for large scale multi-site linkage studies.

Introduction

Privacy preserving data sharing for collaborative research across different organizations has recently become a significant challenge in healthcare research¹. Data in electronic health records for a single patient may be spread across a number of different healthcare providers. After the patient visits different hospitals and clinics, their electronic health records are registered at the respective data repositories. Harnessing such disjoint information is crucial for establishing a complete longitudinal picture of the patient’s treatment history, assessing the patient’s comprehensive health condition, and improving overall healthcare quality. Complete datasets could be assembled by linking these disparate records; however, doing so directly requires matching certain patient records and may put patient’s privacy at significant risk. Relevant laws and regulations also restrict the sharing of protected health information (PHI) without prior consents from patients. Thus, secure and efficient technologies for privacy preserving record linkage (PPRL) ^2,3,4 are needed to strike the perfect balance between accurate data linkage and privacy protection.

To date, the predominant approach for PPRL is to use cryptographic hashing, such as the SHA-2 family algorithms^5,6 to conceal PHI. Such a cryptographic hash method is suitable here because it performs a cryptographic transformation, after which the sensitive patient information is converted into hashes—a kind of ciphertext virtually indiscernible from random text strings. Another key property of such hashing algorithms is they work effectively as one-way functions, where the ciphertext (hashes) can be efficiently generated from plaintext but reverting back to plaintext is practically impossible. These two key properties ensure that sensitive identifying information is protected after the hash function is applied. Because the hash function is also deterministic, the same identifying data elements are always transformed to be the same hashes. As a result, the original patient data is de-identified into anonymized tokens and may then be transmitted outside the institution boundary for linkage analysis studies. Such techniques have been deployed in several multi-site network as the principle PPRL approach. Several clinical data research networks (CDRNs) ^7,8,9 funded by the national Patient-Centered Clinical Research Institute (PCORI)^10,11 serve as prominent examples. Within these research networks, the cryptographic hashes are first generated at different local institutions to de-identify the records, and then submitted to a trusted central party where a designated honest broker performs the tasks of record linkage and deduplication.

Because protecting patient privacy is paramount, it is important to ensure the PPRL systems achieve high security standards. In principle, properly implemented hash functions (such as SHA-256 and SHA-512) constitute a noninvertible function. That is, one cannot reverse engineer the hashing transformation to reveal the identifying data elements. However, a well-known attack against it is the so-called “dictionary attack”, where an attacker uses pre- computed tables to recover the plaintext. For example, if some possible contents are known to be contained in the inputs to the hashing function (such as name, date of birth, etc.), an attacker can construct rainbow lookup tables by producing hashes from the complete set of potentially valid input values. To reduce such security risks, a common approach is to use a “salt,” a random text string added to the original input text. Using a salt renders building the lookup table practically impossible because the attacker does not know what the salt looks like. For PPRL systems relying on cryptographic hash functions, one common salt needs to be used and shared with all participating sites¹². This raises security concerns because, if an attacker obtains the salt, a dictionary attack is again possible. Because the salt is common, this security risk applies to all the participants. Therefore, protecting the common salt is essential to the security of hash-based PPRL systems. It is also worth noting that such security concerns apply whenever a common secret is used, including the “passwords” in PPRL systems using bloom filters¹³.

To meet such challenges, proper security measures for safeguarding the secret salt have been developed. Typically, the secret salt needs to be securely transmitted to the local hosts where the hashing will take place. These hosts have specifications that meet HIPPA security standards, and only authorized employees in the covered entities have access to them. Such an approach becomes unwieldy when the network gets large, however, because all participating sites need first to obtain institutional approval and then securely obtain the hashing application. It becomes especially problematic in a situation where large-scale projects are conducted across networks that adopt different PPRL software products, since now the different vendors need to agree on which common secret salt to use, to disseminate it, and to ensure all parties involved are in compliance with the security protocol. This inevitably leads to technical and policy challenges related to salt distribution and management. Such challenges may vary for different research networks, but they likely always will exist and present significant hurdles for scaling up the PPRL solution across a growing number of networks. As PCORI are planning to carry out large scale and across network linkage, more efficient solutions are needed to mitigate such challenges.

The question then arises: “Can we design an approach to completely eliminate the security implications of sharing a common secret?” In this paper, we describe a novel approach to conduct privacy preserving record linkage using asymmetric key cryptography. With asymmetric keys, the problem of deploying a common secret salt is eliminated because keys are unique for each local institution. Instead of receiving a common salt, local sites instead use their own private keys for encryption and securely transfer their individual public keys to the central party. Critically, the final ciphertext produced remains resistant to cryptanalysis and known attacks. An important additional benefit for the approach is that it allows the PPRL system to scale up more easily, as will be demonstrated below.

Here we highlight the key requirements for such a system. In the method section, we describe how our system, dubbed “PPRL-Plus,” fulfills each of these requirements.

Enable asymmetric key encryption such that local sites use their own secret keys to encrypt their own data, thus eliminating the need for a common secret salt.
Apply strong cryptographic primitives that are sufficiently secure, such as the existing SHA-2 hashing methods and asymmetric key encryption primitives.
Allow easy integration into existing PPRL infrastructures by ensuring outputs remain compatible with existing record linkage tools.
Produce record identifiers in ciphertext that are still resistant to cryptanalysis and known (e.g., dictionary or rainbow table) attacks.

Method

In cryptography there are essentially two prevailing paradigms: symmetric key cryptography and asymmetric key cryptography^14,15. In symmetric key systems, the same key is used both for encryption and decryption, whereas in asymmetric key systems different keys are applied. One significant advantage for asymmetric key system is only sharing of the public key is required, unlike in symmetric cryptography in which the entire secret key needs to be shared. The common salt approach currently used in PPRL systems is analogous to symmetric key encryption because the same secret (salt) is shared among the local sites in advance, leading to the security concerns and management burdens discussed above. The main focus in our work is to enhance the existing PPRL systems with asymmetric key encryption in order to eliminate the need to use a network wide common secret, while ensuring encryption is sufficiently secure.

The most important applications for asymmetric cryptography include key exchange and digital signatures. Our main intuition derives from digital signatures and given the scope of this paper we are going to skip describing key exchange. In digital signatures, the sender first signs the plaintext message using its private key and forwards the encrypted message to the receiver. The receiver then decrypts the message using the sender’s public key in order to verify the authenticity of the message that it actually originated from the sender. One common approach here is to perform a cryptographic transformation first by hashing the plaintext and then signing the hash ℎ(m) using the sender’s private key. The receiver then recovers the hashed message ℎ(m) using the sender’s public key for verification purposes. In the context of PPRL, corresponds to the patient identifying data and ℎ(m) the hashes produced for record linkage.

Here we briefly go over the well-known RSA algorithm¹⁶ to illustrate the digital signature process. The first step in RSA algorithm is key generation, where a {private, public} key pair is generated as follows:

Choose two large primes p and q of similar magnitude.
Calculate $N = p \times q$ . (N will be the modulus.)
$φ (N) = (p - 1) \times (q - 1)$ (This is the Euler totient function.)
Find e, such that GCD(e, N)=1. (e is the public key.)
Find d, such that $d \times e \equiv 1 m o d φ (N)$ . (d is the private key.)

During the signing process, a hashing function is first applied to obtain the hash h(m) for the plaintext m. The digital signature is then calculated using the private key d as signature = (h(m)^d mod N). The signature is sent to the receiver along with h(m), which is also typically encrypted. The receiver may then verify the signature by comparing the value of (sgn^e mod N) to h(m). (In practice, the hash value h(m) is subsequently processed according to padding and encoding schemes as specified in standards like PKCS¹⁷ or OAEP¹⁸.)

Note that a given signature also constitutes an encrypted (and hashed) message, provided the h(m) is sent for comparison purposes. But the message is encrypted with a private key and decrypted with a public key, the opposite of what is done in the RSA key exchange algorithm proper. The message is hashed for performance reasons and also to prevent “existential forgeries,” a known security problem with RSA signatures. The receiver is able to verify the signature because the following holds true for properly generated key pairs {e, d} and any non-negative integer k:

{(h {(m)}^{d})}^{e} = h {(m)}^{d e} = h {(m)}^{φ (N) + 1} = {(h {(m)}^{φ (N)})}^{k} \times h (m) = h (m) {(1)}^{k} = h (m) \mod N

(1)

Note the digital signature not only uses asymmetric key cryptography but also makes use of a hash function, so it already contains those key components required in our proposed PPRL-Plus system. However, closer scrutiny reveals the last requirement D is not satisfied. The reason is that in the end the hash h(m) is recovered, unlike in the existing PPRL solutions, where the salt has been mixed in to produce h(salt||m). Since unsalted hashes are susceptible to dictionary attack, the requirement is not met yet.

The fundamental transformation scheme we want to perform may be illustrated using the following simple example:

Alice:"abc" \overset{p r i v a t e k e y^{a}}{\to} " f o o " \overset{p u b l i c k e y^{a}}{\to} " X Y Z "

Bob:"abc" \overset{p r i v a t e k e y^{b}}{\to} " b a r " \overset{p u b l i c k e y^{b}}{\to} " X Y Z "

where $" X Y Z " \Rightarrow " a b c "$ in both cases.

Here Alice and Bob both hold “abc,” which they each transform into different ciphertext tokens (denoted as “foo” and “bar” respectively) at their local sites using private key encryption. These different tokens are transferred to the central party, where they are subsequently transformed into the same final ciphertext tokens (denoted as “xyz”) using the appropriate public keys. The crucial requirement for this transformation scheme is that reversing the final ciphertext back to the original text must be infeasible (denoted as $" X Y Z " \Rightarrow " a b c "$ ). If we apply standard RSA digital signatures, the final tokens will be equal to the original h(m) value “abc,” which is not what is needed. Security concerns arise (in the context of PPRL) if the final ciphertext can be easily reversed back to h(m), because when h(m) does not utilize the secret salt, it is susceptible to attack.

To address this, we present a novel method for key generation, based on RSA, to produce asymmetric key pairs that satisfy special mathematical properties. We couple this with a second component: a compression algorithm to produce de-identified tokens suitable for the existing record linkage software modules. More specifically, these ciphertext tokens not only have the same format (such as the commonly used hexdigest format), but also the same cryptographic properties of the regular hashes. We call a ciphertext token containing such properties a pseudo-hash. The first component (Algorithm 1) is executed at the local sites, while the second (Algorithm 2) is executed by the central trusted party to generate tokens for linkage analysis.

The PPRL-Plus key generation algorithm follows:

Let $| N_{2} |$ be the size of modulus N in binary. (Default value: 2048.)
Let $| h {(m)}_{2} |$ be the size of the hash in binary. (Default value: 256, when SHA-256 in use.)
choose $1 < r < \frac{| N_{2} |}{| h {(m)}_{2} |}$ (Default value: 7.)
Execute steps 1 through 3 of RSA Key Generation.
Find the key pair {d’, e’}, where $d' \times e' = \begin{matrix} r & m o d & φ (N) \end{matrix}$
d’ is the private key, and e’ is the “protected” key.

Algorithm 1. Modified key generation for PPRL-Plus. The subscripts “2” here denote the binary format.

In this algorithm, we introduce a new integer parameter r, such that the binary width of h(m)^r is guaranteed to be less than that of N (the same as RSA key size). This holds when the size of h(m)^r in binary is at least $| h {(m)}_{2} |$ bits less than that of N in binary. For example, in the algorithm above where the default SHA-256 hashing function is used, the hashes h(m) will be exactly 256 bits, using the default value of 7 for r, h(m)^rwill be at most 256 × 7 = 1792 bits wide, which is guaranteed to be smaller than the modulus N, a 2048-bit integer. Hence, the following holds true.

\begin{array}{l} {(h {(m)}^{d'})}^{e'} = h {(m)}^{d' \times e'} = h {(m)}^{φ (N) k + r} = {(h {(m)}^{φ (N)})}^{k} \times h {(m)}^{r} = h {(m)}^{r} m o d N \\ = h {(m)}^{r} \begin{array}{l} \forall r : 1 < r < \frac{| N_{2} |}{| h {(m)}_{2} |} \end{array} \end{array}

(2)

The above transformation corresponds to the original RSA verification algorithm. The main difference is that with the set of key pairs {d’, e’}, the original hash h(m) is transformed into h(m)^r. The signing produces hash signatures h(m)^d’ mod N and it takes place at the sender just as in the original digital signature procedure. Using properly constructed one-way compression function $γ$ (h), it can be subsequently converted into a pseudo-hash h(m)’ that satisfy requirement D, as the following:

\begin{array}{l} γ (h {(m)}^{r}) = h (m)', & w h e r e & h (m)' \Rightarrow h (m) & a n d & | h (m)'_{2} | \end{array} = | h {(m)}_{2} |

(3)

Here, the following key property must hold for $γ$ (h): it takes an input whose binary size is approximately r times the binary size of the original hash h(m), and outputs a pseudo-hash whose binary size is exactly the same as h(m). For example, in our default setting where the original hash is produced by SHA-256 and r = 7, $γ$ (h) takes input that is approximately 1792 bits long and outputs a pseudo-hash that is exactly 256 bits long, which is conveniently the same width as a SHA-256 hash value to be consumed by the existing hash matching modules. In most PPRL applications, the hashes are processed into the corresponding hexadecimal format using a digest function. For example, if SHA- 256 is used, the outputs will be strings contain 64 hexadecimal characters. In our implementation, we take a number of randomly selected 256-bit blocks from h(m)^r and iteratively apply them as XOR stream ciphers. Due to the pseudo- randomness of the hash function in h(m), each 256-bit blocks randomly selected from the binary form of h(m)^r can be treated as a pseudo-random number, hence as a stream cipher. More formally, the compression algorithm can be described as follows:

Input

$h {(m)}_{2}^{r} : h {(m)}^{r}$ in binary format

s : seed for random number generator (RNG)

k : number of times XOR stream ciphers applied

Output

$h {(m)}_{16}^{'}$ : pseudo-hash in hexadecimal format

RNG ← s

index ← RNG $(| h {(m)}_{2}^{r} |)$

size ← $| h {(m)}_{2} |$

block ← $h {(m)}_{2}^{r}$ [index: index + size]circular

repeat k times:

index ← RNG $(| h {(m)}_{2}^{r} |)$

stream ← $(| h {(m)}_{2}^{r} |)$ [index: index + size]circular

block ← block $\oplus$ stream

$h {(m)}_{16}^{'}$ ← hexdigest(block)

Algorithm 2. Pseudo-hash compression algorithm. The subscripts denote binary or hexadecimal formats.

In this algorithm, the block size is defined as the standard hash size in binary. For example, if SHA-256 is used, the block size will be 256. The starting block index is randomly selected by a random number generator (RNG) from a range of integers up to the binary width of $(| h {(m)}_{2}^{r} |)$ , and indices are implemented as a circular array, such that the entire range of ℎ(m) may be utilized. Using a fixed value for the seed, the sequence of integers generated for the block indices are guaranteed to be the same every time the above algorithm runs. The seed is an important security parameter and must be kept secure. After the initial block is selected, the XOR stream cipher is repeatedly applied k times. Note that k must be chosen carefully. To satisfy requirement D, it must be infeasible to recover ℎ(m)’ via ℎ(m)^’. It is crucial to ensure cryptanalysis of pseudo-hash output ℎ(m)’ generated via Algorithm 2 cannot reveal ℎ(m)^r. Fortunately, without prior knowledge of how random indices are chosen, the probability of finding all correct stream ciphers can be estimated as follows (assuming r = 7, and $| h (m_{2}) |$ )

{\prod_{i = 1}^{k} \frac{1}{| h {(m)}_{2}^{r} |} < (\frac{1}{256 \times (r - 1)})}^{k} = \frac{1}{2^{8 k}} \times \frac{1}{6^{k}} = {(\frac{1}{2^{(8 + l o g 2 (6))}})}^{k} < (\frac{1}{2^{10.5}}) < < \frac{1}{2^{512}}

(4)

when k = 50.

Therefore, when we choose k appropriately, such probability is even smaller than finding a collision for a SHA-512 hash. In the above equation, where k is set to 50, an intruder has no reasonable way to invert the pseudo-hash output ℎ(m)’ back to ℎ(m)^r, and then proceed to recover ℎ(m). As a general guideline, k should be determined by the parameter r and the block size |ℎ(m)₂|. Using such k, enough random blocks are sampled from ℎ(m)_2^r to ensure the chance that any parts of it were never passed through the XOR stream cipher would be very low. Specifically, using our default values (r = 7, k = 50 and SHA-256), the probability that a given bit is never chosen is $\prod_{i = 1}^{k} P_{i} ≅ {(\frac{6}{7})}^{50} < 0.05 %$ . Note that approximately 6 out of 7 bits within ℎ(m)_2^r won’t be chosen as a part of the XOR stream cipher for each round.

Here we complete the discussions for the main algorithmic methods in PPRL-Plus system. The new key generation algorithm is a slight modification based on the widely used open standards of RSA key generation, and it not only offers the local sites the preferred private keys for encryption, but also does not require use of a common secret salt. Compared to existing PPRL systems in which all the participating parties needs to have special software to retrieve the salt and to produce the hashes, only the central trusted party needs one piece of additional software to run the pseudo-hash compression algorithm. Because it does not require the efforts to coordinate and disseminate the salt, this promises to be a more scalable solution to incorporate a large number of participating sites to perform intra-network record linkage studies. In the next section, we discuss how we have operationalized the different algorithmic components in a research data network setting, where we simulate a multi-site study. Notice such mode of operations also applies in multi-site studies that span several different research networks.

Results

In this section, we first explain how to deploy the key generation algorithm (Algorithm 1). The default value for r in the original RSA key generation algorithm should be 1 as always, the PPRL-Plus key generation algorithm can be achieved by using a new value for r in the product of d and e $e m o d φ (N)$ (as shown in step 5 of Algorithm 1). Using a good quality open source project, such an algorithm can be implemented quite easily. In our experiment, we set the default values for r to 7 as demonstrated above. New key pairs were generated around one second on desktop hardware.

Notice the key pairs in our discussion here are only meant to be used in the PPRL applications to encrypt the hashes. They are not the same as the regular RSA keys of the host server. To differentiate these keys, we call them PPRL-Plus keys. Once the key pair is generated, the private key d is used to encrypt the hashes created from the cryptographic hash algorithms (such as SHA-512 or SHA-256). This step corresponds to generating h(m)^d mod N. Here we leverage the security primitive that finding the modular root of an arbitrary number (hash signature) is infeasible when the key size is large enough. In our implementation the key size of 2048 bits is used according to the NIST recommendation¹⁹. Notice the secret salt is no longer required when creating the hashes because the security for the hashes is now protected by the private key to which only the local site has access. The resulting hash signatures not only apply a cryptographic hash algorithm to de-identify the PHI, but also are sufficiently protected by the site’s own private key. Hence, they can be transmitted outside the institutional boundary to a file repository hosted by a central party for downstream analysis.

One major issue that needs to be addressed here is that we need to enhance the security measures with respect to the supposedly public key component e. In regular RSA key generation, the public key e may be shared with impunity. This is expressly not the case for the method we describe here. If somehow an intruder obtains the private hash signatures h(m)^d mod N, the known value of e can be used to derive h(m)’, which can be mapped to h(m) for dictionary attacks. We thus call the e component a protected key in the PPRL-Plus system, as compared to public key in the regular RSA application. The resulting key e is then encoded in PEM format and securely forwarded to a central party when requested, as indicated in Figure 1.

Figure 1. — Key generation and hash signature generation workflow.

The pseudo-hash compression and the subsequent data distribution workflow is illustrated in Figure 2. In order to produce hashes suitable for linkage, the following actions are performed. A signature repository is populated with encrypted data from the various local institutions. The trusted party (also referred to as the honest broker) uses the pseudo-hash compression engine to process the data received from the various local institutions. As needed, protected keys are retrieved from the local sites and the hash signature files from signature repositories. The honest broker plays a coordinating role throughout this execution phase. During execution, the engine picks a fixed seed to ensure the XOR stream ciphers are obtained from the same set of blocks within h(m)_Y’. Therefore, the same h(m)’ values from different local institutions are be compressed to the same pseudo-hashes h(m)’ for the down-stream linkage. During execution, details that have significant security implications (namely, the seed and the protected keys) reside only in the host machine’s memory. These data are never written to disk. Furthermore, they are properly scrubbed from memory at the earliest opportunity. Notice that the private hash signatures h(m)^d mod N by itself is sufficiently secure as long as the private key d meets the standard size requirement.

Since all the submitted data needs to be processed by the compression engine, it is necessary that the compression algorithm offers acceptable speed performance. In our experiment, we implement the pseudo-hash compression algorithm using Python version 3.7.3, and we observe a simulated average speed of 0.4 millisecond per record using an Intel i7 8th generation processor. This translates to approximately 7 minutes to generate one million pseudo-hashes using a single-threaded application. Given this performance, we are convinced the compression algorithm is efficient enough to generate study-specific pseudo-hashes for multiple studies. On the other hand, we observe the average simulated speed to generate the private hash signatures is about 21 milliseconds, which translates to approximately 6 hours for 1 million private hash signatures. This is primarily due to the security requirement to use a sufficiently large private key. While such performance speed appears slow, it might actually be considered acceptable in practice, because the private hash signatures need only be generated once at each local site. In addition, such a performance issue may be addressed easily using parallel processing or hardware acceleration.

In large-scale data research networks where multiple studies are carried out, it is common to create hashes that may only be used for individual projects. This is necessary to comply with various data governance policies. In current PPRL systems, hashes created for one particular study may not match with hashes created for other protocols. Therefore, study-specific hashes must be generated using different secret salts, and all local sites need to run the hashing application as many times as there are studies requiring their site-specific hashes. This becomes unwieldy as the scale of the data research network grows larger, and more studies are in demand. Note that in PPRL-Plus, if a different seed is used in the compression algorithm, a different set of XOR stream ciphers will be obtained, and different pseudo-hashes will be generated. However, the pseudo-hashes created even with a different seed will still match if they originate from the same m and h(m). Such characteristics ensure pseudo-hashes are only capable of linking records under one study and cannot be used for linkage to other studies, assuming a different seed is used for each study. Moreover, since pseudo-hash generation is performed at an outside entity, the local institutions no longer need to re-generate linkage hash tokens for different studies. Thus, PPRL-Plus provides better scalability for conducting multiple concurrent projects. The resulting study-specific pseudo-hash tokens are further transferred to different study teams for downstream analysis, as illustrated in Figure 2.

Discussion

In the following table we compare our proposed PPRL-Plus system to the prevailing PPRL security measures. As we have explained previously, the main advantage for PPRL-Plus is to achieve the goal of using site-specific private keys for encryption. In order to produce the site-specific private hash signatures h(m)^d mod N, first hashes are calculated using the standard SHA-2 family functions. The hashing functions required such as SHA-256 or SHA-512 come with modern programming language and database applications and can be easily applied. Second, only slight modification to the open source RSA key generation library is required to change the remainder value from the default 1 to r. Third, the private hash signatures h(m)^d mod N may be calculated efficiently. Fast implementations for the modular exponentiation function required in the third step are also available, such as the pow() function from Python’s standard build-in library. Therefore, it is straightforward for participating sites to adopt this approach locally to create private hash signatures without using special-purpose vendor software.

Note that in regular digital signatures, padding and encoding schemes such as PKCS or OAEP are applied to prevent ^Tenmalleability attacks. However, in the context of PPRL, the main security threat is not about altering ciphertext.

Therefore, we may skip such padding schemes. If such padding is required, it is necessary to use a smaller value for the remainder r, such that the conditions in equation (2) in the method section still hold. The resulting private hash signature is secure in its own right and may be transmitted securely outside the institutional boundary, as long as the key size is sufficiently large. Currently the NIST recommendation for the key size is 2048 bits. Guaranteed by the demonstrated security of the RSA formulation, the hash signatures h(m)’ mod N saved in the signature repository by themselves are resistant to cryptanalysis. Armed only with these signatures, intruders have no way to recover the original h(m). If a private hash signature were exposed, the relevant local site need only produce a new key pair, regenerate the private hash signatures, and permanently delete the old key pairs. Following such a recovery procedure, normal operations may be resumed. Because different local sites use different key pairs, the impact on any individual security instance is confined to a particular local institution. On the contrary, in common salt-based PPRL systems, the compromise of any individual institution may have broader impact.

Also note that while executing the pseudo-hash compression algorithm, the value of h(m)’ for the unsalted hash h(m) is present in the memory of the host machine. Since r is employed as a common public parameter, one can easily replace h(m) with h(m)’ in a dictionary table. Therefore, carrying out a cryptanalysis attack using h(m)’ is just as easy as using h(m). The seed for the random number generator also persists in memory, albeit for short period of time, posing another potential risk by possibly revealing how the XOR stream ciphers are applied. It is therefore very important to ensure the security of the host machine where the pseudo-hash compression engine runs, and to flush the memory completely as soon as these sensitive data are no longer required.

To eliminate such vulnerability completely, we seek a method to achieve the fundamental transformation scheme presented in page 3 without using resorting to intermediary processing that may reveal the original hash h(m) or any of its derivable forms. In future work, we will be seeking a function with suitable mathematical properties that can directly transform the private hash signatures into pseudo-hash tokens that are not practically reversible to the original hash values.

Conclusion

In this paper, we have demonstrated a novel framework to carry out privacy preserving record linkage for multi-site studies using asymmetric key cryptography. The main advantage of our approach is greater scalability by eliminating the need for pre-disseminating a common secret to participating sites. Our inspiration is derived from how asymmetric keys have helped scale up secure Internet applications. We have provided an overarching demonstration of how to design, implement, and deploy a PPRL system based on asymmetric keys. This work is still in its early stages, and we are working to identify functions with better mathematical properties to meet the fundamental transformation requirements.

Figures & Table

Table 1.

Comparison of PPRL-Plus and current PPRL systems.

	PPRL-Plus	PPRL
Require distribution of network wide common salt/secret	No	Yes
Need special-purpose software at local sites	No	Yes
Need special-purpose software at the central	Yes	No
Encryption speed at local sites	Slow	Fast
Encryption speed at the central	Fast	N/A TenT
Generate study specific hashes at local sites	One time	tenTen Many time

Open in a new tab

References

1.Jiang X, Sarwate AD, Ohno-Machado L. Privacy Technology to Support Data Sharing for Comparative Effectiveness Research. A Systematic Review. 2013;51(8 SUPPL.3) doi: 10.1097/MLR.0b013e31829b1d10. Med Care. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Schnell R, Bachteler T, Reiher J. Privacy-preserving Record Linkage Using Bloom Filters. BMC Med Inform Decis Mak. 2009;9(1) doi: 10.1186/1472-6947-9-41. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Chen F, Jiang X, Wang S, et al. Perfectly Secure and Efficient Two-party Electronic-health-record Linkage. IEEE Internet Comput. 2018;22(2):32–41. doi: 10.1109/MIC.2018.112102542. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Ong T, Lazrig I, Ray I, Ray I, Kahn M. Scalable Secure Privacy-preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure. Int J Popul Data Sci. 2018;3(4) [Google Scholar]
5.Dang QH. Secure Hash Standard. Fed Inf Process Stand Publ. 2015:180–4. [Google Scholar]
6.Kho AN, Cashy JP, Jackson KL, et al. Design and Implementation of a Privacy Preserving Electronic Health Record Linkage Tool in Chicago. J Am Med Informatics Assoc. 2015;22(5):1072–1080. doi: 10.1093/jamia/ocv038. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Kho AN, Hynes DM, Goel S, et al. CAPriCORN: Chicago Area Patient-centered Outcomes Research Network. J Am Med Informatics Assoc. 2014;21(4):607–611. doi: 10.1136/amiajnl-2014-002827. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Pasquali SK, Jacobs JP, Farber GK, et al. Report of the National Heart, Lung, and Blood Institute Working Group: An Integrated Network for Congenital Heart Disease Research. Circulation. 2016;133(14):1410–1418. doi: 10.1161/CIRCULATIONAHA.115.019506. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Yuan J, Malin B, Modave F, et al. Towards a Privacy Preserving Cohort Discovery Framework for Clinical Research Networks. J Biomed Inform. 2017;66:42–51. doi: 10.1016/j.jbi.2016.12.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Fleurence RL, Curtis LH, Califf RM, Platt R, Selby JV, Brown JS. Launching PCORnet, a National Patient- centered Clinical Research Network. J Am Med Informatics Assoc. 2014;21(4):578–582. doi: 10.1136/amiajnl-2014-002747. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Collins FS, Hudson KL, Briggs JP, Lauer MS. PCORnet: Turning a Dream Into Reality. J Am Med Informatics Assoc. 2014;21(4):576–577. doi: 10.1136/amiajnl-2014-002864. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Marsolo K, Carton TW. Data Linkage Within, Across, and Beyond PCORnet. [Internet] Available from: https://dcricollab.dcri.duke.edu/sites/NIHKR/KR/GR-Slides-11-09-18.pdf.
13.Randall SM, Ferrante AM, Boyd JH, Bauer JK, Semmens JB. Privacy-preserving Record Linkage on Large Real World Datasets. J Biomed Inform. 2014;50:205–212. doi: 10.1016/j.jbi.2013.12.003. [DOI] [PubMed] [Google Scholar]
14.Delfs H, Knebl H. Symmetric-key Cryptography. In: Information Security and Cryptography. 2015;Vol 2.:11–48. [Google Scholar]
15.Simmons GJ. Symmetric and Asymmetric Encryption. ACM Comput Surv. 1979;11(4):305–330. [Google Scholar]
16.Rivest RL, Shamir A, Adleman L. A Method for Obtaining Digital Signatures and Public-key Cryptosystems. Commun ACM. 1978;21(2):120–126. [Google Scholar]
17.Kaliski B, Jonsson J, Rusch A. PKCS #1: RSA Cryptography Specifications Version 2.2. 2016 [Google Scholar]
18.Fujisaki E, Okamoto T, Pointcheval D, Stern J. RSA-OAEP is Secure Under the RSA Assumption. J Cryptol. 2004 [Google Scholar]
19.Barker EB, Dang QH. Recommendation for Key Management Part 3: Application-Specific Key Management Guidance. 2015 [Google Scholar]

[r1-3198010] 1.Jiang X, Sarwate AD, Ohno-Machado L. Privacy Technology to Support Data Sharing for Comparative Effectiveness Research. A Systematic Review. 2013;51(8 SUPPL.3) doi: 10.1097/MLR.0b013e31829b1d10. Med Care. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r2-3198010] 2.Schnell R, Bachteler T, Reiher J. Privacy-preserving Record Linkage Using Bloom Filters. BMC Med Inform Decis Mak. 2009;9(1) doi: 10.1186/1472-6947-9-41. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3-3198010] 3.Chen F, Jiang X, Wang S, et al. Perfectly Secure and Efficient Two-party Electronic-health-record Linkage. IEEE Internet Comput. 2018;22(2):32–41. doi: 10.1109/MIC.2018.112102542. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4-3198010] 4.Ong T, Lazrig I, Ray I, Ray I, Kahn M. Scalable Secure Privacy-preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure. Int J Popul Data Sci. 2018;3(4) [Google Scholar]

[r5-3198010] 5.Dang QH. Secure Hash Standard. Fed Inf Process Stand Publ. 2015:180–4. [Google Scholar]

[r6-3198010] 6.Kho AN, Cashy JP, Jackson KL, et al. Design and Implementation of a Privacy Preserving Electronic Health Record Linkage Tool in Chicago. J Am Med Informatics Assoc. 2015;22(5):1072–1080. doi: 10.1093/jamia/ocv038. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7-3198010] 7.Kho AN, Hynes DM, Goel S, et al. CAPriCORN: Chicago Area Patient-centered Outcomes Research Network. J Am Med Informatics Assoc. 2014;21(4):607–611. doi: 10.1136/amiajnl-2014-002827. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r8-3198010] 8.Pasquali SK, Jacobs JP, Farber GK, et al. Report of the National Heart, Lung, and Blood Institute Working Group: An Integrated Network for Congenital Heart Disease Research. Circulation. 2016;133(14):1410–1418. doi: 10.1161/CIRCULATIONAHA.115.019506. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r9-3198010] 9.Yuan J, Malin B, Modave F, et al. Towards a Privacy Preserving Cohort Discovery Framework for Clinical Research Networks. J Biomed Inform. 2017;66:42–51. doi: 10.1016/j.jbi.2016.12.008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r10-3198010] 10.Fleurence RL, Curtis LH, Califf RM, Platt R, Selby JV, Brown JS. Launching PCORnet, a National Patient- centered Clinical Research Network. J Am Med Informatics Assoc. 2014;21(4):578–582. doi: 10.1136/amiajnl-2014-002747. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r11-3198010] 11.Collins FS, Hudson KL, Briggs JP, Lauer MS. PCORnet: Turning a Dream Into Reality. J Am Med Informatics Assoc. 2014;21(4):576–577. doi: 10.1136/amiajnl-2014-002864. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r12-3198010] 12.Marsolo K, Carton TW. Data Linkage Within, Across, and Beyond PCORnet. [Internet] Available from: https://dcricollab.dcri.duke.edu/sites/NIHKR/KR/GR-Slides-11-09-18.pdf.

[r13-3198010] 13.Randall SM, Ferrante AM, Boyd JH, Bauer JK, Semmens JB. Privacy-preserving Record Linkage on Large Real World Datasets. J Biomed Inform. 2014;50:205–212. doi: 10.1016/j.jbi.2013.12.003. [DOI] [PubMed] [Google Scholar]

[r14-3198010] 14.Delfs H, Knebl H. Symmetric-key Cryptography. In: Information Security and Cryptography. 2015;Vol 2.:11–48. [Google Scholar]

[r15-3198010] 15.Simmons GJ. Symmetric and Asymmetric Encryption. ACM Comput Surv. 1979;11(4):305–330. [Google Scholar]

[r16-3198010] 16.Rivest RL, Shamir A, Adleman L. A Method for Obtaining Digital Signatures and Public-key Cryptosystems. Commun ACM. 1978;21(2):120–126. [Google Scholar]

[r17-3198010] 17.Kaliski B, Jonsson J, Rusch A. PKCS #1: RSA Cryptography Specifications Version 2.2. 2016 [Google Scholar]

[r18-3198010] 18.Fujisaki E, Okamoto T, Pointcheval D, Stern J. RSA-OAEP is Secure Under the RSA Assumption. J Cryptol. 2004 [Google Scholar]

[r19-3198010] 19.Barker EB, Dang QH. Recommendation for Key Management Part 3: Application-Specific Key Management Guidance. 2015 [Google Scholar]

PERMALINK

Enabling Privacy Preserving Record Linkage Systems Using Asymmetric Key Cryptography

Xiao Dong, PhD

David A Randolph, MEng

Subhash Kolar Rajanna, MD

Abstract

Introduction

Method

Results

Figure 1.

Figure 2.

Discussion

Conclusion

Figures & Table

Table 1.

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Enabling Privacy Preserving Record Linkage Systems Using Asymmetric Key Cryptography

Xiao Dong, PhD

David A Randolph, MEng

Subhash Kolar Rajanna, MD

Abstract

Introduction

Method

Results

Figure 1.

Figure 2.

Discussion

Conclusion

Figures & Table

Table 1.

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases