Abstract
To mitigate bias in multi-institutional research studies, healthcare organizations need to integrate patient records. However, this process must be accomplished without disclosing the identities of the corresponding patients. Various private record linkage (PRL) techniques have been proposed, but there is a lack of translation into practice because no software suite supports the entire PRL lifecycle. This paper addresses this issue with the introduction of the Secure Open Enterprise Master Patient Index (SOEMPI). We show how SOEMPI covers the PRL lifecycle, illustrate the implementation of several PRL protocols, and provide a runtime analysis for the integration of two datasets consisting of 10,000 records. While the PRL process is slower than a non-secure setting, our analysis shows the majority of processes in a PRL protocol require several seconds or less and that SOEMPI completes the process in approximately two minutes, which is a practical amount of time for integration.
Introduction
Healthcare organizations (HCOs) are encouraged, or are required, to share data to support various endeavors, such as post-market surveillance and biomedical research. To mitigate bias in investigations, it is important to resolve when a patient’s data resides in multiple resources. This process, called record linkage, is non-trivial because a patient’s record often contains typographical and semantic errors.1 Sophisticated record linkage strategies have been proposed to resolve these problems2, but their application is hampered by policies or regulations, such as the HIPAA Privacy Rule3, which limit the sharing of identifiers that facilitate the linkage process (e.g., personal name and Social Security number). To overcome this barrier, a growing list of techniques has been proposed to support private record linkage (PRL).4
From a high level, the PRL process has a lifecycle that entails (but is not necessarily limited to) the following steps:
Generation and storage of keys for cryptosystems, or salt values for hash functions, invoked in a PRL protocol;
Communication of keys and salt to the entities encoding the records upon request;
Transformation of identifiers into their protected form as specified by the protocol;
Execution of the record linkage framework (e.g., feature weighting, blocking, and comparison of record pairs to predict which correspond to the same individual); and
Transfer of records and parameters related to the linkage protocol (i.e., all communication between parties).
Unfortunately, there has been a lack of transition of sophisticated PRL techniques into practice. We suspect this is due, in part, to factors associated with: i) complexity in the description and design of protocols, ii) availability of working software, and iii) coverage of the data lifecycle.
Regarding the first factor, a wide variety of methods exist to support PRL. For instance, let us focus on one aspect of step four of the lifecycle for a moment. There are many techniques that have been proposed for computing the similarity of strings in a privacy preserving manner. Some of these are based on secure multiparty computation (SMC) protocols (e.g.,5). Others are based on less cryptographically intense approaches, such as mapping the identifiers to Bloom filter encodings (BFEs)6–8. Hybrid strategies selectively reveal information (e.g., patient’s age as a 5-year range) to speed up the SMC process9. These methods vary considerably in their effectiveness (e.g., precision and recall), efficiency (e.g., computational runtime, memory required, and bandwidth), and security11.
Second, even when PRL techniques appear to balance these factors, they may not be adopted because it is argued that they require more engineering than a simple protocol12. In many respects, this is a valid argument because the majority of published PRL methods are often limited to software implementations developed for experimental analysis. HCOs that wish to use such strategies would be hindered from doing so without significant additional engineering. This is problematic because most HCOs have neither the time nor resources to interpret and implement cryptographic procedures into an easy to use software product. To the best of our knowledge, there are only two tools13,14 (as reviewed below) which explicitly provide readily available software that supports PRL methodologies.
Third, to the best of our knowledge, all published PRL methods, as well as the software disseminated in support, tend to focus on the linkage stage of lifecycle only. In other words, they assume that HCOs have the ability to manage patients’ records, establish communication links, and execute linkage procedures. Consider, a certain subset of PRL methods require HCOs to work with a third party to more appropriately balance efficiency and security.4 Yet, this assumption does not hold true and, thus, the PRL methods in the literature, as well as the aforementioned software tools, fail to cover the lifecycle. This, again, requires HCOs (or research data managers) to piece together or augment certain software tools.
Given the aforementioned challenges, we developed an open source software toolkit to support the PRL lifecycle. Thus, there are several contributions of this paper:
Open source software framework for PRL: We extend an existing open source master patient indexing tool15 to handle the record linkage process in a manner consistent with accepted frameworks. The resulting software, called the Secure Open Enterprise Master Patient Index (SOEMPI), entails an innovative architecture to tailor the specification of PRL protocols to an HCO’s needs. In doing so, SOEMPI manages the communication between disparate data providers, as well as the third parties, involved in the PRL process. Furthermore, the architecture of SOEMPI is implemented in a component-based manner and is readily extensible to PRL approaches that follow the aforementioned lifecycle.
Case Studies in PRL Protocol Implementation: We implement several PRL protocols to illustrate the capability of SOEMPI. All protocols are based on a Bloom filter-based record transformation method proposed in the literature. In these protocols, HCOs hash the patient identifiers in their respective medical record systems into Bloom filters encodings (BFEs). Based on a recently published protocol, each variable is encoded in its own filter, which are subsequently combined into a single composite filter, with bit positions weighted to optimize linkage performance.16 These encodings are subsequently sent to a third party for linkage.
Runtime Analysis: To demonstrate the feasibility of SOEMPI, we perform an analysis over the PRL lifecycle. We compare the running time between SOEMPI and OpenEMPI and show the communication and transformation required for PRL is relatively short, such that it can be completed in a practical amount of time. Given that the accuracy of such strategies has been addressed6, this paper focuses on an evaluation of runtime.
PRL Participants and Protocols
This section begins with a review of record linkage and PRL. We then proceed into describing common participants of the linkage schemes and finally describe two actual protocols.
Record linkage is a data management process. At its core, it consists of two or more data managing HCOs who wish to integrate disparate collections of records. The process of record linkage requires the transfer of data from one HCO to another, who performs a data comparison and integration procedure. This procedure may be deterministic and rule-driven17 or probabilistic and supported by robust statistical inference methods (e.g., the expectation-maximization implementation of the Fellegi-Sunter (FS) algorithm18). In PRL, the set of entities may be expanded to include one or more third parties (i.e., non-contributors of data). As we explain in further depth below, these parties can take on a variety of responsibilities, ranging from the generation and communication of cryptographic keys for the HCOs to the execution of the integration process on behalf of the HCOs in a manner that prevents the inference of patients’ identifiers. Thus, before proceeding into the details of existing systems, we take a moment to review the classes of participating entities described in various PRL systems and their responsibilities.
Data Providers (DPs) are the HCOs who manage the identified patient records that will be linked. Without loss of generality, Let us focus on two disparate patient data holder HCOs, Alice and Bob. They engage in record linkage sessions with other HCOs through the third parties.
Key Server (KS) provides “salt” and/or cryptographic key values needed by the DPs, so that they can perform data transformations (e.g., generation of BFEs) of their data fields in a consistent manner.
Data INtegrator (DAN) performs the integration component of the PRL process for the HCOs. DAN may accept encoded values (e.g., BFEs) from the DPs and performs record integration in a privacy preserving manner.
PArameter Manager (PAM) receives sample data from the DPs to determine the parameters of the linkage protocol (e.g., number of bits in a Bloom filter). It coordinates the communication of such parameters to the DPs.
A. Three-Party PRL Protocol
As an example, let us consider a PRL protocol with one third party as shown in the blue-shaded section of Figure 1. In this protocol, DAN requires Alice and Bob to encode their patients’ identifiers using a consistent set of salted (i.e., keyed) HMAC (Hash-based Message Authentication Code) functions. As such, it is often asked, “Where do these keys come from?” In the first variation of the protocol illustrated, we integrate an independent, semi-trusted authority in the form of key server KS to generate the salts for Alice and Bob. In Figure 1, the sections shaded in blue depict a sequence diagram that summarizes the series of steps and service calls. After receiving the salts, Alice and Bob encode their records into BFEs based on a salted hash function. The resulting encodings are then submitted to DAN, who waits until both datasets and match request tickets (explained below) have been received. Upon reception, DAN runs the requested matching procedure.
Figure 1.
An illustration of the flow for a private record linkage protocol that incorporates multiple third parties. The blue sections of the flow correspond to a third-party protocol, while the incorporation of the red-shaded sections convert it into a four-party protocol. In this diagram, the variables a and b are used to denote the steps associated with Alice’s and Bob’s datasets, respectively).
B. Four-Party PRL Protocol
Recent research has shown that if the BFEs submitted to DAN are not tuned properly, they are vulnerable to cryptanalysis and leakage (e.g., mapping of an encoded value to a patient’s real name).7 To mitigate such an attack, Alice and Bob should parameterize their BFE strategy in a manner that minimizes cryptanalysis, but maximizes accuracy.16 Due to privacy concerns, Alice and Bob cannot exchange patient identifiers directly, but may wish to employ the assistance of an additional third party to provide feedback on how best to setup the system (e.g., determine weights for each field, such as forename, surname, and Social Security number). This is where a parameter manager, PAM, can be of assistance.
In Figure 1, the sections shaded in red provide additional support for such a process. In this case, Alice and Bob request and receive salts from KS as before. However, instead of sending data to DAN, they provide a small sample of their HMAC encoded records to PAM. This entity compares the datasets (as described elsewhere16) and responds to Alice and Bob with the appropriate set of parameter values (e.g., size of Bloom filter, number of hash functions, size of the n-grams into which patient identifiers should be split, and random order of bits into which the hash functions will be mapped). The protocol then continues in the same manner as described in the third party protocol.
Background: the State of PRL Software
This section provides a systematic analysis of existing free software solutions for master patient indexes (MPI), record linkage, and PRL. Table 1 summarizes various aspects of software tools for record linkage and PRL that are readily available. The majority of the software suites in the table are limited in that they were developed to facilitate the integration of specific datasets in a non-coordinated manner. Moreover, they are limited to the user interface of a local machine and, thus, their main deployment is for desktop-level usage. While it is easier to install desktop software than to deploy it as a server application, it significantly limits its scalability. If such software was installed on a server, for instance, interaction would only be possible through a remote desktop protocol. Because this functions in a stand-alone manner, it is beyond its scope to support any protocol in an automatic fashion that involves multiple participants.
Table 1.
A systematic comparison of existing generic and privacy preserving record linkage software tools.
| Tool | PRL | Free | Open Source | Extensible | Communicationa | GUIk |
|---|---|---|---|---|---|---|
| Link King21 | No | Noc | Yes | Limitede | Manual | Desktopb |
| Link Plus20 | No | Yes | No | No | Manual | Desktopb |
| FEBRL26 | Nod | Yes | Yes | Yes | Manual | Desktopb |
| FRIL19 + LinkIt29 | Yes | Yes | Yes | Yes | Manual | Desktopb |
| MTB28 | Yes | Yes | Nof | Yes | Manual | Desktopb |
| OpenEMPI15 | No | Yes | Yes | Yesg | Yesh | Webj |
| OpenMRS23 | No | Yes | Yes | Yesg | Yesh | Webj |
| RECLINK22 | No | Noc | Yes | Limitede | Manual | Desktopb |
| SOEMPI | Yes | Yes | Yes | Yesg | Yesi | Webj |
Communication with other entities.
Requires graphical desktop sharing solution on server and client side to view the graphical user interface of the server locally.
The script is free in itself, but requires additional SAS or Stata license to run.
Proposed, but not implemented.
The software is not free, and requires specific programmer knowledge.
BloomEncode and SafeLink sourcecode is available only for research projects.
SOA software by nature designed for extensibility, but also require programmer knowledge.
With standard HCO actors, but not in a complex record linkage protocol.
With other SOEMPI instances for record linkage.
Users can view graphical interface of web applications remotely with a browser easily by nature
Graphical User Interface
Most of these software tools enable a data cleaning process and support the tuning of record linkage parameters. They are all capable of performing FS-style probabilistic record linkage and most have rich and detailed user interfaces. The Fine-grained Record Integration and Linkage (FRIL) tool even aids in the execution of several consecutive match runs, where the outcome of each run iteratively helps refine parameters.19 Link Plus is free, however, it is closed source and only available for Windows.20 Link King21 and RECLINK22 are both based on statistical software suites (SAS and Stata, respectively) which require licenses, but they are free and open source additions. However, this means that their extensibility requires scripting knowledge of the specific statistical package.
A. Open Source Master Patient Indexing
Distributed healthcare systems require systematic approaches for coordination, integration, and management of linked records. One way this has been accomplished is through master patient indexing (MPI) initiatives, which have been integrated into health information exchange systems. Many of the resulting technologies have been implemented as open source software tools. OpenEMPI15 is an open source MPI project with ongoing development. It is capable of performing deterministic and probabilistic matching and supports MPI. Every aspect of the record linkage process is configurable. It can interface with various health information systems in a standardized manner. The software is based on a service-oriented architecture (SOA) design and a component framework, which makes it extensible. OpenMRS23 was designed to support the delivery of healthcare in developing countries. It has a matching module that is configurable and capable of performing probabilistic matching. Other open source MPI systems, such as the OpenEMed24 and OpenHRE25, have not been supported for years and, thus they were not considered further. It is worth noting that these solutions only address record linkage and patient indexing and do not support PRL.
B. Open Source Private Record Linkage
In comparison to the various options for record linkage tools, the PRL landscape is more barren. The developers of the Freely Extensible Biomedical Record Linkage (FEBRL)26 tool, for instance, planned to implement PRL methods based on n-gram hashes27, but this has yet to be realized. To the best of our knowledge, there exist only two actual PRL implementations. The first corresponds to the BloomEncode and SafeLink companion tools for the Merge ToolBox (MTB) software.28 To facilitate PRL, the BloomEncode tool is invoked to transform patient identifiers into BFEs. These are then manually passed to a third party who runs the MTB system and performs record linkage. The other is the LinkIt companion tool of FRIL.29 We wish to highlight, however, that to realize a fully automated PRL lifecycle, MTB or FRIL/LinkIt would need communication protocols to interact with the companion tools at disparate organizations securely.
Methods
In this section, we describe the PRL lifecycle and how it is supported by the SOEMPI architecture.
A. Requirements
In preparation for this project, we performed a requirements analysis to identify the factors in an existing record linkage software that support the PRL lifecycle. From the outset, it was determined that such a system should have the following properties:
Coverage of the entire PRL process;
Configurable to support various parameterizations of data schemas, record linkage methods, as well as encoding functions, blocking strategies, and record comparison / matching algorithms;
Flexible to facilitate various data schemas at the data providers and the design of different PRL protocols;
Enterprise environment capable of enabling the scalable deployment of the software on a server or in a data warehousing environment; and
Open source technology to allow for extension and revision by a community;
Use of industry-wide accepted technologies for easier communications with possible non-PRL HCO software.
OpenEMPI was selected as a code base because it satisfies most of these properties and is currently maintained.
B. Software and Communications Architecture
This section begins with a high-level overview of the SOEMPI design, composition, and main building blocks. Next, we provide an overview of the pluggable architecture associated with SOEMPI, which allows for flexible management of the lifecycle. Then, we take a closer look at certain key functionalities in the software design. Finally, we discuss how the various participating entities and data schemas are supported.
1. Architecture of a SOEMPI Instance
Figure 2 provides a high-level view of a SOEMPI instance. This view is based on an n-tier design, which facilitates responsible isolation and decoupling of processes. Specifically, the user interface layer (both client- and server-side) at the top interacts with the middle services tier below, which connects to the data access layer, and finally connects to the underlying database. Under the hood, an SOA is applied to achieve a flexible and pluggable architecture.
Figure 2.
A high-level architecture of the SOEMPI software
While many record linkage protocols involve several parties, SOEMPI serves as a universal actor because it can be instantiated according to each of the possible roles described earlier. The actual role played can be selected through either a configuration file or during the login process. According to the selected role, SOEMPI displays only the relevant features on the user interface of the particular role and is restricted to perform only the appropriate operations (e.g., when acting as a Key Server, it can only generate and distribute salts). If additional SOEMPI instances are involved in the protocol (e.g., a third party), the programs communicate with the help of Java EJB remote calls, a method inherited from OpenEMPI.
2. Pluggable Services
While the complete details of SOEMPI are beyond the scope of this paper, we wish to highlight the pluggable aspect of the components in the PRL process. This is one of the key contributions because it enables a modular approach to PRL protocol design and deployment. Figure 3 depicts the steps of the PRL process and how each step offers various encapsulated method options. For instance, SOEMPI offers a variety of record comparison functions, blocking algorithms, and probabilistic matching methods. The composition of a method-chain is possible through specification in an XML configuration file or runtime user interface interaction.
Figure 3.

A depiction of the pluggable service architecture for the record encoding and matching functionality of the SOEMPI system.
3. Dynamic Data Schemas
Given that we cannot predict which patient-specific attributes will be utilized for record linkage purposes, SOEMPI should be able to store any type of data and database tables with diverse schemas. We designed SOEMPI to store multiple instances of flexible typed datasets (any number and type of fields). This requirement derives from real-world scenarios where, for example, mother’s weight and newborn’s weight are required (both are floating point data types). The software was designed such that each imported dataset has its own table in the underlying database, as well as an entry in a registry table which keeps track of the uploaded datasets. Likewise, when SOEMPI links records, it creates a new table for the resulting join, and updates a registry documenting which tables were joined. SOEMPI still leverages a conventional persistence layer (i.e., object relational mapping tools) whenever possible (i.e., user management, sessions, etc.), but uses a custom flexible persistence layer when necessary.
Performance Analysis
In this section, we compare the running time of OpenEMPI to SOEMPI with respect to a certain PRL protocol to illustrate how time and computation are influenced by the privacy preserving procedures.
A. Experimental Design
To allow for reproducibility, we performed our experiments with records from the publicly available North Carolina Voter Registration (NCVR) database, which, at the time of this study, consisted of 6,190,504 records. We generated 10 datasets, each of which consists of 10,000 randomly selected records over the following fields: Forename, Surname, City (of Residence), Street Name (of Residence), Gender, and Ethnicity. From each dataset, we created a corresponding “corrupted” dataset for matching purposes as described by Durham et al.16 In doing so, every field of each record is subject to a corruption procedure, a part of the SOEMPI toolkit, that extends the strategy implemented in FEBRL26 (which is based on research by Christen and Pudjijono31). This procedure introduced character-level errors at rates consistent with those reported in practice. These included optical character recognition (OCR) errors (e.g., S swapped for 8), phonetic errors (e.g., ph swapped for f), and typographic errors (e.g., insertions and transpositions). In our experiments, these corruptions occurred with the following probabilities: insertions and deletions: 0.15, phonetic errors: 0.03, OCR errors: 0.01, substitutions: 0.35, and transpositions: 0.05. To further simulate record linkage challenges faced by HCOs, our extensions introduced errors at the value-level, such as the use of a nickname or a change in residential address. First, nicknames based on the Massmind database32 were substituted for Forenames with probability 0.15. Second, surnames were substituted, using the U.S. Census names data33, for females (e.g., changes due to marriage) with probability 0.1, for males with probability 0.01, and were hyphenated with probability 0.01. Third, street names were changed with probability 0.1 using a random selection from the entire NCVR dataset.
For OpenEMPI, we considered a record linkage framework that involved three parties: Data Providers Alice and Bob and a Data Integrator DAN. The Data Providers transformed the Forename and Surname fields using double metaphone (a phonetic encoding algorithm designed to mitigate noise) during the import of the datasets. These are appended as additional fields to the datasets, which are then passed onto DAN, where a blocking procedure in the form of the sorted neighborhood algorithm is performed in two rounds over these fields. The matching protocol consisted of an Expectation Maximization-based (EM-based) FS algorithm.
For PRL, we used the Four-Party protocol described above. During the Parameter Manager phase, Alice and Bob transferred 1,000 random samples to PAM in an HMAC-encoded form. PAM then performed an EM-based FS algorithm to determine the parameters of the Bloom filter representations of the records as described elsewhere.16 After Alice and Bob transferred their BFEs to DAN, blocking is performed through three rounds of a clustering processing based on locality sensitive hashing (LSH), where each seed of a cluster corresponds to 10 bit positions randomly sampled from the Bloom filter schema. Finally, DAN measured the similarity of record pairs (one BFE contributed by Alice and one contributed by Bob) in each cluster (using a Dice similarity function) in the same blocks to link records to their best match.
For each experiment, we matched the clean version of a given dataset to its own corrupted counterpart and reported on the average (and standard deviation) time for the 10 runs. We measured the time required to 1) import the data, 2) exchange data, and 3) execute various aspects of the linkage algorithm. All experiments were run on a single Quad-core Intel i7-2670QM processor @ 2.2GHz with 12 GB system memory. The majority of the computations were measured natively on the operating system. The data transfer measurements were performed between one SOEMPI instance running on the native OS and one running in a virtual box. Figure 1 depicts the specific steps in the process that were measured.
B. Findings
Our results focus on i) the bandwidth required to transfer and manage datasets for record linkage and ii) the time required to complete the record linkage protocols.
Bandwidth Required
In the non-privacy preserving environment, datasets transferred for record linkage required 550 KB. The size of the privacy preserving BFE datasets depend on their length of the Bloom filters, which itself is dependent on records involved in the linkage process. The length of the BFE is determined by PAM during the Parameter Manager phase. Over the 10 runs of our protocol, the average recommended size of the Bloom filter was 9,217 bits, with a standard deviation of 2,197 bits. However, size ranged from a minimum of 7,133 to a maximum of 12,904, such that the size of the datasets transferred for private record linkage to DAN for integration ranged from approximately 8.9 MB to 16.1 MB. It should also be noted that the runtime of the PRL matching process is greatly influenced by the number of bits used in the LSH-based blocking process, as well as the number of rounds of blocking. Since the goal of this paper is to report on how the communication protocol influences the time necessary to perform PRL, we note that such a sensitivity analysis is beyond the scope of this paper and refer the reader elsewhere for a discussion on Bloom Filter record linkage8 and LSH blocking accuracy34.
Time to Complete Protocol
Before comparing the record linkage protocols, it is important to note that for each step in SOEMPI in which a participant initiates communication with another participant the first time, a one-time authentication procedure must take place. This authentication necessary to establish a secure connection and incurs a fixed cost of 8–9 seconds.
Turning our attention to a comparison of the protocols, Table 2 provides a summary of the results and a breakdown by process. Here, there are several notable findings to highlight. First, the Four-Party protocol incurs three additional categories of cost in comparison to the conventional record linkage method. The first corresponds to the request and dissemination of salt values from the Key Server, which required approximately 9.5 seconds. The majority of this time, however, is spent in authentication and is a fixed cost, which will decrease in its relative contribution to the overall runtime as datasets grow in size. The second corresponds to another fixed cost associated with the FS-based matching step performed at PAM. It can be seen that the Parameter Manager phase is quick. This is due, in part, to the fact that PAM performs an exact matching protocol (as opposed to a similarity comparison) because each record is represented by a single HMAC only. The cost is kept relatively low also because PAM performs this step over a subsample (1,000 records) of the datasets. The third additional cost corresponds to Bloom filter generation. This cost comes from the PRL match step, which is more variable in terms of running time.
Table 2.
Average time (+/− standard deviation) in seconds for 10 runs across each step of the linkage protocols. Private record linkage (PRL) corresponds to the Four-Party protocol in Figure 1 via SOEMPI, while non-private record linkage (non-PRL) is the standard record linkage protocol in OpenEMPI. The steps seen here correspond to the protocol steps depicted in Figure 1. (a+b) refers to operations done both on Alice’s and Bob’s dataset (Figure 1).
| Linkage Step | PRL | Non-PRL |
|---|---|---|
|
| ||
| 0: Import datasets | 6.70 (0.59) | 5.58 (0.73) |
| 1 (a+b): Obtain salts | 0.09 (0.02) | – |
| 2 (a+b): Message digest fields | 0.04 (0.01) | – |
| 3 (a+b): Send encoded datasets | 2.31 (0.08) | – |
| 4: Compute new parameters | 5.32 (0.77) | – |
| 5 (a+b): Obtain parameter advice | <0.01 (<0.01) | – |
| 6 and 7: Create BFEs and send data | 20.11 (3.36) | 5.95 (0.38) |
| 8: Block and match record pairs | 58.74 (42.70) | 47.86 (3.61) |
|
| ||
| Total Time required | 86.13 (43.52) | 59.78 (4.12) |
Overall, the non-secure process requires around one minute, while the secure process requires about one and a half minutes. Though the secure process is roughly 1.5X slower, it should be recognized this is a relatively fixed cost. The majority of this increase is due to parameterization of the record encoding process. The record linkage process (step 8) itself is only 1.2X slower, which derives from the fact that it takes more time to compare two Bloom filters or several thousand bits (see bandwidth findings) than comparison over each person-level field used in this study.
Discussion
The findings from our experimental investigation illustrate that the lifecycle of private record linkage (using BFE encodings and a Four-Party protocol) is slower and requires more bandwidth than record linkage over identifiable patient records. However, at the same time, our results show that the costs do not incur drastic loss in speed or increase in memory footprint and that the PRL process lifecycle can be completed in a practical amount of time. At the same time, though SOEMPI was designed to be flexible, configurable, and enterprise capable, there are certain limitations to our current implementation that we wish to highlight, which can be enhanced in the future.
First, it should be recognized that the evaluation was a pilot study only. As such, the empirical analysis was performed over a dataset of 10,000 records only with a specific blocking and record linkage algorithm. Many of the authentication steps and transmission of key / salt processes will have negligible changes as the various parameters of the PRL process are changed. However, the speed of the system will vary with the size of the dataset, length of the Bloom Filter, and number of bits sampled for LSH change scalability of the system. It is thus recommended that a more comprehensive scalability assessment be performed before applying SOEMPI in larger record linkage frameworks.
At the same time, we wish to point out that the majority of SOEMPI operations are engineered to run in single-threaded processes. However, many of the procedures can be translated into multi-threaded versions, particularly LSH-based blocking35 and record matching36 to take advantage of modern parallel computing frameworks, such as MapReduce.
Second, from a PRL perspective, though SOEMPI can handle communication between the various participating entities, it does not implement any of the cryptographic primitives or protocols that have been proposed in PRL protocols based on secure multiparty computation (SMC). However, SOEMPI can readily incorporate various crypto-toolkits that have been implemented in Java (e.g., the UTD Paillier toolkit37). We believe SOEMPI can be an environment for integrating and managing various PRL protocols in a plug-and-play manner. Other Bloom filter-based solutions could be implemented with small effort in SOEMPI.
Third, from a technical stance, there are several aspects of the system that can be improved. Notably, EJB may not be the best technology for remote communication in the case of record linkage. While it is ideal for serving concurrent and independent web queries, as well as exchanging individual patient records, PRL has different needs. In particular, certain PRL protocols may require longer runtimes, as well as concurrent computations where threads should exchange information. To address these needs, special care will be required to increase the timeout values of the system and achieve synchronization. Moreover, our customized persistence technology supports only the PostgreSQL database management system (DBMS). However, SOEMPI uses standard database connectivity and thus is readily extensible to other DBMS technologies.
Finally, the virtual machine’s JBoss application server is not fully secured and a SSL communication layer must be configured. As such, SOEMPI will need to undergo some additional security hardening when applied in the real world settings.
Conclusions
This paper introduced an open source software suite to support the private record linkage (PRL) lifecycle. SOEMPI’s root goes back to extensively tested and proven existing Open Master Patient Index (OpenEMPI) reference implementation software to manage, transfer, and perform record linkage over encoded patient identifiers, but it is modified and enhanced in several key areas. We performed a meta-analysis to compare and contrast the proposed software toolkit with various freely available record linkage and PRL software toolkits that have been available to the biomedical research community. In addition to describing the software architecture and curtails of the specific technologies leveraged to realize the SOEMPI in working code, we provided a high-level depiction of how to build several PRL protocols. These protocols demonstrate how multiple third-parties, as well as data providers, can be integrated through a common communication and software system to cover the lifecycle. We also provided a runtime analysis of a Bloom filter-based PRL protocol that incorporates the lifecycle of encoding, blocking, and matching, and showed that such a procedure can complete in a practical amount of time.
Acknowledgments
This research was supported by grants CCF-0424422, CNS-1016343, and CNS-0964350 from the U.S. National Science Foundation and R01-LM009989, UL1-TR000135 from the U.S. National Institutes of Health. The authors would like to thank Steve Nyemba, MS from Vanderbilt University, Doug Bell, MD, PhD from the University of California at Los Angeles, Abel Kho, MD from Northwestern University, and Peter Christen, PhD from Australian National University for helpful discussions during the development of this software and for evaluating prototypes during their development.
References
- 1.Hernandez M, Stolfo S. Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery. 1998;2:9–37. [Google Scholar]
- 2.Christen P. Data matching: concepts and techniques for record linkage and duplicate detection. Springer. 2012 [Google Scholar]
- 3.U.S. Dept. of Health and Human Services Standards for privacy of individually identifiable health information, final rule. Federal Register. 2003 Feb 20; 45 CFR: Pt 164. [PubMed] [Google Scholar]
- 4.Vatsalan D, Christen P. Verykios A taxonomy of privacy-preserving record linkage techniques. Information Systems. 2013;38:946–69. [Google Scholar]
- 5.Atallah M, Kerschbaum F, Du W. Secure and private sequence comparisons. Proc ACM Workshop on Privacy in the Electronic Society. 2003:39–44. [Google Scholar]
- 6.Schnell R, Bachteler T, Reiher J. Privacy-preserving record linkage using Bloom filters. BMC Med Inform Decis Mak. 2009;9:41. doi: 10.1186/1472-6947-9-41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Kuzu M, Kantarcioglu M, Durham E, Toth C, Malin B. A practical approach to achieve private medical record linkage in light of public resources. J Am Med Inform Assoc. 2013;20:285–92. doi: 10.1136/amiajnl-2012-000917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Randall SE, Ferrante A, Boyd JH, Bauer J, Semmens JB. Privacy-preserving record linkage on large real world data sets. J Biomed Inform 2014. doi: 10.1016/j.jbi.2013.12.003. in press. [DOI] [PubMed] [Google Scholar]
- 9.Inan A, Kantarcioglu M, Bertino E, Scannapieco M. A hybrid approach to private record linkage. Proc IEEE International Conference on Data Engineering. 2008:496–505. [Google Scholar]
- 10.Kum HC, Krishnamurthy A, Machanavajjhala A, Reiter MK, Ahalt S. Privacy preserving interactive record linkage (PPIRL) J Am Med Inform Assoc. 2014;21:212–20. doi: 10.1136/amiajnl-2013-002165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Durham E, Kantarcioglu M, Xue Y, Malin B. Quantifying the correctness, computational complexity, and security of privacy-preserving string comparators for record linkage. Inf Fusion. 2012;13:245–59. doi: 10.1016/j.inffus.2011.04.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Weber SC, Lowe H, Das A, Ferris T. A simple heuristic for blindfolded record linkage. J Am Med Inform Assoc. 2012;19:e157–61. doi: 10.1136/amiajnl-2011-000329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Schnell R, Bachteler T, Bender S. A toolbox for record linkage. Austrian J Statistics. 2004;33:125–33. [Google Scholar]
- 14.Bonomi L, Xiong L, Lu J. LinkIT: privacy preserving record linkage and integration via transformations. Proc ACM International Conference on Management of Data. 2013:1029–32. [Google Scholar]
- 15.Pentakalos O, Xie Y. An extensible open source enterprise master patient index. Poster Presentation - AMIA Annu Symp Proc. 2009. Software available at: http://www.openempi.org/
- 16.Durham E, Kantarcioglu M, Xue Y, Toth C, Kuzu M, Malin B. Composite Bloom filters for secure record linkage. IEEE Transactions on Knowledge and Data Engineering. doi: 10.1109/TKDE.2013.91. In press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Grannis SJ, Overhage JM, McDonald CJ. Analysis of identifier performance using a deterministic linkage algorithm. Proc AMIA Symp. 2002:305–9. [PMC free article] [PubMed] [Google Scholar]
- 18.Grannis SJ, Overhage JM, Hui S, McDonald CJ. Analysis of a probabilistic record linkage technique without human review. AMIA Annu Symp Proc. 2003:259–63. [PMC free article] [PubMed] [Google Scholar]
- 19.Jurczyk P, Lu JJ, Xiong L, Cragan JD, Correa A. FRIL: a tool for comparative record linkage. AMIA Annu Symp Proc. 2008. pp. 440–4. Software online at: http://fril.sourceforge.net/ [PMC free article] [PubMed]
- 20.Thoburn KK, Gu D, Rawson T. Link Plus: Probabilistic record linkage software. Probabilistic Record Linkage Conference Call. 2007 [Google Scholar]
- 21.Campbell KM. Rule your data with the Link King (a SAS/AF application for record linkage and unduplication) 30th SAS User Group International Meeting. 2005 [Google Scholar]
- 22.Blasnik M. RECLINK: Stata module to probabilistically match records. Boston College Dept of Economics. 2010. Available online at http://ideas.repec.org/c/boc/bocode/s456876.html.
- 23.Wolfe BA, Mamlin BW, Biondich PG, et al. The OpenMRS system: collaborating toward an open source EMR for developing countries. AMIA Annu Symp Proc. 2006:1146. [PMC free article] [PubMed] [Google Scholar]
- 24. Available online at: http://openemed.org/
- 25. Available online at: http://www.openhre.org/
- 26.Christen P. FEBRL – a freely available record linkage system with a graphical user interface. Proc Australasian Workshop on Health Data and Knowledge Management. 2008. pp. 17–25. Software available online at: http://datamining.anu.edu.au/projects/linkage.html#prototype_software.
- 27.Churches T, Christen P. Some methods for blindfolded record linkage. BMC Med Inform Decis Mak. 2004;4:9. doi: 10.1186/1472-6947-4-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Available online at: http://www.record-linkage.de/
- 29.Bonomi L, Xiong L, Lu JJ. LinkIT: privacy preserving record linkage and integration via transformations. SIGMOD Conference. 2013:1029–32. [Google Scholar]
- 30. Available online at: ftp://www.app.sboe.state.nc.us/data.
- 31.Christen P, Pudjijono A. Accurate synthetic generation of realistic personal information. Proc Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. 2009:507–14. [Google Scholar]
- 32.Massmind Nicknames Database. Online at: http://techref.massmind.org/techref/ecommerce/nicknames.htm.
- 33.U.S. Census Bureau, Population Division Online at: http://www.census.gov/genealogy/names/namesfiles.html.
- 34.Kim H, Lee D. HARRA: a faster iterative hashed record linkage for large-scale data collections. Proc International Conference on Extending Database Technology. 2010:525–36. [Google Scholar]
- 35.Karapiperis D, Verykios V. A distributed framework for scaling up LSH-based computations in privacy preserving record linkage. Proc Balkan Conference on Informatics. 2013:102–9. [Google Scholar]
- 36.Yan W, Xue Y, Malin B. Scalable load balancing for MapReduce-based record linkage. Proc IEEE International Performance Computing and Communications Conference. 2013:1–10. [Google Scholar]
- 37. Available online at http://www.utdallas.edu/~mxk093120/cgi-bin/paillier/index.php.


