Abstract
Linking patient records across disparate healthcare systems is essential to create comprehensive views of patient health, yet this task is complicated by inconsistent identifiers and data quality issues. Although traditional deterministic and probabilistic record linkage (RL) methods have long been used for this purpose, deterministic approaches are brittle in the presence of noisy personally identifiable information (PII), while probabilistic approaches are often difficult to scale. As a result, large-scale linkage commonly relies on restrictive matching strategies that limit recall. This work presents a hybrid RL approach that integrates a deep embedding model with deterministic rules, leveraging both the flexibility and noise robustness of soft embeddings and the reliability and predictable baseline performance of deterministic rules. Using a large-scale real-world dataset, a BERT-based embedding model is fine-tuned in a Siamese network with contrastive loss to encode PII fields as numeric vectors. De-duplicated identifiers (Fuzzy IDs) are then obtained through a blocking-and-clustering step using the embedding vectors. The approach is evaluated using multiple signals (social security number, phone, and email) and is shown to outperform baseline methods. A postprocessing step based on deterministic rules allows embedding-based linkage to be overridden in a subset of cases where high-confidence rules apply, such as when a high-quality identifier is available. The system is deployed on a commercial database consisting of more than 200 million PII records, demonstrating scalability in a real-world healthcare setting.
Keywords: contrastive learning, probabilistic record linkage, deterministic record linkage, entity resolution, patient matching
Introduction
As healthcare delivery becomes distributed across a multitude of systems and providers, the need to integrate data from disparate sources has become critical. Many health systems in the USA independently collect and store clinical data about patients, often using locally defined identifiers. As patients interact with multiple healthcare institutions, their information becomes fragmented across these silos, resulting in incomplete views of health records and posing challenges for data-driven healthcare delivery and research [1].
Integrating electronic health records (EHRs) across systems can provide a more comprehensive understanding of individual health and improve clinical decision-making, research, and population health initiatives. Consequently, numerous initiatives have emerged to link health records across systems while preserving data quality and patient privacy [2–7].
Record linkage (RL), also known as entity resolution, seeks to identify and merge records that refer to the same entity. For matching patients, this process typically relies on personally identifiable information (PII), such as name, date of birth (DOB), and ZIP codes. However, PII is prone to quality issues, including typographical errors and inconsistent or incomplete entries, which complicates direct linkage [8].
Traditional RL methods are broadly classified as deterministic, based on exact matching rules, and probabilistic, which use similarity scores to estimate the likelihood of a match [9, 10]. Hybrid methods have also been developed to combine the strengths of both approaches [11]. Recent works often use bloom filters [12–16], creating a better balance between privacy protection and linkage accuracy, yet challenges persist when applied to noisy real-world data.
Parallel to these developments, machine learning (ML) techniques have been applied to RL tasks, including early work using neural networks [12, 17, 18]. Most ML-based RL models to date, however, rely on relatively simple architectures and small-scale datasets. Transformer-based architectures such as BERT [19] have shown strong performance across various domains by capturing complex relationships in input data through attention mechanisms. Despite this, their application in RL remains limited, likely due to the scarcity of large datasets in this domain.
The present study uses a large-scale real-world patient dataset to fine-tune a BERT-based Siamese network [20] using contrastive loss. Patient records are encoded as continuous vector representations (embeddings) in which the similarity of the records is reflected by geometric proximity. This embedding-based approach improves robustness to common variations—such as misspellings, inconsistent formatting, name changes, and missing values. These embeddings are used to obtain de-duplicated identifiers (Fuzzy IDs) in a blocking-and-clustering process. The matching performance of this process is compared with several baselines, from simple token-based matching to neural network-based approaches. Moreover, to harness the strengths of both learned embeddings and deterministic rules, the proposed framework adds a postprocess that enables overriding the embedding-based linkage by deterministic rules, for cases where high-quality identifiers are available. The resulting hybrid framework was deployed and used as a scalable solution for integration on more than 200 million records in a real-world commercial EHR database.
Materials and methods
Dataset
We use a commercial database composed of records aggregated from more than 30 healthcare systems across the USA. The full database at data sampling time contained more than 125 million PII records, capturing substantial diversity in demographic and data-entry patterns. For the experiments in this study, we use a 20% random sample of these records, totaling 25 million records.
Figure 1 summarizes the 13 available PII fields and their fill rates after data cleaning. Among these, five standard PII fields—first name, last name, DOB, gender, and ZIP code—had the highest completeness and were selected as the primary features for modeling and linkage. For the rest of the paper, unless otherwise specified, these fields are used after basic normalization and cleaning. The process includes converting the first and last name to uppercase and removing obvious placeholder values, ZIP code to a string of five digits, and the DOB to a string with YYYY-MM-DD format.
Figure 1.
Number of filled records and fill rates (after data cleaning) for PII fields. Fill rates are shown in brackets next to the field names
Problem formulation and measuring linkage performance
We formulate RL as a binary classification problem over pairs of records. Let denote a dataset of patient records. The objective is to identify pairs that refer to the same underlying individual. Ideally, ground truth match labels are obtained through human annotation. Here, in the absence of a large human-annotated dataset, we construct high-confidence proxy labels using social security numbers (SSNs) as the primary labeling source, and validated Phone and Email values as auxiliary sources for cohorts where SSN is unavailable (“Primary labeling source—SSN” and “Auxiliary labeling sources—phone and email” sections describe the procedure for preparing these labels).
Let M denote the set of true match pairs (based on the proxy labeling procedure) and the set of predicted match pairs produced by a linkage method. We evaluate linkage performance using standard metrics:
Precision:
Recall:
F1 score: the harmonic mean of precision and recall.
Primary labeling source—SSN
Although SSNs are not universally available and may be affected by data quality issues, valid SSNs—when present—provide a strong ground-truth anchor. In the absence of manual annotations, we use SSNs as a proxy to construct labeled record pairs. After excluding incorrectly formatted values, an initial inspection showed that records sharing the same SSN typically correspond to the same individual. However, we observed two recurring exceptions:
Correctly formatted values that resemble placeholder SSNs (e.g. 111-22-3333) and are shared by a large number of records (e.g. more than 50).
Correctly formatted, valid-looking SSNs that are shared by records belonging to different individuals (e.g. records with different names or DOB). Qualitative inspection suggested these cases frequently reflect SSN reuse among family members.
To ensure high confidence in the resulting ground-truth labels, we designed a procedure to identify SSN values affected by these issues. Formatting errors are detected using regular expressions (regex), while the above two exception scenarios are identified by examining the consistency of names and DOB associated with each SSN. After the format check, we mark an SSN as invalid if the group of records associated with this SSN contains more than one unique name–DOB combination. To tolerate imperfections in name and DOB fields, we introduce two additional steps:
To account for variations such as first–last name switching or cases where a middle name is concatenated with either the first or last name (often due to differences across EHR data sources), we construct a combined full name field. This field is created by alphabetically sorting and concatenating name components, thereby removing ordering effects.
To accommodate typographical errors, full names and DOB are clustered using sequence match ratio calculated by python standard library module difflib. Two full name–DOB combinations are considered different only if their match ratio is lower than 0.8.
We inspected 300 SSNs (groups) to validate our assumptions and found that this process successfully captured all SSN-related issues detectable through manual review. Out of 25 million records in the dataset, 46% contained a well-formed SSN (as indicated by SSN in Fig. 1), and among these, 75% (8.6 million records) were retained as having valid SSNs, yielding 121 509 positive (SSN-matching) pairs.
Auxiliary labeling sources—phone and email
Valid SSNs provide a high-quality labeling signal, but they are missing for a substantial fraction of records. To evaluate linkage quality on the cohort of records without SSN, we relied on additional identifiers that (i) have sufficient availability and (ii) are typically associated with a unique individual. Therefore, we use phone number and email address as auxiliary labeling sources. These fields are not as stable or universally unique as SSNs, yet they are used for identification in many practical applications. Due to the possibility of individuals having multiple emails or a phone number being used by more than one person, we treat phone and email as imperfect but informative evaluation signals rather than an absolute ground truth.
As with SSNs, we apply regex-based format validation followed by consistency checks over name–DOB values within phone/email groups, using the same tightening logic. After excluding records with valid SSNs, the phone cohort contains 3.7 million records with valid phone numbers, yielding 23 468 positive (phone-matching) pairs, and the email cohort contains 3.0 million records with valid email addresses, yielding 14 487 positive (email-matching) pairs.
Test sets and splitting
For the primary SSN-labeled cohort (described in “Primary labeling source—SSN” section), we split records into 20% testing, 70% training, and 10% validation. To prevent overlap and leakage, partitioning is performed at the block level: all records sharing the same blocking token value (“Blocking” section) are assigned to the same split. This ensures that test-time records are disjoint from—and not trivially similar to—training records under the chosen blocking scheme. For each of the Email and Phone cohorts (described in “Auxiliary labeling sources—phone and email” section), all of the records are used for testing, and these two auxiliary test sets, by definition (lacking SSN), are disjoint from the primary SSN-based train/validation/test sets.
Baselines
Simple token-based matching
A common operational linkage approach is to construct a token by concatenating selected PII fields (or their substrings/derived strings) and to match only on exact token equality. This approach is computationally efficient (tokens can be computed and matched via group-by operations) but is inflexible under noise, especially if tokens are hashed and only equality comparisons are possible.
We evaluate multiple token definitions that capture different precision–recall trade-offs. We consider the following tokens constructed from combinations of PII fields:
T1: first_name[0] + last_name + DOB + gender
T2: first_name + last_name + DOB + ZIP[: 3]
T3: first_name + last_name + gender + DOB
T4: first_name[: 3] + last_name + gender
These represent different balances between noise-robustness and discriminative power. For example, T1 uses only the first letter rather than the full first name, improving recall under name noise at the cost of precision, while T2 adds ZIP code, which can change over time and reduce recall. We assume tokens will be hashed; therefore, matching relies strictly on exact equality. The fields are normalized, prior to use, as described in “Dataset” section.
XGBoost classifier
As a stronger baseline, we trained an XGBoost classifier [21] that predicts whether a record pair is a match. Each pair is represented by five numerical features indicating the similarity of each of the five primary PII fields (first name, last name, DOB, gender, ZIP). The classifier outputs a match probability, which is thresholded to obtain predicted links. Hyperparameters and the decision threshold are selected to maximize F1 on the validation set.
Neural network (BERT-based) classifier
We also evaluate a neural network-based pair-classification baseline using a BERT encoder. This model receives each record pair as a structured textual input (JSON encoding of PII fields). A classification head is applied to the output embedding of the [CLS] token, and it is trained end-to-end with binary cross-entropy loss. As with XGBoost, hyperparameters and classification threshold are selected based on validation-set F1.
Proposed method using embeddings—fuzzy ID
Our primary linkage method, Fuzzy ID relies on learned vector representations (embeddings) of patient records to group records of the same individual together and assign the same Fuzzy ID to the group. The embedding model transforms a given record into a vector such that records from the same individual are close in Euclidean distance while records from different individuals are separated by a margin. This representation-learning formulation is scalable and defines a criterion for a desirable embedding model. The embedding model can then be continuously evaluated and improved with respect to this criterion. Figure 2 shows the schematic of the model architecture and how the embedding vector is derived for each record.
Figure 2.
Embedding model architecture
At linkage time, first, records are divided into blocks based on a high-recall token and then records in each block are separately clustered according to their embedding distance. After clustering, each record is assigned a provisional Fuzzy identifier derived by concatenating the block identifier (i.e. the blocking token value) with the cluster index. This process is described in Algorithm 1, and is shown on the left side of Fig. 4, highlighted in the light blue box. Blocking and Clustering steps are explained in more detail in the following sections (“Blocking” and “Clustering”). “Training embedding model” section describes how the model is trained using contrastive learning.
Figure 4.
Schematic illustration of the hybrid linkage
Blocking
We employ blocking, a standard strategy that partitions the dataset into disjoint subsets (blocks) based on a blocking token constructed from PII fields. Candidate comparisons are restricted to records within the same block, under the assumption that, having a blocking token with sufficiently high recall, true matches almost always share the blocking token. This avoids a quadratic explosion of the number of comparisons in large datasets and yields independent subproblems that can be processed in parallel.
We evaluated multiple blocking tokens constructed from different combinations of the five primary PII fields. Token T1 yielded the most favorable trade-off between high recall and manageable block sizes. By using only, the first initial rather than the full first name, T1 is more robust to common first-name noise (misspellings, nicknames, truncation) than T3, while retaining discriminative power from DOB, last name, and gender. Token T2 achieved higher precision but lower recall due to its reliance on ZIP code, which can change over time. The T4 token omits DOB and exhibited poor precision. These trends are consistent with the match performance of the tokens reported in Table 1.
Table 1.
Matching performance based on SSN.
| FuzzyID | NN | XGB | T1 | T2 | T3 | T4 | T1–T4 | Exact | |
|---|---|---|---|---|---|---|---|---|---|
| Precision | 0.94 | 0.87 | 0.87 | 0.81 | 0.97 | 0.95 | 0.004 | 0.97 | 0.98 |
| Recall | 0.93 | 0.90 | 0.98 | 0.95 | 0.69 | 0.82 | 0.85 | 0.69 | 0.65 |
| F1 | 0.935 | 0.88 | 0.92 | 0.874 | 0.86 | 0.880 | 0.008 | 0.806 | 0.782 |
Note. NN: neural network classifier. XGB: XGBoost Classifier. Exact: Matching based on exact string match of PII features. T1–T4: Matching that requires all four tokens to match. Bold: Proposed method. Italics: Highest for the given metric.
Clustering
Within each block, we cluster record embeddings using a stringent distance criterion. The goal is to identify groups of records whose embeddings lie within a specified distance of one another in the embedding space. We select this distance threshold by examining the distribution of embedding distances for positive and negative pairs in the validation set, and choosing the value that maximizes the F1 score when embedding distance is used to predict the pair label. Figure 3 highlights the selected operating point on the distance distribution (i), ROC curve (ii), and the threshold–F1 curve (iii).
Figure 3.
(a) Distribution of embedding distance in positive and negative (within-block) pairs in the validation set. (b) ROC curve of the embedding distance as a predictor of label in validation set. (c) The F1 score for the embedding distance as a predictor of label is plotted for different threshold values on the validation set
Let G be the neighborhood graph whose vertices are records in the block and where an undirected edge connects two records if their Euclidean embedding distance is at most a threshold thr. Rather than taking connected components of G (single-linkage clustering), which can propagate errors transitively, we form clusters as fully connected subgraphs (cliques) of G. This guarantees that every pair of records assigned to the same cluster is within thr, improving cluster precision.
Computing an optimal clique cover is NP-hard in general. In our setting, blocks are small (at most 150 records in the evaluated dataset), and a greedy heuristic based on coloring the complement graph is sufficiently fast and produced stable results (see “Computational infrastructure” section for the overall runtime of the linkage process on the full data). Algorithm 2 describes the heuristic used to partition G into cliques. For substantially larger blocks, a more scalable but less strict alternative would be to use connected components (single-linkage) or other approximate clustering methods, at the cost of potentially introducing false positives through transitive closure.
Training embedding model
The goal of the embedding model is to map each patient record to a vector in a space in which Euclidean distance reflects match likelihood: embeddings of records from the same individual are close, while those from different individuals are separated by a margin. Each record is represented using five key PII attributes—first name, last name, DOB, gender, and ZIP code—serialized into a normalized JSON string. This string is encoded by a transformer encoder initialized from an uncased BERT checkpoint, and the final [CLS] representation is used as the record embedding. Two identical encoders with shared weights form a Siamese network (Fig. 2).
We optimize the Siamese network using the standard contrastive loss with margin (m), set to 0.8. Although hyper-parameter (m) is not usually optimized, it directly shapes the distance scale in the embedding space and therefore affects the threshold that best separates positive and negative pairs—an operating point (thr) that we use for clustering (“Clustering” section).
For an input pair with label (where indicates a match), the loss is:
| (1) |
where denotes the embedding function and is the Euclidean distance between embeddings.
Training pairs are constructed from the SSN-labeled subset described in “Primary labeling source—SSN” section. To align training with inference-time behavior and to avoid trivial examples, we sample pairs only within blocks, exclude pairs with all five fields identical, and remove redundant symmetric permutations (pair order does not affect the distance). Training is performed using Adam optimizer with learning rate of 5e-5 and weight decay of 0.005. We used mixed precision (16-bit) and a maximum sequence length of 1024 tokens.
We dedicate 10% of the data to validation for tuning hyperparameters and use early stopping based on the AUC obtained from embedding distances. We select the checkpoint from the last epoch that improves the validation set AUC (epoch 7). Each epoch consists of all within-block nonidentical pairs in the training split, and each batch contains 32 pairs (set to the maximum that could fit in our GPU memory).
Results
Evaluation of fuzzy ID matching performance
Table 1 reports Precision, Recall, and F1 for Fuzzy ID and all baselines by SSN-based evaluation.
Overall, the results show that Fuzzy ID matching provides a favorable balance: it achieves higher recall than more restrictive tokens (e.g. T3) while improving precision over more permissive tokens (e.g. T1), consistent with the intended role of embeddings as a robust soft-matching layer.
To assess performance on records without SSNs, we also evaluate on two independent auxiliary cohorts labeled by validated phone and email, as described in “Auxiliary labeling sources—phone and email” section. Tables 2 and 3 show the result of evaluation based on email and phone respectively. As expected, across all linkage methods, Precision and Recall against email and phone labels are lower than against SSN labels because email and phone are inherently noisier identifiers: individuals may report different values over time, and phone numbers can be reassigned.
Table 2.
Matching performance based on email.
| FuzzyID | T1 | T2 | T3 | T4 | |
|---|---|---|---|---|---|
| Precision | 0.70 | 0.60 | 0.76 | 0.71 | 0.001 |
| Recall | 0.74 | 0.78 | 0.52 | 0.69 | 0.75 |
| F1 | 0.72 | 0.68 | 0.62 | 0.70 | 0.003 |
Note. Bold: Proposed method. Italics: Highest for the given metric.
Table 3.
Matching performance based on phone.
| FuzzyID | T1 | T2 | T3 | T4 | |
|---|---|---|---|---|---|
| Precision | 0.63 | 0.53 | 0.74 | 0.64 | 0.001 |
| Recall | 0.71 | 0.75 | 0.56 | 0.68 | 0.72 |
| F1 | 0.67 | 0.62 | 0.64 | 0.66 | 0.001 |
Note. Bold: Proposed method. Italics: Highest for the given metric.
The email and phone cohorts are not used for training or linkage decisions and are disjoint from the SSN cohort, providing an additional view of performance that mitigates any bias introduced by SSN availability, and shows that the performance superiority of Fuzzy IDs is maintained when assessed with non-SSN labels.
Analysis of the embeddings
The success of embedding-based linkage with Fuzzy IDs depends critically on the quality of the learned embeddings—specifically, how well positive and negative pairs are separated according to their embedding distances. Figure 3a shows the distribution of Euclidean distances for positive and negative (within-block) pairs in validation set, and reveals a clear separation, indicating that the embedding space generalizes and provides a meaningful measure of match likelihood on held-out data.
Figure 3b provides a complementary view of this discriminative ability via the receiver operating characteristic (ROC) curve. The resulting area under the curve (AUC) indicates strong predictive performance in distinguishing matched from unmatched record pairs using distance alone. Finally, Figure 3c illustrates how we select the clustering threshold (thr) by maximizing the F1 score on the validation set; this threshold is then used in the subsequent clustering and linkage procedure (“Clustering” section).
It is important to note that the negative pairs shown in Fig. 3 belong to the same block and therefore already share substantial PII-based similarity, despite corresponding to distinct individuals. As a result, these negatives are more challenging to predict than randomly sampled global negatives, which explains the lower F1 score observed in panel (c) compared to the results reported in Table 1.
Error analysis
To better understand failure modes, we manually inspected and annotated a random sample of 100 false positives and 100 false negatives from Fuzzy ID predictions and grouped them into categories summarized in Tables 4 and 5.
Table 4.
False-positive categories (different SSNs but same fuzzy ID).
| Category | Count | Description |
|---|---|---|
| CN | 73 | Same name and different ZIP code (usually common names e.g. john smith) |
| SF | 13 | Similar but different first names (e.g. Audrey and Avery) and different ZIP code |
| CN-SZ3 | 9 | Same names, and same ZIP3 (usually common names) |
| CN-SZ5 | 4 | Same names, and same ZIP5 (usually common names) |
| SF-SZ5 | 1 | Similar but different first names and same ZIP5 |
Table 5.
False-negative categories (same SSN but different fuzzy ID).
| Category | Count | Description |
|---|---|---|
| DL-MAR | 20 | Different last name (probably because of last name change after marriage) |
| DZ | 13 | ZIP is different—All other fields are identical |
| NO-GEN | 11 | Gender is missing (null) |
| DF-MID | 11 | Middle name is attached to first name in one of the records |
| NO-ZIP | 9 | ZIP is missing (null) |
| DL-MID | 9 | Middle name is attached to last name in one of the records |
| DF-TYPO | 8 | Typo in first name in one of the records |
| DL-TYPO | 6 | Typo in last name in one of the records |
| DF-FORM | 4 | Different form of the first name used in one record (Abigail vs. Abby) |
| DG | 3 | Different gender without name change (probably data-entry error) |
| DB-TYPO | 3 | One digit difference in DOB (Probably due to typo) |
| DL-ERR | 2 | Placeholder or wrong value in last name in one of the records |
| DF-NC | 1 | Different first name (due to name change or middle name entered instead of first) |
False positives were dominated by cases where the five main PII fields matched except for ZIP code. These errors frequently occurred among individuals with very common names (e.g. “John Smith”), where even exact agreement on DOB can occur across different people. In such cases the model appears to infer that the records belong to the same individual who has moved, producing a spurious link. We also observed variants where ZIP codes were similar (e.g. same first three digits) or identical, suggesting the model may place nontrivial weight on ZIP for the embedding even when the remaining evidence is ambiguous.
Another false-positive category involved names that are character-wise similar but represent different individuals (e.g. “Avery” versus “Audrey”). These are plausibly attributable to the embedding model’s learned similarity structure and may be mitigated through additional training data, improved sampling of hard negatives, or modifications to the loss/objective.
False negatives are cases with the same SSN that resulted in different Fuzzy IDs. A prominent category involved last-name changes consistent with marriage (‘DL-MAR’). Many of these failures are expected under our blocking setup because the blocking token (T1) includes last name and DOB; records that differ on these attributes will fall into different blocks and will not be compared during embedding-based clustering. While SSN can confirm such matches during evaluation, it may be difficult for any PII-based system to confidently infer a match when names differ substantially. Importantly, the deterministic combine postprocess can recover many such links when a high-confidence identifier like SSN is available.
We also observed false negatives arising from missing values (e.g. missing ZIP or gender). These suggest an opportunity to improve robustness to missingness so that the absence of a single field does not prevent clustering when other evidence strongly supports linkage. Other categories (e.g. typographical errors in first/last name, DOB formatting issues) represent cases that an improved matching system should capture; these categories directly inform future model and rule improvements.
Postprocess—overriding linkage using deterministic rules
Embedding-based clustering provides a strong general-purpose linkage mechanism. However, in many operational settings, high-confidence identifiers are available for subsets of records. When such fields are present, deterministic rules can enforce constraints that embeddings alone may occasionally violate. Here we used a valid SSN, but the same process can be used to enforce rules based on other identifiers.
We define two forms of deterministic overrides that aim to preserve the existing clusters as much as possible, intervening only when there is a clear conflict:
Split phase (enforce separation)
If a cluster contains conflicting values of a high-confidence split key (e.g. different valid SSNs), those records are required to reside in separate clusters. The presence of conflicting split-key values is treated as definitive evidence that the records represent different individuals. Algorithm 3 describes this procedure.
Combine phase (enforce merging)
If records share a high-confidence combine key (e.g. the same SSN) and meet additional consistency criteria (e.g. compatible DOB and first name), they are merged into the same cluster. This treats the combined key as a sufficient condition for linkage. Algorithm 4 describes the combine operation.
Figure 4 provides an end-to-end illustration of the full hybrid linkage pipeline. In this illustration, five records are grouped into two token blocks. Within Block 1, there is a single embedding cluster, which is then split into two clusters due to an SSN conflict. Within Block 2, three clusters are formed, with no SSN conflicts. Finally, Records 2 and 3 are merged across blocks and clusters to share the same Fuzzy ID, based on the same SSN.
Computational infrastructure
We deployed the hybrid linkage algorithm on 200 million records from the full database. Embedding inference was performed on a 10-node cluster, each node equipped with an NVIDIA T4 GPU, and completed in approximately 300 h. The subsequent clustering, splitting, and combining steps were executed on 10–20 Azure Standard_D32ds_v5 nodes (32 cores, 75 GB RAM each) and completed in approximately 2 h. Model training was conducted on a single node with 800 GB of RAM and four NVIDIA A100 80 GB GPUs, using PyTorch’s Distributed Data Parallel (DDP). Training took approximately 3 h (about 20 min per epoch).
Discussion
Benefits of the hybrid approach
The proposed framework combines two complementary components: embedding-based Fuzzy ID generation and deterministic postprocessing. Both stages are highly parallelizable, enabling a scalable end-to-end solution for generating de-duplicated identifiers at a very large scale. The components are also relatively modular: the embedding model can be improved independently (e.g. via better contrastive training, additional fields, or improved handling of missingness), while deterministic rules can be expanded or refined to enforce domain-specific constraints and recover high-confidence linkages when strong identifiers exist.
The embedding stage performs the “heavy lifting” for difficult cases where exact equality is too brittle (misspellings, formatting variation, and partial missingness). The error analysis highlights specific weaknesses—such as over-linking among common-name collisions and sensitivity to ZIP similarity—that can guide future model refinement. Meanwhile, deterministic rules provide reliable safeguards and overrides when high-confidence identifiers are present, and can serve as practical stopgaps during iterative development.
Limitations and future work
Several limitations suggest clear future directions. First, additional tuning of the embedding model (e.g. hard-negative mining, calibrated distance thresholds, or alternative contrastive objectives) may improve discrimination, especially among common-name collisions. Second, quantifying the incremental benefit of deterministic rules requires evaluation beyond SSN-based ground truth; additional annotation strategies or alternative validation designs would strengthen this component’s measured impact. Third, incorporating additional fields and improving robustness to missing values are likely to improve recall without sacrificing precision.
We also note that privacy preservation was not the primary focus of this work. The hybrid algorithm was used for internal deduplication within the Truveta environment, with all experiments conducted within a single secure zone and without transfer of PII/PHI outside controlled boundaries. Embeddings are not necessarily equivalent to hashed tokens in privacy properties; investigating techniques that enable privacy-preserving linkage compatible with learned representations is an important future direction and will be addressed in forthcoming work.
Appendix A. Algorithms
Algorithm 1:
Generate Fuzzy ID
Algorithm 2:
Cluster Neighborhood Graph Using a Greedy Clique Cover Algorithm
Algorithm 3:
Postprocess-Split
Algorithm 4:
Postprocess-Combine
Contributor Information
Cheng Cao, Truveta Inc., 1745 114th Ave SE, Bellevue, WA 98004, United States.
Jay Pillai, Truveta Inc., 1745 114th Ave SE, Bellevue, WA 98004, United States.
Sara Daraei, Truveta Inc., 1745 114th Ave SE, Bellevue, WA 98004, United States.
Sina Ghadermarzi, Truveta Inc., 1745 114th Ave SE, Bellevue, WA 98004, United States.
Author contributions
Cheng Cao (Conceptualization, Formal Analysis, Investigation, Methodology, Software, Writing—Review & Editing), Jay Pillai (Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Supervision, Writing—Review & Editing), Sara Daraei (Data Curation, Investigation, Methodology, Software, Writing—Review & Editing), and Sina Ghadermarzi (Investigation, Methodology, Software, Writing—Original Draft Preparation, Writing—Review & Editing, Data Curation, and Visualization)
Conflicts of interest
None declared.
Funding
None declared.
Data availability
The data underlying this article cannot be shared publicly due to privacy reasons.
References
- 1. Shah K, Patt D, Mullangi S. Use of tokens to unlock greater data sharing in health care. JAMA 2023;330:2333–4. [DOI] [PubMed] [Google Scholar]
- 2. Afshar M, Oguss M, Callaci TA et al. Creation of a data commons for substance misuse related health research through privacy-preserving patient record linkage between hospitals and state agencies. JAMIA Open 2023;6:ooad092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Bromley J, Guyon I, LeCun Y et al. Signature verification using a “siamese” time delay neural network. Adv Neural Inform Process Syst 1993;6:737–44. [Google Scholar]
- 4. Castro VM, Gainer V, Wattanasin N et al. The Mass General Brigham Biobank Portal: an i2b2-based data repository linking disparate and high-dimensional patient data to support multimodal analytics. J Am Med Inform Assoc 2022;29:643–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Kho AN, Cashy JP, Jackson KL et al. Design and implementation of a privacy preserving electronic health record linkage tool in Chicago. J Am Med Inform Assoc 2015;22:1072–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Kiernan D, Carton T, Toh S et al. Establishing a framework for privacy-preserving record linkage among electronic health record and administrative claims databases within PCORnet®, the National Patient-Centered Clinical Research Network. BMC Res Notes 2022;15:337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Marsolo K, Kiernan D, Toh S et al. Assessing the impact of privacy-preserving record linkage on record overlap and patient demographic and clinical characteristics in PCORnet®, the National Patient-Centered Clinical Research Network. J Am Med Inform Assoc 2023;30:447–55. March [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Grannis SJ, Xu H, Vest JR et al. Evaluating the effect of data standardization and validation on patient matching accuracy. J Am Med Inform Assoc 2019;26:447–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Joffe E, Byrne MJ, Reeder P et al. A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation. J Am Med Inform Assoc 2014;21:97–104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Tromp M, Ravelli AC, Bonsel GJ et al. Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage. J Clin Epidemiol 2011;64:565–72. [DOI] [PubMed] [Google Scholar]
- 11. Ong TC, Duca LM, Kahn MG, Crume TL. A hybrid approach to record linkage using a combination of deterministic and probabilistic methodology. J Am Med Inform Assoc 2020;27:505–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Christen V, Häntschel T, Christen P, Rahm E. Privacy-preserving record linkage using autoencoders. Int J Data Sci Anal 2023;15:347–57. [Google Scholar]
- 13. Gkoulalas-Divanis A, Vatsalan D, Karapiperis D, Kantarcioglu M. Modern privacy-preserving record linkage techniques: an overview. IEEE Transinformforensic Secur 2021;16:4966–87. [Google Scholar]
- 14. Lazrig I, Ong TC, Ray I et al. Privacy preserving probabilistic record linkage without trusted third party. In 2018 16th Annual Conference on Privacy, Security and Trust (PST), pp. 1–10, Belfast: IEEE, 2018. [Google Scholar]
- 15. Karapiperis D, Gkoulalas-Divanis A, Verykios VS. FEDERAL: a framework for distance-aware privacy-preserving record linkage. IEEE Trans Knowl Data Eng 2018;30:292–304. [Google Scholar]
- 16. Schmidlin K, Clough-Gorr KM, Spoerri A.; for the SNC study group. Privacy preserving probabilistic record linkage (P3RL): a novel method for linking existing health-related data and maintaining participant confidentiality. BMC Med Res Methodol 2015;15:46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Ranbaduge T, Vatsalan D, Ding M. Privacy-preserving deep learning based record linkage. IEEE Trans Knowl Data Eng 2024;36:6839–50. [Google Scholar]
- 18. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–94, New York, NY, USA: ACM, 2016.
- 19.Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–86, Minneapolis, Minnesota: Association for Computational Linguistics, 2019. doi: 10.18653/v1/N19-1423. [DOI] [Google Scholar]
- 20. Bian J, Loiacono A, Sura A et al. Implementing a hash-based privacy-preserving record linkage tool in the OneFlorida clinical research network. JAMIA Open 2019;2:562–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Wilson DR. Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage. In The 2011 International Joint Conference on Neural Networks, pp. 9–14, 2011. doi:10.1109/IJCNN.2011.6033192.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data underlying this article cannot be shared publicly due to privacy reasons.




