Skip to main content
Biology Methods & Protocols logoLink to Biology Methods & Protocols
. 2026 Feb 9;11(1):bpag009. doi: 10.1093/biomethods/bpag009

Linking patient records at scale with a hybrid approach combining contrastive learning and deterministic rules

Cheng Cao 1, Jay Pillai 2, Sara Daraei 3, Sina Ghadermarzi 4,
PMCID: PMC12952525  PMID: 41777588

Abstract

Linking patient records across disparate healthcare systems is essential to create comprehensive views of patient health, yet this task is complicated by inconsistent identifiers and data quality issues. Although traditional deterministic and probabilistic record linkage (RL) methods have long been used for this purpose, deterministic approaches are brittle in the presence of noisy personally identifiable information (PII), while probabilistic approaches are often difficult to scale. As a result, large-scale linkage commonly relies on restrictive matching strategies that limit recall. This work presents a hybrid RL approach that integrates a deep embedding model with deterministic rules, leveraging both the flexibility and noise robustness of soft embeddings and the reliability and predictable baseline performance of deterministic rules. Using a large-scale real-world dataset, a BERT-based embedding model is fine-tuned in a Siamese network with contrastive loss to encode PII fields as numeric vectors. De-duplicated identifiers (Fuzzy IDs) are then obtained through a blocking-and-clustering step using the embedding vectors. The approach is evaluated using multiple signals (social security number, phone, and email) and is shown to outperform baseline methods. A postprocessing step based on deterministic rules allows embedding-based linkage to be overridden in a subset of cases where high-confidence rules apply, such as when a high-quality identifier is available. The system is deployed on a commercial database consisting of more than 200 million PII records, demonstrating scalability in a real-world healthcare setting.

Keywords: contrastive learning, probabilistic record linkage, deterministic record linkage, entity resolution, patient matching

Introduction

As healthcare delivery becomes distributed across a multitude of systems and providers, the need to integrate data from disparate sources has become critical. Many health systems in the USA independently collect and store clinical data about patients, often using locally defined identifiers. As patients interact with multiple healthcare institutions, their information becomes fragmented across these silos, resulting in incomplete views of health records and posing challenges for data-driven healthcare delivery and research [1].

Integrating electronic health records (EHRs) across systems can provide a more comprehensive understanding of individual health and improve clinical decision-making, research, and population health initiatives. Consequently, numerous initiatives have emerged to link health records across systems while preserving data quality and patient privacy [2–7].

Record linkage (RL), also known as entity resolution, seeks to identify and merge records that refer to the same entity. For matching patients, this process typically relies on personally identifiable information (PII), such as name, date of birth (DOB), and ZIP codes. However, PII is prone to quality issues, including typographical errors and inconsistent or incomplete entries, which complicates direct linkage [8].

Traditional RL methods are broadly classified as deterministic, based on exact matching rules, and probabilistic, which use similarity scores to estimate the likelihood of a match [9, 10]. Hybrid methods have also been developed to combine the strengths of both approaches [11]. Recent works often use bloom filters [12–16], creating a better balance between privacy protection and linkage accuracy, yet challenges persist when applied to noisy real-world data.

Parallel to these developments, machine learning (ML) techniques have been applied to RL tasks, including early work using neural networks [12, 17, 18]. Most ML-based RL models to date, however, rely on relatively simple architectures and small-scale datasets. Transformer-based architectures such as BERT [19] have shown strong performance across various domains by capturing complex relationships in input data through attention mechanisms. Despite this, their application in RL remains limited, likely due to the scarcity of large datasets in this domain.

The present study uses a large-scale real-world patient dataset to fine-tune a BERT-based Siamese network [20] using contrastive loss. Patient records are encoded as continuous vector representations (embeddings) in which the similarity of the records is reflected by geometric proximity. This embedding-based approach improves robustness to common variations—such as misspellings, inconsistent formatting, name changes, and missing values. These embeddings are used to obtain de-duplicated identifiers (Fuzzy IDs) in a blocking-and-clustering process. The matching performance of this process is compared with several baselines, from simple token-based matching to neural network-based approaches. Moreover, to harness the strengths of both learned embeddings and deterministic rules, the proposed framework adds a postprocess that enables overriding the embedding-based linkage by deterministic rules, for cases where high-quality identifiers are available. The resulting hybrid framework was deployed and used as a scalable solution for integration on more than 200 million records in a real-world commercial EHR database.

Materials and methods

Dataset

We use a commercial database composed of records aggregated from more than 30 healthcare systems across the USA. The full database at data sampling time contained more than 125 million PII records, capturing substantial diversity in demographic and data-entry patterns. For the experiments in this study, we use a 20% random sample of these records, totaling 25 million records.

Figure 1 summarizes the 13 available PII fields and their fill rates after data cleaning. Among these, five standard PII fields—first name, last name, DOB, gender, and ZIP code—had the highest completeness and were selected as the primary features for modeling and linkage. For the rest of the paper, unless otherwise specified, these fields are used after basic normalization and cleaning. The process includes converting the first and last name to uppercase and removing obvious placeholder values, ZIP code to a string of five digits, and the DOB to a string with YYYY-MM-DD format.

Figure 1.

Figure 1

Number of filled records and fill rates (after data cleaning) for PII fields. Fill rates are shown in brackets next to the field names

Problem formulation and measuring linkage performance

We formulate RL as a binary classification problem over pairs of records. Let D={r1,r2,,rn} denote a dataset of patient records. The objective is to identify pairs (ri,rj) that refer to the same underlying individual. Ideally, ground truth match labels are obtained through human annotation. Here, in the absence of a large human-annotated dataset, we construct high-confidence proxy labels using social security numbers (SSNs) as the primary labeling source, and validated Phone and Email values as auxiliary sources for cohorts where SSN is unavailable (“Primary labeling source—SSN” and “Auxiliary labeling sources—phone and email” sections describe the procedure for preparing these labels).

Let M denote the set of true match pairs (based on the proxy labeling procedure) and M^ the set of predicted match pairs produced by a linkage method. We evaluate linkage performance using standard metrics:

  • Precision: |MM^||M^|

  • Recall: |MM^||M|

  • F1 score: the harmonic mean of precision and recall.

Primary labeling source—SSN

Although SSNs are not universally available and may be affected by data quality issues, valid SSNs—when present—provide a strong ground-truth anchor. In the absence of manual annotations, we use SSNs as a proxy to construct labeled record pairs. After excluding incorrectly formatted values, an initial inspection showed that records sharing the same SSN typically correspond to the same individual. However, we observed two recurring exceptions:

  • Correctly formatted values that resemble placeholder SSNs (e.g. 111-22-3333) and are shared by a large number of records (e.g. more than 50).

  • Correctly formatted, valid-looking SSNs that are shared by records belonging to different individuals (e.g. records with different names or DOB). Qualitative inspection suggested these cases frequently reflect SSN reuse among family members.

To ensure high confidence in the resulting ground-truth labels, we designed a procedure to identify SSN values affected by these issues. Formatting errors are detected using regular expressions (regex), while the above two exception scenarios are identified by examining the consistency of names and DOB associated with each SSN. After the format check, we mark an SSN as invalid if the group of records associated with this SSN contains more than one unique name–DOB combination. To tolerate imperfections in name and DOB fields, we introduce two additional steps:

  • To account for variations such as first–last name switching or cases where a middle name is concatenated with either the first or last name (often due to differences across EHR data sources), we construct a combined full name field. This field is created by alphabetically sorting and concatenating name components, thereby removing ordering effects.

  • To accommodate typographical errors, full names and DOB are clustered using sequence match ratio calculated by python standard library module difflib. Two full name–DOB combinations are considered different only if their match ratio is lower than 0.8.

We inspected 300 SSNs (groups) to validate our assumptions and found that this process successfully captured all SSN-related issues detectable through manual review. Out of 25 million records in the dataset, 46% contained a well-formed SSN (as indicated by SSN in Fig. 1), and among these, 75% (8.6 million records) were retained as having valid SSNs, yielding 121 509 positive (SSN-matching) pairs.

Auxiliary labeling sources—phone and email

Valid SSNs provide a high-quality labeling signal, but they are missing for a substantial fraction of records. To evaluate linkage quality on the cohort of records without SSN, we relied on additional identifiers that (i) have sufficient availability and (ii) are typically associated with a unique individual. Therefore, we use phone number and email address as auxiliary labeling sources. These fields are not as stable or universally unique as SSNs, yet they are used for identification in many practical applications. Due to the possibility of individuals having multiple emails or a phone number being used by more than one person, we treat phone and email as imperfect but informative evaluation signals rather than an absolute ground truth.

As with SSNs, we apply regex-based format validation followed by consistency checks over name–DOB values within phone/email groups, using the same tightening logic. After excluding records with valid SSNs, the phone cohort contains 3.7 million records with valid phone numbers, yielding 23 468 positive (phone-matching) pairs, and the email cohort contains 3.0 million records with valid email addresses, yielding 14 487 positive (email-matching) pairs.

Test sets and splitting

For the primary SSN-labeled cohort (described in “Primary labeling source—SSN” section), we split records into 20% testing, 70% training, and 10% validation. To prevent overlap and leakage, partitioning is performed at the block level: all records sharing the same blocking token value (“Blocking” section) are assigned to the same split. This ensures that test-time records are disjoint from—and not trivially similar to—training records under the chosen blocking scheme. For each of the Email and Phone cohorts (described in “Auxiliary labeling sources—phone and email” section), all of the records are used for testing, and these two auxiliary test sets, by definition (lacking SSN), are disjoint from the primary SSN-based train/validation/test sets.

Baselines

Simple token-based matching

A common operational linkage approach is to construct a token by concatenating selected PII fields (or their substrings/derived strings) and to match only on exact token equality. This approach is computationally efficient (tokens can be computed and matched via group-by operations) but is inflexible under noise, especially if tokens are hashed and only equality comparisons are possible.

We evaluate multiple token definitions that capture different precision–recall trade-offs. We consider the following tokens constructed from combinations of PII fields:

  • T1: first_name[0] + last_name + DOB + gender

  • T2: first_name + last_name + DOB + ZIP[: 3]

  • T3: first_name + last_name + gender + DOB

  • T4: first_name[: 3] + last_name + gender

These represent different balances between noise-robustness and discriminative power. For example, T1 uses only the first letter rather than the full first name, improving recall under name noise at the cost of precision, while T2 adds ZIP code, which can change over time and reduce recall. We assume tokens will be hashed; therefore, matching relies strictly on exact equality. The fields are normalized, prior to use, as described in “Dataset” section.

XGBoost classifier

As a stronger baseline, we trained an XGBoost classifier [21] that predicts whether a record pair is a match. Each pair is represented by five numerical features indicating the similarity of each of the five primary PII fields (first name, last name, DOB, gender, ZIP). The classifier outputs a match probability, which is thresholded to obtain predicted links. Hyperparameters and the decision threshold are selected to maximize F1 on the validation set.

Neural network (BERT-based) classifier

We also evaluate a neural network-based pair-classification baseline using a BERT encoder. This model receives each record pair as a structured textual input (JSON encoding of PII fields). A classification head is applied to the output embedding of the [CLS] token, and it is trained end-to-end with binary cross-entropy loss. As with XGBoost, hyperparameters and classification threshold are selected based on validation-set F1.

Proposed method using embeddings—fuzzy ID

Our primary linkage method, Fuzzy ID relies on learned vector representations (embeddings) of patient records to group records of the same individual together and assign the same Fuzzy ID to the group. The embedding model transforms a given record into a vector such that records from the same individual are close in Euclidean distance while records from different individuals are separated by a margin. This representation-learning formulation is scalable and defines a criterion for a desirable embedding model. The embedding model can then be continuously evaluated and improved with respect to this criterion. Figure 2 shows the schematic of the model architecture and how the embedding vector is derived for each record.

Figure 2.

Figure 2

Embedding model architecture

At linkage time, first, records are divided into blocks based on a high-recall token and then records in each block are separately clustered according to their embedding distance. After clustering, each record is assigned a provisional Fuzzy identifier derived by concatenating the block identifier (i.e. the blocking token value) with the cluster index. This process is described in Algorithm 1, and is shown on the left side of Fig. 4, highlighted in the light blue box. Blocking and Clustering steps are explained in more detail in the following sections (“Blocking” and “Clustering”). “Training embedding model” section describes how the model is trained using contrastive learning.

Figure 4.

Figure 4

Schematic illustration of the hybrid linkage

Blocking

We employ blocking, a standard strategy that partitions the dataset into disjoint subsets B1,B2,,Bk (blocks) based on a blocking token constructed from PII fields. Candidate comparisons are restricted to records within the same block, under the assumption that, having a blocking token with sufficiently high recall, true matches almost always share the blocking token. This avoids a quadratic explosion of the number of comparisons in large datasets and yields independent subproblems that can be processed in parallel.

We evaluated multiple blocking tokens constructed from different combinations of the five primary PII fields. Token T1 yielded the most favorable trade-off between high recall and manageable block sizes. By using only, the first initial rather than the full first name, T1 is more robust to common first-name noise (misspellings, nicknames, truncation) than T3, while retaining discriminative power from DOB, last name, and gender. Token T2 achieved higher precision but lower recall due to its reliance on ZIP code, which can change over time. The T4 token omits DOB and exhibited poor precision. These trends are consistent with the match performance of the tokens reported in Table 1.

Table 1.

Matching performance based on SSN.

FuzzyID NN XGB T1 T2 T3 T4 T1–T4 Exact
Precision 0.94 0.87 0.87 0.81 0.97 0.95 0.004 0.97 0.98
Recall 0.93 0.90 0.98 0.95 0.69 0.82 0.85 0.69 0.65
F1 0.935 0.88 0.92 0.874 0.86 0.880 0.008 0.806 0.782

Note. NN: neural network classifier. XGB: XGBoost Classifier. Exact: Matching based on exact string match of PII features. T1–T4: Matching that requires all four tokens to match. Bold: Proposed method. Italics: Highest for the given metric.

Clustering

Within each block, we cluster record embeddings using a stringent distance criterion. The goal is to identify groups of records whose embeddings lie within a specified distance of one another in the embedding space. We select this distance threshold by examining the distribution of embedding distances for positive and negative pairs in the validation set, and choosing the value that maximizes the F1 score when embedding distance is used to predict the pair label. Figure 3 highlights the selected operating point on the distance distribution (i), ROC curve (ii), and the threshold–F1 curve (iii).

Figure 3.

Figure 3

(a) Distribution of embedding distance in positive and negative (within-block) pairs in the validation set. (b) ROC curve of the embedding distance as a predictor of label in validation set. (c) The F1 score for the embedding distance as a predictor of label is plotted for different threshold values on the validation set

Let G be the neighborhood graph whose vertices are records in the block and where an undirected edge connects two records if their Euclidean embedding distance is at most a threshold thr. Rather than taking connected components of G (single-linkage clustering), which can propagate errors transitively, we form clusters as fully connected subgraphs (cliques) of G. This guarantees that every pair of records assigned to the same cluster is within thr, improving cluster precision.

Computing an optimal clique cover is NP-hard in general. In our setting, blocks are small (at most 150 records in the evaluated dataset), and a greedy heuristic based on coloring the complement graph is sufficiently fast and produced stable results (see “Computational infrastructure” section for the overall runtime of the linkage process on the full data). Algorithm 2 describes the heuristic used to partition G into cliques. For substantially larger blocks, a more scalable but less strict alternative would be to use connected components (single-linkage) or other approximate clustering methods, at the cost of potentially introducing false positives through transitive closure.

Training embedding model

The goal of the embedding model is to map each patient record to a vector in a space in which Euclidean distance reflects match likelihood: embeddings of records from the same individual are close, while those from different individuals are separated by a margin. Each record is represented using five key PII attributes—first name, last name, DOB, gender, and ZIP code—serialized into a normalized JSON string. This string is encoded by a transformer encoder initialized from an uncased BERT checkpoint, and the final [CLS] representation is used as the record embedding. Two identical encoders with shared weights form a Siamese network (Fig. 2).

We optimize the Siamese network using the standard contrastive loss with margin (m), set to 0.8. Although hyper-parameter (m) is not usually optimized, it directly shapes the distance scale in the embedding space and therefore affects the threshold that best separates positive and negative pairs—an operating point (thr) that we use for clustering (“Clustering” section).

For an input pair (x1,x2) with label y{0,1} (where y=1 indicates a match), the loss is:

L=y·f(x1)f(x2)22+(1y)·max(0,mf(x1)f(x2)22), (1)

where f(·) denotes the embedding function and f(x1)f(x2)2 is the Euclidean distance between embeddings.

Training pairs are constructed from the SSN-labeled subset described in “Primary labeling source—SSN” section. To align training with inference-time behavior and to avoid trivial examples, we sample pairs only within blocks, exclude pairs with all five fields identical, and remove redundant symmetric permutations (pair order does not affect the distance). Training is performed using Adam optimizer with learning rate of 5e-5 and weight decay of 0.005. We used mixed precision (16-bit) and a maximum sequence length of 1024 tokens.

We dedicate 10% of the data to validation for tuning hyperparameters and use early stopping based on the AUC obtained from embedding distances. We select the checkpoint from the last epoch that improves the validation set AUC (epoch 7). Each epoch consists of all within-block nonidentical pairs in the training split, and each batch contains 32 pairs (set to the maximum that could fit in our GPU memory).

Results

Evaluation of fuzzy ID matching performance

Table 1 reports Precision, Recall, and F1 for Fuzzy ID and all baselines by SSN-based evaluation.

Overall, the results show that Fuzzy ID matching provides a favorable balance: it achieves higher recall than more restrictive tokens (e.g. T3) while improving precision over more permissive tokens (e.g. T1), consistent with the intended role of embeddings as a robust soft-matching layer.

To assess performance on records without SSNs, we also evaluate on two independent auxiliary cohorts labeled by validated phone and email, as described in “Auxiliary labeling sources—phone and email” section. Tables 2 and 3 show the result of evaluation based on email and phone respectively. As expected, across all linkage methods, Precision and Recall against email and phone labels are lower than against SSN labels because email and phone are inherently noisier identifiers: individuals may report different values over time, and phone numbers can be reassigned.

Table 2.

Matching performance based on email.

FuzzyID T1 T2 T3 T4
Precision 0.70 0.60 0.76 0.71 0.001
Recall 0.74 0.78 0.52 0.69 0.75
F1 0.72 0.68 0.62 0.70 0.003

Note. Bold: Proposed method. Italics: Highest for the given metric.

Table 3.

Matching performance based on phone.

FuzzyID T1 T2 T3 T4
Precision 0.63 0.53 0.74 0.64 0.001
Recall 0.71 0.75 0.56 0.68 0.72
F1 0.67 0.62 0.64 0.66 0.001

Note. Bold: Proposed method. Italics: Highest for the given metric.

The email and phone cohorts are not used for training or linkage decisions and are disjoint from the SSN cohort, providing an additional view of performance that mitigates any bias introduced by SSN availability, and shows that the performance superiority of Fuzzy IDs is maintained when assessed with non-SSN labels.

Analysis of the embeddings

The success of embedding-based linkage with Fuzzy IDs depends critically on the quality of the learned embeddings—specifically, how well positive and negative pairs are separated according to their embedding distances. Figure 3a shows the distribution of Euclidean distances for positive and negative (within-block) pairs in validation set, and reveals a clear separation, indicating that the embedding space generalizes and provides a meaningful measure of match likelihood on held-out data.

Figure 3b provides a complementary view of this discriminative ability via the receiver operating characteristic (ROC) curve. The resulting area under the curve (AUC) indicates strong predictive performance in distinguishing matched from unmatched record pairs using distance alone. Finally, Figure 3c illustrates how we select the clustering threshold (thr) by maximizing the F1 score on the validation set; this threshold is then used in the subsequent clustering and linkage procedure (“Clustering” section).

It is important to note that the negative pairs shown in Fig. 3 belong to the same block and therefore already share substantial PII-based similarity, despite corresponding to distinct individuals. As a result, these negatives are more challenging to predict than randomly sampled global negatives, which explains the lower F1 score observed in panel (c) compared to the results reported in Table 1.

Error analysis

To better understand failure modes, we manually inspected and annotated a random sample of 100 false positives and 100 false negatives from Fuzzy ID predictions and grouped them into categories summarized in Tables 4 and 5.

Table 4.

False-positive categories (different SSNs but same fuzzy ID).

Category Count Description
CN 73 Same name and different ZIP code (usually common names e.g. john smith)
SF 13 Similar but different first names (e.g. Audrey and Avery) and different ZIP code
CN-SZ3 9 Same names, and same ZIP3 (usually common names)
CN-SZ5 4 Same names, and same ZIP5 (usually common names)
SF-SZ5 1 Similar but different first names and same ZIP5

Table 5.

False-negative categories (same SSN but different fuzzy ID).

Category Count Description
DL-MAR 20 Different last name (probably because of last name change after marriage)
DZ 13 ZIP is different—All other fields are identical
NO-GEN 11 Gender is missing (null)
DF-MID 11 Middle name is attached to first name in one of the records
NO-ZIP 9 ZIP is missing (null)
DL-MID 9 Middle name is attached to last name in one of the records
DF-TYPO 8 Typo in first name in one of the records
DL-TYPO 6 Typo in last name in one of the records
DF-FORM 4 Different form of the first name used in one record (Abigail vs. Abby)
DG 3 Different gender without name change (probably data-entry error)
DB-TYPO 3 One digit difference in DOB (Probably due to typo)
DL-ERR 2 Placeholder or wrong value in last name in one of the records
DF-NC 1 Different first name (due to name change or middle name entered instead of first)

False positives were dominated by cases where the five main PII fields matched except for ZIP code. These errors frequently occurred among individuals with very common names (e.g. “John Smith”), where even exact agreement on DOB can occur across different people. In such cases the model appears to infer that the records belong to the same individual who has moved, producing a spurious link. We also observed variants where ZIP codes were similar (e.g. same first three digits) or identical, suggesting the model may place nontrivial weight on ZIP for the embedding even when the remaining evidence is ambiguous.

Another false-positive category involved names that are character-wise similar but represent different individuals (e.g. “Avery” versus “Audrey”). These are plausibly attributable to the embedding model’s learned similarity structure and may be mitigated through additional training data, improved sampling of hard negatives, or modifications to the loss/objective.

False negatives are cases with the same SSN that resulted in different Fuzzy IDs. A prominent category involved last-name changes consistent with marriage (‘DL-MAR’). Many of these failures are expected under our blocking setup because the blocking token (T1) includes last name and DOB; records that differ on these attributes will fall into different blocks and will not be compared during embedding-based clustering. While SSN can confirm such matches during evaluation, it may be difficult for any PII-based system to confidently infer a match when names differ substantially. Importantly, the deterministic combine postprocess can recover many such links when a high-confidence identifier like SSN is available.

We also observed false negatives arising from missing values (e.g. missing ZIP or gender). These suggest an opportunity to improve robustness to missingness so that the absence of a single field does not prevent clustering when other evidence strongly supports linkage. Other categories (e.g. typographical errors in first/last name, DOB formatting issues) represent cases that an improved matching system should capture; these categories directly inform future model and rule improvements.

Postprocess—overriding linkage using deterministic rules

Embedding-based clustering provides a strong general-purpose linkage mechanism. However, in many operational settings, high-confidence identifiers are available for subsets of records. When such fields are present, deterministic rules can enforce constraints that embeddings alone may occasionally violate. Here we used a valid SSN, but the same process can be used to enforce rules based on other identifiers.

We define two forms of deterministic overrides that aim to preserve the existing clusters as much as possible, intervening only when there is a clear conflict:

Split phase (enforce separation)

If a cluster contains conflicting values of a high-confidence split key (e.g. different valid SSNs), those records are required to reside in separate clusters. The presence of conflicting split-key values is treated as definitive evidence that the records represent different individuals. Algorithm 3 describes this procedure.

Combine phase (enforce merging)

If records share a high-confidence combine key (e.g. the same SSN) and meet additional consistency criteria (e.g. compatible DOB and first name), they are merged into the same cluster. This treats the combined key as a sufficient condition for linkage. Algorithm 4 describes the combine operation.

Figure 4 provides an end-to-end illustration of the full hybrid linkage pipeline. In this illustration, five records are grouped into two token blocks. Within Block 1, there is a single embedding cluster, which is then split into two clusters due to an SSN conflict. Within Block 2, three clusters are formed, with no SSN conflicts. Finally, Records 2 and 3 are merged across blocks and clusters to share the same Fuzzy ID, based on the same SSN.

Computational infrastructure

We deployed the hybrid linkage algorithm on 200 million records from the full database. Embedding inference was performed on a 10-node cluster, each node equipped with an NVIDIA T4 GPU, and completed in approximately 300 h. The subsequent clustering, splitting, and combining steps were executed on 10–20 Azure Standard_D32ds_v5 nodes (32 cores, 75 GB RAM each) and completed in approximately 2 h. Model training was conducted on a single node with 800 GB of RAM and four NVIDIA A100 80 GB GPUs, using PyTorch’s Distributed Data Parallel (DDP). Training took approximately 3 h (about 20 min per epoch).

Discussion

Benefits of the hybrid approach

The proposed framework combines two complementary components: embedding-based Fuzzy ID generation and deterministic postprocessing. Both stages are highly parallelizable, enabling a scalable end-to-end solution for generating de-duplicated identifiers at a very large scale. The components are also relatively modular: the embedding model can be improved independently (e.g. via better contrastive training, additional fields, or improved handling of missingness), while deterministic rules can be expanded or refined to enforce domain-specific constraints and recover high-confidence linkages when strong identifiers exist.

The embedding stage performs the “heavy lifting” for difficult cases where exact equality is too brittle (misspellings, formatting variation, and partial missingness). The error analysis highlights specific weaknesses—such as over-linking among common-name collisions and sensitivity to ZIP similarity—that can guide future model refinement. Meanwhile, deterministic rules provide reliable safeguards and overrides when high-confidence identifiers are present, and can serve as practical stopgaps during iterative development.

Limitations and future work

Several limitations suggest clear future directions. First, additional tuning of the embedding model (e.g. hard-negative mining, calibrated distance thresholds, or alternative contrastive objectives) may improve discrimination, especially among common-name collisions. Second, quantifying the incremental benefit of deterministic rules requires evaluation beyond SSN-based ground truth; additional annotation strategies or alternative validation designs would strengthen this component’s measured impact. Third, incorporating additional fields and improving robustness to missing values are likely to improve recall without sacrificing precision.

We also note that privacy preservation was not the primary focus of this work. The hybrid algorithm was used for internal deduplication within the Truveta environment, with all experiments conducted within a single secure zone and without transfer of PII/PHI outside controlled boundaries. Embeddings are not necessarily equivalent to hashed tokens in privacy properties; investigating techniques that enable privacy-preserving linkage compatible with learned representations is an important future direction and will be addressed in forthcoming work.

Appendix A. Algorithms

Algorithm 1:

Generate Fuzzy ID

Algorithm 1:

Algorithm 2:

Cluster Neighborhood Graph Using a Greedy Clique Cover Algorithm

Algorithm 2:

Algorithm 3:

Postprocess-Split

Algorithm 3:

Algorithm 4:

Postprocess-Combine

Algorithm 4:

Contributor Information

Cheng Cao, Truveta Inc., 1745 114th Ave SE, Bellevue, WA 98004, United States.

Jay Pillai, Truveta Inc., 1745 114th Ave SE, Bellevue, WA 98004, United States.

Sara Daraei, Truveta Inc., 1745 114th Ave SE, Bellevue, WA 98004, United States.

Sina Ghadermarzi, Truveta Inc., 1745 114th Ave SE, Bellevue, WA 98004, United States.

Author contributions

Cheng Cao (Conceptualization, Formal Analysis, Investigation, Methodology, Software, Writing—Review & Editing), Jay Pillai (Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Supervision, Writing—Review & Editing), Sara Daraei (Data Curation, Investigation, Methodology, Software, Writing—Review & Editing), and Sina Ghadermarzi (Investigation, Methodology, Software, Writing—Original Draft Preparation, Writing—Review & Editing, Data Curation, and Visualization)

Conflicts of interest

None declared.

Funding

None declared.

Data availability

The data underlying this article cannot be shared publicly due to privacy reasons.

References

  • 1. Shah K, Patt D, Mullangi S.  Use of tokens to unlock greater data sharing in health care. JAMA  2023;330:2333–4. [DOI] [PubMed] [Google Scholar]
  • 2. Afshar M, Oguss M, Callaci TA  et al.  Creation of a data commons for substance misuse related health research through privacy-preserving patient record linkage between hospitals and state agencies. JAMIA Open  2023;6:ooad092. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Bromley J, Guyon I, LeCun Y  et al.  Signature verification using a “siamese” time delay neural network. Adv Neural Inform Process Syst  1993;6:737–44. [Google Scholar]
  • 4. Castro VM, Gainer V, Wattanasin N  et al.  The Mass General Brigham Biobank Portal: an i2b2-based data repository linking disparate and high-dimensional patient data to support multimodal analytics. J Am Med Inform Assoc  2022;29:643–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Kho AN, Cashy JP, Jackson KL  et al.  Design and implementation of a privacy preserving electronic health record linkage tool in Chicago. J Am Med Inform Assoc  2015;22:1072–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Kiernan D, Carton T, Toh S  et al.  Establishing a framework for privacy-preserving record linkage among electronic health record and administrative claims databases within PCORnet®, the National Patient-Centered Clinical Research Network. BMC Res Notes  2022;15:337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Marsolo K, Kiernan D, Toh S  et al.  Assessing the impact of privacy-preserving record linkage on record overlap and patient demographic and clinical characteristics in PCORnet®, the National Patient-Centered Clinical Research Network. J Am Med Inform Assoc  2023;30:447–55. March [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Grannis SJ, Xu H, Vest JR  et al.  Evaluating the effect of data standardization and validation on patient matching accuracy. J Am Med Inform Assoc  2019;26:447–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Joffe E, Byrne MJ, Reeder P  et al.  A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation. J Am Med Inform Assoc  2014;21:97–104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Tromp M, Ravelli AC, Bonsel GJ  et al.  Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage. J Clin Epidemiol  2011;64:565–72. [DOI] [PubMed] [Google Scholar]
  • 11. Ong TC, Duca LM, Kahn MG, Crume TL.  A hybrid approach to record linkage using a combination of deterministic and probabilistic methodology. J Am Med Inform Assoc  2020;27:505–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Christen V, Häntschel T, Christen P, Rahm E.  Privacy-preserving record linkage using autoencoders. Int J Data Sci Anal  2023;15:347–57. [Google Scholar]
  • 13. Gkoulalas-Divanis A, Vatsalan D, Karapiperis D, Kantarcioglu M.  Modern privacy-preserving record linkage techniques: an overview. IEEE Transinformforensic Secur  2021;16:4966–87. [Google Scholar]
  • 14. Lazrig I, Ong TC, Ray I  et al.  Privacy preserving probabilistic record linkage without trusted third party. In 2018 16th Annual Conference on Privacy, Security and Trust (PST), pp. 1–10, Belfast: IEEE, 2018. [Google Scholar]
  • 15. Karapiperis D, Gkoulalas-Divanis A, Verykios VS.  FEDERAL: a framework for distance-aware privacy-preserving record linkage. IEEE Trans Knowl Data Eng  2018;30:292–304. [Google Scholar]
  • 16. Schmidlin K, Clough-Gorr KM, Spoerri A.; for the SNC study group. Privacy preserving probabilistic record linkage (P3RL): a novel method for linking existing health-related data and maintaining participant confidentiality. BMC Med Res Methodol  2015;15:46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Ranbaduge T, Vatsalan D, Ding M.  Privacy-preserving deep learning based record linkage. IEEE Trans Knowl Data Eng  2024;36:6839–50. [Google Scholar]
  • 18. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–94, New York, NY, USA: ACM, 2016.
  • 19.Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–86, Minneapolis, Minnesota: Association for Computational Linguistics, 2019. doi: 10.18653/v1/N19-1423. [DOI] [Google Scholar]
  • 20. Bian J, Loiacono A, Sura A  et al.  Implementing a hash-based privacy-preserving record linkage tool in the OneFlorida clinical research network. JAMIA Open  2019;2:562–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wilson DR. Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage. In The 2011 International Joint Conference on Neural Networks, pp. 9–14, 2011. doi:10.1109/IJCNN.2011.6033192.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data underlying this article cannot be shared publicly due to privacy reasons.


Articles from Biology Methods & Protocols are provided here courtesy of Oxford University Press

RESOURCES