The development of a mobile app‐focused deduplication strategy for the Apple Heart Study that informs recommendations for future digital trials

Ariadna Garcia; Justin Lee; Vidhya Balasubramanian; Rebecca Gardner; Santosh E Gummidipundi; Grace Hung; Todd Ferris; Lauren Cheung; Sumbul Desai; Christopher B Granger; Mellanie True Hills; Peter Kowey; Divya Nag; John S Rumsfeld; Andrea M Russo; Jeffrey W Stein; Nisha Talati; David Tsay; Kenneth W Mahaffey; Marco V Perez; Mintu P Turakhia; Haley Hedlin; Manisha Desai; the Apple Heart Study Investigators

doi:10.1002/sta4.470

. 2022 Nov 18;11(1):e470. doi: 10.1002/sta4.470

The development of a mobile app‐focused deduplication strategy for the Apple Heart Study that informs recommendations for future digital trials

Ariadna Garcia ^1,², Justin Lee ^1,², Vidhya Balasubramanian ^1,², Rebecca Gardner ^1,², Santosh E Gummidipundi ^1,², Grace Hung ³, Todd Ferris ³, Lauren Cheung ⁴, Sumbul Desai ⁴, Christopher B Granger ⁵, Mellanie True Hills ⁶, Peter Kowey ⁷, Divya Nag ⁴, John S Rumsfeld ⁸, Andrea M Russo ⁹, Jeffrey W Stein ⁴, Nisha Talati ¹⁰, David Tsay ⁴, Kenneth W Mahaffey ¹⁰, Marco V Perez ², Mintu P Turakhia ^2,¹¹, Haley Hedlin ^1,², Manisha Desai ^1,^2,^✉; the Apple Heart Study Investigators

PMCID: PMC9787886 PMID: 36589778

Abstract

An app‐based clinical trial enrolment process can contribute to duplicated records, carrying data management implications. Our objective was to identify duplicated records in real time in the Apple Heart Study (AHS). We leveraged personal identifiable information (PII) to develop a dissimilarity score (DS) using the Damerau–Levenshtein distance. For computational efficiency, we focused on four types of records at the highest risk of duplication. We used the receiver operating curve (ROC) and resampling methods to derive and validate a decision rule to classify duplicated records. We identified 16,398 (4%) duplicated participants, resulting in 419,297 unique participants out of a total of 438,435 possible. Our decision rule yielded a high positive predictive value (96%) with negligible impact on the trial's original findings. Our findings provide principled solutions for future digital trials. When establishing deduplication procedures for digital trials, we recommend collecting device identifiers in addition to participant identifiers; collecting and ensuring secure access to PII; conducting a pilot study to identify reasons for duplicated records; establishing an initial deduplication algorithm that can be refined; creating a data quality plan that informs refinement; and embedding the initial deduplication algorithm in the enrolment platform to ensure unique enrolment and linkage to previous records.

Keywords: Damerau‐Levenshtein distance, deduplication, digital trial, entity resolution, pragmatic trial, resampling methods

1. INTRODUCTION

Over the past few years, there has been increased interest in conducting pragmatic clinical trials (PCTs). PCTs are cost effective; they can reduce the burden on participants and the study team by providing greater flexibility on where, when and how the data are collected. The digital clinical trial (DCT) is a special case of the PCT that further allows investigators to incorporate mobile devices or digital tools to facilitate the conduct of a trial and/or to evaluate the use of digital platforms in an intervention in a real‐life setting (Inan et al., 2020). Mobile apps and wearable devices enable DCTs to provide opportunities to engage a larger number of participants with many more data elements collected per individual. Although it may be advantageous to have many more measurements of a given type and to have a diversity of types of measurements to address the study objectives, management of the data may become challenging and may pose threats to the integrity of the trial.

For example, studies that rely on apps for enrolment are at an increased risk of duplicated data, which can have serious implications on downstream analyses as well as issues for study conduct. In a systematic review conducted by Zhang et al. (2018), the authors state that ‘scholars have only started to employ apps in field experiments in the last 4 years’ and ‘most studies only used apps as an experimental treatment instead of an experimental platform’. Though many have studied this issue extensively by providing solutions in other contexts (Chaudhuri et al., 2003; Gravano et al., 2001), the challenges faced when conducting PCTs and mitigation solutions have yet to be fully characterized. Further, the specific issue of duplication has not been well described in the DCT literature. For example, the Clinical Trials Transformation Initiative (2022) provides an excellent set of guidelines to address various issues in the conduct, design and analysis of clinical trials that rely on digital tools. This particular issue, however, is not featured as one for investigators to consider in the design and analysis. Thus, currently, there is a gap in knowledge of how to plan for and mitigate deduplication issues when conducting DCTs.

The Apple Heart Study (AHS) was a prospective, single‐arm, site‐less, pragmatic study (Turakhia et al., 2019) conducted between 11/29/2017 and 02/21/2019, which faced and overcame some of these data management challenges. More specifically, the goal of the AHS was to evaluate the ability of an irregular pulse notification on the Apple Watch to identify signals consistent with atrial fibrillation. The study relied on an app on the participant's phone to enroll participants, collect data and monitor the heart rhythm for abnormalities using pulse data recorded on their Apple Watch. The AHS was a collaborative project between Apple Inc. and Stanford University. Our study team consisted of experts from both organizations that covered diverse disciplines including cardiovascular medicine, digital health, software engineering, biostatistics, epidemiology, clinical trials, data management, clinical informatics and clinical operations. Clinical operations were led by the Stanford Center for Clinical Research. Design, data management and data analysis were led by the Stanford Quantitative Sciences Unit. Data capture, security and housing were led by the Stanford Technology and Digital Solutions Team, and the Stanford Center for Digital Health provided expertise in the area of digital health. Key external collaborators included AmericanWell, a TeleHealth provider, and BioTelemetry, an ambulatory electrocardiogram provider. Governance of the AHS included an Executive Committee of strategic investigators that drove day‐to‐day decisions. The Executive Committee was advised by a larger Steering Committee of representatives from Stanford University, Apple Inc. and five external members including a patient advocate with atrial fibrillation. Additionally, a Data and Safety Monitoring Board was established to advise the team on trial integrity, conduct, safety and dissemination. Using team science principles, the study team met regularly (weekly meetings with the larger study team and biweekly meetings with smaller sub‐teams) to troubleshoot issues on study design, launch, enrolment, data management, analysis and interpretation of findings.

Through data monitoring procedures, the study team observed early in the recruitment phase that some participants were represented under multiple participant identifiers or IDs—an occurrence that we refer to as duplication of IDs—making it challenging to identify the unique number of participants enrolled and to link longitudinal data within an individual. Both tasks were critical to accomplish study goals. Behaviours of the participants and/or the devices may have contributed to duplication of records. Some examples include participants deleting, reinstalling and re‐enrolling through the app; app crashes or software updates that could cause loss of data; switching mobile devices; or sharing iCloud accounts with other individuals.

Entity resolution (ER) is the task of identifying and grouping various manifestations of the same real‐world object within one data source or between different sources (Benjelloun et al., 2009; Köpcke et al., 2010). This area of research is also referred to as deduplication, record linkage, object matching or linkage discovery. ER has several applications particularly in web searches, health data, financial transactions, law enforcement and more.

In this paper, we describe an issue with ER specifically as it arose in the AHS. We describe our approach to mitigating this problem so that we could achieve study goals. Lessons learned are translated into recommendations for future DCTs.

2. METHODS

2.1. Background

When designing the AHS (Turakhia et al., 2019), we established a system to verify unique identifiers that addressed anticipated scenarios that could lead to duplication issues. Our system involved having an ID for the participant (participant ID or PID) and an additional ID for the device (device ID or DID) so that data generated from a device would have both a corresponding PID and DID linked to a single participant. Data not generated from the device (e.g. study visit) would only have the PID. In the following, we describe in greater detail these two types of identifiers.

2.2. DID and PID

When a participant downloaded and opened the app, encrypted data were securely sent to Apple servers through an application program interface (API). The data included a system identifier referred to as the DID that allowed Apple servers and participant devices to exchange data. In addition, when a participant enrolled in the study through the app, they were assigned a PID generated by the app. Upon assignment of a PID to a participant, a mapping was established between the DID and PID.

We define duplication as the phenomenon of having multiple de‐linked records that appeared to belong to the same unique individual. This can occur in multiple ways. A new DID could be assigned to the same device throughout the study for a variety of reasons including software updates, device reboot and device updates. Like DIDs, new PIDs may be assigned to a participant with an existing assigned PID for a variety of reasons (Figure 1). Duplication (multiple PIDs corresponding to the same individual) occurs when either or both PID and DID are reassigned to the same person or device leading the data to become orphaned or no longer linked, where pieces of data from the same participant are mistakenly thought to belong to separate individuals. To solve these issues, we used ER techniques and developed an algorithm to establish which PIDs were associated with the same individual. The algorithm was designed to link data from multiple sources back together and create a true unique identifier per person enabling longitudinal views of data within a participant over time. The algorithm consists of multiple steps described in detail that follow and depicted in detail in Figure 2.

Common scenarios that potentially led to orphaned data

Step‐by‐step diagram of deduplication algorithm. DID, Device ID; DS, Dissimilarity score; PID, Participant's ID; PPV, positive predictive value

2.3. Key inputs to the algorithm

The algorithm leveraged data that contained participant identifiable information (PII), obtained by asking participants questions related to their identity and demographic information after providing consent. These data were encrypted, not accessible to the sponsor (Apple Inc.), and only available to an unblinded sub‐team of the larger AHS data science team at Stanford University.

The algorithm specifically used the combination of seven participant‐level identifiers: email, date of birth, first name, last name, phone number, state and consent date in order to assess the similarity of two records with different PIDs. All identifiers provided important information to our algorithm with varying levels of contribution; email, date of birth, first name, last name and phone allowed us to identify the individual and were referred to as strong identifiers, whereas the others were considered auxiliary to the strong identifiers. For example, the participant's state was useful in scenarios where individuals shared their first and last names and birthdays but differed by state, casting doubt on the records being linked. Consent date was also helpful. For example, suppose an individual enrolled and then subsequently updated the app and re‐enrolled. In this case two different PIDs would be attached to the respective data collected at the separate enrolments. Common PIIs with distinct consent dates close in time may reflect this sequence of events and increase the probability that the records belong to the same individual. Alternatively, common PIIs with the same consent date may indicate different family members joining the study simultaneously.

2.4. Dissimilarity score (DS)

There are numerous algorithms that have been developed to calculate the distance between two string values (Cohen et al., 2003). We used the optimal string‐alignment distance (osa) that relies on the Damerau–Levenshtein distance and returns the string distance taking into account deletion, insertion, substitution and transposition under the condition that no substring is edited more than once (Levenshtein, 1966). We implemented this using the stringdist R package (Van der Loo et al., 2014).

To calculate the dissimilarity score (DS) between two records, we first pre‐processed the data by standardizing strings, removing special characters including empty spaces and transforming all text to the lower case. We then calculated approximate string distance between each of the patient‐level identifiers and then summed the individual metrics to generate a single composite score that represented their dissimilarity. For example, email from one record was compared with email from a second record, and the string was quantified and recorded. The same process was repeated for each of the seven patient‐level identifiers. A DS of zero indicated the two strings being compared were identical, whereas nonzero DSs reflected the degree of dissimilarity, that is, higher scores indicated greater dissimilarity.

2.5. Algorithm execution and data sub‐setting

One downside of performing pairwise string comparison for a high volume of records is that it is computational expensive. Assessing distance among all pairs in the full data would have required over 96 billion pairwise comparisons. To mitigate this challenge, we identified four subsets of data where true matches were most likely to occur. The first subset was composed of record pairs with multiple PIDs associated with the same DID (Subset 1: multiple PIDs mapped to the same DID). The second and third subsets were record pairs with multiple PIDs associated with the same first and last name, respectively (Subset 2: multiple PIDs mapped to the same first name; Subset 3: multiple PIDs mapped to the same last name), and the fourth was formed by participants where multiple PIDs were associated with the same date of birth (Subset 4: multiple PIDs mapped to the same date of birth). The DS was calculated for each pairing in each of these four subsets.

2.6. Threshold identification

Empirical assessments of comparisons demonstrated that DSs > = 25 clearly corresponded to records from distinct participants, whereas comparisons with DSs < 25 more likely reflected a mixture of the same and different individuals. Thus, we restricted our focus on deriving a rule for records with DSs < 25.

For this purpose, we created an Annotated Data Set by randomly sampling 2% of pairwise comparisons with DSs < 25 and annotating each pair using a single reviewer to indicate whether the match was false (denoted as 0) or true (denoted as 1).

To identify the optimal threshold, we bootstrapped 10,000 samples from the Annotated Data Set, where within each bootstrap sample, a random 90% sample of the Annotated Data Set was allocated to a training data set in which the optimal cut‐point was calculated using the Youden index (Youden, 1950) that minimizes misclassification. This cut‐point was then applied to the remaining 10% of the bootstrap sample to estimate the accuracy for this specific cut‐point. These results were summarized by evaluating the mean, median, mode and spread and we chose the cut‐point as the mean of the 10,000 samples. The R package ‘OptimalCutpoints’, which defines the optimal cut as the point that maximizes the Youden function (the difference between true and false positive rates over all possible cut‐point values) using an empirical approach, was used to execute approaches (Lopez‐Raton & Rodriguez‐Alvarez, 2021).

2.7. Threshold validation

To assess the positive predictive value of the threshold in identifying true matches, we randomly resampled 5% of the original data set (excluding the 2% sample previously used) from records that had DSs under the identified threshold for annotation performed by four reviewers and called this new data set the Annotated Validation Data Set. There was no formal training for the reviewers or protocol for annotation, and subjective judgement was used. The final annotated value was taken to be the one defined by the majority of the four reviewers. In cases where majority could not be achieved, there was collaborative discussion to arrive at consensus. The positive predictive value was estimated as the proportion of the records in the Annotated Validation Data Set that was truly determined to be a duplicated record as defined by the final annotated value.

2.8. Manual refinement

Minimizing match errors was critical to our study, so after completing the algorithm execution, we performed a manual verification process where we manually examined pairs that were slightly above or below the threshold. Erroneous instances were corrected manually. Once we completed this manual verification and correction process, the pairs identified as true matches comprised our final matched data set.

2.9. Disjoint sets implementation

The last step was to find the intersections among all the matched pairs (referred to as the Final Matched Data Set). For example, suppose that PID 1 matched to PID 2 and PID 2 matched to PID 3. We then established that PID 1, PID 2 and PID 3 all corresponded to the same individual.

To understand the connection of all PIDs based on the pairwise matches, we first identified disjoints sets (pairs of matched PIDs with no PID in common). Those in the disjoint sets resulted in one unique ID per pair. PIDs that were not included in the disjoint data set were (1) those not included in one of the original four subsets with higher probability of being duplicates and (2) those PIDs excluded during the final matched data set creation (either by being above the threshold or by being deemed ‘not a match’ during the manual refinement step).

After executing the disjoint sets exercise described previously, we were able to create a data set of disjoint sets (Disjoint Data Set) where each row contained all PIDs that belonged together. Using the example above, suppose that the following matched pairs existed on the Final Matched Data Set: PID 1 matched to PID 2, PID 2 matched to PID 3, and PID 4 matched to PID 5. In this scenario, PID 1, PID 2 and PID 3 corresponded to the same individual, and PID 4 and PID 5 belonged to a separate individual because there were no PIDs in common. From this data set, we can arrive at unique IDs for those with multiple PIDs (Final Disjoint Data Set). A new unique ID was then added to each row allowing us to link PIDs that mapped to the same unique identifier and establish the longitudinal trajectory of each participant.

2.9.1. Assessing inter‐annotation reliability

To assess inter‐rater reliability among the four reviewers of the Annotated Validation Data Set used to describe the properties of the estimated cut‐point, each pair was recoded by the other reviewers, and Fleiss's kappa was calculated (Fleiss et al., 2003).

3. RESULTS

3.1. Algorithm execution and data sub‐setting

AHS had a total of 438,435 PIDs assigned in the study cohort by the end of the enrollment period (Figure 2). Recall from above that to make the algorithm computationally efficient, we created four subsets where two or more PIDs were associated with the same piece of data. Subset 1 (Multiple PIDs mapped to the same DID) was composed of 6818 unique PIDs where two or more PIDs were associated with the same DID. Subsets 2 and 3 (multiple PIDs mapped to the same first and last name, respectively)—the most common records at risk of duplication that included participants—were composed of 57,505 and 38,717 PIDs. The last subset, Subset 4 (multiple PIDs mapped to the same date of birth), was composed of 20,250 PIDs.

3.2. Threshold identification

DS was computed for each pair of PIDs, where DS was the sum of each of the seven individual identifiers. A description of the DS distribution and the number of pairs can be found in Table 1. The mean DS on the subsets where PIDs were associated with the same DID (mean = 8) and the date of birth (mean = 9) was lower than in the subsets where PIDs were associated with the same first name (mean = 16) or last name (mean = 14). This was expected as first and last names alone are weaker identifiers than DID or the date of birth.

TABLE 1.

Dissimilarity score (DS) characteristics by subset

Subset
Subset 1: Multiple PIDs mapped to same DID	N
N (PIDs)	6818
N (pairs with DS < = 25)	2715
DS distribution
Min	0
Median	7
Mean	8
Max	25
Subset 2: Multiple PIDs mapped to same first name
N (PIDs)	57,505
N (pairs with DS < = 25)	36,884
DS distribution
Min	0
Median	18
Mean	16
Max	25
Subset 3: Multiple PIDs mapped to same last name
N (PIDs)	38,717
N (pairs with DS < = 25)	30,395
DS distribution
Min	0
Median	9
Mean	14
Max	25
Subset 4: Multiple PIDs mapped to same date of birth
N (PIDs)	20,250
N (pairs with DS < = 25)	20,063
DS distribution
Min	0
Median	8
Mean	9
Max	25

PID	DID	First_Name	Last_Name	DOB	Email	State	Phone	Consent_Date
1	1	Linda	Smith	10/29/1964	lsmith22@email.com	MA	453‐245‐0712	5/3/2017
2	2	Jennifer	Williams	8/18/1965	jenniferwilliams@email.com	PA	462‐946‐0095	5/10/2017
3	3	Susan	Brown	8/29/1972	susanbrowng@email.com	TN	630‐512‐5824	6/2/2017
4	4	Michael	Jones	10/24/1972	mjones@email.com	CA	258‐652‐4875	6/29/2017
5	5	James	Davis	7/20/1988	jamesdavis44@email.com	TX	880‐391‐9208	7/19/2017
6	6	Lidia	Smith	11/3/1993	lsmith12@email.com	FL	828‐304‐4350	9/4/2017
7	3	Sue	Brown	8/29/1972	susanbrowng@email.com	TN	630‐512‐5824	9/14/2017
8	4	Maria	Jones	11/22/1972	mjones@email.com	CA	258‐652‐4875	10/20/2017
9	2	Jennifer	William	8/18/1966	jenniferwilliams@email.com	PA	462‐946‐0095	12/5/2017
10	3	Sue	Brown	8/18/1966S	susanbrowng@email.com	TN	630‐512‐5825	10/24/17

PID.1	PID.2	ds	First_Name.1	First_Name.2	Last_Name.1	Last_Name.2	DOB.1	DOB.2	Email.1	Email.2	State.1	State.2	Phone.1	Phone.2	Consent_Date.1	Consent_Date.2
2	9	6	jennifer	jennifer	williams	william	8/18/1965	8/18/1966	jenniferwilliams@email.com	jenniferwilliams@email.com	PA	PA	462‐946‐0095	462‐946‐0095	5/10/2017	12/5/2017
7	10	6	sue	sue	brown	brown	8/29/1972	8/29/1972	susanbrowng@email.com	susanbrowng@email.com	TN	TN	630‐512‐5824	630‐512‐5825	9/14/2017	10/24/17

PID.1	PID.2	ds	First_Name.1	First_Name.2	Last_Name.1	Last_Name.2	DOB.1	DOB.2	Email.1	Email.2	State.1	State.2	Phone.1	Phone.2	Consent_Date.1	Consent_Date.2
3	7	6	susan	sue	brown	brown	8/29/1972	8/29/1972	susanbrowng@email.com	susanbrowng@email.com	TN	TN	630‐512‐5824	630‐512‐5824	6/2/2017	9/14/2017
7	10	6	sue	sue	brown	brown	8/29/1972	8/29/1972	susanbrowng@email.com	susanbrowng@email.com	TN	TN	630‐512‐5824	630‐512‐5825	9/14/2017	10/24/17
3	10	9	susan	sue	brown	brown	8/29/1972	8/29/1972	susanbrowng@email.com	susanbrowng@email.com	TN	TN	630‐512‐5824	630‐512‐5825	6/2/2017	10/24/17

PID.1	PID.2	ds	First_Name.1	First_Name.2	Last_Name.1	Last_Name.2	DOB.1	DOB.2	Email.1	Email.2	State.1	State.2	Phone.1	Phone.2	Consent_Date.1	Consent_Date.2
2	9	6	jennifer	jennifer	williams	william	8/18/1965	8/18/1966	jenniferwilliams@email.com	jenniferwilliams@email.com	PA	PA	462‐946‐0095	62‐946‐0095	5/10/2017	12/5/2017
3	7	6	susan	sue	brown	brown	8/29/1972	8/29/1972	susanbrowng@email.com	susanbrowng@email.com	TN	TN	630‐512‐5824	630‐512‐5824	6/2/2017	9/14/2017
7	10	6	sue	su	brown	brown	8/29/1972	8/29/1972	susanbrowng@email.com	susanbrowng@email.com	TN	TN	630‐512‐5824	630‐512‐5825	9/14/2017	10/24/17
3	10	9	susan	sue	brown	brown	8/29/1972	8/29/1972	susanbrowng@email.com	susanbrowng@email.com	TN	TN	630‐512‐5824	630‐512‐5825	6/2/2017	10/24/17

unique_id	pid.1	pid.2	pid.3
1	3	7	10
2	9	2	NULL

PERMALINK

The development of a mobile app‐focused deduplication strategy for the Apple Heart Study that informs recommendations for future digital trials

Ariadna Garcia

Justin Lee

Vidhya Balasubramanian

Rebecca Gardner

Santosh E Gummidipundi

Grace Hung

Todd Ferris

Lauren Cheung

Sumbul Desai

Christopher B Granger

Mellanie True Hills

Peter Kowey

Divya Nag

John S Rumsfeld

Andrea M Russo

Jeffrey W Stein

Nisha Talati

David Tsay

Kenneth W Mahaffey

Marco V Perez

Mintu P Turakhia

Haley Hedlin

Manisha Desai

Abstract

1. INTRODUCTION

2. METHODS

2.1. Background

2.2. DID and PID

FIGURE 1.

FIGURE 2.

2.3. Key inputs to the algorithm

2.4. Dissimilarity score (DS)

2.5. Algorithm execution and data sub‐setting

2.6. Threshold identification

2.7. Threshold validation

2.8. Manual refinement

2.9. Disjoint sets implementation

2.9.1. Assessing inter‐annotation reliability

3. RESULTS

3.1. Algorithm execution and data sub‐setting

3.2. Threshold identification

TABLE 1.

FIGURE 3.

3.3. Threshold validation

3.4. Manual refinement

3.5. Disjoint sets implementation

3.6. Assessing inter‐annotation reliability

4. DISCUSSION

TABLE 2.

TABLE 3.

5. CONCLUSIONS

ETHICS STATEMENT

APPENDIX A.

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases