Skip to main content
Science Advances logoLink to Science Advances
. 2022 Aug 19;8(33):eabl6464. doi: 10.1126/sciadv.abl6464

Expanding the attack surface: Robust profiling attacks threaten the privacy of sparse behavioral data

Arnaud J Tournier 1,2,*, Yves-Alexandre de Montjoye 1,2,*
PMCID: PMC11323786  PMID: 35984877

Abstract

Behavioral data, collected from our daily interactions with technology, have driven scientific advances. Yet, the collection and sharing of this data raise legitimate privacy concerns, as individuals can often be reidentified. Current identification attacks, however, require auxiliary information to roughly match the information available in the dataset, limiting their applicability. We here propose an entropy-based profiling model to learn time-persistent profiles. Using auxiliary information about a single target collected over a nonoverlapping time period, we show that individuals are correctly identified 79% of the time in a large location dataset of 0.5 million individuals and 65.2% for a grocery shopping dataset of 85,000 individuals. We further show that accuracy only slowly decreases over time and that the model is robust to state-of-the-art noise addition. Our results show that much more auxiliary information than previously believed can be used to identify individuals, challenging deidentification practices and what currently constitutes legally anonymous data.


Individuals can be accurately identified in seemingly anonymous behavioral datasets using data collected at a different time.

INTRODUCTION

Over 22 billion connected devices, from smartphones and wearables to Internet of Things devices, passively collect fine-grained behavioral data about our lives (1). The location of a mobile phone is, for instance, collected up to 14,000 times a day (2), while a car generates up to 25 gigabytes of data every hour (3). These data are widely used. Location data, for example, are used by banks to detect fraudulent behavior (4) and predict the likelihood of loan repayment (5). They are also used by governments to monitor employment (6), quickly respond to natural disasters (7), and recently to respond to the coronavirus disease 2019 (COVID-19) pandemic (8). Last, researchers have used location data to better understand the spread of infectious diseases (911) or the segregation in cities (12). While extremely useful, behavioral data are also extremely personal and sensitive (13), as shown by the Cambridge Analytica affair (14) and Edward Snowden’s revelations (15). In recent surveys, over 80% of Americans (16) and 80% of Britons (17) have expressed concerns over how their data are used and shared.

Finding a balance between using behavioral data for good and protecting people’s privacy often relies on anonymizing the data. Once anonymized, behavioral data fall outside the scope of data protection laws and can be freely used and shared. In the European Union’s General Data Protection Regulation (18) (GDPR, recital 26), data are considered anonymized when “rendered anonymous in such a manner that the data subject is not or no longer identifiable. [ … ] To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.” Similar definitions are found in privacy laws around the world, e.g., in the California Consumer Privacy Act section 1798.140 (h) (19) and in new bills currently under examination across the United States [e.g., Washington Senate Bill 5062 section 101 (20), Massachusetts Bill HD.3847 (21), and Virginia House Bill 2307 section 59.1-571 (22)].

Data matching has long been used to reidentify individuals in deemed-anonymous datasets using, e.g., record linkage algorithms (2325). The seminal reidentification of the Governor of Massachusetts William Weld’s health records (26) and the recent reidentification of President Trump’s tax records (27) are examples of high-profile data matching attacks (2832). More recently, data matching attacks have been developed and deployed against high-dimensional datasets, such as public transport smart cards (33), credit cards (34), cryptocurrency transactions (35), personal vehicles GPS (36), mobile phone mobility (37), smartphone application usage (38), web histories (3941), smart meter data (42), and social network graphs (43, 44). New techniques have also been developed to identify individuals from imperfect matches (41, 45), evaluate the correctness of matches (46), and assess the robustness of matching in large datasets (47).

Matching attacks, however, require the attacker to have access to auxiliary information about the target, which are also available in the dataset. For example, Gov. Weld was identified by his date of birth, gender, and zip code (26). These pieces of information were both public and available in the anonymized medical dataset. Assuming that the same information is both available in the dataset and as auxiliary information is, in general, a reasonable requirement for traditional tabular data. For behavioral data, however, this means that auxiliary information about the target and data points in the dataset have to be collected not only over the same period of time but also roughly at the same times. This is a strong requirement that can substantially limit the availability of matching auxiliary information, in particular, when the data are sparse (45, 48). This has led some to question the practical risks posed by data matching for behavioral data and ultimately whether data protection laws should apply to pseudonymized behavioral datasets (4850).

Here we present a profiling attack against sparse behavioral data capable of leveraging fully nonmatching auxiliary information, enabling the attacker to use a wide range of auxiliary information including publicly available information. Once trained, our entropy-based model correctly identifies 79% of individuals in a location dataset of 0.5 million people and 93% within a set of 10 candidates. Similarly, on a grocery shopping dataset of 85,000 individuals (51), our model correctly identifies 65% and 74% within a set of 10 candidates. Using a metaclassifier, our model reaches an Area Under the Receiver Operating Characteristic (AUROC) of 0.91 and is well calibrated. Our results hold even when (i) the time gap between the dataset and the auxiliary information increases, (ii) state-of-the-art noise is added to the dataset, and (iii) the dataset is large. Together, our results relax a strong requirement of current matching attacks and show that much more auxiliary information than previously thought might be available to reidentify individuals in behavioral datasets. This has broad implications for what constitutes anonymous data in today’s world, challenges current deidentification practices, and emphasizes the need to develop and deploy modern privacy engineering solutions.

RESULTS

We consider a population Idata of N individuals interacting with a service over a time period Tdata=[tdata,tdata). The service collects a behavioral dataset Θdata = {yiiIdata} where, for each individual iIdata, yi = ((ti,1, xi,1), …, (ti,ni, xi,ni)) is a trace of data points. x ∈ 𝒳 (e.g., a physical location) and points are time-ordered (ti,1 ≤ ⋯ ≤ ti,ni, with tdatati,1 and ti,ni<tdata). An attacker holds auxiliary information τ = ((tτ,1, xτ,1), …, (tτ,nτ, xτ,nτ)) about a target individual j, which they hope to use to identify j in Θdata (Fig. 1A). The auxiliary information is recorded over a time interval Taux=[taux,taux) (i.e., tauxtτ,1tτ,nτ<taux) disjoint from 𝒯data (i.e., tdatataux or tauxtdata). We here assume that jIdata and consider the general case in Discussion.

Fig. 1. Representation of the attack and effect of training.

Fig. 1.

(A) Using auxiliary information τ about the target (top: recorded over 𝒯aux), the attacker attempts to identify the target in the dataset Θdata (bottom: recorded over 𝒯data). Traces are processed by the model in three steps: (i) Time persistent profiles are computed using Ψ, (ii) the dissimilarities between the auxiliary information and profiles are computed with the divergence d, and (iii) potential candidates are ranked and the meta-classifier estimates the likelihood κ of the best candidate ρ to be the target. (B and C) Representation of profiles built by the model before (B) and after (C) training using t-distributed stochastic neighbor embedding (t-SNE) on d (97). Each point is a profile computed from 1 week of location data for a person. The training procedure here improves the ability of our model to distinguish the profiles of a single individual from the profiles of other individuals.

Profiling model for sparse data

We consider a space of profiles 𝒮 and a map Ψ from raw traces to profiles. Profiles aim to capture information about an individual that is both specific to that individual and stable over time. Formally, with few assumptions about the behavior of individuals, Ψ performs nonparametric density estimation of q random variables extracted from each trace (e.g., location or time of the day). The space of profiles S=k=1qSk is a product of subsets of ka, where ak is the dimension corresponding to the kth variable (see Materials and Methods).

We propose an asymmetric dissimilarity function d on 𝒮 to compare the profiles of individuals in the dataset with the auxiliary information available to the attacker

X,YS dΩ,Λ(XY)=k=1qΩkdΛk(XkYk) (1)

where

dΛk(XkYk)=H(MΛk(Xk,Yk))hΛk(Xk,Yk) (2)

The model parameters Ω+q and Λ ∈ (0,1)q are shared across individuals. H is the information entropy function, and MΛk(Xk, Yk) = ΛkXk + (1 − Λk)Yk and hΛk(Xk, Yk) = ΛkH(Xk) + (1 − Λk)H(Yk) are convex combinations (see Materials and Methods). These convex combinations adjust H nonnegative gaps of concavity in Eq. 2. The gaps are then combined linearly in Eq. 1 with Ω controlling their respective weights. Gaps capture the amount of statistical uncertainty that would be introduced by mixing profiles X and Y. Mixing profiles from a single individual is expected to introduce less uncertainty than mixing profiles from distinct individuals, leading to smaller values for d (see the Supplementary Materials).

Using the divergence d, traces in the dataset Θdata are ranked according to their similarity with the auxiliary information τ. In particular, our model finds the most similar trace ρ(τ)=arg minyΘdata d (Ψ(τ)Ψ(y)). Once ρ(τ) is found, a meta-classifier using the “second-over-first” score (41, 45) estimates the likelihood κ^τ of ρ(τ) to be correct (see Materials and Methods).

s(τ)=δ(τ,ρ2(τ))δ(τ,ρ1(τ)) (3)

for δ(τ, ρk(τ)) = dΩ,Λ(Ψ(τ)∥Ψ(ρk(τ))), with k ∈ {1,2}, ρ1(τ) = ρ(τ), and ρ2(τ) the second most similar trace to τ.

We train our model using a contrastive loss function, a well-known approach in image representation learning (5254). Training traces are split over two disjoint subintervals of 𝒯data into Θ1 and Θ2 (see Materials and Methods). Traces in Θ1 can be viewed as anchors, compared to positive and negative examples in Θ2. Formally, for 𝒜 = {(Ψ(y1), Ψ(y2)) ∣ y1 ∈ Θ1, y2 ∈ Θ2, y1y2}, a set of training profiles computed from Θ1 and Θ2, where y1y2 indicates two traces originating from the same individual

L(Ω,Λ)=D(A)αD+(A) (4)

where

D(A)=(X1,X2)A[d(X1X2)d(X1σ(X1))]+ (5)

and

D+(A)=(X1,X2)A[d(X1σ(X1))d(X1X2)]+ (6)

with

σ(X1)=arg minY2Ψ(Θ2)\X2  d(X1Y2) (7)

The terms of 𝒟+ are nonzero for couples (X1, X2) ∈ 𝒜 where the model correctly finds X2 to be the most similar profile for X1 (positive examples). Reciprocally, the terms of 𝒟 are nonzero on couples where X1 is incorrectly identified (negative examples). Training minimizes 𝒟 while maximizing 𝒟+ according to a balancing meta-parameter α > 0 (see Materials and Methods and Fig. 1, B and C).

Empirical evaluation

We use a large-scale location dataset collected over 24 consecutive weeks for 0.5 million people through Call Detail Records (CDRs). For every interaction (call or text), CDRs typically contain the pseudonyms of the sender, the recipient, an hourly timestamp, the type of interaction (call or text), the duration of the call, as well as the approximate location of both the sender and the recipient. More specifically, for each party, approximate location refers to the antenna the party was connected to when the interaction occurred. To keep our model general, we here only use the location and hourly timestamp information. On average, traces contain 50.70 data points per week. The dataset Θdata and the auxiliary information τ are fully disjoint and recorded respectively over the weeks 𝒯data = [1,11) and 𝒯aux = [11,16). The remaining weeks (i.e., [16,24]) are used in the next section to study the impact of the time gap g=tauxtdata between 𝒯data and 𝒯aux on the model accuracy.

We also validate the generality of our model by applying it to a grocery shopping dataset provided by Instacart (51). The dataset contains the ordered list of shopping baskets purchased online by each customer during a year. Each data point is a transaction corresponding to the basket of items purchased by a customer at once. The dataset contains approximately 100,000 individuals with at least 10 recorded transactions throughout the year. Recorded transactions include the purchased quantities of each product, the aisle where each product is stored within the shop (such as vegetables and meat), as well as the day of the week and hour of the day when the transaction happened. As this shopping dataset does not contain location information, we apply our model to profile individuals according to what they typically buy (see the Supplementary Materials). Similarly to the setup we use for the location dataset, the auxiliary information used by the model here is also fully nonoverlapping (see the Supplementary Materials).

Figure 2A shows that our model has ζ1 = 79% chance of correctly finding a target out of N = 0.5 million individuals in the location dataset. The attacker would furthermore have ζ10 = 93% chance of correctly finding the target in a set of 10 candidates and ζ50 = 97.5% chance in a set of 50 candidates, both out of 0.5 million individuals.

Fig. 2. The model identifies the correct individual with high probability.

Fig. 2.

(A) Likelihood ζm to find a target in the location dataset within the top m candidates selected by our model out of N = 0.5 million. An attacker has ζ1 = 79% chance of correctly identifying the target in the location dataset, with ζm increasing rapidly with m. (B) Likelihood ζm to find a target in the grocery shopping dataset within the top m candidates selected by our model out of N = 85,000 individuals. An attacker has ζ1 = 65% chance of correctly identifying the target in the grocery shopping dataset, with ζm increasing rapidly with m. (C) The meta-classifier accurately evaluates whether the individual found by our model in the location dataset is correct (ROC curve, AUC = 0.91). Inset: False discovery rate (FDR) for traces with κ^τ>κ. For individuals predicted to be correctly found in the location dataset with κ^τ>0.95, 4.81% are actually incorrect, showing the method to be well calibrated (see table S4). (D) The meta-classifier accurately evaluates whether the individual found by our model in the grocery shopping dataset is correct (ROC curve, AUC = 0.94). Inset: FDR for traces with κ^τ>κ. For individuals predicted to be correctly found in the grocery shopping dataset with κ^τ>0.95, 4.22% are actually incorrect (see table S5). TPR, true positive rate; FPR, false positive rate.

Figure 2B shows that our model has ζ1 = 65% chance of correctly finding a target out of N = 85,000 individuals in the grocery shopping dataset. The attacker would furthermore have ζ10 = 74% chance of correctly finding the target in a set of 10 candidates and ζ50 = 80% chance in a set of 50 candidates, both out of 85,000 individuals. While a complete analysis of why individuals might be less identifiable in shopping data than location data is beyond the scope of this work, we offer some hypotheses in Discussion.

Figure 2 (C and D) shows that the meta-classifier accurately predicts whether the right individual has been found by our model. It achieves a high AUC [area under the receiver operating characteristic curve (ROC)] of 0.91 for the location dataset (0.94 for the grocery shopping dataset), and the estimated likelihood κ^τ of the right individual to be found is well calibrated. This ensures that an individual found by our model and given a high probability by the meta-classifier is likely to be the right person. For instance, for κ^τ>0.95 (respectively 0.9 and 0.99), the empirical likelihood for ρ(τ) to be incorrect (false discovery rate) is 4.85% for the location dataset (respectively 10.4 and 1.0%; see inset in Fig. 2C).

Time persistence of location profiles

Our behavior is likely to change over time, as we change jobs or partners, move houses, or favor new shops (fig. S4). To evaluate the robustness of profiles against the natural drift of human behavior over time (55), we compare the performances of our model when the auxiliary information is collected after a time gap g=tauxtdata. Starting from g = 0, i.e., the previous configuration where the auxiliary information 𝒯aux and the dataset Θdata are collected over disjoint consecutive periods of time 𝒯data = [1,11) and 𝒯aux = [11,16), we increase the time gap g and consider auxiliary information collected over 𝒯aux,g = [11 + g,16 + g).

This experiment shows (fig. S5) that the location profiles built by our model are time persistent, with accuracy ζ1,g decreasing slowly over time. For each added week in the time gap, we estimate that ζ1,g decreases by 0.93 percentage point on average [±0.29, standard error of the difference (SED), linear fit R2 = 0.996]. The AUC of the meta-classifier similarly decreases slowly over time (AUCg=0 = 0.91 down only to AUCg=9 = 0.90).

To understand why some individuals are more identifiable than others in the location dataset, we compute a handful of summary statistics for each individual and use them in a post hoc analysis with individuals being split into two groups according to their respective identification rate when g increases (see Materials and Methods). We found (fig. S5) that individuals that are more identifiable visit more unique locations (30 versus 21 medians, P < 10−15), that their traces contain more geographical information (geographical entropy of traces: 2.8 versus 2.2 bits of information, P < 10−15), that they spend most of their time within a small geographical region (radius of gyrations (56): 19.8 versus 21.8 km, P < 10−15), and that they live in less densely populated area (area of the primary Voronoi cell: 7.2 versus 9.6 km2, P < 10−15). These differences suggest that the lifestyle of an individual affects its identifiability in profiling attacks.

Robustness to noise addition

Noise addition has long been used as a mechanism to prevent identification. Geo-indistinguishability, a technique inspired by Differential Privacy (57), has become a popular noise addition mechanism for high-dimensional location data. It has, for instance, been implemented by browser apps such as Location Guard (58) and Geoprivacy (59). Geo-indistinguishability is achieved by adding, to each data point, independent spatial noises sampled from a bidimensional Laplace distribution with mean radius r¯=2ϵ for a given parameter ϵ. Typical values of ϵ used in the literature range from 0.023 m−1 (r¯=100 m) to 0.0034 m−1 (r¯=600 m) (6065). Knowing the noisy location of a data point thus only reveals a 95% confidence region about its real location with radius ranging from r95 = 237 m (r¯=100m) to r95 = 1432 m (r¯=600m).

Figure 3 shows that the accuracy of our model on the location dataset only decreases to 78% when small amounts of noise are added (r¯=100m, r95 = 237 m) and 71% for large amounts of noises (r¯=600 m, r95 = 1432 m). This shows that our model is robust to even the large amounts of spatial Laplace noise addition used in the literature and industry. Even the addition of very large amounts of noise (r¯>2000 m) only decreases the accuracy of the model slightly below 60%. This decrease is, however, also likely to strongly affect the utility of the data. r¯=2000m indeed means r95 = 4744 m (see probability density functions in inset Fig. 2C). For comparison, the average area of a zip code in New York City corresponds to a circular region of radius 1300 m.

Fig. 3. The model is robust to noise addition.

Fig. 3.

(A) ζ1 when locations in Θdata are perturbed with Laplacian noises. Standard amounts of noises (average radius r¯<600m, r95 < 1432 m) only decrease ζ1 by 7 percentage points. For large amounts of noise, ζ1 only slowly decreases further, e.g., ζ1 = 59% for r¯=2000m (r95 = 4744 m). Inset: Probability density function Dϵ(r) = ϵ2re−ϵr1r > 0 of the noise radius r. The mean radius is r¯=1ϵ. (B) The predictive power of the meta-classifier, captured by the ROC curves, only decreases slowly as the amount of added noise increases. The AUC decreases from 0.91 without noise down to, at most, 0.83 for r¯=4000m.

In this work, we show for the first time how profiling attacks are possible at large scale against sparse behavioral datasets. Profiling attacks relax a strong requirement of matching attacks, especially against behavioral data: the need for auxiliary information to be recorded not only over the same time period but also roughly at the same times. Our attack significantly expands the attack surface by making a much wider range of auxiliary information usable for reidentification, even against noisy datasets. Further research in profiling attacks is likely to lead to even more powerful models. Our results emphasize the need to account for profiling attacks when evaluating what constitutes anonymous data, for instance, the European Union Article 29 WP linkability criteria (66) interpretation of GDPR. Technically, our results emphasize the need for formal privacy guarantees and technical privacy engineering solutions enabling the truly anonymous use of behavioral data.

DISCUSSION

Scalability of the attack to larger datasets

The size of a dataset is likely to affect the likelihood of a person to be identified in it, with accuracy likely to be lower in larger datasets on average (46, 47). We here study the accuracy of our attack as a function of N, the number of individuals in the dataset Θdata, assuming everything else is kept equal (see the Supplementary Materials).

Figure 4 (A and B) shows that the accuracy ζ1 of our attack only decreases slowly with N for both location and grocery shopping datasets. The first derivative ζ1N is strictly increasing, showing that ζ1 is strongly convex (see inset). Moreover, ζ1N converges to 0 rapidly when N increases. Last, a simple logarithmic fit (R2 = 0.999) shows the decrease of ζ1 to behave as a third-order polynomial in log (N). Together, these findings strongly suggest that the accuracy of our attack would remain high in most practical settings.

Fig. 4. Robustness of the attack to larger datasets and relaxation of the membership assumption.

Fig. 4.

(A) Model accuracy ζ1 averaged over 10 runs for the location dataset (see table S8). ζ1 decreases slowly with population size (log fit, R2 = 0.999). Inset: The first derivative ζ1N converges to zero as N increases, confirming that the decrease of ζ1 is convex and slow. (B) Model accuracy ζ1 averaged over 10 runs for the grocery shopping dataset (see table S9). ζ1 decreases slowly with population size (log fit, R2 = 0.99). (C) The attack is robust to a relaxation of the membership assumption (location dataset). The prior P is the Jaccard index between Iaux and Idata. Inset: FDR for predictions ρ(τ) estimated to be correct with κ^τ>0.95. The likelihood for ρ(τ) to be incorrect is 4.6% for P = 0.9 (resp. 4.3% for P = 0.75 and 3.3.% for P = 0.5), showing that the method is well calibrated even when the membership assumption is relaxed (see table S6). (D) The attack is robust to a relaxation of the membership assumption (grocery shopping dataset). The ROC curves for various prior P are indistinguishable. Inset: FDR for predictions ρ(τ) estimated to be correct with κ^τ>0.95. The likelihood for ρ(τ) to be incorrect is 4.22% for P = 0.9 (resp. 4.06% for P = 0.75 and 4.27% for P = 0.5), showing that the method is well calibrated even when the membership assumption is relaxed (see table S7).

Removing the membership assumption

We have, throughout the article, worked under the assumption that the attacker knew the target to be in the dataset Θdata. While we believe that this assumption is reasonable in many practical settings, there are situations where it might not be the case. We thus extend the meta-classifier to the case where the attacker is unsure whether the target is in the dataset.

The meta-classifier is adapted using a prior P on the probability of the target to be in the dataset. In the main text, scores were calibrated using two sets of traces obtained over two disjoint periods of time from the same individuals. Here, the prior P is used during the calibration phase as a leave-one-out parameter. Each calibration trace in the first set has its counterpart in the second set virtually removed with probability 1 − P (see Materials and Methods). P is chosen by the attacker on a case per case basis depending on the information available to them. For instance, P could be the sampling rate for sampled data or the market share for a company in the country of interest.

Figure 4 (C and D) shows that our classifier accurately predicts whether the right individual has been found for various levels of prior P. For each prior, after calibration, we perform the attack on targets from a new set of individuals Iaux such that P = J(Iaux, Idata) the Jaccard index between both sets (67). The performances of our meta-classifier only decrease slightly with P as new targets, not contained in the dataset, are introduced. In particular, for the location dataset, the AUC decreases from AUCP=1 = 0.91 (main text) to AUCP=0.9 = 0.89, AUCP=0.75 = 0.89, and AUCP=0.5 = 0.88. For predictions ρ(τ) estimated to be correct with κ^τ>0.95, the likelihood for ρ(τ) to be incorrect is 4.6% for P = 0.9 (resp. 4.3% for P = 0.75 and 3.3% for P = 0.5, see inset Fig. 4C).

Comparison with previous works on location data

While most previous works on location data have investigated attacks where matching auxiliary information is available to an attacker (34, 37, 45, 6870), a few attacks using nonoverlapping auxiliary information have been proposed and evaluated on small-scale datasets (from 100 to 50,000 individuals) (7173). These attacks are based either on Markov chains (71, 72) or on histograms (73). We reimplement and compare our work to six of these methods: four based on histograms using the Jensen-Shannon (JS) divergence (73), Bhattacharyya (Bhat) distance (74), L1 distance (75), and cosine distance (76) and two based on Markov chains (71, 72). We compare these methods to our approach in three scenarios: (i) no noise added to the dataset, (ii) small amounts of noise added to the dataset (r¯=200 m), and (iii) very large amounts of noise added to the dataset (r¯=2000 m).

Our method outperforms all six previous methods by at least 13 percentage points in each scenario (see table S3). More specifically, we outperform the state of the art by 13.6% in scenario 1, by 16.3% in scenario 2, and by a striking 26.9% in scenario 3. Figure 5 shows how our method outperforms the state of the art (Bhat or JS) across various scales and amounts of added noises. Across the board, histogram-based methods perform better than Markov-based methods (see table S3).

Fig. 5. The model outperforms previous work on location data.

Fig. 5.

(A) Our model outperforms the baselines at all scales on the location dataset. Accuracies ζ1 are averaged over 10 runs (see table S8). (B) Our model outperforms the baselines for any amount of added noise, e.g., by 27.6 percentage points for r¯=3000m. ζ1 also visually decreases more rapidly for baselines than for our model.

Deep Learning methods have been developed recently to extract representations from raw sequential data, e.g., Contrastive Predictive Coding (77), Recurrent Attention Models (78), and Autoencoder architectures with Recurrent Neural Networks (79) and Autoregressive models (80). These representations might, in future work, replace the simple profiles we used here (see Materials and Methods), although questions remain on the applicability of these methods to sparse data. Similarly, future specialized models for behavioral datasets are likely to be developed. Both will ultimately further increase the scope and accuracy of profiling attacks and the risk they pose to our privacy.

Limitation—auxiliary information

Throughout this work, we evaluate the potential of a person to be identified in a location dataset using fully nonoverlapping auxiliary information coming from the same modality. In some cases, an attacker might try to identify someone using auxiliary information coming from other modalities including publicly available information, such as social media posts, or privately collected information, such as the WiFi connection data used in the recent reidentification and subsequent outing of a U.S. priest (81). For ethical, legal (82), and contractual reasons, we did not attempt to identify individuals in our dataset using auxiliary information coming from other modalities. Unless the auxiliary information comes from a modality independent from the one used to collect the dataset (e.g., a credit card only used for expenses abroad with mobile phone dataset recorded only in the country), we expect our model to perform well and our results to qualitatively hold.

Limitation—noise addition mechanism

We here consider, Geo-indistinguishably, a local noise addition mechanism that has traditionally been used for location data (58, 59). A range of other mechanisms could be considered. For instance, one could decide to report the same obfuscated location every time an individual is in a corresponding real location. One could also consider global mechanisms such as k-anonymity (83). While some of these mechanisms might prove more effective against our attack, something we leave for future work, their impact on the downstream utility of the dataset has to be carefully considered. In particular, the biases introduced by nontruthful methods are considered generally problematic and global methods such as k-anonymity have been shown to strongly affect utility (83). We are skeptical that behavioral data can be anonymized at individual level while retaining general utility. Instead, we believe modern privacy engineering methods, such as query-based systems (84, 85), and formal guarantees, such as Differential Privacy, to be the way forward when it comes to safely releasing behavioral data.

Discrepancies between location and grocery shopping datasets

While a complete analysis of why individuals might be less identifiable in shopping data than location data is beyond the scope of this work, we offer some hypotheses below. First, the grocery shopping dataset used here is much sparser than the location dataset (0.52 data points per week on average for the grocery shopping dataset versus 50.70 data points per week on average for the location dataset). This is likely to affect the computation of profiles by density estimation, making them less accurate. Second, grocery shopping data points might be less identifiable than location data points. For instance, groceries online are mostly purchased from the first category page displayed by retailers (86), which could reduce the diversity of shopping baskets across individuals. Third, shopping patterns might be less stable over time than location patterns. Previous works have shown human mobility to be fairly predictable, especially with regard to home and work locations (55, 87, 88). On the other hand, customers seem to shop for groceries online only as a complement to traditional stores (89), with many situational factors influencing the loyalty of customers and their purchasing habits over time (90).

Last, shopping patterns have been used to train recommender systems. These systems learn to predict future purchases through collaborative filtering, deducing future purchases from what other individuals have purchased in the past. However, while recommender systems learn that customers who bought X will also likely be interested in Y, our model learns the singularities of the shopping habits of an individual to then identify the specific list of their purchases in the past.

MATERIALS AND METHODS

Our model is an open-set inductive classifier based on nearest neighbor classification (1-NN) (91). Open-set means that classes, i.e., the identities of the individuals in the dataset, are disjoint between training, validation, and testing, thus providing the attacker with a model readily applicable to new individuals. Previous studies have shown that 1-NN performances can be improved by learning the model’s distance (92, 93).

Our methodology contribution can be summarized into three points: (i) We propose an abstract space as input of the model, the space of profiles, and a method to map raw data into that space as collections of histograms. (ii) Within the space of profile, we propose a supervised learning method similar to recent works in distance learning (92, 93) to learn a new divergence to compare profiles. (iii) From the divergence values, we propose a method to estimate the likelihood of auxiliary information to be correctly classified. Our framework, which we will now describe, makes no parametric assumption about the distribution of the data.

Formalism

Behavioral data are individual-level temporal data containing discrete events characterizing the behavior of each individual. We model the generation of these discrete events as point processes, a general nonparametric model for point pattern analysis (94). Formally, for each individual, we consider a trace y as a realization of a point process Y on T × X, where T ⊂ ℝ+ is a time interval over which the data are recorded and X=kXk is a multidimensional space with each Xk either discrete or an interval on the real line. For instance, in the main text, X is the (single-dimensional) finite set of the indexed geographical regions around each antenna. We further assume that these point processes are invariant, as random elements whose values are point patterns, by week translation over time. This is a modeling assumption that works well enough to model the weekly patterns, followed by individuals’ behavior aside from holidays and life-changing events (55, 87, 88). Under this assumption, Y has similar distributions over all T′ × X, where T′ ⊂ T′ is a 1-week time interval.

Mapping traces to profiles

Using a map Ψ : Θ → S from raw traces to the space of profiles, we compute profiles aiming to capture the recurrent patterns of an individual while reducing the microvariations observed in the data. Profiles are collections of density estimates corresponding to variables obtained from each point process Yi (see the Supplementary Materials). Formally, for each individual i, the profile Ψ(Yi) = Zi = (Zi,k)k=1, …, q is a collection of random variables Zi,k taking values on their respective probability simplex Sk (with S=k=1,,qSk). Here, we choose Zi,k to be a histogram obtained using the random counting measure Ni associated with Yi on a collection Bk of Borel sets of T × X

Zi,k=(Ni(B)BBkNi(B)) (8)

with Ni(B) = # (BYi) the random variable counting the number of events of Yi in B, for any B ∈ Bk. Aiming for profiles to be time persistent and robust to added noise, we consider collections of Borel sets Bk corresponding to aggregating events time-wise (over 𝒯) and value-wise (over 𝒳) (see the Supplementary Materials). Although beyond the scope of this work, other density estimation methods, e.g., kernel density estimators, could be used for the variables Zi,k.

Divergence

Our model learns how important each random variable Z·,k is for profiles to be identifiable, and weights these variables accordingly. Each variable is valued on a probability simplex of up to a few thousand dimensions. Weights Ω+q and Λ ∈ [0,1]q are, by design, shared across individuals for the model to be inductive. This allows the model to be applied to individuals that are not seen during training and validation.

We define the model divergence dΩ,Λ=k=1qΩkdΛk as a linear combination, weighted by Ω, of pairwise subdivergences of the variables Zi,k on their respective simplexes. More specifically, the subdivergence

dΛk(Zi,kZi,k)=H(Mi,i,k)hi,i,k (9)

compares the convex combination Mi,i,k = ΛkZi,k + (1 − Λk)Zi,k of the histograms Zi,k and Zi,k via the entropy H on 𝒮k (Mi,i,k ∈ 𝒮k as probability simplexes arestable by convex combination) to the convex combination hi,i,k = ΛkH(Zi,k) + (1 − Λk)H(Zi,k) of the entropies taken separately. Because of the concavity of H, for all k, the subdivergence dΛk is valued in ℝ+.

Empirical setup

Figure S1 shows how we split the location dataset to train, validate, test, and calibrate our model. In particular, the dataset is split into training sets Θ1 and Θ2, validation sets Θ3 and Θ4, testing sets Θdata and Θaux, and score calibration sets ΘA and ΘB.

For testing, traces are collected over Tdata=[tdata,tdata) from individuals in Idata. Auxiliary information is fully nonoverlapping, collected over Taux=[taux,taux) with tdatataux from targets in Iaux such that the Jaccard index between Idata and Iaux is equal to the attacker’s prior P. For training, traces are collected over a split of Tdata into two consecutive time intervals T1 = [t0, t1) (with t0 = tdata) and T2 = [t1, t2) (with t2t1 = t1t0) from individuals in Itrain. Training individuals Itrain are disjoint from Idata and Iaux. For validation, traces are collected over a split of Tdata into two other consecutive time intervals T3 = [t2, t3) and T4 = [t3, t4) (with t4=tdata and t4t3 = t3t2) disjoint from T1 and T2. Individuals Ivalid used for validation are disjoint from Idata, Iaux, and Itrain. Last, to calibrate the scores, traces are recorded over another split of Tdata into TA = [tA, tB) and TB = [tB, t4) (with t4tB=tBtA=tauxtaux, i.e., 5 weeks here) from individuals in Idata.

We select a 80/20 split for the time, with 10 weeks of Tdata split into 4 and 4 weeks for T1 and T2, and 1 and 1 week for T3 and T4, all disjoints. We used a small number of individuals for training (10,000) and validation (1000) to illustrate the strength of our model even when tested orders of magnitude above its training size (N = 0.5 million). Individuals were kept strictly separate between training, validation, and testing to prevent overfitting and to show the inductive strength of our model.

Training

Parameters Ω and Λ are obtained by minimizing the average training error using a standard mini-batch Adam stochastic gradient descent (95). In particular, a mini-batch θt is drawn at each step t from traces in Θ1 such that #θt,+ = # θt,− = 100 individuals, where θt,+ = {y1 ∈ θt∣ρt2(y1) ≡ y1} and θt,− = {y1 ∈ θt∣ρt2(y1) ≢ y1} for ρt,Θ2(y1)=arg minyθ  dΩt,Λt(Ψ(y1)Ψ(y)). This allows balancing positive and negative examples at each step during training, e.g., when the majority of traces in Θ1 are positive examples.

L(Ω,Λ)=D(θ)+αD(θ+) (10)

where

D(θ)=1#θyθ[dΩ,Λ(Ψ(y1)Ψ(y2))dΩ,Λ(Ψ(y1)Ψ(σ(y1)))] (11)

where σ(y1) = ρΘ2∖{y2}(y1). At every iteration, Λt is clipped within (0,1)p and Ωt above 0. Although we balanced the contribution of positive and negative examples, the values of 𝒟 in Eq. 11 could still be different between these two groups, particularly when noise is added to the dataset. The balance between 𝒟(θ+) and 𝒟(θ) is thus controlled by the meta-parameter α > 0 (see figs. S3 and S4). To select α, the validation accuracy after training ζval(α)=1#Ivalidy3Θ31ρΘ4(y3)y3 is maximized by grid search (see the Supplementary Materials).

Calibrating scores

We compute the likelihood of the right individual to have been found, i.e., of ρ(τ) to be correct, using a standard second-over-first score from the literature (41, 45):

s(τ)=δ(τ,ρ2(τ))δ(τ,ρ1(τ)) (12)

where δ(x, y) = d(Ψ(x)∥Ψ(y)), ρ1(τ) = ρ(τ), and ρ2(τ) is the second best candidate for τ. For any desired minimum likelihood κ of ρ(τ) to be correct, ρ(τ) is rejected if s(τ) < sκ, where sκ is a score threshold corresponding to κ calibrated using ΘA and ΘB. The likelihood of ρ(τ) to be correct is then estimated by κ^τ such that sκ^τ=s(τ).

In practice, an attacker will often have a prior about the likelihood of a target to be in the dataset, e.g., because of the market share of a service in a country. We denote P the attacker’s prior on the probability of an auxiliary information to correspond to an individual in Idata, matching in our experiments the Jaccard index (67) between Iaux and Idata (in particular, in the main text P = 1). The distribution of scores is then obtained by comparing each trace yA within ΘA with all the traces in ΘB, after the trace corresponding to yA has been removed from ΘB with probability 1 − P. Formally, for N independent Bernoulli variables (Bi(p))iIdata of parameter p, thresholds sκ are obtained by solving 1NiIdata(Bi×1{ρΘB(yiA)=yiBs(yA)>sκ}+(1Bi)×1{ρΘB{yiB}(yiA)=yiBs(yA)>sκ})=κ.

Determinants of identifiability

To better understand what might make an individual identifiable, we perform a post hoc study of a handful of summary statistics of an individual’s behavior: the radius of gyration of an individual, the area of the main visited antenna cell, the number of unique antenna cells visited, and the entropy of the location distribution over antenna cells. We consider people who are easier to identify to be target individuals that our model manages to correctly identify in at least 50% of our time gap experiments with Taux,g (group 1). All remaining targets are considered harder to identify, including the ones that are never correctly identified (group 2). Of the 420,787 targets considered in the time gap experiment, 312,985 are in group 1, 100,425 are in group 2, and the remaining 7377 were discarded for having their main cell on the edge of the map where the area could not be computed (see the Supplementary Materials). To study discrepancies between these two groups, we perform a Kruskal-Wallis test (96), a nonparametric test (see fig. S5). Note that while using these summary statistics as additional features for our model might further improve its performances, we do not use them as they might not be transferable to behavioral datasets beyond location data.

Noise addition

Given a spatiotemporal point (t, x) ∈ T × ℝ2 such that x are GPS coordinates, Geo-indistinguishability (60) translates (t, x) spatially to (t, x + r(ϵ)) with a random noise vector of norm r(ϵ) drawn according to the density Dϵ(r) = ϵ2re−ϵr1r>0.

For CDR, adding noise to the GPS coordinate of the antenna routing the call would unfairly dampen the effect of the added noise. The coverage of each antenna indeed forms a Euclidean Voronoi cell around its location, often effectively canceling out the added noise. Instead, for each data point (t, x) ∈ T × 𝒳 with 𝒳 the set of these cells, we uniformly sample a GPS point (t, (𝓁, L)) ∈ T × ℝ2 within the cell x. The noisy GPS point (t, (𝓁, L) + r(ϵ)) is then projected to (t, xϵ) ∈ T × X, with xϵ the cell containing the noisy GPS coordinates (𝓁, L) + r(ϵ). The sets Θdata, Θ1, Θ2, Θ3, Θ4, ΘA, and ΘB are thus perturbed into Θdataϵ, Θ1ϵ, Θ2ϵ, Θ3ϵ, Θ4ϵ, ΘAϵ, and ΘBϵ. Noises are drawn independently for each data point before the model is trained, validated, tested, and calibrated.

Acknowledgments

We thank all members of the Computational Privacy Group for discussions and suggestions. We especially thank B. Felbo and T. Lienart for early contributions to this project. We also thank L. Rocher, A. Oehmichen, and A. Farzanehfar for comments on earlier versions of the manuscript.

Funding: The authors acknowledge that they received no funding in support for this research.

Author contributions: A.J.T. designed and trained the model, designed and performed experiments, and wrote the article. Y.-A.d.M. designed the model and experiments, and wrote the article.

Competing interests: The authors declare that they have no competing interests.

Data and materials availability: The source code to reproduce the results of this article and for the model is available at https://doi.org/10.14469/hpc/10632. The Instacart Online Shopping Dataset 2017 used in this article is available at https://kaggle.com/c/instacart-market-basket-analysis/data. For contractual and privacy reasons, we cannot make the location dataset available. We, however, provide the complete outputs of our model and code to replicate Figs. 2 to 4.

Supplementary Materials

This PDF file includes:

Sections S1 to S4

Figs. S1 to S6

Tables S1 to S9

References

sciadv.abl6464_sm.pdf (1.4MB, pdf)

REFERENCES AND NOTES

  • 1.Strategy Analytics, “Global connected and IoT device forecast update” (Strategy Analytics, 2019).
  • 2.J. Valentino-Devries, N. Singer, M. H. Keller, A. Krolik, “Your apps know where you were last night, and they’re not keeping it secret,” New York Times, 10 December 2018; www.nytimes.com/interactive/2018/12/10/bsiness/location-data-privacy-apps.html [accessed 4 January 2021].
  • 3.Communication: A European strategy for data; https://ec.europa.eu/info/strategy/priorities-2019-2024/europe-fit-digital-age/european-data-strategy [accessed 11 December 2020].
  • 4.Burge P., Shawe-Taylor J., An unsupervised neural network approach to profiling the behavior of mobile phone users for use in fraud detection. J. Parallel Distrib. Comput. 61, 915–925 (2001). [Google Scholar]
  • 5.Björkegren D., Grissen D., Behavior revealed in mobile phone usage predicts credit repayment. World Bank Econ. Rev. 34, 618–634 (2020). [Google Scholar]
  • 6.Toole J. L., Lin Y.-R., Muehlegger E., Shoag D., González M. C., Lazer D., Tracking employment shocks using mobile phone data. J. R. Soc. Interface 12, 20150185 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Bengtsson L., Lu X., Thorson A., Garfield R., Schreeb J. V., Improved response to disasters and outbreaks by tracking population movements with mobile phone network data: A post-earthquake geospatial study in Haiti. PLOS Med. 8, e1001083 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.T. Breton, Commission recommendation (EU) 2020/518 of 8 April 2020 on a common union toolbox for the use of technology and data to combat and exit from the COVID-19 crisis, in particular concerning mobile applications and the use of anonymised mobility data (2020); https://eur-lex.europa.eu/eli/reco/2020/518/oj [accessed 23 February 2021].
  • 9.Grantz K. H., Meredith H. R., Cummings D. A. T., Metcalf C. J. E., Grenfell B. T., Giles J. R., Mehta S., Solomon S., Labrique A., Kishore N., Buckee C. O., Wesolowski A., The use of mobile phone data to inform analysis of COVID-19 pandemic epidemiology. Nat. Commun. 11, 4961 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Oliver N., Lepri B., Sterly H., Lambiotte R., Deletaille S., De Nadai M., Letouzé E., Salah A. A., Benjamins R., Cattuto C., Colizza V., de Cordes N., Fraiberger S. P., Koebe T., Lehmann S., Murillo J., Pentland A., Pham P. N., Pivetta F., Saramäki J., Scarpino S. V., Tizzoni M., Verhulst S., Vinck P., Mobile phone data for informing public health actions across the COVID-19 pandemic life cycle. Sci. Adv. 6, eabc0764 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wesolowski A., Eagle N., Tatem A. J., Smith D. L., Noor A. M., Snow R. W., Buckee C. O., Quantifying the impact of human mobility on malaria. Science 338, 267–270 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Dong X., Morales A. J., Jahani E., Moro E., Lepri B., Bozkaya B., Sarraute C., Bar-Yam Y., Pentland A., Segregated interactions in urban and online space. EPJ Data Sci. 9, 20 (2020). [Google Scholar]
  • 13.Stone E. F., Stone D. L., Privacy in organizations: Theoretical issues, research findings, and protection mechanisms. Res. Pers. Hum. Resour. Manag. 8, 349–411 (1990). [Google Scholar]
  • 14.K. Granville, “Facebook and Cambridge Analytica: What you need to know as fallout widens,” New York Times, 19 March 2018; www.nytimes.com/2018/03/19/technology/facebook-cambridge-analytica-explained.html.
  • 15.Lyon D., Surveillance, Snowden, and big data: Capacities, consequences, critique. Big Data Soc. 1, 2053951714541861 (2014). [Google Scholar]
  • 16.Morning Consult, National tracking poll #210496 (2021); https://assets.morningconsult.com/wp-uploads/2021/04/26163900/210496_crosstabs_MC_TECH_RVs_v1_LM.pdf.
  • 17.Harris Interactive, Information rights strategic plan: Trust and confidence (2019); https://ico.org.uk/media/about-the-ico/documents/2615515/ico-trust-and-confidence-report-20190626.pdf.
  • 18.Recital 26: Not applicable to anonymous data (2018); https://gdpr.eu/recital-26-not-applicable-to-anonymous-data/ [accessed 6 December 2020].
  • 19.California State Legislature, California consumer privacy act of 2018 (2018); https://www.consumerprivacyact.com/section-1798-140-definitions/.
  • 20.Concerning the management, oversight, and use of data; https://app.leg.wa.gov/billsummary?BillNumber=5062&Year=2021&Initiative=false [accessed 24 February 2021].
  • 21.An act relative to data privacy; https://malegislature.gov/Bills/192/HD3847 [accessed 24 February 2021].
  • 22.Consumer data protection act; https://lis.virginia.gov/cgi-bin/legp604.exe?211+ful+HB2307H1 [accessed 24 February 2021].
  • 23.Dunn H. L., Record linkage. Am. J. Public Health 36, 1412–1416 (1946). [PubMed] [Google Scholar]
  • 24.L. Sweeney, Computational disclosure control for medical microdata: The Datafly system, in Record Linkage Techniques 1997: Proceedings of an International Workshop and Exposition (National Academy Press, 1997), pp. 442–453. [Google Scholar]
  • 25.B. Malin, L. Sweeney, Re-identification of DNA through an automated linkage process, in Proceedings of the AMIA Symposium (American Medical Informatics Association, 2001), p. 423. [PMC free article] [PubMed] [Google Scholar]
  • 26.Ohm P., Broken promises of privacy: Responding to the surprising failure of anonymization. UCLA Law Rev. 57, 1701 (2009). [Google Scholar]
  • 27.R. Buettner, S. Craig, “Decade in the red: Trump tax figures show over $1 billion in business losses,” New York Times, 8 May 2019, p. 7.
  • 28.Sweeney L., k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10, 557–570 (2002). [Google Scholar]
  • 29.Matthews G. J., Harel O., Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy. Statist. Surv. 5, 1–29 (2011). [Google Scholar]
  • 30.C. Skinner, Statistical disclosure control for survey data, in Handbook of Statistics (Elsevier, 2009), vol. 29, pp. 381–396. [Google Scholar]
  • 31.D. Kifer, Attacks on privacy and Definetti’s theorem, in Proceedings of the 2009 ACM SIGMOD International Conference on Management of data (Association for Computing Machinery, 2009), pp. 127–138. [Google Scholar]
  • 32.R. Kumar, J. Novak, B. Pang, A. Tomkins, On anonymizing query logs via token-based hashing, in Proceedings of the 16th International Conference on World Wide Web (Association for Computing Machinery, 2007), pp. 629–638. [Google Scholar]
  • 33.A. Lavrenovs, K. Podins, Privacy violations in Riga open data public transport system, in 2016 IEEE 4th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE) (IEEE, 2016), pp. 1–6. [Google Scholar]
  • 34.de Montjoye Y.-A., Radaelli L., Singh V. K., Pentland A. S., Unique in the shopping mall: On the reidentifiability of credit card metadata. Science 347, 536–539 (2015). [DOI] [PubMed] [Google Scholar]
  • 35.A. D. Luzio, A. Mei, J. Stefa, Consensus robustness and transaction de-anonymization in the ripple currency exchange system, in 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) (IEEE, 2017), pp. 140–150. [Google Scholar]
  • 36.Pellungrini R., Pappalardo L., Pratesi F., Monreale A., A data mining approach to assess privacy risk in human mobility data. ACM Trans. Intell. Syst. Technol. 9, 1–27 (2017). [Google Scholar]
  • 37.de Montjoye Y.-A., Hidalgo C. A., Verleysen M., Blondel V. D., Unique in the crowd: The privacy bounds of human mobility. Sci. Rep. 3, 1376 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.V. Sekara, E. Mones, H. Jonsson, Temporal limits of privacy in human behavior. arXiv:1806.03615 [cs.CY] (10 June 2018).
  • 39.J. Su, A. Shukla, S. Goel, A. Narayanan, De-anonymizing web browsing data with social networks, in Proceedings of the 26th International Conference on World Wide Web (International World Wide Web Conferences Steering Committee, 2017), pp. 1261–1269. [Google Scholar]
  • 40.C. Deußer, S. Passmann, T. Strufe, Browsing unicity: On the limits of anonymizing web tracking data, in 2020 IEEE Symposium on Security and Privacy (SP) (IEEE, 2020), pp. 777–790. [Google Scholar]
  • 41.A. Narayanan, V. Shmatikov, Robust de-anonymization of large sparse datasets, in Proceedings of the 2008 IEEE Symposium on Security and Privacy (IEEE, 2008), pp. 111–125. [Google Scholar]
  • 42.S. Cleemput, M. A. Mustafa, E. Marin, B. Preneel, De-pseudonymization of smart metering data: Analysis and countermeasures, in 2018 Global Internet of Things Summit (GIoTS) (IEEE, 2018), pp. 1–6. [Google Scholar]
  • 43.M. Hay, G. Miklau, D. Jensen, P. Weis, S. Srivastava, Anonymizing social networks, in Computer Science Department Faculty Publication Series (2007), p. 180.
  • 44.A. Narayanan, V. Shmatikov, De-anonymizing social networks, in 2009 30th IEEE Symposium on Security and Privacy (IEEE, 2009), pp. 173–187. [Google Scholar]
  • 45.C. Riederer, Y. Kim, A. Chaintreau, N. Korula, S. Lattanzi, Linking users across domains with location data: Theory and validation, in Proceedings of the 25th International Conference on World Wide Web (International World Wide Web Conferences Steering Committee, 2016), pp. 707–719. [Google Scholar]
  • 46.Rocher L., Hendrickx J. M., de Montjoye Y.-A., Estimating the success of re-identifications in incomplete datasets using generative models. Nat. Commun. 10, 3069 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Farzanehfar A., Houssiau F., de Montjoye Y.-A., The risk of re-identification remains high even in country-scale location datasets. Patterns 2, 100204 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.A. Cavoukian, D. Castro, in Big Data and Innovation, Setting The Record Straight: De-Identification Does Work (Information and Privacy Commissioner, 2014). [Google Scholar]
  • 49.M. Elliot, E. Mackey, K. O’Hara, The Anonymisation Decision Making Framework: European Practitioners’ Guide (UK Anonymisation Network, 2020), pp. 78–79. [Google Scholar]
  • 50.C. Mitchell, J. Ordish, E. Johnson, T. Brigden, A. Hall, The GDPR and Genomic Data—The Impact of the GDPR and DPA 2018 on Genomic Healthcare and Research (PHG Foundation, 2020), pp. 51–53. [Google Scholar]
  • 51.The Instacart online grocery shopping dataset 2017; www.instacart.com/datasets/grocery-shopping-2017 [accessed 16 April 2019.
  • 52.J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Pires, Z. Guo, M. Azar, B. Piot, K. Kavukcuoglu, R. Munos, M. Valko, Bootstrap your own latent: A new approach to self-supervised learning, in Neural Information Processing Systems (Curran Associates Inc., 2020). [Google Scholar]
  • 53.T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive learning of visual representations, in International Conference on Machine Learning (PMLR, 2020), pp. 1597–1607. [Google Scholar]
  • 54.F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for face recognition and clustering, in Proceedings of the IEEE conference on computer vision and pattern recognition (IEEE, 2015), pp. 815–823. [Google Scholar]
  • 55.Alessandretti L., Sapiezynski P., Sekara V., Lehmann S., Baronchelli A., Evidence for a conserved quantity in human mobility. Nat. Hum. Behav. 2, 485–491 (2018). [DOI] [PubMed] [Google Scholar]
  • 56.Gonzalez M. C., Hidalgo C. A., Barabasi A.-L., Understanding individual human mobility patterns. Nature 453, 779–782 (2008). [DOI] [PubMed] [Google Scholar]
  • 57.C. Dwork, Differential privacy: A survey of results, in International Conference on Theory and Applications of Models of Computation (Springer, 2008), pp. 1–19. [Google Scholar]
  • 58.Location Guard; https://github.com/chatziko/location-guard/ [accessed 21 July 2020].
  • 59.Geoprivacy Plugin: A set of location privacy tools for geographic data; https://diuke.github.io/GeoPrivPlugin/ [accessed 28 March 2021].
  • 60.M. E. Andrés, N. E. Bordenabe, K. Chatzikokolakis, C. Palamidessi, Geo-indistinguishability: Differential privacy for location-based systems, in Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security (Association for Computing Machinery, 2013) pp. 901–914. [Google Scholar]
  • 61.K. Chatzikokolakis, C. Palamidessi, M. Stronati, A predictive differentially-private mechanism for mobility traces, in International Symposium on Privacy Enhancing Technologies Symposium (Springer, 2014), pp. 21–41. [Google Scholar]
  • 62.Chatzikokolakis K., Elsalamouny E., Palamidessi C., Efficient utility improvement for location privacy. Proc. Priv. Enh. Technol. 2017, 308–328 (2017). [Google Scholar]
  • 63.S. Oya, C. Troncoso, F. Pérez-González, Is geo-indistinguishability what you are looking for? in Proceedings of the 2017 on Workshop on Privacy in the Electronic Society (Association for Computing Machinery, 2017), pp. 137–140. [Google Scholar]
  • 64.M. Alvim, K. Chatzikokolakis, C. Palamidessi, A. Pazii, Metric-based local differential privacy for statistical applications, in 31st Computer Security Foundations Symposium (CSF 2018) (IEEE Computer Society, 2018), pp. 262–267. [Google Scholar]
  • 65.M. Cunha, R. Mendes, J. P. Vilela, Clustering Geo-indistinguishability for privacy of continuous location traces, in 2019 4th International Conference on Computing, Communications and Security (ICCCS) (IEEE, 2019), pp. 1–8. [Google Scholar]
  • 66.Data Protection Working Party Article 29, Opinion 05/2014 on Anonymisation Techniques; https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf [accessed 28 March 2021].
  • 67.T. T. Tanimoto, Elementary Mathematical Theory of Classification and Prediction (International Business Machines Corporation, 1958). [Google Scholar]
  • 68.Cecaj A., Mamei M., Zambonelli F., Re-identification and information fusion between anonymized CDR and social network data. J. Ambient Intell. Humaniz. Comput. 7, 83–96 (2016). [Google Scholar]
  • 69.L. Rossi, M. Musolesi, It’s the way you check-in: Identifying users in location-based social networks, in Proceedings of the Second ACM Conference on Online Social Networks (Association for Computing Machinery, 2014), pp. 215–226. [Google Scholar]
  • 70.C. Y. T. Ma, D. K. Y. Yau, N. K. Yip, N. S. V. Rao, Privacy vulnerability of published anonymous mobility traces, in Proceedings of the 16th Annual International Conference on Mobile Computing and Networking (IEEE, 2010), pp. 185–196. [Google Scholar]
  • 71.Gambs S., Killijian M.-O., Núñez del Prado Cortez M., De-anonymization attack on geolocated data. J. Comput. Syst. Sci. 80, 1597–1614 (2014). [Google Scholar]
  • 72.Y. De Mulder, G. Danezis, L. Batina, B. Preneel, Identification via location-profiling in GSM networks, in Proceedings of the 7th ACM Workshop on Privacy in the Electronic Society (Association for Computing Machinery, 2008), pp. 23–32. [Google Scholar]
  • 73.Naini F. M., Unnikrishnan J., Thiran P., Vetterli M., Where you are is who you are: User identification by matching statistics. IEEE Trans. Inf. Forensics Secur. 11, 358–372 (2015). [Google Scholar]
  • 74.Bhattacharyya A., On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943). [Google Scholar]
  • 75.Donoho D. L., For most large underdetermined systems of linear equations the minimal ℓ1-norm solution is also the sparsest solution. Commun. Pure Appl. Math. 59, 797–829 (2006). [Google Scholar]
  • 76.Singhal A., Modern information retrieval: A brief overview. IEEE Data Eng. Bull. 24, 35–43 (2001). [Google Scholar]
  • 77.A. van den Oord, Y. Li, O. Vinyals, Representation learning with contrastive predictive coding. arXiv:1807.03748 [cs.LG] (10 July 2018).
  • 78.V. Mnih, N. Heess, A. Graves, K. Kavukcuoglu, Recurrent models of visual attention, in Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2 (MIT Press, 2014), pp. 2204–2212. [Google Scholar]
  • 79.D. Bahdanau, K. H. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in 3rd International Conference on Learning Representations (ICLR, 2015). [Google Scholar]
  • 80.Chorowski J., Weiss R. J., Bengio S., van den Oord A., Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Trans. Audio, Speech, Language Process. 27, 2041–2053 (2019). [Google Scholar]
  • 81.M. I. M. Boorstein, A. Shin, “Top US catholic church official resigns after cell phone data used to track him on grindr and to gay bars,” Washington Post, 21 July 2021; www.washingtonpost.com/religion/2021/07/20/bishop-misconduct-resign-burrill/ [accessed 28 July 2021].
  • 82.Data protection act 2018, section 171 on the re-identification of de-identified personal data; www.legislation.gov.uk/ukpga/2018/12/section/171 [accessed 28 July 2021].
  • 83.M. Gramaglia, M. Fiore, On the anonymizability of mobile traffic datasets. arXiv:1501.00100 [cs.CY] (31 December 2014).
  • 84.A. Oehmichen, S. Jain, A. Gadotti, Y.-A. de Montjoye, OPAL: High performance platform for large-scale privacy-preserving location data analytics, in 2019 IEEE International Conference on Big Data (Big Data) (IEEE, 2019), pp. 1332–1342. [Google Scholar]
  • 85.Williamson E. J., Walker A. J., Bhaskaran K., Bacon S., Bates C., Morton C. E., Curtis H. J., Mehrkar A., Evans D., Inglesby P., Cockburn J., McDonald H. I., MacKenna B., Tomlinson L., Douglas I. J., Rentsch C. T., Mathur R., Wong A. Y. S., Grieve R., Harrison D., Forbes H., Schultze A., Croker R., Parry J., Hester F., Harper S., Perera R., Evans S. J. W., Smeeth L., Goldacre B., Factors associated with COVID-19-related death using OpenSAFELY. Nature 584, 430–436 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Anesbury Z., Nenycz-Thiel M., Dawes J., Kennedy R., How do shoppers behave online? An observational study of online grocery shopping. J. Consum. Behav. 15, 261–270 (2016). [Google Scholar]
  • 87.D. Wang, D. Pedreschi, C. Song, F. Giannotti, A.-L. Barabasi, Human mobility, social ties, and link prediction, in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Association for Computing Machinery, 2011), pp. 1100–1108. [Google Scholar]
  • 88.Song C., Qu Z., Blumm N., Barabási A.-L., Limits of predictability in human mobility. Science 327, 1018–1021 (2010). [DOI] [PubMed] [Google Scholar]
  • 89.Robinson H., Dall’Olmo Riley F., Rettie R., Rolls-Willson G., The role of situational variables in online grocery shopping in the UK. Mark. Rev. 7, 89–106 (2007). [Google Scholar]
  • 90.Hand C., Dall’Olmo Riley F., Harris P., Singh J., Rettie R., Online grocery shopping: The influence of situational factors. Eur. J. Mark. 43, 1205–1219 (2009). [Google Scholar]
  • 91.G. Biau, L. Devroye, Lectures on the Nearest Neighbor Method (Springer, 2015), vol. 246. [Google Scholar]
  • 92.Weinberger K. Q., Saul L. K., Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res. 10, 207–244 (2009). [Google Scholar]
  • 93.Goldberger J., Hinton G. E., Roweis S., Salakhutdinov R. R., Neighbourhood components analysis. Adv. Neural Inf. Process. Syst. 17, 513–520 (2004). [Google Scholar]
  • 94.D. R. Cox, V. Isham, Point Processes (CRC Press, 1980), vol. 12. [Google Scholar]
  • 95.D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in ICLR (Poster) (2015). [Google Scholar]
  • 96.P. E. McKight, J. Najab, Kruskal-Wallis test, in The Corsini Encyclopedia of Psychology (John Wiley & Sons, 2010), pp. 1–1. [Google Scholar]
  • 97.van der Maaten L., Hinton G., Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008). [Google Scholar]
  • 98.Barabasi A.-L., The origin of bursts and heavy tails in human dynamics. Nature 435, 207–211 (2005). [DOI] [PubMed] [Google Scholar]
  • 99.Kuhn H. W., The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2, 83–97 (1955). [Google Scholar]
  • 100.Murakami T., Expectation-maximization tensor factorization for practical location privacy attacks. Proc. Priv. Enh. Technol. 2017, 138–155 (2017). [Google Scholar]
  • 101.Massey F. J. Jr., The Kolmogorov-Smirnov test for goodness of fit. J. Am. Stat. Assoc. 46, 68–78 (1951). [Google Scholar]
  • 102.Braden B., The surveyor’s area formula. Coll. Math. J. 17, 326–337 (1986). [Google Scholar]
  • 103.Voronoi G., Nouvelles applications des paramètres continus à la théorie des formes quadratiques. premier mémoire. sur quelques propriétés des formes quadratiques positives parfaites. J. Reine Angew. Math. 1908, 97–102 (1908). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Sections S1 to S4

Figs. S1 to S6

Tables S1 to S9

References

sciadv.abl6464_sm.pdf (1.4MB, pdf)

Articles from Science Advances are provided here courtesy of American Association for the Advancement of Science

RESOURCES