Skip to main content
Patterns logoLink to Patterns
. 2021 Mar 12;2(3):100204. doi: 10.1016/j.patter.2021.100204

The risk of re-identification remains high even in country-scale location datasets

Ali Farzanehfar 1, Florimond Houssiau 1, Yves-Alexandre de Montjoye 1,2,
PMCID: PMC7961185  PMID: 33748793

Summary

Although anonymous data are not considered personal data, recent research has shown how individuals can often be re-identified. Scholars have argued that previous findings apply only to small-scale datasets and that privacy is preserved in large-scale datasets. Using 3 months of location data, we (1) show the risk of re-identification to decrease slowly with dataset size, (2) approximate this decrease with a simple model taking into account three population-wide marginal distributions, and (3) prove that unicity is convex and obtain a linear lower bound. Our estimates show that 93% of people would be uniquely identified in a dataset of 60M people using four points of auxiliary information, with a lower bound at 22%. This lower bound increases to 87% when five points are available. Taken together, our results show how the privacy of individuals is very unlikely to be preserved even in country-scale location datasets.

Keywords: privacy, anonymisation, de-identification, unicity, re-identification, call detail records, data protection, human mobility, location data, applied mathematics

Highlights

  • Re-identification risk is statistically modeled and shown to decrease slowly with dataset size

  • With increasing dataset size, the unicity decrease is lower-bounded and convex

  • Previous estimates of unicity unrealistically underestimated the risks

  • Individuals are likely re-identifiable in country-size location data and other high-dimensional datasets

The bigger picture

Data about us are being collected in many different ways, when we use our bank cards, use our phones, browse the web, or even drive our cars. These datasets contain detailed information about our lives. For each person, a dataset might contain thousands to tens of thousands of records. Previous research has shown that knowing just a few points about a target can single out the vast majority of people in location datasets. However, some had argued the risk of re-identification becomes negligible if we look at large-scale datasets containing tens of millions of people.

Here, we empirically measure, mathematically model, and provide a lower bound on the relationship between the size of a dataset and the risk of re-identification. Our results all show that re-identification risk decreases very slowly with increasing dataset size. Contrary to previous claims, people are thus very likely to be re-identifiable even in country-scale datasets.


Researchers have claimed that individuals could not be re-identified in large-scale location datasets, making them safe. We here empirically measure and mathematically model the relationship between the size of a dataset and the risk of re-identification. Our results show that the risk decreases slowly with dataset size, making even large country-scale datasets very likely to be re-identifiable.

Introduction

Throughout our day, we interact with many digital services when using our phone, paying with our credit card, or using public transport with a smart card. This results in our location data being collected broadly, sometimes on the scale of countries. For instance, Vodafone UK collects location trajectories of 20M citizens1—a third of the population—while up to 5 million people use London's subway daily.2

Location data have been used extensively in research. In urban planning, mobility data can be used to monitor urban activity3 and help design better cities.4 In epidemiology, it has been used to monitor and mitigate the spread of infectious diseases such as Ebola and COVID-19.5, 6, 7, 8, 9, 10 In computational social science, it has allowed us to gain unprecedented insights into the spatial distribution of poverty,11 and even to study the impact of mass employment layoffs on society.12 Further, the use of location data has withstood scrutiny into potential biases in their collection mechanisms.13

Despite this, the large-scale collection and use of location data has raised serious privacy concerns. It consists of fine-grained records of where we are and how we move around, and was considered sensitive by 82% of Americans in a recent survey.14 Location data can furthermore be used to predict individuals' income,11,15 their home and work locations,16, 17, 18, 19, 20, 21 when they sleep and wake up,22, 23, 24, 25, 26 their gender and age,27 their personality,28 who their friends are,29,30 and where they tend to socialize.31

Unicity has been proposed as a measure for the risk of re-identification in anonymous datasets and was used to show how four points of auxiliary information (places and times where someone was) are enough to uniquely identify 95% of people in a large-scale location dataset.32 These four points of auxiliary information could be in the form of geo-tagged “tweets,” online check-ins, or information obtained by more traditional means, such as observing someone making a call. Unicity (ϵp) is defined as the fraction of trajectories that are unique based on knowledge of p randomly chosen points in a given trajectory. Unicity has since been used to quantify re-identification risk across a number of domains, including the mobility of vehicles,33 apps downloaded by smartphones over time,34,35 smart cards used in public transport,24 credit card transaction histories,36 and location data from mobile phones in a number of countries.32,37,38 A range of studies have furthermore exploited the unicity of datasets to re-identify people. Narayanan and Shmatikov famously showed that close to 90% of people could be re-identified in the Netflix dataset,39 while Riederer and colleagues used the unicity of traces to match the same individual across multiple datasets.40

Researchers and industry practitioners have, however, argued that these high unicity numbers are an artifact of the small size of the datasets considered, and are overestimating the risk of re-identification.41, 42, 43 For instance, Riederer et al.40 relied on a location dataset of 1.7k people, while other case studies report unicity on dataset sizes ranging from several thousands (respectively 12k and 55k)33,34 to over 1 million people (1.5M).32 Examining a published study,36 El Emam et al. estimated that the unicity of a dataset of 20M trajectories will be as low as 1% given four points of auxiliary information, the conclusion being that privacy was preserved in such large datasets.42

We here (1) study 3 months of location data and show empirically that unicity decreases slowly with the size of the dataset, (2) approximate this decrease with a simple statistical model taking into account three population-wide marginal distributions along with the underlying geography, and (3) prove that the decrease in unicity is a convex function of the dataset size and obtain a linear lower bound on unicity. We finally perform a sensitivity analysis suggesting that the decrease in unicity is agnostic to broad perturbations in the input distributions. These results disprove previous claims, instead showing that unicity is likely to remain high even in country-scale datasets.

Results

Our experiments are performed on a dataset of call detail records containing the location of 1M individuals over 3 months. Each record contains a unique user ID, an hourly time stamp, and an antenna ID, which relates to a location (see Supplemental information for more details). We formally model this dataset as a sequence, D=(D1,,DN), populated with user time/location traces of the type Di=(Xi,Ci). Xi and Ci are lists of positions (antennas) and times (hours) representing the spatial and temporal components of a user's location trace.

Using this dataset, we empirically study the decrease in unicity with the dataset size by randomly sampling individuals from our original dataset and measuring the unicity of the sample as we increase its size (see Experimental procedures for details). We use the formal definition of unicity and the estimation algorithm S2 from de Montjoye et al.36 In line with previous work, we use the subscript p in ϵp(N) to indicate the number of points of auxiliary information used in the computation of unicity.

Figure 1A shows that unicity empirically decreases slowly with the size of the dataset. With three points of auxiliary information, unicity (solid orange line) goes down from ϵ3(100K)=0.98 in a dataset of 100,000 people to ϵ3(1M)=0.93 in a dataset of a million people. With two points (solid blue line) this decreases slightly faster, reaching ϵ2(1M)=0.69, while unicity with four points or more (solid red and brown lines) decreases very slowly with ϵ4(1M)=0.98. These results show that, while the size of the dataset has an impact on unicity, the decrease in unicity is slow.

Figure 1.

Figure 1

The relationship between unicity and dataset size

(A) Empirical (solid lines) and estimated (dashed lines) unicity decreases slowly with the size of the dataset. Inset: close up of the region ϵ0.7.

(B) The estimated unicity remains high even in large datasets. This is confirmed by the lower bound results (dotted lines). Taken together, these results strongly suggest that unicity remains high even in country-scale datasets.

To further study how unicity decreases with dataset size and whether it decreases sufficiently in population-scale datasets, we propose a simple statistical model taking into account three population-wide marginal distributions—circadian (PC), frequency (PF), and activity (PA)—along with the network of mobile phone antennas in a country. Using solely these quantities, the model is able to replicate the observed decrease in unicity with dataset size.

Figure 2 displays the information extracted from the dataset, three distributions, and the antenna network. (PC) characterizes the circadian cycle, the overall likelihood of a record to occur at a given time in a week. The existence of circadian cycles is well documented in the computational social science literature,22,23,25,26 and we use their empirical form in the model. The frequency distribution, (PF), is the relative overall likelihood of a location to be visited. This distribution too has been studied before and has been widely shown to be well approximated by a power-law distribution,44, 45, 46, 47, 48 as is also the case here (Figure 2B, R2=0.99). The activity distribution, (PA), captures the number of records (i)=|Di| that appear in each user trace. We approximate it here with a β distribution (α=1.72, β = 14.7, R2=0.98). Finally, Si is the set of locations visited by person i. It is a sub-graph sampled from the Delaunay tessellation of the antenna coordinates (L) in the dataset (see Supplemental information for the detailed algorithm).

Figure 2.

Figure 2

Inputs to the unicity model

(A) The circadian distribution, PC.

(B) The frequency distribution, PF, along with a power law fit (solid line, R2=0.99). The inset displays the cumulative distribution with 85% of activity captured by the top 10 locations.

(C) The activity distribution, PA, indicating the distribution of the number of records per trajectory along with a β distribution fit (solid line, R2=0.98).

(D) Illustration of the sub-graph sampling method used to generate an antenna set Si where Si(k)Si. The underlying antenna network is represented by dotted lines. The filled nodes (circles) correspond to locations already selected, while the hollow nodes are potential locations that could be selected next (Si(k+1) candidates) (see Supplemental information for detailed algorithm). Remaining locations are represented by filled diamonds.

In short, for each user, our model samples a list of 10 connected antennas (S1,,S10) on the network and an activity (number of records in the user's trace), APA. Each record's timestamp C and position X is then sampled according to the circadian distribution CPC and X=SK,KPF. This model is formally defined in the Experimental procedures.

Figure 1A shows that our simple statistical model closely follows the empirical measure of unicity from 1 to 1M people (dashed and solid lines). Using the model, we then study how unicity is likely to evolve as the size of the dataset increases to 20M people (Figure 1B). For N=20M, our model estimates unicity with three points to be close to ϵ3ˆ(20M)=0.93, while knowing one more point would increase this to the region of ϵ4ˆ(20M)=0.99. This is a stark difference with the linear extrapolation made by El Emam,42 who reports a unicity of 0.01 with four points (we replicate El Emam's method in the discussion and display our results for up to 60M people in the Supplemental information).

The model provides good evidence that unicity is likely to remain high even in datasets as large as 20M people. For further evidence, we prove that the decrease in unicity with increasing dataset size follows a convex form, and use this result to provide a lower bound on unicity in large datasets. We show in the Supplemental information that the unicity of a dataset of size N can be expressed as a sum of convex functions of N, and is thus convex.

This builds on two assumptions: (1) there exists an underlying trajectory distribution TX from which all trajectories DiD are sampled and (2) all trajectories are independent of one another, DiDj. The first assumption states that an underlying distribution for trajectories exists. Such a distribution would also capture correlations between individuals on a large scale (e.g., commuting patterns, cities, weekends). The second assumption presumes that the correlation between specific individuals is negligible when estimating unicity of large datasets.

A direct consequence of unicity being a strictly decreasing convex function is that it will be lower bounded by its linear tangent (treating unicity as a function of a real-valued N):

ϵ(D(N))ϵ(D(N))+(NN)dϵdN|N=N. (Equation 1)

Re-arranged and expressed for discrete values, this gives a lower bound for unicity:

ϵ(D(N))ϵ(D(N))(NN)(ϵ(D(N1))ϵ(D(N))). (Equation 2)

Using the tangent to the empirical unicity curves estimated by discrete difference over the range of N[0.9M,1M], we obtain a lower bound of 0.73 for ϵ4(20M) and 0.9 for ϵ5(20M) (Figure 1B, dotted lines).

Our results show that unicity decreases slowly with the size of the dataset and that it, very likely, remains high even in population-scale datasets. This refutes previous claims that privacy is preserved in population-scale datasets, instead showing the risk of re-identification to be high. Modern location datasets have a great potential to improve our society, for example, by training AI algorithms, but robust privacy engineering solutions are needed to use them safely.

Discussion

Taken together, these results show that the scale of a dataset does not prevent re-identification. Human mobility, much like a physical fingerprint, is highly unique and can be used to find a person across mobility datasets.

Legally, the European Union (EU) General Data Protection Regulation sets a high threshold for what constitutes anonymous data, namely that the individual should not be identifiable taking into account both the “available technology at the time of the processing” but also future “technological developments” (Recital 26). The Article 29 Working Party, the predecessor to the European Data Protection Board, in its guidance sets out three criteria to assess whether a dataset is anonymous, singling out, linkability, and inference49 with the former two being directly applicable here. As an example, the Centre for Humanitarian Data of the United Nations (UN OCHA) adopted 5% as a threshold for what constitutes an acceptable re-identification risk.50 Even our lower bound of 22% far exceeds this liberal threshold.

Finally, here we study the unicity of location datasets with a spatial resolution of 1 km2 and a temporal resolution of an hour. Fine-grained GPS data are likely to lead to even higher values of unicity, and previous research has shown that, in general, de-identification methods do not meaningfully reduce the risk of re-identification. For instance, research32,34 has shown that reducing the spatial and temporal resolution of the data further only slowly decreases the risk, while another study51 concluded that location data “show poor anonymizability [as measured by k-anonymity], i.e., require important spatial and temporal generalization in order to slightly improve user privacy".

Ensuring that these data can be accessed and used broadly is of paramount importance, but this should not come at the expense of people's privacy. A range of privacy engineering techniques allowing data to be used while giving individuals strong privacy guarantees have been developed and are starting to be used.52, 53, 54 As standards for anonymization are being redefined, in the EU and around the world, it is essential for them to emphasize the strong limits of de-identification, possibly banning the uncontrolled release of individual-level de-identified data, and to give guidance on the use of modern privacy-engineering solutions.

In the next three sections we discuss the underlying assumptions of the unicity model and some considerations regarding the sensitivity of our results and, finally, include a discussion on previous estimates of unicity.

Assumptions underpinning the simple unicity model

We here evaluate the four assumptions underpinning the simple unicity model we present.

First, the model treats each of the four inputs in Figure 2 as independent of one another. Considering them, or some of them, jointly might further improve the model. This would, however, also increase its complexity and, therefore, its sensitivity to small changes in the data. Although further exploration would be interesting, we consider that the simple model approximates the decrease in unicity with increasing dataset size well enough to support our conclusion that unicity is unlikely to be low even in population-scale datasets.

Second, our model uses input distributions extracted from a dataset of 1M people to study the unicity of datasets with up to 60M people (see Supplemental information). This assumes that these distributions estimated from a smaller sample are representative of the larger sample (i.e., the estimation of the distributions has converged). We show that this is a reasonable assumption by instantiating our model M with distributions extracted from samples of sizes significantly smaller than 1M, and showing that the unicity results remain largely unchanged (Figure S5 in Supplemental information). We also perform a sensitivity analysis to evaluate the impact of broad variations on these input distribution on our results (see next subsection).

Third, the model assumes each trajectory to contain at most one unique location. This allows for the mean frequency distribution (PF) to be used in the modeling process (Figure 2B). As seen in the inset of Figure 2B, more than 85% of the activity in the average trajectory is captured by the top 10 locations visited. Furthermore, we find that PF changes only slightly when the number of unique locations is altered, and that our conclusions are not influenced by this choice.

Finally, our model assumes that the set of locations appearing in each trajectory can be described by a connected planar sub-graph of the underlying antenna network. We believe this to be a reasonable assumption, as previous work suggests that sub-graphs spanned by each trajectory in human mobility are highly localized, with the distribution P(rg) of the radius of gyration—a metric for how far people tend to travel on average—following a power law with increasing radius.55

Sensitivity analysis

Our simple statistical model for unicity takes as input three distributions. However, these distributions may vary depending on specifics of the dataset, such as the country where it was collected or the sources of location information. Here we perform a sensitivity analysis to ensure the robustness of our model to even broad changes to the distributions.

We first perturb the PA and PF distributions (Figure 3) around their empirical forms using a scaled earth mover’s distance as the guiding metric (see Supplemental information for details). The PC distribution, on the other hand, has been shown to be very stable across datasets22, 23, 24, 25, 26 and we thus keep it constant throughout our analysis.

Figure 3.

Figure 3

Range of distributions studied for the sensitivity analysis

The ranges of perturbed activity PA¯ (A) and frequency PF¯ (B) distributions are displayed (dotted lines) along with their empirical forms (solid lines).

These distributions are combined to produce 63 different instantiations of the unicity model (Figure S2). Table 1 summarizes the unicity values for models using the broad range of distributions in Figure 3, at a dataset size of 20M trajectories (see Supplemental information for 60M results). Note that the lowest unicity values across all instantiations of the model are still high, with Min(ϵ4(20M))=43.1% and would still be considered as putting people's privacy at risk.

Table 1.

Summary of unicity results at N=20M as per the sensitivity analysis

ϵ2 ϵ3 ϵ4 ϵ5
Mean 0.307 0.735 0.876 0.935
Standard deviation 0.175 0.216 0.159 0.113
Minimum 0.071 0.260 0.431 0.544
Maximum 0.704 0.997 1 1

Further, we study how certain aspects of human mobility contribute to unicity. Starting from empirical user location traces Di=(Xi,Ci), first, we find that removing the association between times (Ci) and locations (Xi), by shuffling the vectors and recombining them, only slightly affects unicity values (Figure S4A). Specifically, consider a dataset D composed of trajectories Di=(Xi,Ci) such that:

Xi=σi(Xi),
Ci=πi(Ci),

where σi and πi refer to random permutations of the spatial and temporal components of Di. This only marginally affects unicity, showing that unicity does not depend on the specific places being visited at specific times, as long as those times and places appear in the trace with their respective frequencies independently.

Second, we replace the set of locations in each trajectory with uniformly picked locations. Instead of using the sub-graph sampling method displayed in Figure 2D, we populate each Si with antennas picked from the entire set of locations L uniformly at random. We find that this leads to unicity being overestimated (Figure S4D).

Third, replacing PC or PF with uniform distributions (Figures S4B and S4C) or attempts to model unicity using a simple combinatorial model (Figure S3) also cause the model to overestimate unicity. These demonstrate the importance of all three distributions and the underlying geography to correctly capture the unicity of mobility datasets.

This analysis, combined with the relative simplicity and generality of the unicity model, strongly suggest that our results would generalize to any location dataset. Likewise, the strong underlying combinatorial effect that underpins unicity combined with previous research34, 35, 36 suggests that unicity will similarly decrease slowly in other types of high-dimensional data.

El Emam's method

El Emam42 proposed a method (hereafter the EE method) to estimate the uniqueness of a population-size (N) dataset given the unicity ϵ(m) of a smaller sample dataset of size m. Using this method, he estimates that the uniqueness for a population of size N=22106 is about 1%, given a uniqueness of 90% of a sample of size m=22106 of the same dataset. This estimate forms the basis for his claim that uniqueness is low in large-scale datasets.

We here show that the EE method (1) is unrealistic and (2) provably gives the lowest possible estimate for the risk in the larger dataset, and that (3) by using our dataset, we observe that the real empirical unicity is significantly higher than the upper bound given by the EE method.

First, the method is unrealistic, as it effectively generates a dataset D of size N where a fraction α of records are unique, while all the other records are identical to exactly one and only one other record. The parameter α is selected such that the expected estimated uniqueness on a sample of size m, which we denote by νD(m), is equal to the empirical unicity. This assumes that users in the real mobility dataset are either unique or exact duplicates of another user.

Second, we prove in the Supplemental information that the risk estimated by the EE method will be lower or equal to the risk of any other dataset of size N, as this estimate is an affine function of m. In other terms, this method will always return the absolute lowest possible estimate of the risk.

Third, we apply the EE method to our dataset and show that its estimate of the risk is significantly lower than the real empirical value, leading to the risk of re-identification being strongly underestimated. For a dataset of 200,000 people, we empirically observe an ϵ2(200K)=0.86. Using this number, El Emam's method would estimate the risk of a larger 1M person dataset to be ϵ2(1M)=0.3, while the correct empirical value is 0.7.

Taken together, our results cast serious doubt on the validity of the EE method to carry out risk assessments.

Experimental procedures

Resource availability

Lead contact

Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Yves-Alexandre de Montjoye (demontjoye@imperial.ac.uk).

Materials availability

There are no physical materials associated with this study.

Data and code availability

Due to reasons of confidentiality and user privacy, we cannot share the raw data. However, we can make available all the input distributions and raw empirical results upon request for purposes of reproducibility.

The code used for all experiments is available at: github.com/computationalprivacy/scaling-unicity.

The unicity model in detail

We propose a simple statistical model M taking into account three population-wide distributions: activity (PA), circadian (PC), and frequency (PF). This model samples location traces for each user independent of other users to estimate unicity of a dataset of size N. These location traces are then grouped together to compute unicity.

Formally, the model M can be written as:

M(PA,PC,PF,L,N)=D=(D1,,DN). (Equation 3)

Each DiD is a location trace for a unique user, represented as a list of Li records (Xi(j),Ci(j))j=1Li. The length Li of trace Di is sampled from the empirical activity distribution PA:

P[Li=]=PA(). (Equation 4)

The timestamps of each record in a trace, (Ci(j))j=1Li, are sampled independent of the empirical circadian distribution PC:

P[Ci(j)=c]=PC(c)j{1,,Li}. (Equation 5)

For the spatial component, for each user, a connected sub-graph Si of size 10 is first sampled from the Delaunay tessellation of the antenna coordinates L. This sub-graph is then randomly ordered as a list, which we denote by Si=(Si(k))k=110 with a slight abuse of notations. Finally, the locations of the records Xi(j)Xi are sampled independent of Si according to the empirical frequency distribution PF:

P[Xi(j)=Si(k)]=PF(k)j{1,,Li}. (Equation 6)

Note that when the size of the dataset N sampled by our model M increases, this corresponds to sampling more individuals from the same underlying geography. This is what we mean throughout this work when we increase the size of the dataset, e.g., in unicity curves (Figure 1): we consider the dataset to be a growing sample from the same underlying population.

Acknowledgments

The authors would like to thank Shubham Jain for their comments on the codebase, and Ana-Maria Cretu, Andrea Gadotti, Shubham Jain, Thibaut Lienart, Axel Oehmichen, Luc Rocher, and Arnaud Tournier for their invaluable comments on the manuscript. We acknowledge support from the Agence Française de Développement as part of its financial assistance to the OPAL project.

Author contributions

A.F. designed and performed the experiments, built the models, helped with the mathematical results, and drafted the manuscript. F.H. derived the mathematical results, advised on model construction, and revised the manuscript. Y-A.d.M. designed the experiments and revised the manuscript.

Declaration of interests

The authors declare no competing financial interests.

Published: February 12, 2021

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.patter.2021.100204.

Supplemental information

Document S1. Supplemental experimental procedures, Figures S1–S5, and Tables S1–S3
mmc1.pdf (582.2KB, pdf)
Document S2. Article plus Supplemental information
mmc2.pdf (1.4MB, pdf)

References

  • 1.Vodafone Vodafone UK’s company history and achievements. 2018. https://www.vodafone.co.uk/about-us/company-history/
  • 2.Lomas N. TechCrunch; 2017. How “anonymous” wifi data can still be a privacy risk.http://tcrn.ch/2ywXGdy [Google Scholar]
  • 3.Deville P., Linard C., Martin S., Gilbert M., Stevens F.R., Gaughan A.E., Blondel V.D., Tatem A.J. Dynamic population mapping using mobile phone data. Proc. Natl. Acad. Sci. U S A. 2014;111:15888–15893. doi: 10.1073/pnas.1408439111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Ratti C., Frenchman D., Pulselli R.M., Williams S. Mobile landscapes: using location data from cell phones for urban analysis. Environ. Plann. B Plann. Des. 2006;33:727–748. [Google Scholar]
  • 5.Wesolowski A., Eagle N., Tatem A.J., Smith D.L., Noor A.M., Snow R.W., Buckee C.O. Quantifying the impact of human mobility on malaria. Science. 2012;338:267–270. doi: 10.1126/science.1223467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Gomes M.F., Pastore Y., Piontti A., Rossi L., Chao D., Longini I., Halloran M.E., Vespignani A. Assessing the International spreading risk associated with the 2014 west African Ebola outbreak. PLoS Currents. 2014;6 doi: 10.1371/currents.outbreaks.cd818f63d40e24aef769dda7df9e0da5. ecurrents.outbreaks.cd818f63d40e24aef769dda7df9e0da5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Mari L., Bertuzzo E., Righetto L., Casagrandi R., Gatto M., Rodriguez-Iturbe I., Rinaldo A. Modelling cholera epidemics: the role of waterways, human mobility and sanitation. J. R. Soc. Interface. 2012;9:376–388. doi: 10.1098/rsif.2011.0304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Bajardi P., Poletto C., Ramasco J.J., Tizzoni M., Colizza V., Vespignani A. Human mobility networks, travel restrictions, and the global spread of 2009 H1n1 pandemic. PLoS One. 2011;6:e16591. doi: 10.1371/journal.pone.0016591. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Merler S., Ajelli M. The role of population heterogeneity and human mobility in the spread of pandemic influenza. Proc. Biol. Sci. 2009;277:557–565. doi: 10.1098/rspb.2009.1605. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Aktay A., Bavadekar S., Cossoul G., Davis J., Desfontaines D., Fabrikant A., Gabrilovich E., Gadepalli K., Gipson B., Guevara M. Google COVID-19 community mobility reports: anonymization process description (version 1.0) arXiv. 2020 preprint arXiv:2004.04145. [Google Scholar]
  • 11.Steele J.E., Sundsøy P.R., Pezzulo C., Alegana V.A., Bird T.J., Blumenstock J., Bjelland J., Engø-Monsen K., de Montjoye Y.A., Iqbal A.M. Mapping poverty using mobile phone and satellite data. J. R. Soc. Interface. 2017;14:20160690. doi: 10.1098/rsif.2016.0690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Toole J.L., Lin Y.-R., Muehlegger E., Shoag D., González M.C., Lazer D. Tracking employment shocks using mobile phone data. J. R. Soc. Interface. 2015;12:20150185. doi: 10.1098/rsif.2015.0185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wesolowski A., Eagle N., Noor A.M., Snow R.W., Buckee C.O. The impact of biases in mobile phone ownership on estimates of human mobility. J. R. Soc. Interface. 2013;10:20120986. doi: 10.1098/rsif.2012.0986. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Madden M., Rainie L., Zickuhr K., Duggan M., Smith A. Vol. 12. Pew Research Center; 2014. (Public Perceptions of Privacy and Security in the Post-Snowden Era). [Google Scholar]
  • 15.Blumenstock J., Cadamuro G., On R. Predicting poverty and wealth from mobile phone metadata. Science. 2015;350:1073–1076. doi: 10.1126/science.aac4420. [DOI] [PubMed] [Google Scholar]
  • 16.Li G., Yu L., Ng W.S., Wu W., and Goh S.T. Predicting Home and Work Locations Using Public Transport Smart Card Data by Spectral Analysis. In 2015 IEEE 18th International Conference on Intelligent Transportation Systems, pages 2788–2793, Gran Canaria, Spain, September 2015. IEEE.
  • 17.Ashbrook D. and Starner T. Learning significant locations and predicting user movement with GPS. In Proceedings. Sixth International Symposium on Wearable Computers, pages 101–108, Seattle, WA, USA, 2002. IEEE.
  • 18.Isaacman S., Becker R., Cáceres R., Kobourov S., Martonosi M., Rowland J., Varshavsky A. Identifying important places in people’s lives from cellular network data. In: Lyons K., Hightower J., Huang E.M., editors. Pervasive Computing, Volume 6696 of Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2011. pp. 133–151. [Google Scholar]
  • 19.Mahmud J., Nichols J., Drews C. Home location identification of Twitter users. ACM Trans. Intell. Syst. Technol. 2014;5:47. [Google Scholar]
  • 20.Li R., Wang S., Deng H., Wang R, and Chen-Chuan Chang K. Towards Social User Profiling: Unified and Discriminative Influence Model for Inferring Home Locations. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1023–1031, New York, NY, USA, 2012. ACM.
  • 21.Cho E., Myers S.A., and Leskovec J. Friendship and Mobility: User Movement in Location-based Social Networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1082–1090, New York, NY, USA, 2011. ACM.
  • 22.Monsivais D., Ghosh A., Bhattacharya K., Dunbar R.I.M., Kaski K. Tracking urban human activity from mobile phone calling patterns. PLoS Comput. Biol. 2017;13:e1005824. doi: 10.1371/journal.pcbi.1005824. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Monsivais D., Bhattacharya K., Ghosh A., Dunbar R.I.M., Kaski K. Seasonal and geographical impact on human resting periods. Sci. Rep. 2017;7:10717. doi: 10.1038/s41598-017-11125-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Kondor D., Hashemian B., de Montjoye Y.-A., Ratti C. Vol. 6. 2020. Towards matching user mobility traces in large-scale datasets; p. 1. (IEEE Transactions on Big Data). 714-726. [Google Scholar]
  • 25.Hasan S., Zhan X., and Ukkusuri S.V. Understanding Urban Human Activity and Mobility Patterns Using Large-scale Location-based Data from Online Social Media. In Proceedings of the 2Nd ACM SIGKDD International Workshop on Urban Computing, UrbComp ’13, pages 6:1–6:8, New York, NY, USA, 2013. ACM.
  • 26.Ahas R., Aasa A., Silm S., Tiru M. Daily rhythms of suburban commuters’ movements in the Tallinn metropolitan area: case study with mobile positioning data. Transport. Res. C Emerg. Tech. 2010;18:45–54. [Google Scholar]
  • 27.Felbo B., Sundsøy P., Pentland A., Lehmann S., de Montjoye Y.-A. Machine learning and knowledge discovery in databases, volume 10536 of lecture notes in computer science. Springer; 2017. Modeling the temporal nature of human behavior for demographics prediction; pp. 140–152. [Google Scholar]
  • 28.de Montjoye Y.-A., Quoidbach J., Robic F., Pentland A.S. International conference on social computing, behavioral-cultural modeling, and prediction. Springer; 2013. Predicting personality using novel mobile phone-based metrics; pp. 48–55. [Google Scholar]
  • 29.Onnela J.-P., Saramäki J., Hyvönen J., Szabó G., Lazer D., Kaski K., Kertész J., Barabási A.-L. Structure and tie strengths in mobile communication networks. Proc. Natl. Acad. Sci. U S A. 2007;104:7332–7336. doi: 10.1073/pnas.0610245104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Onnela J.-P., Saramäki J., Hyvönen J., Szabó G., De Menezes M.A., Kaski K., Barabási A.-L., Kertész J. Analysis of a large-scale weighted network of one-to-one human communication. New J. Phys. 2007;9:179. [Google Scholar]
  • 31.Krumme C., Llorente A., Cebrian M., Pentland A.S., Moro E. The predictability of consumer visitation patterns. Sci. Rep. 2013;3:1645. doi: 10.1038/srep01645. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.de Montjoye Y.-A., Hidalgo C.A., Verleysen M., Blondel V.D. Unique in the crowd: the privacy bounds of human mobility. Sci. Rep. 2013;3:1376. doi: 10.1038/srep01376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Pellungrini R., Pappalardo L., Pratesi F., Monreale A. A data mining approach to assess privacy risk in human mobility data. ACM Trans. Intell. Syst. Technol. 2017;9:31. [Google Scholar]
  • 34.Achara J.P., Acs G., Castelluccia C. ACM Press; 2015. On the Unicity of Smartphone Applications; pp. 27–36. [Google Scholar]
  • 35.Sekara V., Mones E., Jonsson H. Temporal limits of privacy in human behavior. arXiv. 2018 doi: 10.1038/s41598-021-82294-1. preprint arXiv:1806.03615. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.de Montjoye Y.-A., Radaelli L., Singh V.K., Pentland A.P. Unique in the shopping mall: on the reidentifiability of credit card metadata. Science. 2015;347:536–539. doi: 10.1126/science.1256297. [DOI] [PubMed] [Google Scholar]
  • 37.Xu Y., Belyi A., Bojic I., Ratti C. Human mobility and socioeconomic status: analysis of Singapore and Boston. Comput. Environ. Urban Syst. 2018;72:51–67. [Google Scholar]
  • 38.Deußer C., Passmann S., and Strufe T. Browsing unicity: On the limits of anonymizing web tracking data. In 2020 IEEE Symposium on Security and Privacy (SP), pages 777–790. IEEE, 2020.
  • 39.Narayanan A., Shmatikov V. IEEE; 2008. Robust De-anonymization of Large Sparse Datasets; pp. 111–125. [Google Scholar]
  • 40.Riederer C., Kim Y., Chaintreau A., Korula N., Lattanzi S. Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee; 2016. Linking users across domains with location data: Theory and validation; pp. 707–719. [Google Scholar]
  • 41.Snchez D., Martnez S., Domingo-Ferrer J. Comment on ”Unique in the shopping mall: on the reidentifiability of credit card metadata”. Science. 2016;351:1274. doi: 10.1126/science.aad9295. [DOI] [PubMed] [Google Scholar]
  • 42.El Emam K. 2015. On Re-identification: Not Really Unique in the Shopping Mall. [Google Scholar]
  • 43.Barth-Jones D., El Emam K., Bambauer J., Cavoukian A., Malin B. Assessing data intrusion threats. Science. 2015;348:194–195. doi: 10.1126/science.348.6231.194-b. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Pappalardo L., Simini F. Data-driven generation of spatio-temporal routines in human mobility. Data Min. Knowl. Discov. 2018;32:787–829. doi: 10.1007/s10618-017-0548-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Gonzlez M.C., Hidalgo C.A., Barabási A.-L. Understanding individual human mobility patterns. Nature. 2008;453:779. doi: 10.1038/nature06958. [DOI] [PubMed] [Google Scholar]
  • 46.Alessandretti L., Sapiezynski P., Sekara V., Lehmann S., Baronchelli A. Evidence for a conserved quantity in human mobility. Nat. Hum. Behav. 2018;2:1. doi: 10.1038/s41562-018-0364-x. [DOI] [PubMed] [Google Scholar]
  • 47.Song C., Koren T., Wang P., Barabási A.-L. Modelling the scaling properties of human mobility. Nat. Phys. 2010;6:818–823. [Google Scholar]
  • 48.Hasan S., Schneider C.M., Ukkusuri S.V., González M.C. Spatiotemporal patterns of urban human mobility. J. Stat. Phys. April 2013;151:304–318. [Google Scholar]
  • 49.Article 29 Data Protection Working Party . European Commission; 2014. Opinion 05/2014 on Anonymisation Techniques. [Google Scholar]
  • 50.Centre for Humanitarian Data of the United Nations Office for the Coordination of Humanitarian Affairs . United Nations; 2019. Guidance Note Series on Data Responsibility on Humanitarian Action. Note 1: Statistical Disclosure Control. [Google Scholar]
  • 51.Gramaglia M., Fiore M. On the anonymizability of mobile traffic datasets. arXiv. 2014 preprint arXiv:1501.00100. [Google Scholar]
  • 52.Oehmichen A., Jain S., Gadotti A., de Montjoye Y.-A. 2019 IEEE International Conference on Big Data (Big Data) IEEE; 2019. Opal: high performance platform for large-scale privacy-preserving location data analytics; pp. 1332–1342. [Google Scholar]
  • 53.Mir D.J., Isaacman S., Cáceres R., Martonosi M., Wright R.N. 2013 IEEE international conference on big data. IEEE; 2013. Dp-where: Differentially private modeling of human mobility; pp. 580–588. [Google Scholar]
  • 54.Francis P., Probst Eide S., Munz R. Annual Privacy Forum. Springer; 2017. Diffix: high-utility database anonymization; pp. 141–158. [Google Scholar]
  • 55.Gonzalez M.C., Hidalgo C.A., Barabasi A.-L. Understanding individual human mobility patterns. Nature. 2008;453:779. doi: 10.1038/nature06958. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Supplemental experimental procedures, Figures S1–S5, and Tables S1–S3
mmc1.pdf (582.2KB, pdf)
Document S2. Article plus Supplemental information
mmc2.pdf (1.4MB, pdf)

Data Availability Statement

Due to reasons of confidentiality and user privacy, we cannot share the raw data. However, we can make available all the input distributions and raw empirical results upon request for purposes of reproducibility.

The code used for all experiments is available at: github.com/computationalprivacy/scaling-unicity.


Articles from Patterns are provided here courtesy of Elsevier

RESOURCES