Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Feb 1.
Published in final edited form as: Transp Res Part C Emerg Technol. 2018 Feb;87:58–74. doi: 10.1016/j.trc.2017.12.003

On data processing required to derive mobility patterns from passively-generated mobile phone data

Feilong Wang 1, Cynthia Chen 1,*
PMCID: PMC5789780  NIHMSID: NIHMS933857  PMID: 29398790

Abstract

Passively-generated mobile phone data is emerging as a potential data source for transportation research and applications. Despite the large amount of studies based on the mobile phone data, only a few have reported the properties of such data, and documented how they have processed the data. In this paper, we describe two types of common mobile phone data: Call Details Record (CDR) data and sightings data, and propose a data processing framework and the associated algorithms to address two key issues associated with the sightings data: locational uncertainty and oscillation. We show the effectiveness of our proposed methods in addressing these two issues compared to the state of art algorithms in the field. We also demonstrate that without proper processing applied to the data, the statistical regularity of human mobility patterns—a key, significant trait identified for human mobility—is over-estimated. We hope this study will stimulate more studies in examining the properties of such data and developing methods to address them. Though not as glamorous as those directly deriving insights on mobility patterns (such as statistical regularity), understanding properties of such data and developing methods to address them is a fundamental research topic on which important insights are derived on mobility patterns.

Keywords: Human mobility trajectory, Locational uncertainty, Oscillation problem, Representativeness issue, Incremental clustering method, Time-window-based method, Statistical regularity

1. Introduction

In the past decade, passively-generated mobile phone data (hereafter called “mobile phone data”) has risen as a viable data source for human mobility analysis and transportation applications. It is the by-product of the primary, often non-transportation related purposes such as billing and operations (Alexander et al., 2015a; Chen et al., 2014, 2016; Iqbal et al., 2014; Toole et al., 2015). Often location- and time-stamped, mobile phone data offer a tremendous opportunity to revolutionize the transportation field that goes beyond behavior analysis, traffic operations and safety analysis (Vlahogianni et al., 2015). Indeed, a search in the literature using keyword combinations of mobile phone data, mobility, and travel behavior resulted in more than 1000 articles published in journals across different disciplines (Ahas et al., 2010a; Becker et al., 2013; Calabrese et al., 2013; Candia et al., 2008; Chen et al., 2014, 2016; Gao et al., 2013; Song et al., 2010b; Wang et al., 2014). These articles cover a wide range of topics including, for example, estimating mobility patterns (Csáji et al., 2013; González et al., 2008; Song et al., 2010a), inferring OD matrix (Calabrese et al., 2011b; Iqbal et al., 2014), finding anchor locations (Dong et al., 2015; Isaacman et al., 2011), inferring activity types (Jiang et al., 2017; Widhalm et al., 2015) and travel modes (Qu et al., 2015; Wang et al., 2010).

Unlike household travel survey data that is generated by asking subjects record an entire day’s activities and travels, thus resulting in complete information on one’s spatio-temporal movement, mobile phone data generates one or more records when a user connects to a network (Chen et al., 2016; Wang et al., 2014). Each record typically includes the device ID, the timestamp and the location information. Two kinds of mobile phone data have been frequently used today: CDR (Call Detail Record) data and sightings data (Calabrese et al., 2013; Chen et al., 2014), where the former (CDR data) comprises a set of phone call records along with the time information and the location information of the cell tower that channels the call, and the sightings data contains sightings that are results of the triangulation of cell towers. Hereby we call either a record or a sighting as a trace. Neither data covers the full spectrum of a user’s movement in a day. Since both rely on the phone usage, both are temporally sparse: for sightings data, the average time between two consecutive network-connection activities could be longer than one hour and for CDR data the interval could be even longer (Calabrese et al., 2013). This issue is in addition to their much coarser spatial resolution in location estimation compared with data from household travel surveys and GPS (Global Positioning System)-based surveys. The precision of location estimation is reported to be approximately 300 m on average in urban area for sightings data (Jiang et al., 2013) and between several hundred meters to several kilometers for CDR data, depending on the density of cellular towers (Chen et al., 2016; Järv et al., 2014). Another issue is purely signaling related: some traces are resulted from signaling activities, as opposed to users’ movements. That is, when a user stays at one location, his/her phone may be handed over to different cell towers due to load-balancing or other operational purposes. This results in traces suggesting that the user bounces between multiple stations. This is known as oscillation (Lee and Hou, 2006; Qi et al., 2016).

Additionally, because a trace in mobile phone data represents either a phone call (for CDR data) or a sighting when a phone is seen on the network (for sightings data), a trace in the mobile phone data cannot be viewed directly as an activity location where a person spends time for an activity. Because of this and the previously noted issues such as uncertainty and oscillation, the mobile phone data must be processed first before it can be used for mobility analysis. The word “processed” here specificially refers to the process that must be applied to mobile phone data to address issues such as uncertainty and oscillation in order to infer activity and trip related information. This process is uniquely different from the cleaning process typically applied to household travel survey data whose records directly represent activities and trips conducted by the subjects. Clearly we do not need to emphasize the importance of this processing task: the results of processing directly affect the results of mobility analysis using such data (Horn et al., 2014). We demonstrate this in Section 5.3 of the paper.

Surprisingly, among the many studies that have used mobile phone data, only a few have reported the properties and issues associated with such data, and documented how they have processed the data and how the results compare to external data sources that may provide some validations (Palchykov et al., 2014; Serok and Blumenfeld-Lieberthal, 2015; Yin et al., 2017). This paper focuses on mobile phone sightings data, in particular, seeking to illustrate and address two prominent issues of the dataset: (1) locational uncertainty and (2) oscillation. The sightings data has now generated many papers in the field (Calabrese et al., 2011a, 2013; Jiang et al., 2013), is already being used by several planning agencies and departments of transportation in the US (Milone, 2015; Stabler and Sikder, 2014) and the data provider is now widely advertising its products (e.g., OD matrices) derived from such data (Milone, 2015). Thus, having a clear understanding of such data and the techniques that can be applied to process such data is not only important for researchers but also for practitioners who may be interested in using such data or their derived products. The first issue (locational uncertainty) is unique to the sightings data, as every location estimate contained in the data is unique, as the results of triangulation. Methods are designed to deal with this issue so that activity locations (where people spend time to conduct activities) can be identified from the sightings data. A common solution is to define a radius and a minimum duration corresponding to the positioning errors and the time spent on activities respectively (Jiang et al., 2013). Then an activity location is identified if a sequence of traces meets both the spatial and temporal constraints. The second issue—oscillation—exists in both sightings and CDR data. The phenomenon cannot be ignored as traces resulted from oscillation take up a substantial fraction of the total number of records: approximately 30% (Lee and Hou, 2006). Solutions, however, especially for sightings data, are lacking and are mostly discussed in computer science communities (Bayir et al., 2010; Lee and Hou, 2006; Shad et al., 2012). Two approaches have been commonly used to detect oscillation sequences in the traces: pattern-based and hybrid methods. The former examines trace sequences and the one that exhibits a specific switching pattern will be identified as the oscillation case (Lee and Hou, 2006). The latter extends pattern-based methods by utilizing temporal and/or spatial information to consider velocity or other measurements (Iovan et al., 2013). Our paper reviews these existing methods and proposes new solutions. We show that our proposed solutions exceed the performance of the existing methods in addressing these issues.

The rest of the paper is organized as follows. After reviewing related works in Section 2, Section 3 introduces the sightings data used in our study. In Section 4, we construct a data-processing framework that provides solutions for the two problems. The framework consists of two stages: the first is to address the issue of locational uncertainty by proposing a revised incremental clustering algorithm; and the second is to handle the oscillation problem with a time-window-based method in addition to a modified pattern-based method. Through analyzing the processed data, we show advantages and effectiveness of our framework. Section 5 covers this part of work. Lastly in Section 6, we provide a discussion on related questions that are not addressed within the scope of this paper. The representativeness issue, which is related to the issue of temporal sparsity, is discussed in this section. We also discuss how our developed framework and solutions can be applied to the CDR data. We hope our work will stimulate many more studies addressing this critical topic that will determine not only the accuracy of the results from using such data but also its wide applications in the future.

2. Related works

2.1. Understanding passively generated mobile phone data

As noted earlier, two types of mobile phone data have been commonly studied: Call Details Record (CDR) data (Ahas et al., 2010a; Järv et al., 2014; Xu et al., 2015) and the sightings data(Calabrese et al., 2011b; Chen et al., 2016; Wang, 2014). In the former, each record represents a device-driven activity (e.g., phone calls, messaging services and Internet service) and contains information such as the initial time of the activity, the duration, and the ID and location information of the cell tower that channels the activity.

The sightings data differs with the CDR data in two aspects: (1) more records: in addition to just recording device-driven activities, data is also generated from other network-driven interactions, such as handovers when a device switches between two neighboring cell areas (Demissie et al., 2013; Hard et al., 2016; Iovan et al., 2013); and (2) higher spatial accuracy: instead of cell-level location information as in the CDR data, the location information contained in the sightings data is an estimation of the device’s location through triangulation of multiple cells (Calabrese et al., 2011b; Chen et al., 2016).

Passively generated, both types are characterized with spatio-temporal uncertainties. From the temporal perspective, depending on device usage pattern, the temporal distribution of records can be irregular and heterogeneous (Iovan et al., 2013). Consequently, the time interval between two consecutive records is uncertain and could be too long to capture users' movements that took place during the long time gap. The interval for a sample CDR data is reported with a mean of as long as eight hours (González et al., 2008). And a median of slightly longer than one hour is reported for a sample sightings data (Calabrese et al., 2011b). One way to address the temporal issues is to apply a data-filtering process, such as only selecting highly active users (Calabrese et al., 2013; González et al., 2008; Song et al., 2010b; Zhao et al., 2016). However, results from those studies may be affected by the selected filtering process due to potential correlations between individual mobility and device usage, e.g., higher communication frequency is likely to lead to more trips derived (Iovan et al., 2013; Yuan et al., 2012). Though not directly addressed in the current study, we discuss it in Section 6.

From the spatial perspective, several uncertainty issues exist in the mobile phone data, which affect individual mobility derived from data. The first issue existing in both types of data is the relatively low “spatial accuracy”. Depending on the density of cellular towers, the accuracy of location estimates varies. The accuracy is reported to be about 300 m on average in urban area for sightings data (Calabrese et al., 2011b; Hard et al., 2016; Jiang et al., 2013) and between several hundred meters to several kilometers for CDR data corresponding to the coverage of a tower/antenna (Järv et al., 2014). There are efforts to improve spatial accuracy, for example, via interpolation methods (Järv et al., 2017; Louail et al., 2014). The second uncertainty related issue that exists in both types of data is the oscillation phenomenon: due to dynamic factors including varying transmission conditions that influence received signal strength and load balancing policies, instead of connecting to the nearest cell tower, a mobile phone switches its connection between different towers (typically within a short period of time) even though the device itself is not moving (Calabrese et al., 2011b; Wu et al., 2014). The oscillation phenomenon generates a considerable number of records that do not reflect devices' actual movements (Lee and Hou, 2006). The third issue is specifically related to sightings data: location estimates for a stay at the same location change over time. This is related to the method of triangulation, as measurements on factors used for triangulation (e.g., number of surrounding cell towers, received signal strength) are subject to fluctuations, leading to distinct estimations for the same location (Calabrese et al., 2011a). Aggregation on these varying location estimates is required so that we can identify activity locations and calculate the time spent at those locations (Hard et al., 2016).

2.2. Addressing locational uncertainty by extracting activity locations

For CDR data with location information at the cell level, frequently visited locations (e.g. home, work places) by an individual are usually indicated by locations of network cells that were most frequently visited (Ahas et al., 2010b; Isaacman et al., 2011; Järv et al., 2014; Wang et al., 2012; Xu et al., 2016). By clustering cell towers that are spatially close, these locations could also be represented by clusters containing towers that frequently channel call activities (Jiang et al., 2017; Xu et al., 2016).

However, for sightings data, as the location estimation is the result of triangulation among multiple towers (Chen et al., 2016; Widhalm et al., 2015), each of them is unique. In other words, even when a user stays at the same location, locations recorded in the data differ from time to time (though they may scatter in close proximity) (Fig. 1). This means the methods for identifying activity locations from the CDR data, such as those relying on counting call activity frequency with cell towers, are not applicable for the sightings data. The common way to address this locational uncertainty and to reveal activity locations from the sightings data is to aggregate these distinct location estimates by applying a clustering algorithm (Calabrese et al., 2011b; Chen et al., 2016; Jiang et al., 2013). Typically, the centroid of the outputting clusters is used as representing the activity location.

Fig. 1.

Fig. 1

Locational uncertainty. (The inset figure is the zoom-in of closely distributed location estimates of one activity location).

Conventional clustering methods are usually not applicable mainly because parameters in these methods need to be pre-determined, which is infeasible in mobile phone data. For example, popular methods such as k-means and DBSCAN (Ester et al., 1996; Kanungo et al., 2002) require the number of clusters or density-related parameters be known, which cannot be specifically identified because of the variety of mobile phone usage and travel behaviors that exist across individuals (Chen et al., 2014).

Some new methods aggregate traces by segmenting one trajectory into several sequences of traces. Here, one trajectory of a user refers to the user’s available traces of one day. In the trajectory-segmentation methods, an activity location is defined as a sequence of consecutive traces bounded by both temporal and spatial constraints (Calabrese et al., 2011a, 2013; Jiang et al., 2013). Given the trajectory {d0, d1, d2, …, dk}, one cluster of traces {dm, dm+1, …, dn} (0 ≤ m, n ≤ k) will be defined as an activity location, if two conditions are met: (1) the maximal geographical distance between any two location estimates Ldi and Ldj (m ≤ i, j ≤ n) should not exceed Rc (here Ldi, Ldj are location records of traces di and dj respectively and Rc is the spatial constraint (e.g., 300 m); and (2) the time difference between the last and the first trace of this sequence (i.e. tdn–tdm) should be at least Tc (here tdn and tdm are timestamps of traces dn and dm respectively and Tc is the temporal constraint (e.g., 10 min)). Thus, in the trajectory-segmentation method, activity locations can be identified by applying these two conditions and segmenting the trajectories into sequences. This method is intuitive and easy to apply. But, the first condition, requiring any two traces in the sequence to satisfy the distance constraint, can pose challenges if outliers exist. For example, given a sequence of traces {dm, dm+1, …, dn} that are spatially close, the insertion of an outlier doutlier may result into three clusters: {dm, dm+1, …, dm+k}, {doutlier} and {dm+k+1, …, dn}, consequently leading to biased estimation on the number and the types of activity places one has visited.

Chen et al. (2014) suggested a model-based clustering method. Rather than relying on a pre-determined threshold value for all devices, this method determines the optimal number of clusters for each device through a statistical model, more specifically, by searching through a finite set of possible numbers of clusters and finding the one that generates the highest Bayesian Information Criterion (BIC) (Fraley, 1998; Fraley and Raftery, 2002). The centroid of an identified activity cluster (defined as a sequence of traces that are considered to be activity places) is used as activity locations for subsequent modeling processes. The advantage of applying the model-based clustering is obvious: there is no requirement for pre-determined clustering parameters. However, there are limitations. This method is sensitive to the spatial density of the traces, which are often of low density. Additionally, the size of outputting clusters cannot be controlled, meaning that several far-away outliers may be clustered together, causing the resulting cluster to stray away (Fig. 2): in other words, a cluster without spatial constraint makes the clustering process less meaningful.

Fig. 2.

Fig. 2

A problematic classification example of model-based clustering. (Different clusters are shown in different symbols of different colors. The size of the ellipse overlaid on a cluster reflects the size of that cluster. Here locations as far as nine kilometers are aggregated into one cluster, making the clustering less meaningful.)

Another method to aggregate traces is the incremental clustering method (Alexander et al., 2015a; Hariharan and Toyama, 2004; Wang et al., 2015; Widhalm et al., 2015). Given traces {d0, d1, d2, …, dk} in one trajectory, the clustering is performed as follows: (1) starting from trace d0, one new cluster C0 is created and d0 is the center; (2) each trace that is not clustered will be checked and the trace within a distance Rc to the center of C0 is aggregated to the cluster; (3) every time the cluster grows, its center is updated; (4) if no trace could be aggregated in the current cluster, one new cluster will be created containing the non-clustered trace. This procedure repeats itself until all traces are clustered. After the clustering process, clusters with a time duration of no less than Tc will be considered as activity locations. The cluster duration is calculated as the largest time difference between any two traces in the cluster. This clustering method is relatively robust to outliers present among traces, as the mechanism of updating centers (step 3) improves its tolerance to outliers.

Most of above mentioned clustering methods are applied to individual days of traces. In data of multiple days, however, the same activity location may appear distinct in terms of their coordinates. This creates challenges for the identification of common activity locations, such as home and work locations (Xu et al., 2015), and observations on mobility patterns such as the regularity property (González et al., 2008). Therefore, it would be necessary to develop methods to identify common activity locations where users conduct activities on multiple days. By dividing the study area into small grids, activity locations on multiple days are regarded as a common location if they fall into the same grid (Andrienko et al., 2011; Zheng et al., 2010). Jiang et al. identified the common locations by aggregating activity locations of multiple days via the agglomerative clustering algorithm (Jiang et al., 2013).

2.3. Removing oscillation sequences

By applying the clustering algorithms, the issue of locational uncertainty is dealt with so that activity locations could be identified. However, the results can still be biased if traces generated from oscillations are not removed (Qi et al., 2016; Wu et al., 2014). When oscillation occurs, some devices may be observed switching between two (or more) far-away locations with high frequencies. In most cases, an oscillation pattern forms, such as L0-L1-L0-L1-L0, where L0 and L1 are distinct location estimates. Table 1 provides an example oscillation case. The Euclidean distance between locations L0 and L1 is 2.7 km, but the user’s location switches between them three times within four minutes. If the switching speeds are calculated, they are incredibly high, which should not reflect the actual movement of the user. Since oscillation is a common phenomenon, a data processing step must be applied.

Table 1.

An oscillation case.

Trace Location Time Distance (km) Switching speed (km/h)
d0 L0 12:21:48 \ \
d1 L1 12:22:01 2.7 748
d2 L0 12:25:20 2.7 49
d3 L1 12:25:39 2.7 512

(The switching speed of two consecutive traces is calculated as the division of the distance between two location estimates by the time interval of these two traces. A space-time visualization of this oscillation case is given in Fig. 8.)

A few heuristic rules were proposed for CDR data to detect oscillations among cellular towers. In general, they are based on the observation that oscillation generates trace sequences showing unique patterns (e.g., L0–L1–L0–L1) and usually lead to an illusion that users travel with an incredibly high speed (e.g. 1000 km/h) (Bayir et al., 2010; Lee and Hou, 2006; Shad et al., 2012; Yin et al., 2017).

Lee and Hou (2006) defined a sequence of consecutive location estimates with pattern L0–L1–L0–L1 or L0–L1–L2–L0 as an oscillation sequence. A similar method was introduced by Bayir et al. (2010). According to their definition, an oscillating sequence is detected if at least triple switches are found in one trip (One switch refers to a one-time switch between two locations). For example, the sequence of location estimates {L0–L1–L0–L2–L3–L2–L1} is regarded as an oscillation sequence because of the triple switches between L0 and L1. This method was designed for a unique dataset collected by using an application installed on subjects’ mobile phones and requesting subjects to put semantic tags to locations they visited. Shad et al. (2012) investigated the solution to oscillation using the same data. The similar pattern-based method they proposed relies on cellular towers’ Location Area Code and radius information. For datasets where semantic tags and radius information are missing, it is hard to implement these methods. Meanwhile, relying on the unique patterns detected in the sequence of traces risks mistaking actual travels for occurrences of oscillation. In other words, an individual may indeed travel between two places multiple times in a typical day and traces generated from this actual travel behavior could be mistaken as oscillation traces.

A few methods have been designed to improve the detection precision by utilizing the temporal information available in the data (Iovan et al., 2013; Qi et al., 2016; Widhalm et al., 2015; Wu et al., 2014). In an oscillation sequence, the frequent switches between distinct locations give the illusion that people travel at an abnormally high speed. This problem is considered in the speed-based methods, where the switching speed is calculated. An oscillation sequence is identified if the switching speed exceeds a given threshold Vc. Iovan et al. (2013) set Vc as 200 km/h. In their work, they extended the definition with one more condition: for an oscillation case, the difference between the heading directions of two consecutive displacements should be 180 degrees, which corresponds to two switches (i.e. L0–L1–L0) in the pattern-based methods. Thus, their work can be regarded as a hybrid method – a combination of pattern-based and speed-based method. But the criterion of two switches is too simple to detect complex patterns that involve more than two locations. Meanwhile, given the low accuracy of location estimation, the method involving the speed measurement could be misleading, especially if the time intervals of switches are small (Demissie et al., 2013).

Wu et al. elaborated several heuristic rules to detect oscillation traces (Wu et al., 2014). Working on CDR data, they begin with finding suspicious sequences that contain at least three traces from multiple cellular towers within a short time period (e.g., one minute). Then a suspicious sequence is confirmed as an oscillation sequence if it contains a circular event (i.e. a tour L0–L1–…–L0). In this rule, the limit of a short time window is to account for the observation that oscillation is a short time event and the circular-event criterion is to rule out suspicious sequences resulting from fast movements.

Following the detection of oscillation sequences, a meaningful location needs to be identified to represent the device’s location when oscillation occurs. This meaningful location could be the one, among all locations in the sequence, which is visited most frequently (Iovan et al., 2013). Wu et al. (2014) introduced a score for each location in the oscillation sequence by considering both the visiting frequency and the average distance to other locations. The one that receives the highest score is selected as the meaningful location.

Oscillation solutions for sightings data are limited (Alexander et al., 2015a; Dong et al., 2015; Palchykov et al., 2014; Yin et al., 2017). Calabrese et al. (2011b) and Widhalm et al. (2015) addressed two problems (both locational uncertainty and oscillation) simultaneously using clustering methods. Their clustering methods could eliminate some of oscillation traces, since oscillation traces falling into a same cluster are aggregated and replaced by the centroid of the cluster. The problem is: oscillation traces could be too faraway (e.g., several kilometers away) to be compressed into a constraint-specific cluster (see our results in Section 5.2). Additionally, Yin et al. (2017) mentioned the use of the pattern-based method to detect oscillation traces in their applied sightings data, but details (e.g., what parameters to use) are not present.

3. The sightings data

The mobile phone sightings data used in this study consist of sightings from about one million users in the Buffalo metropolitan area during the month of April 2014. Sightings are derived by an intermediate company from records generated by phone operators for operation and billing purposes so that phone users maintain connections to the network when needed (for calling, texting and activities that require accessing the internet) and are correctly billed. Each sighting reveals the trace of a user, containing an encrypted mobile identification number, a time stamp and a location estimate expressed in latitude and longitude (Table 2). The data include devices that were observed in the Buffalo area and served by the cellular network carriers contracted with the data provider in April 2014. According to the data provider, users in the data could be resident workers, home workers, inbound commuters, outbound commuters, short-term visitors, or long-term visitors, etc.

Table 2.

A sample of mobile phone sightings data.

ID Time* Location estimate**
36cc5**77ca 1,396,381,131 42.951554|−78.665607
36cc5**77ca 1,396,385,839 42.929892|−78.624486
36cc5**77ca 1,396,386,054 42.928625|−78.607994
36cc5**77ca 1,396,386,319 42.929892|−78.624436
36cc5**77ca 1,396,386,567 42.951554|−78.665607
36cc5**77ca 1,396,386,653 42.929892|−78.624436
36cc5**77ca 1,396,386,876 42.955625|−78.606594
*

In Unix time – defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time, 1/1/1970.

**

Location estimates in the table are modified to protect privacy. The same is applied as for Table 3.

The location information of traces is estimated through the data provider’s proprietary technology and its accuracy is reported to be about 300 m on average in the urban area. For each sighting, a variable called certainty radius is calculated, which is a measurement in meters representing the radius of a circular area where the actual location of the device can be found with a 90% probability. The center of the area gives the location estimate of each trace in the data in the format of latitude and longitude. Fig. 3 shows the spatial distribution of certainty radius. Area with smaller certainty radius is shown in lighter shade, indicating relatively higher spatial accuracy. We note that generally downtown area shows higher accuracy than the rest of the study area.

Fig. 3.

Fig. 3

Spatial distribution of certainty radius.

Fig. 4a shows the distribution of the number of days observed in the data. Half of the users are observed for fewer than seven days, and only 6797 users (less than 1% of all users) have at least one trace everyday in the entire month. As another test on the temporal sparsity of the data, we divide a day into 24 hourly slots and, for each trajectory of a device, we count, among 24 slots, the number of slots in which the device is sighted at least once (Widhalm et al., 2015) (Fig. 4b). The median is six slots, indicating that half of trajectories have no more than six hourly slots with locations revealed by the mobile phone network.

Fig. 4.

Fig. 4

(a) Distribution of the number of days observed. (b) Distribution of the number of hourly slots with at least one trace.

4. Methodology

4.1. An overview

In this section, we provide a framework to process sightings data. As noted earlier, the framework consists of two stages, corresponding to two issues of the data. In the first stage, the issue of locational uncertainty is addressed by aggregating locations into clusters using a revised incremental clustering algorithm. We show that the revised incremental clustering algorithm can maintain the spatial precision of the original data and is adaptive to the spatial density of traces in rural and urban area (see details in Section 5). In the second stage, oscillation sequences are detected by a time-window-based detection method and a modified pattern-based method, which are able to identify complex oscillation patterns and to reduce the risk of mistaking real trips. Here, the order of these two stages matters, which will be discussed at the end of this section. The framework is summarized in Fig. 5 and described in subsequent sections.

Fig. 5.

Fig. 5

Framework of processing sightings data.

4.2. Addressing locational uncertainty toward extracting activity locations

Our method to deal with the locational uncertainty in the sightings data contains three steps: (1) applying the revised incremental clustering algorithm to aggregate traces; (2) finding a suitable parameter for the incremental clustering algorithm; and (3) applying the k-means clustering algorithm to deal with an order problem that is not addressed by the incremental clustering algorithm (see details below).

4.2.1. Incremental clustering algorithm

We first identify activity locations. As noted earlier, in sightings data, location estimates for a single activity location could be unique because of the locational uncertainty. Thus, duration at a location cannot be directly inferred. In principle, these location estimates shall be closely distributed in space. Therefore, we could aggregate these closely distributed locations and represent them with one cluster so that the duration associated with each potential activity location could be calculated.

The other issue caused by the locational uncertainty is that the common activity location that a user visits on different days may appear distinct in their expressed format of longitude and latitude. If we put together all the available location estimates of a potential activity location, we will find that they scatter closely in a cluster as well. Thus, common activity locations could be revealed by putting traces of multiple days together and clustering them without regarding their time ordering.

In this study, we develop a revised incremental clustering algorithm to aggregate traces of multiple days for each user without considering their time ordering. This method enables us to identify common activity locations by aggregating traces that are close in space but may be far away in time (e.g., several days). Details of this algorithm are provided in Fig. 6.

Fig. 6.

Fig. 6

Revised incremental clustering algorithm.

4.2.2. Determining a suitable Rc via trial and error

The incremental clustering method requires a spatial constraint Rc as an input. Without information to choose a suitable Rc to cluster our data, we find it via trial and error. More specifically, we try different values of Rc and find the suitable one by counting the total number of activity locations identified in each trial and analyzing its relationship with Rc. To do this, we first need to identify activity locations, which is described bellow.

(1) Duration and activity locations

Given a setting of Rc (e.g., 0.5 km), traces are aggregated into clusters without considering the temporal information. To determine the duration, we use individual trajectories. Through scanning every trajectory, we find consecutive traces that belong to the same cluster and aggregate them into a newly created trace, which contains several fields (Table 3): the original user ID; the initial time given by the timestamp of the first trace of the cluster; the duration that is calculated as the difference of the first and last timestamp in the cluster; and the location information that is given by the centroid of the cluster. Then, an activity location is identified if its duration exceeds Tc. We set Tc as five minutes. This five-minute threshold also follows the rule used in many household travel surveys to define what counts an activity (Transportation Research Board, 2005). Additionally, it is an appropriate threshold for an activity location in the activity based modeling context (Yin et al., 2017).

Table 3.

A sample of data after clustering.

ID Initial time Duration (second) Location estimate
36cc5**77ca 1,396,377,202 10,246 42.949429|−78.670337
36cc5**77ca 1,396,388,619 26 42.972285|−78.776146
36cc5**77ca 1,396,388,890 134 42.998380|−78.791654
36cc5**77ca 1,396,389,805 73 43.039335|−78.860423
36cc5**77ca 1,396,392,466 1310 42.949429|−78.670337
(2) Determining a suitable Rc

In Fig. 7, we show the average number of distinct activity locations visited per day per person nAL as a function of Rc. If Rc is small, one actual activity location may be split into several clusters. With increasing Rc, small clusters start to merge into meaningful representations of actual activity locations. Fig. 7 suggests the suitable Rc appears at 1 km. Recall that the identification of activity locations is subject to both spatial and temporal constraints. Above 1 km, it is possible that some clusters may absorb far-away outliers (passing-by points), therefore lengthen their duration and meet the temporal constraint (i.e. 5 min), consequently leading to the growth of the number of activity locations. This is witnessed by the slight increase of nAL and by the steep rise of cluster duration (per cluster) when Rc is above 1 km (Fig. 7).

Fig. 7.

Fig. 7

Number of distinct activity locations per day and associated duration as a function of Rc.

4.2.3. K-means clustering algorithm

We notice an order problem that could not be handled by the incremental clustering algorithm alone—clustering results are subject to the order that how traces are clustered, which can result in unreasonable clusters. Fig. 8a gives one example, where fourteen traces are to be clustered. Following the incremental clustering algorithm, a new cluster is created at L0, and consequently we obtain two strange clusters: one consists of locations {L0, L1, L2, L3, L4} and the other one contains other locations. For this case, however, a more reasonable output might be: one cluster contains L0 alone and the other one compasses the remaining.

Fig. 8.

Fig. 8

Illustration of the order problem and the solution. (a) Classification without k-means algorithm. (b) Resulting classification after implementing k-means algorithm. (Following the incremental clustering algorithm, we obtain two strange clusters: one in triangle and the other one in circle. After applying k-means clustering, we have a more reasonable output.)

We apply the k-means algorithm to address this problem. This algorithm enables us to aggregate traces into k clusters in which each trace belongs to the cluster with the nearest distance to the centroid. We address the above order problem by initializing the k-means clustering using the results yielded from the incremental clustering algorithm. More specifically, the number of resulting clusters and their centroids are required for the initialization (Kanungo et al., 2002). In other words, the above order problem is resolved by applying the k-means clustering algorithm to revise the clustering results from the incremental clustering algorithm. We notice that the clustering error in the example is corrected (Fig. 8b). Following this revision, the duration and centroid of clusters are updated.

4.3. Removing oscillation sequences

In the second stage of data processing, we remove oscillation sequences. Our solution consists of four steps: (1) detecting oscillation sequences using a time-window-based method; (2) determining the suitable time window (required input parameter of previous step); (3) applying a modified pattern-based method to detect oscillation sequences with low switching frequency; and (4) removing oscillation sequences detected.

4.3.1. Time-window-based method

Inspired by the method proposed by Wu et al. (2014), we develop a time-window-based method to detect oscillation sequences. It scans trajectories with a short time window Tw (defined in the following section), which always starts at the ending time of a trace d0. Here, the ending time of a trace is the time given by the sum of the initial time and duration of the trace (Table 3). A sequence of traces returns after the time window is applied (note that this sequence may contain a single trace). Among the sequences returned, we find the suspicious sequence which includes more than one distinct location estimates. And then oscillation sequences are con-firmed as those containing at least one circular event. A circular event refers to a tour that one device is initially found at location L0, later goes somewhere else (e.g., L1, L2), and at last returns to location L0. Since the time window is defined quite short (e.g., several minutes), it is less likely for anyone to make a tour within such a short time period. Therefore, we consider that the oscillation sequences detected contain traces resulted from oscillation phenomenon. Fig. 9 illustrates how this time-window-based method works. The oscillation case shown in Table 1 is detected, because a circular event L0-L1- L0 is found in the sequence {d0, d1, d2, d3}, which is returned after the time-window is applied.

Fig. 9.

Fig. 9

Illustration of time-window-based detection method. (This figure corresponds to the oscillation case in Table 1. The implementation of the time-window returns a sequence of traces {d0, d1, d2, d3}, containing the circular event (L0–L1–L0)).

4.3.2. Determining a suitable Tw via trial and error

To determine a suitable Tw, we test a set of time-windows of different lengths. Then we calculate the average oscillation ratio as a function of Tw (Fig. 10), where the oscillation ratio for a user is defined as the ratio of the number of detected oscillation traces over the total number of traces associated with that user. Fig. 10 shows that when Tw is small, the average oscillation ratio increases sharply; but beyond a certain value, the tangent of the curve remains stable, indicating five minutes as a reasonable choice to separate oscillation cases from real trips. More specifically, five-minute is chosen as a longer Tw does not yield much better performance on detecting oscillation traces. Instead, a longer Tw may be at the risk of mistaking more actual trips for oscillations. When Tw is five minutes, the estimated average oscillation ratio is 0.17, suggesting that on average approximately 17% of traces are generated due to pure signaling activities.

Fig. 10.

Fig. 10

Average oscillation ratio as a function of Tw.

4.3.3. Modified pattern-based method

The time-window-based method is effective in identifying most oscillation sequences. But, we found that due to temporal sparsity, the time window may be too short to capture the circular event even though the oscillation occurs. Thus, we further process the result using a modified pattern-based method. We first detect sequences with a twice switches pattern, i.e. L0–L1–L0–L1. As the pattern-based method alone may mistake actual trips for oscillation, for each sequence detected, we look for additional evidence: only if a time interval, among all time intervals between any two consecutive traces in the sequence, is found shorter than Tw, we identify it as an oscillation sequence. In Section 5, we show that the simple two-switch criterion is effective, though more complex ones could be designed.

As the oscillation detection methods rely on identifying circular events, it is necessary to address the issue of locational uncertainty first. Without addressing the problem of locational uncertainty, location estimates in sightings data may be unique from time to time, preventing us from identifying circular events and thus the detection of oscillation.

4.3.4. Removing oscillation sequences detected

For each oscillation sequence detected, we remove oscillation traces by finding and replacing them with a meaningful location. An oscillation sequence may reoccur at different times of a day and/or on different days. If the meaningful location is determined from each oscillation sequence independently (Bayir et al., 2010; Wu et al., 2014), we probably yield different results for the same oscillation sequence that appears at different time. For example, we may obtain two different meaningful locations L0 and L1 for one oscillation sequence {L0–L1–L0…} that occurs twice in one day (e.g., in the morning and evening). To address this issue, we determine the meaningful location by examining the traces of the entire month. Specifically, the meaningful location is identified as the one where the user spends most of time during the entire month. The time spent at one location is computed by summing up durations that is associated with the location (Table 3).

5. Results

5.1. Locational uncertainty

We first evaluate the effectiveness of our framework in terms of addressing the issue of locational uncertainty by comparing resulting clusters with those using commonly used agglomerative clustering method (Hariharan and Toyama, 2004; Jiang et al., 2013). Specifically, the radius of outputting clusters is computed and compared. Since location estimates belonging to one cluster are replaced by the cluster’s centroid, the radius of the cluster could be a measure of the precision of new location estimates: the new location records generated from large clusters have lower precision. Fig. 11 shows the distribution of the radius of clusters produced by different methods using the same spatial constraint (Rc = 1 km). We notice that comparing with the uniform distribution of the radius of clusters obtained by agglomerative clustering, the radius of clusters resulting from our method is more heterogeneously distributed. In addition, approximately 80% of these clusters have a radius no longer than 500 m (compared with 50% in the output of agglomerative clustering method), meaning that our method performs better in terms of preserving location resolution while conducting the clustering to extract activity locations.

Fig. 11.

Fig. 11

Distribution of the radius of clusters.

Interestingly, compared with the agglomerative clustering that outputs clusters with a clear cutoff on their size, our method produces some clusters whose size exceeds Rc (Fig. 11). This suggests that the clustering method proposed here is also adaptive to address cases where traces belonging to the same cluster can scatter far from each other, exceeding the given spatial constraint Rc. Note that this adaptive feature could be useful if traces of one activity location scatter broadly (i.e. the associated locational uncertainty is large) because of the low density of cellular towers that are involved in estimating locations.

5.2. Oscillation

We then evaluate our solution to addressing the oscillation phenomenon. Since the oscillation sequences give us the illusion that a user travels back and forth within a short time, we may overestimate the trip rates of users whose records include oscillation traces. Here, one trip is defined as a sequence of traces starting and ending at two consecutive activity locations. Without removing oscillation sequences, we notice a large number of trajectories with very high trip rates. The Buffalo travel survey suggests that about 95% of population conduct no more than eight trips a day (Greater Buffalo-Niagara Transportation Survey, 2002). We thus assume if a trajectory includes more than eight trips, it is identified as a potentially abnormal trajectory because of possible presence of oscillation traces. We found that before removing oscillation sequences, the number of abnormal trajectories is high – about 12% of the total number of trajectories. After oscillation traces are removed, this value drops to about 0.7%. This resulting value is much smaller than 5% from the travel survey data, which could be attributed to the missing short-distance trips that are shorter than Rc and the related representativeness issues that are not dealt with in this paper but discussed in Section 6.

In Fig. 12, we plot the percentage of abnormal trajectories as a function of Tw. Before and after the implementing the modified pattern-based method, the substantial drop in the number of abnormal trajectories testifies the effectiveness of the method.

Fig. 12.

Fig. 12

Percentage of abnormal trajectories as a function of Tw.

The time-window-based method is capable of identifying oscillation sequences with complex patterns that involve more than two locations. For example, it can detect oscillation sequences with pattern L0–L1–L2–L3–L0–L1–L2–L1–L0, meaning a device is observed switching among four locations. Fig. 13a gives the distribution of different oscillation patterns this method detected in the data. Here, n locations involved in oscillation means devices are observed oscillating among n locations. We see that though in most of cases oscillation only involves two locations, mobile phone data contains complex oscillation patterns and they could be detected with our framework. Specifically, we notice that approximately 3% of oscillation sequences involve more than four locations. Meanwhile, Fig. 13b shows the distribution of distance between oscillation traces (detected using our method), indicating that some oscillation traces are farther apart than 1 km. This suggests that, as we noted earlier in Section 2.3, a clustering method with a specific spatial constraint may fail to detect faraway oscillation traces.

Fig. 13.

Fig. 13

(a) Distribution of different oscillation patterns detected. (b) The distribution of distances between oscillation traces detected (Tw = 5 min).

5.3. Trajectory

As a part of our evaluation, in Fig. 14 we visualize two trajectories of one user before and after the data is processed. The trajectories on the left are drawn from the original data and the right ones are from the processed data. We note that common places visited in multiple days are revealed and the trajectories become smoother after we remove oscillation sequences.

Fig. 14.

Fig. 14

Trajectories of one sample user. (a) before the data processing and (b) after the data processing. (We show two trajectories in different colors. For a clear observation, only traces with a duration (i.e. duration > 0) are shown here.)

Previous studies show that individuals’ trajectories are characterized by statistical regularity: preferable returns to a few locations and occasional explorations to other faraway locations (González et al., 2008; Pappalardo et al., 2015; Song et al., 2010a; Susilo and Axhausen, 2014). As the measurement of mobility regularity, the visiting frequency f of the k-th most visited locations is observed following a scaling law: fkkξ. Here, as the indicator of mobility regularity, a larger exponential factor ξ suggests a higher level of regularity: less exploration into new places and more returns to familiar locations. The mobility analysis in the previous study, which is based on sightings data as well, shows ξ = 1.04 (Jiang et al., 2013). In Fig. 15, we test this scaling property with the datasets before and after removing oscillation sequences. We notice that though the regularity of mobility pattern could be revealed from both datasets, the exponential factors are different: the data processed by the first stage of our framework only (i.e. clustering algorithms) yields ξ = 1.02, which is consistent with Jiang et al.’s study (2013); while after removing the oscillation sequences, ξ decreases significantly to 0.88. Therefore, the oscillation phenomenon leads to the non-negligible overestimation of the regularity of individuals’ mobility.

Fig. 15.

Fig. 15

The location visiting frequency (Tw = 5 min).

6. Discussions

As noted earlier, despite the rapid surge of studies using mobile phone data, our understanding on the nature of the data is very limited—few describes its issues and documents the procedures applied to such data to derive mobility trajectories. This paper is to fill this gap. Our study proposes a data-processing framework to specifically address two problems in the mobile phone sightings data: locational uncertainty and oscillation. For each problem, we extend existing works and propose our algorithms. These algorithms are simple to implement. We illustrate the effectiveness of the proposed framework by comparing to the results using the state of the art algorithms. As shown in Fig. 15, without applying those data processing steps, a fundamental characteristic of human mobility patterns—regularity (González et al., 2008; Jiang et al., 2013; Song et al., 2010a)—is overestimated. Chen et al. (2016) reviewed two lines of research for human mobility research—traditional travel behavior analysis using household travel surveys and human mobility analysis using big data such as mobile phone data. They noted that though regularity and variety are found in both lines of works, variety is much more pronounced in those studies using household travel surveys. Fig. 15 suggests one potential underlying cause of this discrepancy between the two lines of research may be due to our lack of understanding of the nature of mobile phone data and insufficient processing on them.

One known issue of mobile phone data that is not addressed in the paper relates to representativeness. Put it more plainly, since users in a mobile phone data are not the result of a probabilistic sample (as for household travel surveys) and all users voluntarily select into being the subscribers to certain service networks, there is an issue of selection bias (Arai et al., 2015; Wesolowski et al., 2013). We do not address it here as it itself is a complicated issue and addressing it at requires a complete separate study. We briefly discuss a number of complications here. First, the underlying population of mobile phone data is often unknown, as the data provider is often unwilling to release such information. Second, there is lack of ground truth information, an issue related to the unknown underlying population. Because of these two reasons, the usual means of validating against external, ground truth data is infeasible (Chen et al., 2014) and new methods and metrics must be developed for validation purposes. Third, there is a need to uncouple the strong correlation between trip rates and communication frequencies: more connections to communication networks, more sightings resulted and thus more trips derived (Iovan et al., 2013; Yuan et al., 2012). This correlation also suggests that for those studies relying only on active users, their results are likely affected by this issue (Alexander et al., 2015b; Wang et al., 2010). Lastly, because of the temporal sparsity and spatial uncertainty naturally associated with mobile phone data, there is a need for new thinking and new interpretations on the scale and meanings of representativeness (Lu et al., 2017; Zhao et al., 2016). In other words, a day of trajectory from mobile phone data may not be comparable to a trajectory derived from a household travel survey; rather, to be comparable to the latter, it may take trajectories of multiple days. Taking it further, since a key advantage of mobile phone data is its longitudinal nature, new metrics may be developed to capture how representative mobility patterns evolve over time (Chikaraishi et al., 2009, 2011).

Though the reported data-processing framework is developed for the sightings data, some of the steps within the framework can also be applied to the CDR data. As noted earlier and described in more detail in Chen et al. (2016), the location estimates in the CDR data are those of cell towers, rather than distinct coordinates for the devices in sightings data. This suggests that the clustering step as described in our paper (Section 4.3.2) is not needed. However, retaining locations at the cell tower level may miss trips within a single cell area. In this study, after clustering the sightings data, the duration of clusters is then computed. This duration represents the time a user spends at the location. For CDR data, one would find the estimation of the activity duration is similar to the one in sightings data. The second stage of our framework (detecting and removing oscillation traces) can be applied to CDR data directly, as CDR data also has the oscillation issue. Given that the CDR data is temporally sparser than the sightings data (Chen et al., 2016), there may be more difficulty in differentiating oscillations from actual movements, as the detection of oscillations relies on the temporal distribution of traces. Therefore, a careful selection of time-window Tw is required.

It is our hope that the current study will spark many more studies discussing the nature of the data and its issues, documenting and reporting the various data processing procedures that are used, and addressing the representativeness issue. It is undoubted that such studies are of great importance—they are the foundations of those resulting trajectories on which new discoveries are made. This is particularly true if one is looking to use such data to inform our transportation policies.

Acknowledgments

The study is supported by the funding provided by the U.S. National Institute of Health (NIH) (1R01GM108731-01A1). The authors are grateful to Dr. Jae Hyun Lee’s help in generating Fig. 3 as well as his participation in related discussions. Comments from three anonymous reviewers also greatly strengthen our paper. Views expressed in the paper do not represent those of NIH and the authors are responsible for all errors that may remain.

References

  1. Ahas R, Aasa A, Silm S, Tiru M. Daily rhythms of suburban commuters’ movements in the Tallinn metropolitan area: Case study with mobile positioning data. Transp Res Part C Emerg Technol. 2010a;18:45–54. http://dx.doi.org/10.1016/j.trc.2009.04.011. [Google Scholar]
  2. Ahas R, Silm S, Järv O, Saluveer E, Tiru M. Using mobile positioning data to model locations meaningful to users of mobile phones. J Urban Technol. 2010b;17:3–27. http://dx.doi.org/10.1080/10630731003597306. [Google Scholar]
  3. Alexander L, Jiang S, Murga M, González MC. Origin–destination trips by purpose and time of day inferred from mobile phone data. Transp Res Part C Emerg Technol. 2015a;58:240–250. http://dx.doi.org/10.1016/j.trc.2015.02.018. [Google Scholar]
  4. Alexander L, Jiang S, Murga M, González MC. Origin–destination trips by purpose and time of day inferred from mobile phone data. Transp Res Part C Emerg Technol, Big Data Transport Traffic Eng. 2015b;58 (Part B):240–250. http://dx.doi.org/10.1016/j.trc.2015.02.018. [Google Scholar]
  5. Andrienko G, Andrienko N, Hurter C, Rinzivillo S, Wrobel S. Visual Analytics Science and Technology (VAST), 2011 IEEE Conference on. IEEE; 2011. From movement tracks through events to places: extracting and characterizing significant places from mobility data; pp. 161–170. [Google Scholar]
  6. Arai A, Witayangkurn A, Horanont T, Shao X, Shibasaki R. Understanding the unobservable population in call detail records through analysis of mobile phone user calling behavior: a case study of greater Dhaka in Bangladesh. 2015 IEEE International Conference on Pervasive Computing and Communications (PerCom) 2015:207–214. http://dx.doi.org/10.1109/PERCOM.2015.7146530.
  7. Bayir MA, Demirbas M, Eagle N. Mobility profiler: a framework for discovering mobility profiles of cell phone users. Pervasive Mob Comput. 2010;6:435–454. http://dx.doi.org/10.1016/j.pmcj.2010.01.003. [Google Scholar]
  8. Becker R, Cáceres R, Hanson K, Isaacman S, Loh JM, Martonosi M, Rowland J, Urbanek S, Varshavsky A, Volinsky C. Human mobility char-acterization from cellular network data. Commun ACM. 2013;56:74–82. http://dx.doi.org/10.1145/2398356.2398375. [Google Scholar]
  9. Calabrese F, Colonna M, Lovisolo P, Parata D, Ratti C. Real-time urban monitoring using cell phones: a case study in Rome. IEEE Trans Intell Transp Syst. 2011a;12:141–151. http://dx.doi.org/10.1109/TITS.2010.2074196. [Google Scholar]
  10. Calabrese F, Diao M, Di Lorenzo G, Ferreira J, Ratti C. Understanding individual mobility patterns from urban sensing data: a mobile phone trace example. Transp Res Part C Emerg Technol. 2013;26:301–313. http://dx.doi.org/10.1016/j.trc.2012.09.009. [Google Scholar]
  11. Calabrese F, Lorenzo GD, Liu L, Ratti C. Estimating origin-destination flows using mobile phone location data. IEEE Pervasive Comput. 2011b;10:36–44. http://dx.doi.org/10.1109/MPRV.2011.41. [Google Scholar]
  12. Candia J, González MC, Wang P, Schoenharl T, Madey G, Barabási AL. Uncovering individual and collective human dynamics from mobile phone records. J Phys Math Theor. 2008;41:224015. http://dx.doi.org/10.1088/1751-8113/41/22/224015. [Google Scholar]
  13. Chen C, Bian L, Ma J. From traces to trajectories: how well can we guess activity locations from mobile phone traces? Transp Res Part C Emerg Technol. 2014;46:326–337. [Google Scholar]
  14. Chen C, Ma J, Susilo Y, Liu Y, Wang M. The promises of big data and small data for travel behavior (aka human mobility) analysis. Transp Res Part C Emerg Technol. 2016;68:285–299. doi: 10.1016/j.trc.2016.04.005. http://dx.doi.org/10.1016/j.trc.2016.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Chikaraishi M, Fujiwara A, Zhang J, Axhausen K. Exploring variation properties of departure time choice behavior by using multilevel analysis approach. Transp Res Rec J Transp Res Board. 2009;2134:10–20. http://dx.doi.org/10.3141/2134-02. [Google Scholar]
  16. Chikaraishi M, Fujiwara A, Zhang J, Axhausen K, Zumkeller D. Changes in variations of travel time expenditure. Transp Res Rec J Transp Res Board. 2011;2230:121–131. http://dx.doi.org/10.3141/2230-14. [Google Scholar]
  17. Csáji BC, Browet A, Traag VA, Delvenne JC, Huens E, Van Dooren P, Smoreda Z, Blondel VD. Exploring the mobility of mobile phone users. Phys Stat Mech Appl. 2013;392:1459–1473. http://dx.doi.org/10.1016/j.physa.2012.11.040. [Google Scholar]
  18. Demissie MG, de Almeida Correia GH, Bento C. Intelligent road traffic status detection system through cellular networks handover information: an exploratory study. Transp Res Part C Emerg Technol. 2013;32:76–88. http://dx.doi.org/10.1016/j.trc.2013.03.010. [Google Scholar]
  19. Dong H, Wu M, Ding X, Chu L, Jia L, Qin Y, Zhou X. Traffic zone division based on big data from mobile phone base stations. Transp Res Part C Emerg Technol Big Data Transport Traffic Eng. 2015;58(Part B):278–291. http://dx.doi.org/10.1016/j.trc.2015.06.007. [Google Scholar]
  20. Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. 1996:226–231. [Google Scholar]
  21. Fraley C. Algorithms for model-based Gaussian hierarchical clustering. SIAM J Sci Comput. 1998;20:270–281. [Google Scholar]
  22. Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 2002;97:611–631. http://dx.doi.org/10.1198/016214502760047131. [Google Scholar]
  23. Gao S, Wang Y, Gao Y, Liu Y. Understanding urban traffic-flow characteristics: a rethinking of betweenness centrality. Environ Plan B Plan Des. 2013;40:135–153. http://dx.doi.org/10.1068/b38141. [Google Scholar]
  24. González MC, Hidalgo CA, Barabási AL. Understanding individual human mobility patterns. Nature. 2008;453:779–782. doi: 10.1038/nature06958. http://dx.doi.org/10.1038/nature06958. [DOI] [PubMed] [Google Scholar]
  25. Greater Buffalo-Niagara Transportation Survey. 2002, 2003 < https://gbnrtc-mpo-osl2.squarespace.com/s/GBNRTC_Onboard_Transit_Study.pdf>.
  26. Hard E, Byron C, Praprut S, Steve F, Darrell B, Lisa G. (Federal Highway Administration report FHWA-OR-15-01).Synopsis of New Methods and Technologies to Collect Origin-Destination (O-D) Data (No FHWA-HEP-16-083) 2016 < https://www.fhwa.dot.gov/planning/tmip/publications/other_reports/>.
  27. Hariharan R, Toyama K. Project Lachesis: Parsing and Modeling Location Histories. In: Egenhofer MJ, Freksa C, Miller HJ, editors. Geographic Information Science, Lecture Notes in Computer Science; Presented at the International Conference on Geographic Information Science; Berlin, Heidelberg: Springer; 2004. pp. 106–124. http://dx.doi.org/10.1007/978-3-540-30231-5_8. [Google Scholar]
  28. Horn C, Klampfl S, Cik M, Reiter T. Detecting outliers in cell phone data. Transp Res Rec J Transp Res Board. 2014;2405:49–56. http://dx.doi.org/10.3141/2405-07. [Google Scholar]
  29. Iovan C, Olteanu-Raimond AM, Couronné T, Smoreda Z. Moving and calling: mobile phone data quality measurements and spatiotemporal uncertainty in human mobility studies. In: Vandenbroucke D, Bucher B, Crompvoets J, editors. Geographic Information Science at the Heart of Europe. Springer; International Publishing, Cham: 2013. pp. 247–265. [Google Scholar]
  30. Iqbal MS, Choudhury CF, Wang P, González MC. Development of origin–destination matrices using mobile phone call data. Transp Res Part C Emerg Technol. 2014;40:63–74. http://dx.doi.org/10.1016/j.trc.2014.01.002. [Google Scholar]
  31. Isaacman S, Becker R, Cáceres R, Kobourov S, Martonosi M, Rowland J, Varshavsky A. International Conference on Pervasive Computing. Springer; 2011. Identifying important places in people’s lives from cellular network data; pp. 133–151. [Google Scholar]
  32. Järv O, Ahas R, Witlox F. Understanding monthly variability in human activity spaces: a twelve-month study using mobile phone call detail records. Transp Res Part C Emerg Technol. 2014;38:122–135. http://dx.doi.org/10.1016/j.trc.2013.11.003. [Google Scholar]
  33. Järv O, Tenkanen H, Toivonen T. Enhancing spatial accuracy of mobile phone data using multi-temporal dasymetric interpolation. Int J Geogr Inf Sci. 2017;31:1630–1651. http://dx.doi.org/10.1080/13658816.2017.1287369. [Google Scholar]
  34. Jiang S, Ferreira J, Gonzalez MC. Activity-based human mobility patterns inferred from mobile phone data: a case study of Singapore. IEEE Trans Big Data. 2017:1–1. http://dx.doi.org/10.1109/TBDATA.2016.2631141.
  35. Jiang S, Fiore GA, Yang Y, Ferreira J, Jr, Frazzoli E, González MC. Proceedings of the 2nd ACM SIGKDD International Workshop on Urban Computing, UrbComp ’13. ACM; New York, NY, USA: 2013. A review of urban computing for mobile phone traces: current methods, challenges and opportunities; pp. 2:1–2:9. http://dx.doi.org/10.1145/2505821.2505828. [Google Scholar]
  36. Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY. An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell. 2002;24:881–892. http://dx.doi.org/10.1109/TPAMI.2002.1017616. [Google Scholar]
  37. Lee JK, Hou JC. Proceedings of the 7th ACM International Symposium on Mobile Ad Hoc Networking and Computing, MobiHoc ’06. ACM; New York, NY, USA: 2006. Modeling steady-state and transient behaviors of user mobility: formulation, analysis, and application; pp. 85–96. http://dx.doi.org/10.1145/1132905.1132915. [Google Scholar]
  38. Louail T, Lenormand M, Ros OGC, Picornell M, Herranz R, Frias-Martinez E, Ramasco JJ, Barthelemy M. From mobile phone data to the spatial structure of cities. Sci Rep. 2014;4:5276. doi: 10.1038/srep05276. http://dx.doi.org/10.1038/srep05276. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Lu S, Fang Z, Zhang X, Shaw SL, Yin L, Zhao Z, Yang X. Understanding the representativeness of mobile phone location data in characterizing human mobility indicators. ISPRS Int J Geo-Inf. 2017;6:7. http://dx.doi.org/10.3390/ijgi6010007. [Google Scholar]
  40. Milone R. Presented at 15th TRB National Transportation Planning Applications Conferrence. Metropolitan Washington Council of Governments; 2015. Preliminary Evaluation of Cellular Origin-Destination Data as a Basis for Forecasting Non-Resident Travel. < https://www.trbappcon.org/2015conf/Program.aspx>. [Google Scholar]
  41. Palchykov V, Mitrović M, Jo HH, Saramäki J, Pan RK. Inferring human mobility using communication patterns. Sci Rep. 2014;4:6174. doi: 10.1038/srep06174. http://dx.doi.org/10.1038/srep06174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Pappalardo L, Simini F, Rinzivillo S, Pedreschi D, Giannotti F, Barabási AL. Returners and explorers dichotomy in human mobility. Nat Commun. 2015;6:8166. doi: 10.1038/ncomms9166. http://dx.doi.org/10.1038/ncomms9166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Qi L, Qiao Y, Abdesslem FB, Ma Z, Yang J. Proceedings of the First Workshop on Mobile Data, MobiData ’16. ACM; New York, NY, USA: 2016. Oscillation resolution for massive cell phone traffic data; pp. 25–30. http://dx.doi.org/10.1145/2935755.2935759. [Google Scholar]
  44. Qu Y, Gong H, Wang P. Transportation mode split with mobile phone data. 2015 IEEE 18th International Conference on Intelligent Transportation Systems. 2015:285–289. http://dx.doi.org/10.1109/ITSC.2015.56.
  45. Serok N, Blumenfeld-Lieberthal E. A Simulation model for intra-urban movements. Plos One. 2015;10:e0132576. doi: 10.1371/journal.pone.0132576. http://dx.doi.org/10.1371/journal.pone.0132576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Shad SA, Chen E, Bao T. Cell oscillation resolution in mobility profile building. 2012 ArXiv Prepr. ArXiv12065795. [Google Scholar]
  47. Song C, Koren T, Wang P, Barabási AL. Modelling the scaling properties of human mobility. Nat Phys. 2010a;6:818–823. http://dx.doi.org/10.1038/nphys1760. [Google Scholar]
  48. Song C, Qu Z, Blumm N, Barabási AL. Limits of predictability in human mobility. Science. 2010b;327:1018–1021. doi: 10.1126/science.1177170. http://dx.doi.org/10.1126/science.1177170. [DOI] [PubMed] [Google Scholar]
  49. Stabler B, Sikder S. (TREC Friday Seminar Series).Development of the Idaho Statewide Travel Demand Model Trip Matrices Using Cell Phone OD Data and Origin Destination Matrix Estimation. 2014 < https://pdxscholar.library.pdx.edu/trec_seminar/30/>.
  50. Susilo YO, Axhausen KW. Repetitions in individual daily activity–travel–location patterns: a study using the Herfindahl-Hirschman Index. Transportation. 2014;41:995–1011. http://dx.doi.org/10.1007/s11116-014-9519-4. [Google Scholar]
  51. Toole JL, Colak S, Sturt B, Alexander LP, Evsukoff A, González MC. The path most traveled: Travel demand estimation using big data resources. Transp Res Part C Emerg Technol Big Data Transport Traffic Eng. 2015;58(Part B):162–177. http://dx.doi.org/10.1016/j.trc.2015.04.022. [Google Scholar]
  52. Transportation Research Board. Does the Built Environment Influence Physical Activity?: Examining the Evidence – Special Report 282. Institute of Medicine of the National Academies; 2005. [Google Scholar]
  53. Vlahogianni EI, Park BB, van Lint JWC. Big data in transportation and traffic engineering. Transp Res Part C Emerg Technol, Big Data Transport Traffic Eng. 2015;58 (Part B):161. http://dx.doi.org/10.1016/j.trc.2015.08.006. [Google Scholar]
  54. Wang H, Calabrese F, Lorenzo GD, Ratti C. Transportation mode inference from anonymized and aggregated mobile phone call detail records. 2010 13th International IEEE Conference on Intelligent Transportation Systems (ITSC) 2010:318–323. http://dx.doi.org/10.1109/ITSC.2010.5625188.
  55. Wang M. Understanding Activity Location Choice with Mobile Phone Data (Thesis) 2014 < https://digital.lib.washington.edu/researchworks/handle/1773/26523>.
  56. Wang M, Chen C, Ma J. Time-of-day dependence of location variability: application of passively-generated mobile phone dataset. Presented at the Transportation Research Board 94th Annual Meeting 2015 [Google Scholar]
  57. Wang P, Hunter T, Bayen AM, Schechtner K, González MC. Understanding road usage patterns in urban areas. Sci Rep. 2012;2 doi: 10.1038/srep01001. http://dx.doi.org/10.1038/srep01001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Wang T, Chen C, Ma J. Mobile phone data as an alternative data source for travel behavior studies. Presented at the Transportation Research Board 93rd Annual MeetingTransportation Research Board 2014 [Google Scholar]
  59. Wesolowski A, Eagle N, Noor AM, Snow RW, Buckee CO. The impact of biases in mobile phone ownership on estimates of human mobility. J R Soc Interface. 2013;10:20120986. doi: 10.1098/rsif.2012.0986. http://dx.doi.org/10.1098/rsif.2012.0986. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Widhalm P, Yang Y, Ulm M, Athavale S, González MC. Discovering urban activity patterns in cell phone data. Transportation. 2015;42:597–623. http://dx.doi.org/10.1007/s11116-015-9598-x. [Google Scholar]
  61. Wu W, Wang Y, Gomes JB, Anh DT, Antonatos S, Xue M, Yang P, Yap GE, Li X, Krishnaswamy S, Decraene J, Nash AS. Oscillation resolution for mobile phone cellular tower data to enable mobility modelling. 2014 IEEE 15th International Conference on Mobile Data Management. 2014:321–328. http://dx.doi.org/10.1109/MDM.2014.46.
  62. Xu Y, Shaw SL, Zhao Z, Yin L, Fang Z, Li Q. Understanding aggregate human mobility patterns using passive mobile phone location data: a home-based approach. Transportation. 2015;42:625–646. http://dx.doi.org/10.1007/s11116-015-9597-y. [Google Scholar]
  63. Xu Y, Shaw SL, Zhao Z, Yin L, Lu F, Chen J, Fang Z, Li Q. Another tale of two cities: understanding human activity space using actively tracked cellphone location data. Ann Am Assoc Geogr. 2016;106:489–502. http://dx.doi.org/10.1080/00045608.2015.1120147. [Google Scholar]
  64. Yin M, Qiao SM, Feygin S, Paiement J-F, Pozdnoukhov A. IEEE Transactions in ITS, MobiData ’16. ACM; New York, NY, USA: 2017. A generative model of urban activities from cellular data; pp. 25–30. http://dx.doi.org/10.1145/2935755.2935759. [Google Scholar]
  65. Yuan Y, Raubal M, Liu Y. Correlating mobile phone usage and travel behavior – a case study of Harbin, China. Comput Environ Urban Syst, Special Issue: Geoinform. 2012;36:118–130. http://dx.doi.org/10.1016/j.compenvurbsys.2011.07.003. [Google Scholar]
  66. Zhao Z, Shaw SL, Xu Y, Lu F, Chen J, Yin L. Understanding the bias of call detail records in human mobility research. Int J Geogr Inf Sci. 2016;30:1738–1762. http://dx.doi.org/10.1080/13658816.2015.1137298. [Google Scholar]
  67. Zheng VW, Zheng Y, Xie X, Yang Q. Proceedings of the 19th International Conference on World Wide Web, WWW ’10. ACM; New York, NY, USA: 2010. Collaborative location and activity recommendations with GPS history data; pp. 1029–1038. http://dx.doi.org/10.1145/1772690.1772795. [Google Scholar]

RESOURCES