Summary
With increasing availability of smartphones with Global Positioning System (GPS) capabilities, large-scale studies relating individual-level mobility patterns to a wide variety of patient-centered outcomes, from mood disorders to surgical recovery, are becoming a reality. Similar past studies have been small in scale and have provided wearable GPS devices to subjects. These devices typically collect mobility traces continuously without significant gaps in the data, and consequently the problem of data missingness has been safely ignored. Leveraging subjects’ own smartphones makes it possible to scale up and extend the duration of these types of studies, but at the same time introduces a substantial challenge: to preserve a smartphone’s battery, GPS can be active only for a small portion of the time, frequently less than , leading to a tremendous missing data problem. We introduce a principled statistical approach, based on weighted resampling of the observed data, to impute the missing mobility traces, which we then summarize using different mobility measures. We compare the strengths of our approach to linear interpolation (LI), a popular approach for dealing with missing data, both analytically and through simulation of missingness for empirical data. We conclude that our imputation approach better mirrors human mobility both theoretically and over a sample of GPS mobility traces from 182 individuals in the Geolife data set, where, relative to LI, imputation resulted in a 10-fold reduction in the error averaged across all mobility features.
Keywords: GPS, Imputation, mHealth, Missing data, Mobility, Precision medicine
1. Introduction
The Global Positioning System (GPS) is a navigation system that uses a device’s distance from a number of satellites in orbit to determine the location of the device. GPS has a wide range of applications. For example, in transportation GPS has been used to complement self-report surveys on travel activity (Chapman and Frank, 2007; Stopher and others, 2007; Zhou and Golledge, 2003; Shen and Stopher, 2014). One study instrumented participants with both a GPS receiver and an accelerometer and showed that combining both data sources improved the prediction of a person’s mode of activity (Troped and others, 2008). Another study outfitted children with GPS devices and showed that their travel behavior was different when they were accompanied by an adult as opposed to when they were on their own (Mackett and others, 2007).
Another burgeoning area of application is mobile health and the application of GPS to social and behavioral research in a wide variety of contexts (Wolf and Jacobs, 2010). For example, GPS devices on older (63 years of age) care-recipients were used to show that caregiver burden was negatively correlated with the amount of time a care-recipient spent walking per day (Werner and others, 2012). Two studies found that mobility measures extracted from GPS data were correlated with depressive symptom severity (Saeb and others, 2015; Canzian and Musolesi, 2015), and another study was able to predict changes of state for bipolar patients with accuracy (Gruenerbl and others, 2014). Amongst pregnant women at risk for perinatal depression, women with a larger radius of travel were found to have milder depressive symptoms than the women with more severe symptoms (Faherty and others, 2017). Following spine surgery, GPS tracking was used to monitor patient recovery and found that increased mobility corresponded with a successful recovery (Yair and others, 2011). GPS has also shown promise at keeping track of wandering dementia patients (Miskelly, 2005) as well as at monitoring the mobility of patients with Alzheimer’s disease (Shoval and others, 2008). By combining pollution measurements from air samples and a person’s GPS trace, exposure levels at the individual-level were calculated (Phillips and others, 2001). Several behavioral traits measured passively through the smartphone use of schizophrenia patients were found to be associated with self-reported measures of mental health (Wang and others, 2016).
In most of the above studies, GPS data were collected from study participants either through a study-provided smartphone or wearable GPS receiver. While data collected in this way will have minimal missingness, there are three distinct disadvantages to supplying the GPS device. Firstly, this study design is not scalable because it is expensive to provide GPS devices to a large number of participants. Secondly, adherence to wearable devices typically declines sharply after a few months (AsPC., 2015), making long-term longitudinal studies less feasible. Thirdly, the data may be biased in unpredictable ways due to the interference that can arise by introducing of a new device into a participant’s life (Ainsworth and others, 2013). All of these shortcomings can be avoided by taking advantage of built-in GPS devices in the smartphones of participants. In 2017, of US adults owned smartphones, up from in 2011 (Smith, 2015; Cassagnol, 2017), and this number is expected to continue to increase. Anonymized call detail records (CDRs), resulting from mobile phone communication events, have been used to study both social networks (Onnela and others, 2007) and mobility patterns (Gonzalez and others, 2008) at scale, and analysis and modeling of CDRs to different purposes has since then become an active field of its own (Blondel and others, 2015). However, for the purposes of inferring mobility metrics, these data are quite limited as the location of the person is only available at the level of the cell tower used to transmit the event, and even this is only available at the time of communication (either via calls or text messages). Smartphone-based mobility traces from GPS are therefore more precise both spatially and temporally and, importantly, making it possible to link the smartphone data with individual-level covariates, which in the context of digital phenotyping, which we have previously defined as the “moment-by-moment quantification of the individual-level human phenotype in situ using data from personal digital devices” (Onnela and Rauch, 2016; Torous and others, 2016), could range from simple demographic variables to fMRI imaging or genome sequencing data.
To make use of the ubiquity of smartphones in biomedical research settings, we have developed an open source research platform for digital phenotyping called Beiwe that includes customizable iOS and Android smartphone apps that, among other features, can record a phone’s GPS trace using user specified sampling scheme (Torous and others, 2016). For a smartphone app to be scalable and enable long-term data collection, it must not impose too much of a drain on the phone’s battery. Study adherence, which in this context means not uninstalling the app during the study, could be in jeopardy if the participant notices a significant drop in the battery life. Of all the current smartphone sensors, GPS is the most expensive with regards to battery usage (Miller, 2012). To lower the strain on the battery, the Beiwe platform records GPS over short intervals, called on-cycles, between gaps of periods of inactivity, called off-cycles. The platform enables users to specify the length of on-cycles and off-cycles at will; for example, the on-period might be 2 min and the off-period 10 min. However, all of these off-periods lead to a large portion of the GPS trace being missing. Competing smartphone platforms for digital phenotyping also deal with similar missing data problems.
To the best of our knowledge, there is currently no principled method of handling continuous missing GPS data at the individual level (Krenn and others, 2011; Jankowska and others, 2015). There have been methods developed that avoid using the coordinate pairs of GPS data by representing significant locations as nodes in a network (Liao and others, 2007), and there are methods that require the trajectories of large samples of individuals to construct road networks (Li and others, 2016). Studies that model trajectories at the individual-level so far have either ignored missing data (Canzian and Musolesi, 2015), or have used linear interpolation (LI) assuming travel at constant m/s over the missing interval (Rhee and others, 2007, 2011). With the rise of research in digital phenotyping large scale research studies aiming to measure patient-centered outcomes and behavioral phenotypes in naturalistic settings over long periods of time are becoming more prevalent, necessitating the development of statistical methods that properly account for missingness. Here, we introduce a statistical approach for imputing the missing trajectories present in an individual’s mobility trace that attempts to simulate human mobility patterns at the individual level, i.e. using only the observed data from that same individual without reliance on any external data. The properties of this approach are compared analytically to LI, as well as across high-frequency complete-data mobility traces from 182 individuals in the Geolife data set (Zheng and others, 2009, 2008, 2010). Compared to LI, our imputation approach offers significant reductions in error relative to the ground truth in the estimation of a wide variety of mobility measures, with a 10-fold improvement in the average error.
2. Methods
2.1. Mapping longitude and latitude to a 2D plane
While raw GPS data consists of a sequence of longitude and latitude coordinates that trace a person’s location on the surface of the Earth, most mobility metrics are computed for data in a 2D Euclidean plane, requiring a transformation of coordinates as the first step. Because of the differences in geometry, there will always be distortion when mapping the surface of a 3D sphere to a 2D plane, although the distortion is smaller the smaller the region of the sphere’s surface being mapped. People typically do not travel far enough on a daily basis for this projection to greatly distort their mobility traces. For the purposes of extracting mobility measures, a person’s mobility trace can be mapped to a 2D plane on an individual basis, as opposed to using a universal projection such as the Mercator projection (Maling, 2013). By allowing each person their own projection, distortion can be minimized by selecting the projection best suited for each individual. We detail these individualized projections here.
Consider a person’s mobility trace where we let , , , and be the minimum and maximum latitude and longitude attained over the study period, respectively. Note that these parameters are specific to each individual. By projecting the region bounded by the points (,), (,), (,), and (,) onto an isosceles trapezoid, the distortion of the projection is greatly reduced (see Figure S1 of supplementary material available at Biostatistics online). To map a specific coordinate to the X–Y plane, let , , , , and , where m represents the Earth’s radius in meters. Then the corresponding pair is:
As a reference point, we assign the origin to (, ).
2.2. Notation and model
First, a person’s GPS latitude and longitude coordinates are transformed to 2D plane coordinates using the projection detailed in 2.1. According to the rectangular method (Rhee and others, 2007), the data are next converted into a mobility trace defined by a sequence of flights and pauses. Flights are defined to be segments of linear movement and pauses are defined to be periods of time where a person does not move. Curved movement is approximated by multiple sequential flights. Also, if a missing interval is flanked by two pauses at the same location (situated within 50 m of one another), the missing interval is assumed to be a longer pause at the same location.
Suppose a person’s mobility trace begins at time at projected coordinates . The mobility trace is modeled as a sequence of events, where an event is either a flight of straight-line movement, or a pause. Let be the horizontal displacement of the th event, be the vertical displacement of the th event, and be the duration of the th event. The time of the th event is while the location at the start of the th event is . Letting ,
The indicator for missingness at time is . Due to the battery strain that GPS imposes, on smartphones GPS can only be activated for regularly scheduled short intervals (e.g., 2 min) of time, with large gaps (e.g., 10 min) between collection periods. This scheduled missingness occurs independent of a person’s mobility and therefore can be classified as missing completely at random (MCAR) (Little and Rubin, 2002), meaning that , where represents the random variables for location and time. While this scheduled missingness will undoubtedly account for the largest percentage of missing data in a person’s GPS trace, some GPS data may be missing not at random (MNAR) as related to a person’s mobility, such as powering off the phone, being inside a tall building, or geographical features inhibiting satellite connection, and this type of missing data is not accounted for here.
We model the event displacements with the joint density of where and are density functions for flights and pauses conditional on the time and location at the start of the th event, respectively, when the th event is a flight and when the th event is a pause, with . The probability of the th event being a flight instead of a pause is where is the probability of observing a flight conditional on the previous event being a flight. The reason that is dependent on is because two consecutive pauses are impossible by definition, as they would simply combine to count as one longer pause at the same location. This forces conditional on . Note that and are distribution functions conditional on while is not. This conditioning allows the full distribution functions of flights and pauses to change with .
Continuity assumptions on , , and enable local weighted resampling. We consider three different continuity assumptions that correspond to upweighting events close in time, or temporally local (TL), upweighting events locally geographically (GL), or upweighting events at similar times of day at similar locations, or geographical local and circadian (GLC). Details regarding these continuity assumptions are left to the Supplementary Materials available at Biostatistics online.
2.3. Imputing missing trajectories
Our approach to dealing with missing data is to impute missing flight and pause events by resampling from observed events over each missing interval or sequence of events. A person’s true Cartesian location after projection at time is . Consider a period of missing data in a mobility trace that starts at time and ends at time . is unobserved over the time interval with and both being known. We aim to closely approximate over the missing interval by borrowing information from the observed mobility trace outside this interval. Previous approaches either ignore missing data altogether (Canzian and Musolesi, 2015) or have used LI between and , which essentially amounts to connecting the dots at and with a straight line. More precisely, LI estimates with on the interval , and this same linear trajectory is assumed at some pre-specified constant velocity, such as 1 m/s, from to (Rhee and others, 2007, 2011). While such a simple model of human mobility may be suitable for near-complete data with scarce missingness, for smartphone GPS data with substantial degree of missingness, a more careful treatment is required.
To simulate a trajectory at any given time point , we first simulate if a flight occurs (as opposed to a pause) with probability . If a flight is determined to occur, then a flight is sampled from the empirical distribution function . If a pause is determined to occur, then a pause is sampled from . Letting the displacement for the th simulated consecutive event be denoted as , events are simulated until number of steps is reached such that is true, at which point the process is terminated and is declared as the displacement of the final simulated event of the imputed trajectory over the missing interval.
Let be the counting process for the number of simulated events elapsed by time and let be the time of the th event in the simulated trajectory. The simulated trajectory, used as an estimator for , bridged so it starts at and ends at is:
where
This bridging ensures that the flights retain the property of being straight lines, whereas bridging directly would lead to curvature.
Above we described our process for simulating a person’s trajectory over an interval of missing data. A person’s full mobility trace is likely to have multiple missing intervals, and this same approach can be applied equally to each missing interval. Imputing over the gaps in a person’s mobility trace in this fashion introduces variability, so repeated imputations over the same missing intervals are likely to produce variable trajectories. Mobility metrics, such as radius of gyration and distance traveled, will also vary with each imputed mobility trace. As a result, repeated imputations can be used to provide confidence bounds for any mobility metric of interest to account for the uncertainty that results from data missingness and subsequent imputation. After repeating simulating the same trajectories times and calculating the desired mobility metrics each time, the and the ordered values form the lower and upper confidence bounds of an -level confidence interval, respectively.
3. Results
3.1. Analytical treatment of the expected gap between a mobility trace and its surrogates
Here we consider a model for a person’s mobility trace and then compare analytically the performance of our approach to LI in the ability to approximate the true trajectory. Consider a mobility trace with no pauses where each flight has the same arbitrary duration of one unit of time and where the displacement is independently distributed from the displacement of a flight. For the and flight displacements, let the expectations be functions of , and , and let the variances be constant, and , respectively. We assume each flight to be independent. Though such stringent independence assumptions would lead unrealistic mobility traces, the analytic results that can be derived based on this model will provide some insight into how the extent of missingness is related to the accuracy of trajectories we simulate over the missing periods.
Let the length of a missing data interval be , and for let the mobility trace be . Without loss of generality we assume . If instead of continuous time we consider discrete time , this simplifies to . Though represents the actual trajectory, assume that the time period represents a period of missingness, and thus a period where is unobserved. We use a simplified version of our simulated trajectory estimate , which has been bridged so that and , where we only consider integer-valued : . Here and represent the and displacements of the th resampled flight, and are assumed to be independent from and distributed the same as and , respectively. Compare this to LI, which in this context simplifies to . Ideally, the simulated trajectory and the linearly interpolated trajectory should be ‘close’ to the true but unobserved trajectory . To measure closeness, we examine the average squared distance between the and across , or . We then do the same to compare the closeness of and . We seek to answer the question of how the length of the period of missingness, in this case , relates to the accuracy of the surrogate trajectories and used to replace the unobserved true trajectory .
We consider a family of trajectories to allow for varying degrees of curvature in . For a fixed we consider the mean displacements of the flight at time to be and , where is the expected distance of a flight. Under this model, corresponds to a straight trajectory whereas corresponds to a semicircular trajectory (see Figure 1). We investigate how close the simulated trajectory is to the actual trajectory : . By averaging this quantity across all in the missing interval we arrive at:
(3.1) |
An analogous calculation can be performed to see how close the linearly interpolated trajectory is to : , where
is a function of the . The derivation is left for the Supplementary Materials available at Biostatistics online. Again we average across all time points in the missing interval to arrive at:
(3.2) |
When comparing the expected gap between and in (3.1) and the expected gap between and in (3.2), only (3.2) has both a component as well as a component comprised of . The term in (3.2) is exactly of that in (3.1), and while the second component involving disappears in the case where , , or when is constant. In all other cases it can add considerably to the expected gap. Only when is always constant. In this case, we would expect the average squared distance between and to be twice as large as the average squared distance between and . In other words, when the expected trajectory has no curvature (i.e., it is a straight line), then LI is the best approximation of the true trajectory. As the true trajectory gains curvature, becomes a closer approximation to the true trajectory and LI becomes increasingly inaccurate (see Figure 2).
This result tells us that using simulated trajectories from the distribution of unobserved flights leads to a better accuracy, on average, than using LI to fill in the missing data; the only exception to this is if the true unobserved trajectory happens to be a straight line. While this result is demonstrated on a model that assumes we are able to simulate from the distribution of unobserved flights, which is generally not possible since normally only the distribution of observed flights is available, our goal is to come as close as possible to this scenario by borrowing information from the ‘closest’ observed flights. ‘Close’ could mean temporally close (using TL kernel), spatially close (using GL kernel), or close in the sense of leveraging the periodicity of human behavior due to the circadian rhythm along with spatial closeness (using GLC kernel).
3.2. Variability in the biases of mobility measure estimation
Regardless of the approach used for imputing over the missing intervals in a person’s mobility trace, there can still be substantial bias in the mobility estimates that are calculated from the imputed data. After all, each approach assumes a different model; LI assumes constant linear movement over a missing interval, TL assumes that flights and pauses that occur nearby in time come from the same distribution, GL assumes that flights and pauses that occur nearby in space come from the same distribution, and GLC assumes that flights and pauses that occur at the same time of day and at the same place come from the same distribution. These models each try to approximate the true nature of human mobility, but seldom will any of these models precisely hold true.
In addition, in most cases it is difficult to predict the direction of bias. We demonstrate this through an example by looking at an easily interpretable mobility measure: distance traveled. Consider a person who follows a semicircular trajectory with some added jitter to their movement (see Figure S2 of supplementary material available at Biostatistics online). Evenly spaced intervals of different sizes are removed to show how the bias changes as the extent of missingness increases/decreases. As expected, in each case as missingness decreases, the bias in the estimates of distance traveled decreases. The bias is predictably negative for LI because LI takes the shortest possible path over missing intervals, so it attains the lower bound for the distance traveled metric. In contrast, the TL model is less predictable in the direction of its bias. For the smoother trajectories the TL approach overestimates distance traveled, but for a large enough jitter the bias switches direction and becomes negative. In this small toy example, GL will mirror TL in how the data is weighted, and there is no routine for GLC to take advantage of, so both GL and GLC are omitted here.
Overall, the confidence band of TL accurately reflects the amount of missingness in the data, visible as the narrowing of the confidence band as more and more data are observed, whereas the LI approach as a point estimator shows equally misplaced certainty regardless of the amount of missing data. The fact that the confidence bands do not in general attain the nominal coverage of the true distance traveled demonstrate the flaws, created by design in this example, of the TL assumption that nearby flights come from the same distribution. Unfortunately, when the majority of data is missing, it will be nearly impossible to avoid all bias when estimating various measures of mobility. Instead, one must choose a modeling assumption guided by domain specific knowledge that is as close to the truth as possible. To this end, in the next section, we compare various missing data imputation approaches across many measures of mobility in the context of empirical GPS data over a large sample of individuals.
3.3. Mobility measure estimation on a week-long empirical mobility trace
To generate a high-frequency GPS mobility trace, we had a test subject install an Android version of the Beiwe application on their phone for 1 week (Torous and others, 2016). The application was set to sample the smartphone GPS essentially in continuous time: the on-cycle was specified to be 119 min and the off-cycle just 1 min. Ultimately, due to the occasional loss of power or GPS signal to their phone, an average of 92 min of GPS trajectories per day were missing as opposed to the expected 12 min per day, but the high quality of this data set allows us to establish a de facto ground truth. The goal of this analysis is to take a subset of this data set and to simulate a higher rate of missingness, one that is likely observed in practice. We superimposed on top of the observed data a simulated 2-min on-cycle and 10-min off-cycle, and we calculated a variety of mobility measures on the data produced by multiple missing data imputation approaches (LI, TL, GL, and GLC). Here, we report the error for each approach on the estimated mobility measures as compared to the ground truth.
The person’s daily mobility trace for the full week is displayed in Figure 3. The mobility trace based on the simulated 2-min on-cycle and 10-min off-cycle (top row) is shown alongside the mobility trace based on the complete data (bottom row). The general movements, locations, and daily routines are accurately captured by the subset with missingness, but some of the details are of course lost. For each day, different mobility measures were calculated (detailed in the Supplementary Materials available at Biostatistics online), once for each missing data imputation approach and once for the ground truth. The estimates of the mobility measures for one example day are given in Table S1 of supplementary material available at Biostatistics online.
We also investigated sensitivity to changes in the cycle of data collection and planned missingness. Both 1-min on-cycles and 2-min on-cycles were paired with 10-min off-cycles, 20-min off-cycles, and 30-min off-cycles. Absolute relative errors to the ground truth of our test subject for each mobility measure were calculated in of the six settings of planned data collection settings in Table 1. While some measures, like radius of gyration and the probability of a pause, showed a clear decrease in accuracy as the missing interval was lengthened, some other measures, like the maximum distance from home and the number of significant locations visited, did not lose accuracy with increased missingness. In particular, the 1-min on/10-min off-cycle showed a general improvement in accuracy relative to the 2-min on/20-min off-cycle, despite the two different cycles having identical amounts of missingness.
Table 1.
Measures | 1on/10off | 1on/20off | 1on/30off | 2on/10off | 2on/20off | 2on/30off |
---|---|---|---|---|---|---|
Hometime | 2.75 | 6.29 | 7.53 | 2.18 | 5.58 | 5.83 |
DistTraveled | 15.02 | 16.06 | 24.59 | 8.59 | 22.44 | 21.75 |
RoG | 0.30 | 0.85 | 1.79 | 0.29 | 0.90 | 1.53 |
MaxDiam | 0.41 | 7.90 | 7.70 | 6.03 | 1.24 | 3.64 |
MaxHomeDist | 3.12 | 3.17 | 1.43 | 4.14 | 0.47 | 0.85 |
SigLocsVisited | 32.14 | 11.11 | 21.43 | 23.21 | 23.21 | 0.00 |
AvgFlightLen | 63.92 | 63.54 | 56.28 | 30.86 | 51.96 | 60.51 |
StdFlightLen | 61.41 | 78.42 | 65.73 | 25.17 | 50.40 | 64.48 |
AvgFlightDur | 37.74 | 53.97 | 50.63 | 15.57 | 32.34 | 36.29 |
StdFlightDur | 55.69 | 228.24 | 562.80 | 112.36 | 40.70 | 530.51 |
ProbPause | 6.00 | 9.58 | 12.72 | 4.06 | 9.68 | 13.13 |
SigLocEntropy | 1.73 | 7.71 | 4.38 | 3.06 | 2.94 | 5.50 |
CircdnRtn | 1.27 | 6.86 | 5.41 | 0.97 | 6.85 | 6.91 |
WkEndDayRtn | 1.32 | 6.52 | 3.51 | 1.72 | 6.38 | 6.41 |
3.4. Analysis of the Geolife data set
A larger sample of individuals is required for more generalized comparisons of the competing imputation methods, so we also used the complete data GPS trajectories from the Geolife data set for 855 outings across an additional 182 individuals (Zheng and others, 2009, 2008, 2010). Because only the mobility traces from user-specified outings are available, estimates of home and other significant locations are unreliable. As a result, we refrained from calculating those mobility measures that relate to significant locations for this data. Missingness was simulated according to the same 2-min on-cycle and 10-min off-cycle for each trajectory. The performance of each competing imputation approach was applied to the data set with simulated missingness and evaluated against the ground truth. For each of the imputation approaches, TL, GL, and GLC, three different kernel parameter settings were considered. In each case , while the scale parameter was varied (increased by a factor of , , and ). Increasing the scale parameter gave greater weight to nearby observations in resampling. The error was calculated by subtracting the estimated measure under each missing data imputation approach from that same measure calculated on the full data (with near-continuously gathered GPS). For the simulation-based imputation approaches, we used the mean value of the estimated measure from 100 simulated samples in the error calculations.
A small error over most of the mobility measures would indicate that the resampling missing data imputation approaches (TL, GL, and GLC) do a good job of mimicking real human mobility patterns. To quantify this performance, the absolute value of the errors were averaged across all 855 outings for each mobility measure and for each imputation method (Table 2). Based on this metric, the worst performing missing data imputation approach was Li (LI), with errors relative to the ground truth that were consistently larger than the resampling-based approaches for the majority of the mobility measures. The best performing missing data imputation approach for this data was TL with a scaling parameter of , with nearly a 10-fold improvement in accuracy over LI.
Table 2.
LI | TL.1 | TL.10 | TL.20 | GL.1 | GL.10 | GL.20 | GLC.1 | GLC.10 | GLC.20 | |
---|---|---|---|---|---|---|---|---|---|---|
DistTraveled | –1.44 | –0.15 | –0.26 | –0.58 | 1.70 | 0.20 | 0.08 | –0.67 | –0.83 | –0.71 |
RoG | –0.51 | 0.94 | 1.45 | 0.13 | 1.23 | 0.46 | 0.20 | 0.03 | –0.12 | 0.01 |
MaxDiam | –0.41 | 0.25 | 0.22 | –0.19 | 1.16 | 0.36 | 0.27 | –0.11 | –0.28 | –0.09 |
AvgFlightLen | 11.72 | –0.09 | –0.14 | –0.35 | 0.11 | 0.25 | 0.33 | 0.56 | 0.73 | 0.83 |
StdFlightLen | 10.62 | –0.11 | –0.03 | –0.65 | 0.56 | 0.65 | 0.85 | 1.07 | 1.03 | 1.29 |
AvgFlightDur | 22.55 | 0.55 | 0.40 | 0.47 | 0.12 | 0.62 | 0.69 | 1.51 | 1.64 | 1.73 |
StdFlightDur | 29.56 | 2.72 | 2.10 | 2.29 | 1.50 | 2.33 | 2.25 | 3.57 | 3.18 | 3.58 |
ProbPause | –10.01 | 5.22 | 5.36 | 3.80 | 10.26 | 7.88 | 7.05 | 5.35 | 4.66 | 4.38 |
Avg. Error | 10.85 | 1.26 | 1.24 | 1.06 | 2.08 | 1.60 | 1.46 | 1.61 | 1.56 | 1.58 |
4. Discussion
Past studies with small subject pools have not adequately accounted for missingness, likely because missing data are less of a problem for studies that provide their subjects with dedicated instrumentation capable of recording continuous or near-continuous GPS trajectories. However, instrumenting each subject with dedicated GPS devices is expensive and therefore scaling up to larger sample sizes or longer follow-up times becomes infeasible. In the near future, studies will likely increasingly leverage the high ownership rates of smartphones so that subjects need only to download an app onto their personal devices. For example, the smartphone research platform Beiwe is currently used to study patient-centered outcomes across different disorders, from depression to surgical recovery, by collecting sensor data, survey data, and phone usage patterns from diverse patient cohorts. In these studies, which generally have long follow-up times, battery life is preserved by recording GPS less frequently, often leading to more than missing data. This means that missingness can no longer be ignored and will need to be properly adjusted for. In this article, we introduced a hot-deck imputation approach to address missingness, and we found that, even with large percentages of missingness, mobility measure estimation from the proposed data imputation approach is accurate compared the current standard of using LI.
Our approach is designed to account for planned periods of missingness where the missing intervals may be frequent, but are each individually not too long. This type of missingness is benign because it operates independent of a person’s location, and so can be treated as MCAR where imputation approaches like the one we propose are viable. While this planned missingness will undoubtedly account for the largest percentage of missing data in a person’s GPS trace, other sources of missingness are left unaccounted for by our approach. If a person’s position is obstructed from a satellite’s view, it is possible that either a connection to satellite is not possible or their true location can be distorted. This type of missingness is difficult to account for and cannot be ignored as its mechanism qualifies as MNAR due to the missingness being dependent on location. Similarly, MNAR gaps in a person’s mobility trace can be created by the person intentionally turning off their phone or disabling GPS. Individuals that frequently have this type of missingness could potentially lead to large biases in the estimation of mobility measures. If an individual has a large amount of GPS data MNAR, which can be estimated by the extent of missingness there is outside of the scheduled intervals of missingness, the proposed imputation approach may not be appropriate.
With the prospect of scalable studies on the horizon, additional statistical challenges will likely emerge in the analysis of mobility measures from patient cohorts. With mobility measures paired with daily smartphone surveys, the longitudinal nature of the data can be leveraged with generalized linear mixed models (Breslow and Clayton, 1993) (GLMM) or generalized estimating equations (Liang and Zeger, 1986) (GEE) to estimate the effects of mobility measures on various outcomes obtained through the surveys. Also, while here we considered only mobility measures extracted from GPS traces, these mixed model frameworks can be readily adapted to include information from other smartphone sensors by adding additional covariates, such as those obtained from the phone’s built-in accelerometer, into the regression model.
Finally, the method introduced in this paper has been implemented as a package in the statistical computing software, R, and is freely available (see Supplementary Materials available at Biostatistics online). To conduct digital phenotyping studies, the Beiwe research platform can be used through its open source software.
Supplementary Material
Acknowledgments
Conflict of Interest: None declared.
Funding
IB and JPO are supported by NIH/NIMH 1DP2MH103909 (PI: J-PO) and the Harvard McLennan Dean’s Challenge Program (PI: J-PO).
References
- Ainsworth J., Palmier-Claus J. E., Machin M., Barrowclough C., Dunn G., Rogers A., Buchan I., Barkus E., Kapur S., Wykes T.. and others (2013). A comparison of two delivery modalities of a mobile phone-based assessment for serious mental illness: native smartphone application vs text-messaging only implementations. Journal of Medical Internet Research 15,e60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- AARP. AsPC., Initiative. (2015). Building a better tracker: older consumers weigh in on activity and sleep monitoring devices. https://www.aarp.org/content/dam/aarp/home-and-family/personal-technology/2015-07/innovation-50-project-catalyst-tracker-study-AARP.pdf.
- Blondel V. D., Decuyper A. and Krings G. (2015). A survey of results on mobile phone datasets analysis. EPJ Data Science 4,10. [Google Scholar]
- Breslow N. E. and Clayton D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88, 9–25. [Google Scholar]
- Canzian L. and Musolesi M. (2015). Trajectories of depression: unobtrusive monitoring of depressive states by means of smartphone mobility traces analysis. In: Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. Osaka, Japan: ACM, pp. 1293–1304. [Google Scholar]
- Cassagnol D. (2017). A Smartphone Surprise: U.S. Ownership Hits Record Levels, Says CTA Research. Arlington, VA, USA: Consumer Technology Association. [Google Scholar]
- Chapman J. and Frank L. (2007). Integrating Travel Behavior and Urban Form Data to Address Transportation and Air Quality Problems in Atlanta. Atlanta, GA, USA: Georgia Tech Research Institute. [Google Scholar]
- Faherty L. J., Hantsoo L., Appleby D., Sammel M. D., Bennett I. M. and Wiebe D. J. (2017). Movement patterns in women at risk for perinatal depression: use of a mood-monitoring mobile application in pregnancy. Journal of the American Medical Informatics Association. 24, 746–753. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gonzalez M. C., Hidalgo C. A. and Barabasi A.-L. (2008). Understanding individual human mobility patterns. Nature 453, 779–782. [DOI] [PubMed] [Google Scholar]
- Gruenerbl A., Osmani V., Bahle G., Carrasco J. C., Oehler S., Mayora O., Haring C. and Lukowicz P. (2014). Using smart phone mobility traces for the diagnosis of depressive and manic episodes in bipolar patients. In: Proceedings of the 5th Augmented Human International Conference. Osaka, Japan: ACM, p. 38. [Google Scholar]
- Jankowska M. M., Schipperijn J. and Kerr J. (2015). A framework for using GPS data in physical activity and sedentary behavior studies. Exercise and Sport Sciences Reviews 43, 48–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krenn P. J., Titze S., Oja P., Jones A. and Ogilvie D. (2011). Use of global positioning systems to study physical activity and the environment: a systematic review. American Journal of Preventive Medicine 41, 508–515. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Y., Li Y., Gunopulos D. and Guibas L. (2016). Knowledge-based trajectory completion from sparse GPS samples. In: Proceedings of 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. San Francisco, CA, USA: ACM, p. 33. [Google Scholar]
- Liang K.-Y. and Zeger S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 13–22. [Google Scholar]
- Liao L., Patterson D. J., Fox D. and Kautz H. (2007). Learning and inferring transportation routines. Artificial Intelligence 171, 311–331. [Google Scholar]
- Little R. J. A. and Rubin D. B. (2002). Statistical Analysis with Missing Data. Hoboken, NJ, USA: John Wiley & Sons. [Google Scholar]
- Mackett R. L., Brown B., Gong Y., Kitazawa K. and Paskins J. (2007). Setting Children Free: Childrens Independent Movement in the Local Environment. London, UK: University College London. [Google Scholar]
- Maling D. H. (2013). Coordinate Systems and Map Projections. Bethesda, MD, USA: Elsevier. [Google Scholar]
- Miller G. (2012). The smartphone psychology manifesto. Perspectives on Psychological Science 7, 221–237. [DOI] [PubMed] [Google Scholar]
- Miskelly F. (2005). Electronic tracking of patients with dementia and wandering using mobile phone technology. Age and Ageing 34, 497–498. [DOI] [PubMed] [Google Scholar]
- Onnela J.-P. and Rauch S. L. (2016). Harnessing smartphone-based digital phenotyping to enhance behavioral and mental health. Neuropsychopharmacology 41, 1961. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Onnela J.-P., Saramäki J., Hyvönen J., Szabó G., Lazer D., Kaski K., Kertész J. and Barabási A.-L. (2007). Structure and tie strengths in mobile communication networks. Proceedings of the National Academy of Sciences United States of America 104, 7332–7336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Phillips M. L., Hall T. A., Esmen N. A., Lynch R. and Johnson D. L. (2001). Use of global positioning system technology to track subject’s location during environmental exposure sampling. Journal of Exposure Analysis and Environmental Epidemiology 11, 207–215. [DOI] [PubMed] [Google Scholar]
- Rhee I., Shin M., Hong S., Lee K. and Chong S. (2007). Human mobility patterns and their impact on routing in human-driven mobile networks. In: Proceedings of Hotnets-VI. Atlanta, GA, USA: ACM. [Google Scholar]
- Rhee I., Shin M., Hong S., Lee K., Kim S. J. and Chong S. (2011). On the levy-walk nature of human mobility. IEEE/ACM Transactions on Networking (TON) 19, 630–643. [Google Scholar]
- Saeb S., Zhang M., Karr C. J., Schueller S. M., Corden M. E., Kording K. P. and Mohr D. C. (2015). Mobile phone sensor correlates of depressive symptom severity in daily-life behavior: an exploratory study. Journal of Medical Internet Research 17,e175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen L. and Stopher P. R. (2014). Review of GPS travel survey and GPS data-processing methods. Transport Reviews 34, 316–334. [Google Scholar]
- Shoval N., Auslander G. K., Freytag T., Landau R., Oswald F., Seidl U., Wahl, H.-W., Werner S. and Heinik J. (2008). The use of advanced tracking technologies for the analysis of mobility in Alzheimer’s disease and related cognitive diseases. BMC Geriatrics 8,1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith A. (2015). US Smartphone Use in 2015. Washington D.C., USA: Pew Research Center, pp. 18–29. [Google Scholar]
- Stopher P., FitzGerald C. and Xu M. (2007). Assessing the accuracy of the Sydney household travel survey with GPS. Transportation 34, 723–741. [Google Scholar]
- Torous J., Kiang M., Lorme J. and Onnela J.-P. (2016). New tools for new research in psychiatry: a scalable and customizable platform to empower data driven smartphone research. JMIR Mental Health. 3, e16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Troped P. J., Oliveira M. S., Matthews C. E., Cromley E. K., Melly S. J. and Craig B. A. (2008). Prediction of activity mode with global positioning system and accelerometer data. Medicine and Science in Sports and Exercise 40, 972–978. [DOI] [PubMed] [Google Scholar]
- Wang R., Aung M. S. H., Abdullah S., Brian R., Campbell A. T., Choudhury T., Hauser M., Kane J., Merrill M., Scherer E. A.. and others (2016). Crosscheck: toward passive sensing and detection of mental health changes in people with schizophrenia. In: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing. Heidelberg, Germany: ACM, pp. 886–897. [Google Scholar]
- Werner S., Auslander G. K., Shoval N., Gitlitz T., Landau R. and Heinik J. (2012). Caregiving burden and out-of-home mobility of cognitively impaired care-recipients based on GPS tracking. International Psychogeriatrics 24, 1836–1845. [DOI] [PubMed] [Google Scholar]
- Wolf P. S. A. and Jacobs W. J. (2010). GPS technology and human psychological research: a methodological proposal. Journal of Methods and Measurement in the Social Sciences 1, 1–7. [Google Scholar]
- Yair B., Noam S., Meir L., Gail A., Amit B., Michal I., Vaccaro A. R. and Leon K. (2011). Assessing the outcomes of spine surgery using global positioning systems. Spine 36, E263–E267. [DOI] [PubMed] [Google Scholar]
- Zheng Y., Li Q., Chen Y., Xie X. and Ma W.-Y. (2008). Understanding mobility based on GPS data. In: Proceedings of the 10th International Conference on Ubiquitous Computing. Seoul, South Korea: ACM, pp. 312–321. [Google Scholar]
- Zheng Y., Xie X. and Ma W.-Y.. (2010). Geolife: a collaborative social networking service among user, location and trajectory. IEEE Data(base) Engineering Bulletin 33, 32–39. [Google Scholar]
- Zheng Y., Zhang L., Xie X. and Ma W.-Y. (2009). Mining interesting locations and travel sequences from GPS trajectories. In: Proceedings of the 18th International Conference on World Wide Web. Madrid, Spain: ACM, pp. 791–800. [Google Scholar]
- Zhou J. J. and Golledge R. G. (2003). An analysis of variability of travel behavior within one-week period based on GPS. Davis, CA, USA: University of California Transportation Center. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.