Abstract
Household travel survey data is a critical input to travel behavior modeling, and it also can be used to generate trip schedules for activity-based traffic simulation. With emerging information and communication technology (ICT) tools like smartphones, the collection of passive datasets for travelers’ real-time information becomes available. Smartphone GPS survey apps have emerged to be a popular tool for conducting household travel surveys. Most existing studies employ high-frequency smartphone GPS data and collect accurate activity information. However, their study periods are still rather short, ranging from a few days to a few weeks. For a long-term GPS survey, the issues of missing activity information and sparse GPS data are inevitable and must be addressed carefully.
This paper uses 7-month low-frequency smartphone GPS data collected from over 2000 participants, who report 5 most frequently visited locations weekly. The essential goal is to develop a synthetic model of daily activity-location scheduling to capture data with both known and unknown activities. To handle missing activity data, this research develops a new probabilistic approach, which measures the probability of visiting a place by three scores, global visit score (GVS), temporal visit score (TVS), and periodical visit score (PVS). Three different levels of activity-location schedule are modeled respectively. The first level handles only those data with known activities, while data with unknown activities are disregarded. The second takes unknown activities into account but combines all types of them into a single category. The third one models each location with unknown activities separately. These models are able to generate activity-location schedule in different levels of detail for activity-based traffic simulator. After developing activity-location schedule models, both individual and aggregated validation processes are performed with simulation. The validation result shows that the simulated proportion of activity types and activity duration are close to the survey data, indicating the effectiveness of the proposed approaches. This research sheds a light on building sustainable and long-term travel survey using GPS data with missing activity information. In addition, this study will be valuable to model infectious disease transmission, e.g. COVID-19 and assess health risk in urban areas.
Keywords: Smartphone GPS data, Travel survey, Activity-location schedule, Activity-based simulator
1. Introduction
An actively pursued research topic in recent years is understanding travel behavior and estimating travel demand in the development of activity-based models. When microsimulation requires daily travel patterns of individuals as input data, synthetic daily travel chains or empirical travel diaries are needed. However, a traditional household travel survey usually collects travel information over a very short period (e.g. 1 day). Such a short period survey neglects the variation in individuals’ travel behaviors and hardly captures any longitudinal behaviors. A growing body of literature suggests that longer data collection periods are warranted to provide improved data onto modeling purposes and understanding travel variations (Liao et al., 2017). Therefore, there is a pressing need to explore a data type that provides long-period behavior yet avoids the possibly imposed respondent burden and survey costs.
With emerging information and communication technology (ICT) tools, the collection of passive datasets for travelers’ real-time trajectory has become available. Trajectory data can be divided into two categories, explicit trajectory data and implicit trajectory data (Kong et al., 2018). Explicit trajectory data is recorded in succession in constant intervals, such as data from GPS devices and smartphone GPS collection applications. On the contrary, implicit trajectory data is recorded with a random and relatively large time interval. Data sources of implicit trajectory data are sensor-based data (monitor), network-based data (social media check-in data), and signal-based data (Wi-Fi, Bluetooth, RFID, and mobile data).
The high penetration of smartphones usage guarantees that collecting GPS trajectory becomes trivial since most smartphones have both GPS and accelerometer sensors (Shen and Stopher, 2014). Smartphone GPS survey apps have emerged to be a popular tool for conducting household travel surveys (Bierlaire et al., 2013; Hudson et al., 2012; Reddy et al., 2010; Xiao et al., 2012). Compared to GPS devices, using a smartphone to record GPS information for the survey has several advantages. First, smartphones reduce study cost since no additional GPS is needed. Second, it decreases the chance that participants forget to bring or charge wearable GPS devices. Further, smartphones record trajectories continuously, especially between vehicle parking places and activity places, while in-vehicle GPS devices fail to do so, leaving a data gap. Even though the sampling frequency and accuracy of GPS points of smartphones are lower than the dedicated GPS devices, it is proven that data quality and information are comparable with GPS devices and sufficient for research (Montini et al., 2015).
However, the smartphone-based travel survey also holds several intrinsic limitations and challenges. If the travel app keeps sampling high frequent GPS points similar to what GPS devices would, the smartphone battery drains quickly and may be damaged as well (Patterson and Fitzsimmons, 2016). Users are sensitive to battery consumption to the degree that they are often unwilling to participate in experiments. The trade-off between GPS sampling frequency and battery consumption needs to be accounted for during the survey design.
Participant intervention is another concern when travel mode and purpose cannot be recorded automatically by GPS devices or smartphone apps (Zhou et al., 2017). In GPS device assisted travel survey, participants themselves must fill one- or two-day travel diaries. Smartphone-based travel survey also needs participant intervention by validating or inputting data. These interventions increase the burden of participants especially as the length of the survey period increases. The trade-off also needs to be addressed between the amount of labeled activity information and the amount of required participant intervention while one designs the smartphone app (Liao et al., 2017).
To address the abovementioned two concerns, data collected from a low-frequency GPS data sampling method with few activity labels is explored in this paper to derive activity-location schedules. This data collection mechanism naturally resolves the two problems, battery drain, and user intervention burden, and increases the ease of recruitment. Further, this smartphone-based approach enables a long-term survey period. The resultant data can capture more travel information, thus, reducing the unknown information caused by low reporting and validating frequency. Moreover, long-term data also enables analysis of longitudinal travel behavior information which is difficult to be derived from traditional household travel surveys. In addition, such data can account for the variation of individual travel patterns and provide better results in simulation and traffic demand forecasting.
The data to be explored also poses challenges. Our GPS data was collected by a smartphone app running in the background. The time intervals between two consecutive records are not uniform and the data collection frequency is low. The traditional method to process either the explicit GPS trajectory or the implicit trajectory data is no longer appropriate for this data. Therefore, the first methodological challenge is how to identify trip ends from this irregular interval and low-frequency data. The other challenge is how to handle missing activity information, because the data only contains raw GPS points without activity labels. This paper introduces a 3-level modeling scheme to address these issues. Under each level, three different visit scores (aka global visit, temporal visit, and periodic visit) are developed to capture different characteristics of trips.
The rest of the paper is structured as follows: Section 2 summarizes the existing studies. Section 3 describes the dataset of this paper and the challenges when utilizing this dataset. Section 4 introduces the proposed trip generation method. Section 5 shows the numerical example of our trip generation model. Section 6 discusses the findings. Finally, Section 7 concludes the results and lists the potential applications of the proposed method.
2. Literature review
2.1. 1 Smartphone-based travel survey
Smartphone-based travel surveys have rapidly captured the enthusiasm of research communities, resulting in many recent studies. The early development of smartphone-based travel survey was an off-line device named Personal Activity Monitor (PEAMON) which can detect the acceleration with a rather high frequency (every 15 s) but may not be ideal for the battery drain issue (Asakura and Hato, 2004). A recent widely-known smartphone-based activity-travel survey is Future Mobility Survey (FMS) (Cottrill et al., 2013). A large number of participants (793) was involved in a two-week smartphone survey (with 14-day GPS data and more than 5 days of validation data). Several subsequent studies were conducted based on the data collected by the FMS system to identify activity types and behavior patterns (Kim et al., 2015; Zhao et al., 2015). Another smartphone-based system for personal travel survey collection (SITSS) was developed and implemented in New Zealand. This application optimized the battery consumption and data collection procedure, and 94% of participants completed this experiment and interacted with their phones during the data collection period without interruption (Safi et al., 2015). However, only 73 participants uploaded at least one travel day. Recently, Wang et al. (2017) conducted a controlled experiment with a smartphone app on limited scale of trajectory data collected in two weeks with only 16 students.
Other systems of smartphone-based travel surveys are also developed, although for various purposes, including comparisons with dedicated GPS device (Montini et al., 2015), origin-destination (O-D) (Patterson and Fitzsimmons, 2016), trip ends identification (Zhou et al., 2017). There are more smartphone applications developed to collect travel information, summarized in a recent review paper (Liao et al., 2017). It also mentioned concerns about using smartphone applications, including the aforementioned battery consumption, the accuracy of data, and the frequency of collecting GPS points. It was found that even though the frequency of smartphone data was low, the data indeed provided more diverse information than GPS devices. Also, it was proven that the quality of smartphone data was sufficient for research, while maintaining a large-scaled and reasonably long survey period (Montini et al., 2015).
Paper-based surveys are usually shorter than three days, and web-based surveys have more participants or longer collection periods than paper-based surveys. Smartphone-based surveys have the longest collection period. However, the participants are not as many as the other two since the collection period is longer. According to these, we can conclude that travel surveys with longer time period have fewer participants. The conclusion is intuitive because of another fact that a long-period survey with a large number of participants is very costly. Also, this is related to the survey method as well. Filling paper-based travel surveys/trip diaries for a month is a huge burden for participants, even though entering information on web-based planform is very difficult for participants. And the more burden the survey is, the shorter period and the fewer participant will be (Shen and Stopher, 2014).
2.2. Activity scheduling in travel behavior analysis
Activity scheduling is an important component of trip generation in travel behavior analysis. This line of research focuses on activities while locations are not considered. There are three types of activity-focused scheduling: activity-agenda generation scheduling, trip chain generation scheduling, and typical day generation scheduling. From the first to the third, the three types demand an increasing amount of details thus more difficult to produce the schedule. The first, Activity-agenda generation scheduling, only produces a collection of activities needed to complete within a period of time, and there is no sequence or time schedule for these activities (Litwin, 2005). Trip chain generation scheduling contains the sequence information of activities but still does not include their time schedules (Allahviranloo and Recker, 2013; Baral et al., 2018). The last type, typical day generation scheduling, contains both sequence and schedule for activities (Kitamura et al., 1997), and the combination of these activities, once sequentially scheduled, makes up a typical day’s activity chain for an individual. Of the above three types, the third one is the most related to this study.
In the literature of typical day generation, the traditional methods used are discrete choice models. Kitamura et al. (1997) built 13 multinomial logit models to estimate choices of activity type, activity duration, activity location, and travel mode for each activity. Bhat and Singh (2000) divided a day into different time zones, and then employed four types of discrete choice models to estimate how many tours within each time zone, how many stops within each tour, and the characteristics (location, start time, and duration) of each tour. Structure equations modeling (SEM) was also implemented in to trip-chain generation, and it generated four types of trip chains (Golob, 2000). Cheng et al. (2017) also employed SEM to analyze activity participation, trip generation, and mode choice for low-income commuters. Recent prevailing methods utilize machine learning to generate trip, which is similar to a recommendation system for trips. Unger et al. (2016) proposed a latent context-aware recommendation system for activity-agenda generation. A Context-Aware Personalized POI Sequence (CAPS) recommendation system was developed for trip chain generation. Activities also can be scheduled in a fully data-driven way (Drchal et al., 2019), but this method is limited by the unavailability of detailed data, small datasets, and external software (such as path planners).
For typical day trip chain generation, most discrete choice models are developed by household travel survey data. The purpose of previous models is to derive the model to analyze individual’s travel behavior according to parameters estimated, but the focus is not to generate trip chains. Moreover, some discrete choice models make assumptions for variables and create rules of trip generation models, introducing errors and bias in models. Probabilistic methods and Markov chain are appropriate in trip chain generation. However, the complexity in these probabilistic models, especially when too many parameters are involved, makes it difficult to solve. In another line of research in activity recommendation systems, studies usually take into account trips or activities, but barely the sequence of activities or activity time.
There are also other data sources implemented in activity-location schedule modeling, including social media check-in data (Hasan and Ukkusuri, 2014; Cui et al., 2018b; Hasan and Ukkusuri, 2018), CDR data (Di Donna et al., 2015), and vehicle plate scanning data (Siripirote et al., 2014). However, research with these data sources is usually lack of ground truth and validation.
3. Data description and preprocessing
The dataset utilized in this paper is collected from an influenza surveillance survey supported by the National Institutes of Health. More than 2,200 participants were recruited from the urbanized areas of the Western New York region in the United States. The survey was conducted in 7 months, from October 2016 to May 2017, within the influenza season defined by Centers for Disease Control and Prevention (CDC). The original purpose of this survey was to discover the relationship between individuals’ travel pattern and influenza dispersion. However, the dataset is naturally ideal for the travel behavior study of a large number of participants within a relatively long period.
Participants provided three kinds of inputs. The first one is the socio-demographic data, including home and workplaces, gender, age, race and ethnicity, and the number of people in the household. The second one is up to five places most frequently visited by a participant in a week (by choosing the exact street address on Google Maps) but not including home and workplaces. The third kind of data is GPS trajectory data recorded by a smartphone app. Since participants are very sensitive about their battery drain, this smartphone app was designed to run in the background and sample GPS with a very low frequency. The research team developed two versions of smartphone apps, for Android and iOS phones, separately. For the app on Android phones, it recorded a point for every two hours. For the app on iOS phones, it recorded a point when it detected a significant location changing (when the user’s position changes for 500 m or more (Apple, 2020). Since the GPS trajectory points are too sparse for Android phones which are recorded every two hours, we only utilize GPS trajectory points collected by iOS phones (1445 participants).
This data source is similar to Household Travel Surveys with GPS data, but obvious differences also exist. Firstly, participants of the traditional household travel surveys are selected according to the entire population’s composition. However, participants of our dataset, which is the NIH dataset for this study, are mostly from the area between downtown and suburb. The comparison of socio-demographics between the NIH Dataset (dataset in this study) and the 2017 National Household Travel is shown in Table 1. This table demonstrates the discrepancy between our dataset and the real population. However, we are focusing on generating a synthetic probabilistic daily activity for individuals; the disparity of samples will not affect our study. Secondly, instead of completing one-day or multi-day travel diary, this dataset covers 7 months but only contains up to 5 reported frequent places every week. The dataset contains many unlabeled activities. According to (Gonzalez et al., 2008), people devote most of their time to only a few locations, while spending their remaining time in 5 to 50 places, visited with diminished regularity. Hence, reporting 5 frequently visited places is large enough. Thirdly, our GPS trajectory points are not as dense and uniform as explicit GPS trajectory data. Only when the app detects participants having significant movements, such as 500 m or more (Apple, 2020), the GPS points were recorded in a uniform way with an interval of approximately 5 min. This is the reason why our dataset is unique, between explicit data and inexplicit data.
Table 1.
Socio-demographics Comparison of NIH Dataset and 2017 National Household Travel Survey.
| Socio-Demographic | NIH Dataset | 2017 National Household Travel Survey* | ||||||
|---|---|---|---|---|---|---|---|---|
| Gender | Male | Female | Male | Female | ||||
| 31% | 69% | 49% | 51% | |||||
| Age* | 13–17 | 18–35 | 36–65 | 66 or older | 5–19 | 20–40 | 40–65 | 66 or older |
| 4% | 33% | 55% | 8% | 21% | 29% | 34% | 16% | |
| Race | American Indian or Alaska Native | 1% | American Indian or Alaska Native | 1% | ||||
| Asian | 3% | Asian | 5% | |||||
| Black or African American | 2% | Black or African American | 7% | |||||
| White | 94% | White | 81% | |||||
| Other Race | 1% | Other Race | 2% | |||||
The summations of each socio-demographic percentage are not 100% since there are “Don’t Know”, “Refused” data in the 2017 National Household Travel Survey.
The difference in age bin separation is due to the designs of these two surveys.
Fig. 1 is the flow chart of data preprocessing. The first challenge of this research is to identify trip ends of each trip from GPS trajectory data. Given the uniqueness of our dataset, low frequency sampling rate and missing information, the existing density-based methods (Gong et al., 2012; Hariharan and Toyama, 2004; Ye et al., 2009) are not applicable, and the rule-based methods (Palma et al., 2008; Tang and Meng, 2006; Thierry et al., 2013) are not appropriate neither. Therefore, we propose a “Haccuracy”-based method to detect trip ends. “Haccuracy” represents the horizontal accuracy of GPS data. It represents the estimated error or deviation of the real location and recorded coordinates, and the small value indicates the good quality of recorded coordinates. If iPhone is connected to WIFI, the range of “Haccuracy” is from 65 to 165 m. Regarding a new trip, it is typical to start with a very poor accuracy (1000 m), since the hardware takes some time to receive the satellite signal. Then the accuracy improves in a few seconds or more, and finally, it can achieve as good as 5-meter accuracy. If the location is determined based on cell tower triangulation, the “haccuracy” will be recorded as 1414 m by our empirical observations. Fig. 2, which is derived from our dataset, shows the density plot of “Haccuracy”. As one can see, “Haccuracy” indicates two density peaks at 65 m and 1414 m. There is also a little bump around 1000 m, which starts from 900 m. Therefore, we set the threshold as 900 m to identify trip ends. The algorithm of trip end identification is presented in Fig. 3. If “Haccuracy” is larger than 900 m, we believe that it is the start of a new trip. Also, if the time difference between consecutive GPS points is larger than 10 min, we also consider these two points belong to different trips.
Fig. 1.

Flow Chart of Data Preprocessing.
Fig. 2.

Density Plot of “Haccuracy”
Fig. 3.

Algorithm of Trip Ends Identification.
After obtaining all trip ends, we match these trip ends with weekly reported frequent places where activity information is available. Trip ends are assigned to the nearest frequent places when the distance is smaller than (0.5 × haccuracy+50) meters. Out of total 3,139,453 trip ends that are detected, 1,733,538 (55.22%) trips are matched with frequent places including home and workplaces. On weekdays, 51.31% of matched trip ends are ‘Home’ and 11.54% are ‘Workplace’, while 40.78% of matched trip ends are ‘Home’ and ‘1.5%’ are ‘Workplace’ on weekends. This indicates that more work-related trips are observed on weekdays which complies with the common observation.
We define those trip ends that can be matched with user-reported weekly places as frequently visited places (or frequent places), and trip ends that do not match with any reported places as ‘unknown places’ (including unreported places if deemed obviously missing from the data). Fig. 4 shows the distribution and basic statistics of the number of unique frequently visited places and unknown places, respectively. The red histogram shows that the number of frequently visited places for each individual does not vary too much, and most people have around 20 frequently visited places within 20 weeks. However, people visit a higher number of unknown places (mean = 34.3) than the number of frequently visited places. The maximum number of frequently visited places that can be reported in the survey is five, while there is no limit to the number of unknown places since unknown places are detected by Algorithm 1 in Fig. 2. Therefore, this results in a higher number of unknown places demonstrated in Fig. 3.
Fig. 4.

Distributions of Number of Unique Frequently Visited Places and Unique Unknown Places within 20 Weeks.
Distributions of the number of weekly visit times for frequent places and unknown places are shown in Fig. 5, respectively. If each place is visited once every week, the weekly frequency of visiting that place is 1. We employ density-based spatial clustering of applications with noise (DBSCAN) to group locations that are close to each other. From the plot, most unknown places are visited less than 1 time a week (with a mean of 0.35). However, one can see that there are still some unknown places visited more than 1 time. Some frequently visited places are not reported by participants due to various reasons, for example, carelessness, private concerns, and other reluctance. For reported frequent places, most of the places are visited 1 to 3 times a week (with a mean of 2.11), and many participants even visit some frequent places more than 3 times per week. These places could be schools where parents pick up and drop off their children, and restaurants where participants buy breakfast or lunch regularly, among other activities.
Fig. 5.

Distributions of Number of Weekly Visit Times for Frequent Places and Unknown Places.
Given that the activity type for home and workplaces is straightforward, the distribution of activity type of frequent places is further analyzed given the greater number of unknown places visited (see Fig. 5). We employ Google Places API to retrieve categories of frequent places (Cui et al., 2018b) and group detailed types into six major activity types, including “School”, “Shopping”, “Recreation”, “Personal Business”, “Transportation” and “Other”. The detailed grouping rules are described in Appendix. Table 2, 3 and 4 present the distribution of all activity types and their proportions (including Home and Work) by gender, age, and race, respectively. The number in parentheses indicates the percentage of the corresponding activity type. The statistics for each socio-demographic category deems reasonable. For example, participants younger than 18 years old conduct more “School” activities and fewer “Work” and “Personal Business” activities. Collectively, the number of unique places visited in 20 weeks, the frequency of weekly visits for places, and the activity type by demographic categories are used to derive activity information for trips. However, the percentage of work trips is lower than the average work trips of the 2017 National Household Travel Survey (NHTS). According to the 2017 NHTS, there are 31% work vehicle trips during weekdays and 11% vehicle work trips on weekends (McGuckin and Fucci, 2018). For our dataset, there are 13% and 4% of vehicle work trips during weekdays and weekends, respectively. Our survey does not distinguish travel modes, but the household travel survey only accounts for vehicle trips. Moreover, the 2017 NHTS’s work trips also include work-related business trips, which trip ends are not typical workplaces. Our data only consider trips to typical workplaces as work trips. Moreover, participants of our survey are mostly between downtown and suburb. On the contrary, 2017 NHTS sample includes more urban households. Hence, the percentages of work trips are lower in our dataset.
Table 2.
The Distribution of Activity Type by Gender.
| Activity Type Distribution | Gender (Total) | |
|---|---|---|
| Male (451) | Female (992) | |
| Home | 117,593 (51%) | 252,094 (52%) |
| Work | 30,637 (13%) | 52,532 (11%) |
| School | 5216 (2%) | 13,958 (3%) |
| Shopping | 12,943 (6%) | 39,210 (8%) |
| Recreation | 47,925 (21%) | 97,996 (20%) |
| Personal Business | 11,133 (5%) | 25,821 (5%) |
| Trans | 385 (0.2%) | 988 (0.2%) |
| Other | 5208 (2%) | 6817 (1%) |
| Activity Type Distribution | Gender (per Person) | |
| Male (per person) | Female (per person) | |
| Home | 260.74 | 254.13 |
| Work | 67.93 | 52.96 |
| School | 11.57 | 14.07 |
| Shopping | 28.70 | 39.53 |
| Recreation | 106.26 | 98.79 |
| Personal Business | 24.69 | 26.03 |
| Trans | 0.85 | 1.00 |
| Other | 11.55 | 6.87 |
Table 3.
The Distribution of Activity Type by Age.
| Activity Type Distribution | Age (Total) | |||
|---|---|---|---|---|
| 13–17 (53) | 18–35 (476) | 35–65 (789) | 66 or older (110) | |
| Home | 12,927 (52%) | 118,405 (51%) | 208,260 (51%) | 30,095 (55%) |
| Work | 609 (2%) | 29,496 (13%) | 51,047 (13%) | 2017 (4%) |
| School | 2855 (12%) | 6993 (3%) | 8601 (2%) | 725 (1%) |
| Shopping | 1128 (5%) | 15,316 (7%) | 30,719 (8%) | 4990 (9%) |
| Recreation | 6111 (25%) | 49,050 (21%) | 78,356 (19%) | 12,404 (23%) |
| Personal Business | 388 (2%) | 10,917 (5%) | 22,194 (5%) | 3455 (6%) |
| Trans | 39 (0.2%) | 465 (0.2%) | 781 (0.2%) | 88 (0.2%) |
| Other | 570 (2%) | 2480 (1%) | 8370 (2%) | 605 (1%) |
| Activity Type Distribution | Age (per Person) | |||
| 13–17 | 18–35 | 35–65 | 66 or older | |
| Home | 243.91 | 248.75 | 263.95 | 273.59 |
| Work | 11.49 | 61.97 | 64.70 | 18.34 |
| School | 53.87 | 14.69 | 10.90 | 6.59 |
| Shopping | 21.28 | 32.18 | 38.93 | 45.36 |
| Recreation | 115.30 | 103.05 | 99.31 | 112.76 |
| Personal Business | 7.32 | 22.93 | 28.13 | 31.41 |
| Trans | 0.74 | 0.98 | 0.99 | 0.80 |
| Other | 10.75 | 5.21 | 10.61 | 5.50 |
Table 4.
The Distribution of Activity Type by Race.
| Activity Type Distribution | Race (Total) | ||||
|---|---|---|---|---|---|
| American Indian or Alaska Native (9) | Asian (38) | Black or African American (24) | White (1341) | Other Race (14) | |
| Home | 1,971 (59%) | 7,941 (55%) | 5,434 (50%) | 351,011 (51%) | 3,330 (54%) |
| Work | 390 (12%) | 2,328 (16%) | 742 (7%) | 79,178 (12%) | 531 (9%) |
| School | 114 (3%) | 452 (3%) | 217 (3%) | 18,024 (3%) | 367 (6%) |
| Shopping | 211 (6%) | 917 (6%) | 906 (8%) | 49,596 (7%) | 523 (8%) |
| Recreation | 521 (16%) | 1,821 (13%) | 2,852 (26%) | 139,586 (20%) | 1,141 (18%) |
| Personal Business | 114 (3%) | 740 (5%) | 296 (3%) | 35,543 (5%) | 261 (4%) |
| Transportation | 1 (0.03%) | 25 (0.1%) | 176 (2%) | 1,166 (0.2%) | 5 (0.08%) |
| Other | 1 (0.03%) | 264 (2%) | 180 (2%) | 11,551 (2%) | 29 (5%) |
| Activity Type Distribution | Race (per Person) | ||||
| American Indian or Alaska Native | Asian | Black or African American | White | Other Race | |
| Home | 219.00 | 208.97 | 226.42 | 261.75 | 237.86 |
| Work | 43.33 | 61.26 | 30.92 | 59.04 | 37.93 |
| School | 12.67 | 11.89 | 9.04 | 13.44 | 26.21 |
| Shopping | 23.44 | 24.13 | 37.75 | 36.98 | 37.36 |
| Recreation | 57.89 | 47.92 | 118.83 | 104.09 | 81.50 |
| Personal Business | 12.67 | 19.47 | 12.33 | 26.50 | 18.64 |
| Transportation | 0.11 | 0.66 | 7.33 | 0.87 | 0.36 |
| Other | 0.11 | 6.95 | 7.50 | 8.61 | 2.07 |
An example of the activity-location patterns of a participant is illustrated in Fig. 6. The shade of red represents the frequency of visiting a place where the darker shades mean higher frequencies. Every location shows on the y-axis represents a location. Even though there are multiple “Recreation” and “Recreation (Resident)”, these are different locations. Moreover, “Recreation” activity purposes are classified by Google Places that the destination is categorized as a “Recreation” point of interest (POI). While a “Recreation (resident)” purpose is the place which classified as a residential use by not user’s home. With high probabilities, this participant stayed at home before 10 am, then went to work until 18:30 pm. After work, this participant tended to visit two other residential locations, both with noticeable probabilities. In addition, this participant also engaged in other recreational, personal, and shopping activities with different locations and durations. This example reveals the activity type, location, duration, and especially the probability of activities.
Fig. 6.

Heat Map of an Individual’s Reported Frequent Places (including home and workplace) over Time-of-Day, Calculated by Visiting Frequency Within 20 Weeks.
As a typical day, we cannot expect that an individual can visit all the high-frequency places. Instead, we can generate an activity-location schedule through a probability approach since individuals may choose different location chains and the activity may vary by time of day, day of week, and over time (Habib and Miller, 2008).
In this paper, following necessary attributes are used in models, including socio-demographic information (household size, gender, age, race), and activity/trip information (trip start/end time, activity duration, activity/trip type). The characteristics of our data are not unique for most of the existing travel surveys. Many previous travel survey studies have similar characteristics (Crane and Crepeau, 1998; Young and Farber, 2019). The proposed methodology also relies on home and work addresses; however, this information can be inferred from GPS data if the applied dataset does not include this information (Zola et al., 2020; Ye et al., 2009).
4. Methodology
To support the objective of this study of deriving activity-location schedules, it is essential to predict the next activity location and its duration given current and past activity locations and duration for each individual participant. We introduce three scores to assess the probability of visiting the next place, along with a minimum entropy selection method (MESM) to select the appropriate score. Subsequently, we employ a hazard-based model to estimate the duration of stay at the place. Fig. 7 shows the flow chart of the proposed methodology.
Fig. 7.

Flow Chart of Methodology.
4.1. Three probability scores to assess place visit
We introduce three different scores to assess the probability of visiting the next place, which are global visit score (GVS), temporal visit score (TVS), and periodical visit score (PVS), each focusing on a different aspect of travel behavior to infer next visits. A minimum entropy selection method (MESM) is used to determine which probability score is most appropriate, among three sets of probabilities.
Global Visit Score
GVS is defined as PGlobal(pi+1|pi), the probability of visiting (i + 1)th place pi+1 right after ith place pi in a global view regardless of the time. GVS takes advantage of commonly chained activities where the probability of visiting the next place is influenced by the currently visited place. However, if consider the dependence of pi+1 on all previously visited places, we may not have enough data to derive the score. Therefore, this study assumes a Markov property in which the model only considers the current visited place, not all past visited places.
For the first place p1 in the trip chain, GVS is defined as:
| (1) |
where v(p) represents the number of visits to place p in the entire training set.
For non-first places, given current place pi, the GVS of visiting the next place pi+1 is:
| (2) |
where pj ∈ P where P is the potential visited places list, and pj represents jth place in the potential list of visited places. v(pi+1|pi) stands for given current place pi, the number of visits to place pi+1, regardless of the time.
Temporal Visit Score
TVS is defined as , the probability of visiting (i + 1)th place pi+1 right after ith place pi at time period t. TVS is employed for predicting those places that are regularly visited at a specific time. First, we divide a day into 24 time periods (from 0 to 23 h). Then we calculate the probability of visiting a place in each time period.
For the first place p1, the TVS time period t is defined as:
| (3) |
where represents the number of visits to place p within in time period t. M is the total number of days in the collected dataset. vt,m(p) stands for the number of visits to place p within in time period t on day m and m ∈ M.
For non-first places, given current place pi, TVS of visiting place pi+1 at time period t is:
| (4) |
where vt(pi+1|pi) represents given current place pi, the number of visits to place pi+1 at time period t.
Periodical Visit Score
PTS is defined as , the probability of visiting place p on qth day in the period Q. PTS measures probabilities for those places that are visited periodically. For example, employers visit their workplaces every day on weekdays. Different from both GVS and TVS, PVS is assumed not to be influenced by the previously visited place.
| (5) |
where vq(p) represents the number of visits to place p on qth day of the visiting period, and is defined as . M is the number of days in collected dataset, and vq,m(p) represents the number of visits to place p on qth day of visiting period Q of that place. We calculate Q from a frequency-domain method, and it is calculated for number of visits per time unit. We employ Power Spectral Density (PSD) to detect period. PSD calculates the power for each frequency after converting time-domain data into frequency domain data. And we consider the frequency with the highest power fH is the most appropriate visiting period for this place. Then let . For more information about PSD, please refer to (Vlachos et al., 2005).
Minimum Entropy Selection Method
As the three proposed probability scores utilize different aspects of travel behavior, they may be used simultaneously to predict the next place to be visited. This generates three sets of probabilities, one for each score. In order to select the most appropriate score, we introduce a minimum entropy selection method (MESM) to evaluate the three sets of probabilities. Since entropy measures the uncertainty of a random variable, the smaller the entropy value is, the less uncertain, or the more certain, the random variable is. Out of the three sets of probabilities, we first select the set with the lowest entropy value. Then the place that has the highest probability in the set is selected as the predicted place. Equation (6) shows the entropy formula and Fig. 8 presents the MESM algorithm for choosing the right score, respectively.
| (6) |
where Pi is the probability of each variable being chosen, which represents the probability of visiting each place under each score.
Fig. 8.

Algorithm of MESM.
4.2. Three levels of activity-location schedule modeling
As aforementioned, participants only report 5 activities per week during the survey period. Other activities remain unknown, although the GPS locations are provided. In order to tackle the issue of missing activity, this paper develops a new probabilistic model inspired by efforts to predict mobile App usage behavior (Liao et al., 2012; Liao et al., 2013). This research creates daily activity-location schedule model in three levels. The first level only considers reported frequent places, the second level consolidates all unknown activities as a single category, and the third model treats each unknown location as a separate activity location. For each level, the three probability scores and the hazard-based mode are applied to predict the activity location and duration. Fig. 9 illustrates the concept of the three-level modeling approach.
Fig. 9.

Three-level model for activity-location schedule: (a) Level 1: consider frequently visited places only, (b) Level 2: consider unknown places as one category, and (c) Level 3: model each unknown place as a separate activity location.
Level 1 and Level 2 Activity-location Schedule Model
This subsection introduces the activity-location schedule models for both Levels 1 and 2 due to the similarity between them. According to several studies and household travel surveys (Cui et al., 2018a; Kitamura et al., 2000), we set the simulated day to start at 3:00 am and end at 2:59 am on the next day. Fig. 10 presents the algorithm to predict activity location and duration scheduling at the two levels.
Fig. 10.

Algorithm of Level 1 and Level 2 Activity-location Schedule Model.
Level 3 Activity-location Schedule Model
Level 3 schedule model is used to handle unknown activities separately. We first employ DBSCAN, a density-based clustering algorithm (Ester et al., 1996), to group unknown places which are close to each other into groups, then label them as “Unknown 1”, “Unknown 2”, etc. If an unknown place is not grouped with other unknown places, it becomes a group itself. The activity type for these unknown groups is obtained by using Google Places API and the rules described in Appendix. The labeled place list is updated accordingly. Fig. 11 presents the algorithm to predict activity location and duration scheduling for the third level.
Fig. 11.

Algorithm of Level 3 Activity-location Schedule Model.
4.3. Estimating the duration of stay using hazard model
Hazard model or survival analysis examines and models the time it takes until one or more events occur. This method has been widely implemented in transportation research, including time until emergency response reaches the scene of a vehicle crash (Chung, 2010), and time devoted to an activity (shopping, recreation, etc.) (van den Berg et al., 2012). In this paper, we employ the hazard-based model to estimate the duration of stay at a place.
Let hi(t) = λ(t|Xi) represents the hazard function for the ith person at time t, where Xi = {Xi1, Xi2, ⋯, XiK} are K regressors. The baseline hazard function h0(t) is the hazard function at time t when Xi1 = Xi2 = ⋯ = XiK = 0. Moreover, the baseline hazard function is similar to the intercept term in a regression model. The Cox proportional hazards (PH) model does not specify the hazard function and represents it as α(t) = logh0(t), then log hazard function shows below,
| (7) |
In this paper, we employ the Cox PH model to model and estimate the duration of stay at a place. The h item in Equation (7) represents the time duration and the independent variables involved are listed in Table 5, along with the coefficients and statistics of the Cox hazard model. Time variables are treated as continuous variables. And other variables are treated as dummy variables, which take only 0 or 1 to indicate the absence or presents. An analysis of travel behavior regarding socioeconomic factors indicates that gender, race, and household structure show a significant influence on travel behaviors (Mauch and Taylor, 1997; Dogan et al., 2019). Moreover, race is a significant influence factor on travel behaviors Actually, the color of someone’s skin does not directly explain their travel behaviors, but it is because of the fact that race in the US is also highly related to economic well-being (Deka and Lubin, 2012; Wong et al., 2020). Therefore, household size, gender, age, and race are chosen in addition to travel-related attribution.
Table 5.
Coefficients and Statistics of Hazard Models.
| Variable | Level 1 | Level 2&3 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| coef | z | Pr(>|z|) | Sig | coef | z | Pr(>|z|) | Sig | ||
| Time | Month | 0.001 | 4.712 | 2.45E-06 | *** | 0.002 | 6.142 | 8.14E-10 | *** |
| Activity Start Time | 0.024 | 82.665 | <2.00E-16 | *** | 0.023 | 79.211 | <2.00E-16 | *** | |
| Household Size | 1 | −0.545 | −11.762 | <2.00E-16 | *** | −0.329 | −7.423 | 1.15E-13 | *** |
| 2 | −0.472 | −10.207 | <2.00E-16 | *** | −0.260 | −5.856 | 4.73E-09 | *** | |
| 3 | −0.456 | −9.847 | <2.00E-16 | *** | −0.240 | −5.403 | 6.54E-08 | *** | |
| 4 | −0.467 | −10.1 | <2.00E-16 | *** | −0.236 | −5.318 | 1.05E-07 | *** | |
| 5 | −0.407 | −8.764 | <2.00E-16 | *** | −0.210 | −4.708 | 2.50E-06 | *** | |
| 6 | −0.330 | −7.274 | 3.48E-13 | *** | −0.161 | −3.705 | 0.000211 | *** | |
| 7 or more | −0.506 | −10.507 | <2.00E-16 | *** | −0.199 | −4.339 | 1.43E-05 | *** | |
| Gender | Female | 0.366 | 9.17 | <2.00E-16 | *** | 0.251 | 6.819 | 9.14E-12 | *** |
| Male | 0.404 | 10.121 | <2.00E-16 | *** | 0.281 | 7.635 | 2.26E-14 | *** | |
| Age | 18–64 | 0.437 | 9.522 | <2.00E-16 | *** | 0.269 | 6.125 | 9.04E-10 | *** |
| Age ≥ 65 | 0.538 | 11.667 | <2.00E-16 | *** | 0.319 | 7.223 | 5.08E-13 | *** | |
| Race | American Indian or Alaska Native, White | −0.031 | −0.892 | 0.000781 | *** | 0.079 | 3.661 | 0.000251 | *** |
| Asian | 0.066 | 4.3 | 1.71E-05 | *** | 0.024 | 1.73 | 0.083645 | . | |
| Asian, Native Hawaiian or other Pacific Islander, White, other race | 0.148 | 3.432 | 0.000599 | *** | 0.244 | 5.997 | 2.00E-09 | *** | |
| Asian, White | 0.096 | 3.234 | 0.001219 | ** | 0.127 | 4.744 | 2.10E-06 | *** | |
| Black or African American | 0.054 | 3.281 | 0.001033 | ** | 0.062 | 3.995 | 6.46E-05 | *** | |
| Black or African American, White | 0.183 | 7.089 | 1.35E-12 | *** | 0.073 | 2.827 | 0.004692 | ** | |
| Other race | 0.069 | 3.784 | 0.000154 | *** | 0.077 | 4.65 | 3.32E-06 | *** | |
| White | 0.192 | 15.663 | <2.00E-16 | *** | 0.158 | 13.973 | <2.00E-16 | *** | |
| White, Other race | 0.274 | 10.85 | <2.00E-16 | *** | 0.219 | 8.629 | <2.00E-16 | *** | |
| Activity Type | Other | 1.327 | 140.829 | <2.00E-16 | *** | 1.099 | 101.407 | <2.00E-16 | *** |
| Personal | 1.450 | 245.023 | <2.00E-16 | *** | 1.198 | 178.59 | <2.00E-16 | *** | |
| Recreation | 1.416 | 347.812 | <2.00E-16 | *** | 1.182 | 276.083 | <2.00E-16 | *** | |
| School | 1.390 | 174.041 | <2.00E-16 | *** | 1.086 | 115.894 | <2.00E-16 | *** | |
| Shopping | 1.926 | 392.471 | <2.00E-16 | *** | 1.836 | 334.923 | <2.00E-16 | *** | |
| Trans | 1.588 | 65.295 | <2.00E-16 | *** | 1.393 | 48.462 | <2.00E-16 | *** | |
| Unknown | – | – | – | – | 0.843 | 206.129 | <2.00E-16 | *** | |
| Work | 0.931 | 179.016 | <2.00E-16 | *** | 0.826 | 157.117 | <2.00E-16 | *** | |
| If conducted following activities previously | Home | 0.143 | 40.873 | <2.00E-16 | *** | 0.258 | 75.451 | <2.00E-16 | *** |
| Work | 0.065 | 17.027 | <2.00E-16 | *** | 0.108 | 28.609 | <2.00E-16 | *** | |
| School | 0.052 | 9.122 | <2.00E-16 | *** | 0.069 | 10.088 | 2.92E-10 | *** | |
| Personal | 0.039 | 9.691 | <2.00E-16 | *** | 0.032 | 6.928 | 4.25E-12 | *** | |
| Recreation | 0.084 | 30.969 | <2.00E-16 | *** | 0.076 | 25.695 | <2.00E-16 | *** | |
| Shopping | 0.013 | 4.054 | <2.00E-16 | *** | 0.007 | 1.753 | 0.079558 | . | |
| Trans | 0.084 | 4.239 | 5.04E-05 | *** | 0.115 | 4.691 | 2.72E-06 | *** | |
| Other | 0.065 | 9.224 | 2.24E-05 | *** | 0.048 | 5.625 | 1.85E-08 | *** | |
| Unknown | – | – | – | – | 0.047 | 17.321 | <2.00E-16 | *** | |
Significant codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1
A misspecification test has been conducted prior to build Hazard model (Ziliak and McCloskey, 2008; Gillborn et al., 2018). The test result shows that using Hazard model and selected variables for predicting the staying duration is appropriate. All variables are significant, and all coefficients are significantly different from zero. The selected variables have power in explaining the response variable which is the staying duration. In this table, the coefficient of “Activity Start Time” is positive, which means the later the activity start time is, the earlier the activity is more likely to end. Coefficients of “Household Size” are negative, meaning that duration decreases when the size of households increases. The coefficient of Male is larger than Female, which indicates that complete activities slower than females on average. Male are prone to spend longer time on out-of-home activities (social, recreation, out-of-town, etc.). On the contrary, females contribute more to domestic work (shopping, eating, religious, etc.) (Zhong et al., 2012). In general, males spend more time on activities. Most coefficients of “Race”, “Activity Type” and “If conducted following activity previously” are positive. The negative sign of American Indian or Alaska Native, White in race indicates they have shorter activity duration than other races (Helms et al., 2005). The variable of “If conducted following activity previously” is a dummy variable. Since , if there are multiple activities conducted before, these coefficients are accumulated. Therefore, more previous activities are done, the faster the next activity end.
After generating the activity-location schedule and estimating the duration of stay for each activity, an example of final activity-location duration patterns is shown in Fig. 12 for 3 levels, which Level 1 considers frequently visited places only, Level 2 considers unknown places as one category, and Level 3 models each unknown place as a separate activity location.
Fig. 12.

An Example of Final Activity-Location Duration Patterns Result (Female, Age: 18–64, White, Household Size: 1).
5. Validation
We conduct validation at two levels: an individual-level and an aggregated level. Given that individuals’ activity-location-duration patterns evolve over time, we devise a multi-period validation approach to accommodating this dynamic situation. The dataset of 20 weeks is divided into 5 periods, four weeks each period. In each period, the first three weeks are used as training data and the last week as the validation data.
5.1. Individual-level validation
The prediction of visited location and duration, for a given time, is compared with the ground truth for each individual and for all three levels of models (i.e. reported frequent places, all unknown activities as a single category, and each unknown location as a separate activity location). A correct prediction is labeled as 1, otherwise, labeled as 0. The accuracy of the predicted location is measured by percentage. The accuracy of predicted activity duration is assessed by Mean absolute percentage error (MAPE).
Fig. 13 presents the overall averaged accuracies for all individuals. The accuracies of Level 1, Level 2, and Level 3 models are 70.83%, 64.18%, and 63.19%, respectively. Although individual travel behavior can be random and vary with time, most activities are captured by the models, such as routine activities, including “Home’ and “Work”. The accuracies of Level 3 are not much lower than that of Level 1, although meeting greater challenges in prediction. The reason is that predicting Level 3 activity-location schedules is more difficult, but the unknown place is not likely to be visited comparing to frequently visited places. Thus, the number of unknown places is not too much. Fig. 13 also indicates that as the decrease of accuracy of activity-location choice from Level 1 to Level 3, the MAPE of duration increases simultaneously due to the increasing uncertainty.
Fig. 13.

Overall Accuracy of Activity-location Prediction and MAPE of Duration.
Fig. 14 shows the accuracies of location prediction by hours of a day for all three modeling levels. As anticipated, high prediction accuracies occur during the night and relatively low accuracies during daytime. This result aligns with the common understanding that activities do not vary much during night, but may do considerably during the daytime. All three levels show similar accuracy trends in terms of major falls and rises, except that the accuracies of model Level 2 and Level 3 rise considerably higher than Level 1 during the hours of 6–8 pm. As exemplified in Fig. 6, individuals tend to visit a large number of infrequently visited places after work. The Level 1 model, on the other hand, only considers frequently visited places, thus cannot capture places visited not frequently. Table 6 shows MAPE of activity duration for different activity types. Those activity types that have a wider range of staying duration (Max-Min) tend to have a higher MAPE, e.g. Home. Shopping has the most predictable stay duration across three levels.
Fig. 14.

Hourly Accuracy of Activity-location Prediction.
Table 6.
MAPE of Staying Duration for Activity Types.
| Activity Type | Min (Hours) | Max (Hours) | Level 1 MAPE | Level 2 MAPE | Level 3 MAPE |
|---|---|---|---|---|---|
| Home | 0.2819049 | 20.16674 | 17.73% | 22.44% | 24.58% |
| Work | 0.2685546 | 9.739176 | 14.18% | 14.85% | 16.26% |
| School | 0.2540277 | 8.349861 | 11.33% | 16.17% | 17.71% |
| Shopping | 0.2530551 | 5.400831 | 7.82% | 11.09% | 12.14% |
| Recreation | 0.255833 | 8.850557 | 11.13% | 15.95% | 17.47% |
| Personal | 0.2538899 | 8.985486 | 9.44% | 13.70% | 15.00% |
| Trans | 0.2563583 | 7.830212 | 11.12% | 13.32% | 14.58% |
| Other | 0.2546174 | 9.602191 | 14.63% | 12.79% | 14.01% |
Fig. 15 reports the effectiveness of the three scores utilized to predict the activity location: global visit score (GVS), temporal visit score (TVS), and periodical visit score (PVS). All test trips are arranged along the x-axis, against the three model levels along the y-axis. The colored vertical lines indicate the scores that are effective in predicting activity locations for the given trip and model level. The Level 1 prediction tends to utilize a mixture of TVS and PVS. Since this level only considers frequent places where the timing or periodicity of trips are explicitly available, the activity locations can be readily captured by either scores. In contrast, both Level 2 and Level 3 must consider unknown places within activity generating process. Level 2 treats unknown places as one category. If take out the time when people visit those frequent visited places, the rest of time blocks should also have patterns for the unknown places. In addition, the number of unknown places is higher than frequent visited places. Level 2 tends to employ TVS, which is time sensitive. For Level 3, there are no or weak timing and periodical information of visiting unknown places, and people do not visit unknown places frequently, GVS is more effective to determine the visited places.
Fig. 15.

Visualization of Which Score is Utilized When Predicting the Activity Location.
5.2. Aggregated-level validation
We also validate the prediction results at an aggregated level through simulations. A Monte Carol simulation is conducted 300 times for each participant. For each simulated day of a participant, the probability of activity-location chain, aka the next place to be visited, is drawn from the activity-location schedule model (Section 4). Table 7 compares the distribution of activity type between the simulation result and the survey data, and the difference is subtle. The activity of the largest difference is “Work”, underestimated only by 3%.
Table 7.
Distributions of Activity Type for Survey Data and Simulation Results.
| Distribution of Activity Type | Survey | Simulation | Error | ||
|---|---|---|---|---|---|
| Frequency | Percentage | Frequency | Percentage | Percentage | |
| Home | 369,687 | 51.31% | 1,698,507 | 52.59% | −1.28% |
| Work | 83,169 | 11.54% | 282,297 | 8.74% | −2.80% |
| School | 19,174 | 2.66% | 71,566 | 2.22% | −0.44% |
| Shopping | 52,153 | 7.24% | 329,176 | 10.19% | 2.95% |
| Recreation | 145,921 | 20.25% | 639,505 | 19.80% | −0.45% |
| Personal Business | 36,954 | 5.13% | 149,646 | 4.63% | −0.50% |
| Transportation | 1373 | 0.19% | 8830 | 0.27% | −0.08% |
| Other | 12,025 | 1.67% | 50,299 | 1.56% | 0.11% |
| Total | 720,456 | 100% | 3,229,826 | 100% | — |
Moreover, Fig. 16 compares the distributions of activity duration between the simulation result and the survey. The two distribution lines are close to each other, validating the effectiveness of the proposed prediction model. There are more short-duration activities than the longer ones, as the percentage of activity decreases when activity duration increases. However, the percentage for those activities that are longer than 4 h is relatively high. These activities may include “Home”, “Work”, or visit relatives’ and friends’ house which belong to “Recreation”.
Fig. 16.

Distributions of Activity Duration for Survey Data and Simulation Result (labeled percentage in the figure indicates the difference between survey and simulation [% Survey - % Simulation]).
Regarding past studies, the result of (Kitamura et al., 2000), for example, did indicate that the total number of trips can be overestimated by 20% for workers. Our model seems to perform well, as illustrated by Table 7 and Fig. 16. The good performance may be partially attributed to the extended length of the survey data (20 weeks), that might have compensated for the imperfections of the data. This perhaps also the reason that our model outperforms those that use short-term one- or two-day survey data.
Comparing the validation results of individual-level and aggregate-level, aggregate-level is more effective for simulating actives and travel sequences for analysis. In addition, this method is closer to reality when simulate activities for analysis over a long-time frame. Hence, aggregate-level with long-time frame simulation provides better results.
6. Discussions
The survey data used in this study has both similar and different characteristic compared to traditional survey data. The similar information, such as socio-demographic and individual and household information, provides the basis of the analysis, while the differences present unique challenges and opportunities for this study. The first different aspect is participant sampling. Participants of the traditional survey are selected according to the composition of actual population, while, in the survey data used for this study, the recruitment targeted suburb residents, but the actual participants came mostly from the area between downtown and suburb. The second difference is that instead of one- or multi-day travel diaries, our data consists of weekly reported five frequently visited places, and places never reported remain unknown in our dataset. However, this kind of missing information can be complemented by the long data collection period. The third difference is that the traditional travel surveys assisted by GPS devices contain high frequent GPS points. However, GPS points in our data are low frequently sampled to retain the battery life. Therefore, the traditional travel surveys contain valuable information which is irreplaceable. Our dataset augments traditional travel survey by extending the survey period. Long-term data can capture more information and variations in travel behaviors and patterns.
There are two travel behavior characteristics that are not included in our dataset and were not a concern for this particular study. One is travel mode for all activities, and the other one is trip or activity purpose for unreported trips. Travel mode can be hardly obtained given by low-frequent and low-precision GPS points. Travel mode is mostly detected by speed-based methods, which determine travel mode according to speed and time (Bohte and Maat, 2009; Yang et al., 2016) and high-frequency GPS data (Shafique and Hato, 2016; Widhalm et al., 2012). These methods mainly depend on the frequency of GPS points. There is an emerging trend to detect travel modes with advanced smartphone sensors, such as accelerometers, gravity sensors, gyroscopes, magnetometers, etc. (Eftekhari and Ghatee, 2016; Su et al., 2016; Zhou et al., 2016). This technique achieves high detection accuracy and prevents battery drain for smartphones.
It is more difficult to analyze trip purpose than travel mode, because trip purpose identification needs to cooperate with both GPS trajectory data and other sources of data, including land use information, temporal information, and socio-demographics. Methodologies for trip purpose prediction can be mainly divided into two categories, rule-based methods (Chen et al., 2010; Wolf et al., 2004), probabilistic-based methods. Recently, social media data also was involved in trip purpose studies (Cui et al., 2018b; Meng et al., 2017). These methods rely on the accuracy of GPS points. If the precision of GPS is too low, the real activity locations or trip ends are far from the recorded GPS points. Thus, one coordinate could involve a large number of uncertain activity locations and many nearby POIs which make the inference process difficult. In this study, if there is no position change for 500 m or more, iOS devices stop recording GPS points. However, there are plenty of POIs within a 500-meter radius circle. In order to obtain a reliable trip purpose, it is better to utilize other data sources, such as social media data to complete this task.
Our model can generate activity-location schedule for participants who have reported their historical travel data. With regard to unknown travelers, there are still two ways to simulate their activity-location schedule. The first method is to generate travel information according to other datasets collected within the same area (e.g. travel survey data, smart card data, check-in data, ridesharing data, etc.). The second way is to associate socio-demographic data with travel behavior data to generate activity-location schedule by employing discrete choice models or machine learning algorithms. Further, joint activity data plays an important role in future travel survey and travel behavior analysis. For example, one can explore how two household heads plan their trips jointly and analyze how travelers show up at the same locations within similar time spans. Given survey data, one can easily identify members of the same household, co-workers, or schoolmates. Moreover, social media data can be collected for this research. Friendships in social media are easy to be detected, as well as the same locations they visit together. Combining such two pieces of information, one can construct a detailed social network with information about household members, co-workers, and friends.
7. Conclusions
This paper develops a three-level probabilistic activity-location schedule model with a 7-month GPS-based survey, which contains limited activity information. This model handles unknown information on different levels: non-unknown places (Level 1), all unknown places as one new category (Level 2), and individual unknown places (Level 3). Further, a hazard model is employed to estimate activity duration to achieve the activity-location scheduling.
The model in this paper can capture major activities in a synthetic day. In addition, running this probabilistic model multiple times can capture a variety of activity-travel patterns. The accuracy of validation results on the individual level is high during night (83%) and relatively low (67%) during daytime due to the activity variability at different times. The average individual-level validation accuracies are 70.83%, 64.18%, and 63.19% for Level 1, Level 2, and Level 3, respectively. From the aggregated validation, the simulation results are quite close (within 3% errors) to the ground-truth survey data in terms of activity type distribution and activity duration. Therefore, the simulation results are close to the actual activity-location schedule. Results suggest that our method effectively models the activity-location schedule and can handle data that is missing partial trip purpose information.
In the future, this model can be used in activity-based traffic simulations. In order to input the generated activity-location schedule into traffic simulation, one needs to recognize travel modes from smartphone advanced sensors. Moreover, the missing labels of trip ends can be identified with other data sources such as social media data. Even though the proposed hazard model performs well with attributes we already included, there are still several missing attributes like income and employment information, and this is the inherent limitation of our datasets. Therefore, the hazard model can be further improved with more information. In addition, as individual mobility facilitates the dispersion of infectious diseases, the activity-location schedule is valuable in assessing the health risk of populations by assessing the mobility and interaction behaviors (Bian et al., 2012). The proposed method, in another way, can model the spread of COVID. However, the dataset, collected in this study, can only be used for modeling prior-COVID travel behavior studies. (Code of the purposed model onto Github (https://github.com/ycui4/Daily-Activity-Location-Schedule-Model.git)). This survey was originally conducted for tracing the spread of influenza. Through linking social media data with travel survey data, one can examine how disease propagates through social activities. Therefore, the proposed methodology in this paper can also assist in modeling healthcare applications. Although the original purpose of this survey is for tracking disease, what it collected is essential for a travel survey dataset. This study broadens the data pool for future travel behavior research, indicating that one can use healthcare field data for transportation-related research.
Acknowledgment
This research was supported by TransInfo University Research Center. The dataset was supported by NIH grant 5R01GM108731.
Appendix A
Table Rules for Grouping Places Types Obtained from Google Places API to Activity Types
| Activity Type | Place Types Obtained from Google Places API |
|---|---|
| School | university, school, library |
| Shopping | clothing_store, store, liquor_store, supermarket, department_store, grocery_or_supermarket, bakery, shoe_store, pet_store, shopping_mall, convenience_store, home_goods_store, car_dealer, book_store, furniture_store, hardware_store, pharmacy, meal_delivery, electronics_store, meal_takeaway, veterinary_care, florist, brewery, brewing, bicycle_store, jewelry_store |
| Recreation | lodging, park, food, restaurant, cafe, bar, club, historical_landmark, residence, museum, stadium, gym, natural_feature, bowling_alley, zoo, movie_theater, orchestra, night_club, casino, farm, movie_rental, art_gallery, amusement_park, recreation, neighborhood, theater, theatre, campground, rv_park, staduim |
| Personal Business | wedding_hall, convention_center, banquet_hall, funeral_home, office, laundry, city_hall, post_office, church, car_repair, doctor, lawyer, real_estate_agency, gas_station, bank, plumber, local_government_office, health, animal_shelter, oragnization, beauty_salon, travel_agency, car_wash, wedding_venue, hair_care, car, skin_care, logistics, finance, physiotherapist, insurance_agency, hospital, dentist, spa, moving_company, general_contractor, police, courthouse, offic, office, cemetery, accounting, storage, agency, place_of_worship, electrician, atm, car_rental, hindu_temple, fincance, roofing_contractor, Dig Coworking Space, fire_station, hospical, army, wedding_vener, organization |
| Transportation | airport, parking, train_station, bus_station, trainsit_station, transit_staion |
| Other | point_of_interest, street_address, route, intersection, locality, administrative_area_level_3, political, postal_code |
Footnotes
CRediT authorship contribution statement
Yu Cui: Data curation, Writing – original draft. Qing He: Conceptualization, Methodology, Supervision, Writing – review & editing. Ling Bian: Writing – review & editing, Investigation, Validation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
- Allahviranloo M, Recker W, 2013. Daily activity pattern recognition by using support vector machines with multiple classes. Transp. Res. Part B: Methodol 58, 16–43. [Google Scholar]
- APPLE, 2020. Using the Significant-Change Location Service [Online]. Available: https://developer.apple.com/documentation/corelocation/getting_the_user_s_location/using_the_significant-change_location_service [accessed].
- Asakura Y, Hato E, 2004. Tracking survey for individual travel behaviour using mobile communication instruments. Transp. Res. Part C: Emerg. Technol 12, 273–291. [Google Scholar]
- Baral R, Li T, Zhu X, 2018. CAPS: Context Aware Personalized POI Sequence Recommender System. arXiv preprint arXiv:1803.01245. [Google Scholar]
- Bhat CR, Singh SK, 2000. A comprehensive daily activity-travel generation model system for workers. Transp. Res. Part A: Policy Pract 34, 1–22. [Google Scholar]
- Bian L, Huang Y, Mao L, Lim E, Lee G, Yang Y, Cohen M, Wilson D, 2012. Modeling individual vulnerability to communicable diseases: A framework and design. Ann. Assoc. Am. Geogr 102, 1016–1025. [Google Scholar]
- Bierlaire M, Chen J, Newman J, 2013. A probabilistic map matching method for smartphone GPS data. Transp. Res. Part C: Emerg. Technol 26, 78–98. [Google Scholar]
- Bohte W, Maat K, 2009. Deriving and validating trip purposes and travel modes for multi-day GPS-based travel surveys: A large-scale application in the Netherlands. Transp. Res. Part C: Emerg. Technol 17, 285–297. [Google Scholar]
- Chen C, Gong H, Lawson C, Bialostozky E, 2010. Evaluating the feasibility of a passive travel survey collection in a complex urban environment: Lessons learned from the New York City case study. Transp. Res. Part A: Policy Pract 44, 830–840. [Google Scholar]
- Cheng L, Chen X, Yang S, Wu J, Yang M, 2017. Structural equation models to analyze activity participation, trip generation, and mode choice of low-income commuters. Transp. Lett 1–9. [Google Scholar]
- Chung Y, 2010. Development of an accident duration prediction model on the Korean Freeway Systems. Accid. Anal. Prevent 42, 282–289. [DOI] [PubMed] [Google Scholar]
- Cottrill CD, Pereira FC, Zhao F, Dias IF, Lim HB, Ben-Akiva ME, Zegras PC, 2013. Future mobility survey: Experience in developing a smartphone-based travel survey in Singapore. Transp. Res. Rec 2354, 59–67. [Google Scholar]
- Crane R, Crepeau R, 1998. Does neighborhood design influence travel?: A behavioral analysis of travel diary and GIS data. Transp. Res. Part D: Transp. Enviro 3, 225–238. [Google Scholar]
- Cui Y, He Q, Khani A, 2018a. Travel Behavior Classification: An Approach with Social Network and Deep Learning. Transp. Res. Rec 0361198118772723. [Google Scholar]
- Cui Y, Meng C, He Q, Gao J, 2018b. Forecasting current and next trip purpose with social media data and Google Places. Transp. Res. Part C: Emerg. Technol 97, 159–174. [Google Scholar]
- Deka D, Lubin A, 2012. Exploration of poverty, employment, earnings, job search, and commuting behavior of persons with disabilities and African-Americans in New Jersey. Transp. Res. Rec 2320, 37–45. [Google Scholar]
- Di Donna SA, Cantelmo G,Viti F, 2015. A Markov chain dynamic model for trip generation and distribution based on CDR. In: Models and Technologies for Intelligent Transportation Systems (MT-ITS), 2015 International Conference on, 2015. IEEE, pp. 243–250. [Google Scholar]
- Dogan Onur, Bayo-Monton Jose-Luis, Fernandez-Llatas Carlos, Oztaysi Basar, 2019. Analyzing of gender behaviors from paths using process mining: A shopping mall application. Sensors 19 (3), 557. 10.3390/s19030557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Drchal J, Čertický M, Jakob M, 2019. Data-driven activity scheduler for agent-based mobility models. Transp. Res. Part C: Emerg. Technol 98, 370–390. [Google Scholar]
- Eftekhari HR, Ghatee M, 2016. An inference engine for smartphones to preprocess data and detect stationary and transportation modes. Transp. Res. Part C: Emerg. Technol 69, 313–327. [Google Scholar]
- Ester M, Kriegel H-P, Sander J, Xu X, 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 226–231. [Google Scholar]
- Gillborn D, Warmington P, Demack S, 2018. QuantCrit: education, policy, ‘Big Data’and principles for a critical race theory of statistics. Race Ethnicity Educ. 21, 158–179. [Google Scholar]
- Golob TF, 2000. A simultaneous model of household activity participation and trip chain generation. Transp. Res. Part B: Methodol 34, 355–376. [Google Scholar]
- Gong H, Chen C, Bialostozky E, Lawson CT, 2012. A GPS/GIS method for travel mode detection in New York City. Comput. Environ. Urban Syst 36, 131–139. [Google Scholar]
- Gonzalez MC, Hidalgo CA, Barabasi A-L, 2008. Understanding individual human mobility patterns. Nature 453, 779. [DOI] [PubMed] [Google Scholar]
- Habib KM, Miller EJ, 2008. Modelling daily activity program generation considering within-day and day-to-day dynamics in activity-travel behaviour. Transportation 35, 467. [Google Scholar]
- Hariharan R, Toyama K, 2004. Project Lachesis: parsing and modeling location histories. In: International Conference on Geographic Information Science, 2004. Springer, pp. 106–124. [Google Scholar]
- Hasan S, Ukkusuri SV, 2014. Urban activity pattern classification using topic models from online geo-location data. Transp. Res. Part C: Emerg. Technol 44, 363–381. [Google Scholar]
- Hasan S, Ukkusuri SV, 2018. Reconstructing Activity Location Sequences From Incomplete Check-In Data: A Semi-Markov Continuous-Time Bayesian Network Model. IEEE Trans. Intell. Transp. Syst 19, 687–698. [Google Scholar]
- Helms JE, Jernigan M, Mascher J, 2005. The meaning of race in psychology and how to change it: A methodological perspective. Am. Psychol 60, 27. [DOI] [PubMed] [Google Scholar]
- Hudson JG, Duthie JC, Rathod YK, Larsen KA, Meyer JL, 2012. Using smartphones to collect bicycle travel data in Texas. Texas Transportation Institute. University Transportation Center for Mobility. [Google Scholar]
- Kim Y, Pereira FC, Zhao F, Ghorpade A, Zegras PC, Ben-Akiva M, 2015. Activity recognition for a smartphone and web based travel survey. arXiv preprint arXiv:1502.03634. [Google Scholar]
- Kitamura R, Chen C, Pendyala R, 1997. Generation of synthetic daily activity-travel patterns. Transp. Res. Rec. J. Transp. Res. Board 154–162. [Google Scholar]
- Kitamura R, Chen C, Pendyala RM, Narayanan R, 2000. Micro-simulation of daily activity-travel patterns for travel demand forecasting. Transportation 27, 25–51. [Google Scholar]
- Kong X, Li M, Ma K, Tian K, Wang M, Ning Z, Xia F, 2018. Big Trajectory Data: A Survey of Applications and Services. IEEE Access 6, 58295–58306. [Google Scholar]
- Liao C-F, Chen C, Fan Y, 2017. A review on the state-of-the-art smartphone apps for travel data collection and energy efficient strategies. Transportation Research Board 2017 Annual Meeting Compendium of Papers, 17–00436. [Google Scholar]
- Liao Z-X, Lei P-R, Shen T-J, Li S-C, Peng W-C, 2012. Mining temporal profiles of mobile applications for usage prediction. In: Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on, 2012. IEEE, pp. 890–893. [Google Scholar]
- Liao Z-X, Pan Y-C, Peng W-C, Lei P-R, 2013. On mining mobile apps usage behavior for predicting apps usage in smartphones. In: Proceedings of the 22nd ACM international conference on Information & Knowledge Management, 2013. ACM, pp. 609–618. [Google Scholar]
- Litwin MS, 2005. Dynamic household activity scheduling processes.
- Mauch M, Taylor BD, 1997. Gender, race, and travel behavior: Analysis of household-serving travel and commuting in San Francisco bay area. Transp. Res. Rec 1607, 147–153. [Google Scholar]
- McGuckin N, Fucci A, 2018. Summary of travel trends: 2017 national household travel survey. Federal Highway Administration, Washington, DC. [Google Scholar]
- Meng C, Cui Y, He Q, Su L, Gao J, 2017. Travel purpose inference with GPS trajectories, POIs, and geo-tagged social media data. In: Big Data (Big Data), 2017 IEEE International Conference on, 2017. IEEE, pp. 1319–1324. [Google Scholar]
- Montini L, Prost S, Schrammel J, Rieser-Schüssler N, Axhausen KW, 2015. Comparison of travel diaries generated from smartphone data and dedicated GPS devices. Transp. Res. Procedia 11, 227–241. [Google Scholar]
- Palma AT, Bogorny V, Kuijpers B, Alvares LO, 2008. A clustering-based approach for discovering interesting places in trajectories. In: Proceedings of the 2008 ACM symposium on Applied computing, 2008. ACM, pp. 863–868. [Google Scholar]
- Patterson Z, Fitzsimmons K, 2016. DataMobile: Smartphone travel survey experiment. Transp. Res. Rec. J. Transp. Res. Board 35–43. [Google Scholar]
- Reddy S, Mun M, Burke J, Estrin D, Hansen M, Srivastava M, 2010. Using mobile phones to determine transportation modes. ACM Trans. Sens. Netw. (TOSN) 6, 13. [Google Scholar]
- Safi H, Assemi B, Mesbah M, Ferreira L, Hickman M, 2015. Design and implementation of a smartphone-based travel survey. Transp. Res. Rec. J. Transp. Res. Board 99–107. [Google Scholar]
- Shafique M, Hato E, 2016. Travel mode detection with varying smartphone data collection frequencies. Sensors 16, 716. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen L, Stopher PR, 2014. Review of GPS travel survey and GPS data-processing methods. Transp. Rev 34, 316–334. [Google Scholar]
- Siripirote T, Sumalee A, Watling DP, Shao H, 2014. Updating of travel behavior model parameters and estimation of vehicle trip chain based on plate scanning. J. Intell. Transp. Syst 18, 393–409. [Google Scholar]
- Su X, Caceres H, Tong H, He Q, 2016. Online travel mode identification using smartphones with battery saving considerations. IEEE Trans. Intell. Transp. Syst 17, 2921–2934. [Google Scholar]
- Tang J, Meng L, 2006. Learning significant locations from GPS data with time window. In: Geoinformatics 2006: GNSS and Integrated Geospatial Applications, 2006. International Society for Optics and Photonics, 64180J. [Google Scholar]
- Thierry B, Chaix B, Kestens Y, 2013. Detecting activity locations from raw GPS data: a novel kernel-based algorithm. Int. J. Health Geographics 12, 14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Unger M, Bar A, Shapira B, Rokach L, Unger Moshe, 2016. Towards latent context-aware recommendation systems. Knowl.-Based Syst 104, 165–187. [Google Scholar]
- van den Berg P, Arentze T, Timmermans H, 2012. A latent class accelerated hazard model of social activity duration. Transp. Res. Part A: Policy Pract 46, 12–21. [Google Scholar]
- Vlachos M, Yu P, Castelli V, 2005. On periodicity detection and structural periodic similarity. In: Proceedings of the 2005 SIAM international conference on data mining, 2005. SIAM, pp. 449–460. [Google Scholar]
- Wang L, Ma W, Fan Y, Zuo Z, 2017. Trip chain extraction using smartphone-collected trajectory data. Transportmetrica B: Transp. Dyn 1–20. [Google Scholar]
- Widhalm P, Nitsche P, Brändie N, 2012. Transport mode detection with realistic smartphone sensor data. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), 2012. IEEE, pp. 573–576. [Google Scholar]
- Wolf J, Bricka S, Ashby T, Gorugantua C, 2004. Advances in the application of GPS to household travel surveys. In: National Household Travel Survey Conference, Washington DC. [Google Scholar]
- Wong S, McLafferty SL, Planey AM, Preston VA, 2020. Disability, wages, and commuting in New York. J. Transp. Geogr 87, 102818. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiao Y, Low D, Bandara T, Pathak P, Lim HB, Goyal D, Santos J, Cottrill C, Pereira F, Zegras C, 2012. Transportation activity analysis using smartphones. In: Consumer Communications and Networking Conference (CCNC), 2012 IEEE, 2012. IEEE, pp. 60–61. [Google Scholar]
- Yang F, Yao Z, Cheng Y, Ran B, Yang D, 2016. Multimode trip information detection using personal trajectory data. J. Intell. Transp. Syst 20, 449–460. [Google Scholar]
- Ye Y, Zheng Y, Chen Y, Feng J, Xie X, 2009. Mining individual life pattern based on location history. In: Mobile Data Management: Systems, Services and Middleware, 2009. MDM’09. Tenth International Conference on, 2009. IEEE, pp. 1–10. [Google Scholar]
- Young M, Farber S, 2019. The who, why, and when of Uber and other ride-hailing trips: An examination of a large sample household travel survey. Transp. Res. Part A: Policy Pract 119, 383–392. [Google Scholar]
- Zhao F, Pereira FC, Ball R, Kim Y, Han Y, Zegras C, Ben-Akiva M, 2015. Exploratory analysis of a smartphone-based travel survey in Singapore. Transp. Res. Rec. J. Transp. Res. Board 2, 45–56. [Google Scholar]
- Zhong M, Wu C, Hunt JD, 2012. Gender differences in activity participation, time-of-day and duration choices: new evidence from Calgary. Transp. Plan. Technol 35, 175–190. [Google Scholar]
- Zhou C, Jia H, Juan Z, Fu X, Xiao G, 2017. A data-driven method for trip ends identification using large-scale smartphone-based GPS tracking data. IEEE Trans. Intell. Transp. Syst 18, 2096–2110. [Google Scholar]
- Zhou X, Yu W, Sullivan WC, 2016. Making pervasive sensing possible: Effective travel mode sensing based on smartphones. Comput. Environ. Urban Syst 58, 52–59. [Google Scholar]
- Ziliak ST, McCloskey DN, 2008. Science is judgment, not only calculation: A reply to Aris Spanos’s review of The cult of statistical significance. Erasmus J. Philos. Econ 1, 165–170. [Google Scholar]
- Zola P, Cortez P & Tesconi M Using Google Trends, Gaussian Mixture Models and DBSCAN for the Estimation of Twitter User Home Location. International Conference on Computational Science and Its Applications, 2020. Springer, 526–534. [Google Scholar]
