Abstract
Public transport operation network data serves as the foundation for urban transportation research and sustainable development planning. China operates the world’s largest public transport system, where buses and metros constitute primary urban mobility modes. Despite their critical importance, comprehensive integrated bus-metro operation datasets remain lacking at the national scale. This study presents the China Public Transport Operation Network Dataset (CPTOND-2025), systematically integrating bus networks from 350 cities and metro systems from 46 cities across mainland China, Hong Kong, Macao, and Taiwan regions. Based on June 2025 data collection using methodologies integrate professional platforms and commercial APIs, the dataset encompasses approximately 3,408,000 kilometers of operational routes (bus: ~3,375,000 km; metro: ~33,000 km). Key attributes including operating hours, fares, and operating companies are recorded with bilingual support. All data utilize standardized Shapefile format in WGS-84 coordinate system with 5.08-meter average spatial accuracy. This standardized, comprehensive, open-access dataset supports diverse applications including operation efficiency assessment, network analysis, accessibility evaluation, and Transit-Oriented Development (TOD) studies, facilitating transport management decisions and international research.
Subject terms: Civil engineering, Sustainability
Background & Summary
Public transport’s critical role in china’s urban mobility
Public transport systems, as essential infrastructure for the sustainable development of modern cities, play a crucial role in alleviating traffic congestion, reducing environmental pollution, and promoting social equity1,2. As the world’s largest public transport service provider, China’s public transport system ranks among the top globally in terms of scale and service capacity3. According to statistics from the China Association of Metros, as of May 2025, a total of 54 cities in mainland China have operated 326 urban rail transit lines (including 268 metro and light rail lines in 43 cities; 25 monorail, maglev, and regional rapid transit lines in 16 cities; and 33 tram and automated guideway lines in 18 cities), with an operation length of 10,978.3 kilometers and an annual passenger flow of 28.616 billion trips in 20244. Meanwhile, the bus system carries the vast majority of urban public transport passenger flow, with urban buses and trolleybuses nationwide handling over 40 billion passenger trips in 20235. Besides, bus and metro systems account for more than 75% of China’s total urban passenger flow, serving as the backbone of urban transportation operations6.
With the acceleration of urbanization and the advancement of smart city construction, big data-based public transport operation network analysis has emerged as a prominent research topic in the fields of urban transportation planning, network optimization, and operation management7,8. High-quality, large-scale public transport operation datasets not only provide data support for in-depth understanding of urban transportation systems’ operation patterns, but also establish a solid foundation for addressing key issues such as operation optimization, passenger flow prediction, multimodal coordination, and efficiency assessment9,10.
Existing datasets and their limitations
Currently, several international public transport datasets have provided important support for related research. Kujala et al. constructed a public transport network dataset covering 25 international cities, based on the General Transit Feed Specification (GTFS) format, providing a standardized data foundation for public transport network structure analysis and accessibility research11. The General Transit Feed Specification (GTFS) is an internationally recognized standard that defines a common format for public transportation schedules and geographic information. GTFS includes standardized fields such as routes.txt (route information), stops.txt (stop locations), stop_times.txt (arrival/departure times), and calendar.txt (service schedules). While GTFS provides a comprehensive framework for transit data, its implementation requires complete timetable information that may not be publicly available in all Chinese cities. Gallotti and Barthelemy developed a multilayer temporal public transport network dataset for the United Kingdom12. However, these datasets still have obvious limitations. Geographically, while datasets like Kujala et al.11 (25 international cities) and Wang et al.13 (299 Chinese cities) provide valuable coverage, they either include limited cities per country or focus on single transport modes without comprehensive operational attributes such as operating hours, fares, and metro integration.
Recently, Wang et al. released a Chinese bus network dataset13, providing important data support for domestic public transport research. However, this dataset has the following limitations: First, it only covers bus systems and lacks metro network information, making it unable to support integrated bus-metro analysis; Second, the operation information is relatively simple, lacking detailed operational attributes such as operating hours, fare structures, and operating companies; Third, it does not include data from Taiwan Province, affecting the completeness of nationwide comparative studies.
To systematically compare the advantages of CPTOND-2025 with existing datasets, Table 1 provides a detailed comparison of this dataset with representative public transport datasets in terms of geographic coverage, transport modes, methodology, and data quality.
Table 1.
Comprehensive Comparison with Existing Public Transport Datasets.
| Feature | CPTOND-2025 | Wang et al.22 | Kujala et al.11 | |
|---|---|---|---|---|
| Geographic Coverage | Total Cities | 350 (bus) + 46 (metro) | 299 (bus only) | 25 (international) |
| Geographic Scope | Mainland China + HK + Macau + Taiwan | China (Taiwan excluded) | Selected international | |
| Administrative Completeness | 100% provincial coverage | 97% coverage | Selected cities only | |
| Transport Mode Coverage | Bus Networks | 350 cities | 299 cities | 25 cities |
| Metro/Rail Systems | 46 cities | Not included | 25 cities | |
| Multi-modal Integration | Comprehensive | Bus only | Multi-modal (bus, tram, subway, rail, ferry) | |
| Methodology and Sustainability | Route Discovery | Systematic enumeration | Random ID generation | GTFS-based |
| API Dependencies | 8684 + Amap (both stable) | Amap + Baidu (Baidu discontinued 2025) | Standard feeds | |
| Reproducibility | Fully reproducible | Partially non-reproducible | Standard-based extraction | |
| Service Continuity Risk | Low (8684: 20-year track record; Amap: Alibaba-backed) | High (Baidu API discontinued, partial source loss) | Low (official government feeds) | |
| Data Quality and Richness | Total Routes | 140,371 | 147,418 | ~3,000 |
| Total Stops | 3,571,284 | 1,935,695 | ~45,000 | |
| Operational Fields | Operating hours (48%), Fares (87%), Operators (75%) | Route names and basic geometry | Complete GTFS fields (stops, routes, stop_times, calendar) | |
| Spatial Accuracy | 5.08 m average error | Reported good | Variable (authority-dependent) | |
| Validation Coverage | 97.2% verified | Estimated coverage | Authority-dependent | |
Notes:
1. GTFS (General Transit Feed Specification): An international standard format for public transit schedules and geographic information, including standardized files such as routes.txt, stops.txt, stop_times.txt, and calendar.txt. Full GTFS implementation requires complete timetable data, which may have limited public availability in some Chinese cities.
2. Operational Fields Coverage: Percentages indicate the proportion of routes/lines with complete information for each field. CPTOND-2025 provides partial operational data based on publicly available information, while Kujala et al. provides complete GTFS-compliant scheduling data where official feeds exist.
3. Service Continuity Risk Assessment: Based on data provider institutional stability, historical service continuity, and current operational status. All data sources face theoretical discontinuation risk; ratings reflect relative risk levels.
4. Route Counting Convention: CPTOND-2025 and Wang et al. count upbound and downbound directions separately, while Kujala et al. follows GTFS convention. Official Chinese statistics typically count physical routes only (single count).
Existing public transport datasets primarily suffer from the following issues: First, limited geographical coverage, with most studies focusing on specific cities or regions, lacking comprehensive operation datasets at the national scale14,15; Second, insufficient completeness of operation information, with many datasets containing only basic route and stop information while lacking key operational attributes such as operating hours, fares, and operating companies16. While operational attributes like fares may change over time due to policy adjustments, the CPTOND-2025 dataset provides a comprehensive snapshot of the operational status as of June 2025, serving as a valuable baseline for longitudinal studies and policy analysis; Third, low levels of data standardization, with data from different sources exhibiting variations in format, coordinate systems, field definitions, and other aspects17; Fourth, compatibility with the international standard GTFS format needs improvement, limiting the development of international comparative studies.
CPTOND-2025 Dataset advantages and contributions
The “China Public Transport Operation Network Dataset (CPTOND-2025)” released in this study further enhances existing datasets in the following aspects:
Comprehensive operation information integration
The dataset provides detailed records of operational attributes including operating hours, basic fares, full journey fares, and operating companies. This study is intended to provide support for in-depth operation analysis and research on efficiency assessment.
Complete bus-metro integrated coverage
The dataset simultaneously integrates bus operation networks from 350 cities and metro operation systems from 46 cities, with a total route length of approximately 3,408,000 kilometers, supporting comprehensive integrated bus-metro operation analysis.
Standardized data format
The dataset is stored in the internationally standard Shapefile format. The nomenclature of the fields is in both Chinese and English, and the system is compatible with the GTFS format, thereby facilitating international academic exchange.
High-precision spatial data
Spatial coordinates are uniformly derived from the WGS-84 coordinate system with an average positioning accuracy of 5.08 meters, supporting precise spatial analysis and operational planning applications.
National coverage
The dataset covers mainland China as well as Hong Kong, Macao, and Taiwan regions, achieving comprehensive coverage of cities at different levels according to the urban classification system, supporting hierarchical comparative studies.
GTFS compatibility and CTFS development potential
The General Transit Feed Specification (GTFS) is as an internationally recognized public transport data standard, provides a unified framework for global public transport data exchange and applications18. The CPTOND-2025 dataset has fully considered compatibility with the GTFS format, incorporating core data elements such as agency, routes and stops.
Based on the data structure and operation information completeness of CPTOND-2025, this dataset provides an important foundation for developing the China Transit Feed Specification (CTFS). CTFS builds upon the GTFS framework and incorporates the distinctive requirements of China’s public transport operations, such as multi-operator coordination, segmented fare systems, and differentiated peak-hour and off-peak operations, to form a standardized data format suited to China’s national conditions. This will provide crucial support for promoting the standardization and internationalization of China’s public transport data11,19.
Methods
Data collection strategy and source selection
This study employs a multi-source data fusion methodology to acquire publicly available public transportation information through standardized Web API services. All data acquisition processes strictly adhere to relevant terms of service and usage specifications, ensuring the legality and compliance of the research. The data collection process utilizes standardized REST API interfaces to systematically obtain publicly released basic geographic information including bus routes and stops.
The 8684 platform was selected as the foundational index data source based on its authoritative position and comprehensive coverage capabilities in China’s public transportation domain. The 8684 public transportation query platform (https://8684.cn) is China’s leading comprehensive public transit information service provider. Since its foundation in 2005, 8684 has been a specialized platform dedicated to transportation applications. It provides comprehensive coverage across multiple transportation modes including buses, metros, and railways in major cities throughout mainland China20. As a public transport query network with fast data update capabilities, the platform covers more than 400 cities and provides bus query routes, real-time bus query, bus route query and other information services21. 8684 serves as compelling evidence of its market authority and data reliability in large-scale transportation research. Furthermore, the systematic architecture of the 8684 application suite (including CityBus, Metro, Train, and other applications) demonstrates consistent design patterns and shared functional components, particularly in core modules such as timetable management and location-based services. It provides an ideal foundation for integrated multimodal transportation analysis.
This research establishes a complementary dual-source data acquisition framework that combines foundational index data with detailed operation information. The foundational index data source utilizes the 8684 Public Transportation Platform (8684.cn for bus route information, dt.8684.cn for metro line information) as an authoritative index repository for city directories and route nomenclature, providing comprehensive inventories of major cities and public transportation routes nationwide. The detailed operation information source builds upon the index information through Amap API services (restapi.amap.com/v3/bus/linename) to provide precise spatial geometric data and operational attribute information.
The data acquisition workflow implements a hierarchical data collection strategy comprising two primary phases. The first phase involves systematic extraction of national city listings and corresponding bus and metro route name directories from the 8684 platform. The second phase utilizes the acquired route names and their semantic variants as query parameters to batch-retrieve core operation data through Amap API services, including route geometric paths, stop spatial coordinates, operation time windows, and fare structures.
Geographic coverage scope is partially affected by Amap API service limitations. It is impossible to obtain public transit bus operation information for Taiwan Province through this API, while metro operation network data for the Taiwan region can be acquired through the 8684 platform. Therefore, CPTOND-2025’s spatial coverage scope comprises: Complete coverage areas including mainland China, Hong Kong Special Administrative Region, and Macao Special Administrative Region (bus + metro); Partial coverage area of Taiwan Province (metro data only). Cities included in the dataset are categorized into three types based on public transportation mode availability: cities with bus systems only, cities with metro systems only, and comprehensive cities with both bus and metro systems. Figure 1 illustrates the spatial distribution of cities included in CPTOND-2025 by public transport mode types across China, clearly showing the comprehensive geographic coverage and the distribution patterns of different transport mode combinations. Due to data structure limitations, Taiwan Province is treated as a single city unit for statistical purposes in the dataset.
Fig. 1.
Spatial distribution of cities included in CPTOND-2025 by public transport mode types across China.
The data was collected from June 1 to June 30, 2025. This time window was selected for the following reasons: (1) avoiding operational adjustment periods during major holidays such as Spring Festival; (2) relatively stable summer operating schedules; (3) ensuring consistency and comparability of data collection.
To ensure the long-term feasibility of data collection methods, this study conducted a stability assessment of core data sources. The 8684 platform, as a professional transportation information service provider operating for nearly 20 years, has maintained a solid market position and continuous service capability in China’s public transportation sector; Amap API, as a core service of Alibaba Group, provides enterprise-level stable technical support with historical service records demonstrating high reliability. To mitigate data source dependency risks, this study established a modular data collection framework supporting multi-source data validation and rapid data source switching. Additionally, risk mitigation strategies are developed, including alternative data source preparation, regular service monitoring, and data backup mechanisms. These strategies ensures the sustainability and reproducibility of the dataset construction methodology.
Integrated data collection and processing workflow
The data acquisition workflow employs a systematic four-phase processing strategy, integrating key technical components including multi-source data collection, data processing and standardization, quality control and cross-validation, and data organization with final product generation. Figure 2 illustrates the complete workflow for CPTOND-2025 dataset construction, ensuring data completeness, accuracy, and usability through a task-oriented hierarchical processing architecture.
Fig. 2.
The workflow of integrated database construction for CPTOND-2025.
Task A: multi-source data collection and preparation
Task A focuses on establishing a robust data collection infrastructure, covering bus networks in 350 cities and metro networks in 46 cities. Core technologies include multi-source integrated route discovery mechanisms, achieving systematic identification of public transportation routes nationwide through collaborative work between the 8684 platform and Amap API.
Data source integration strategy
A multi-source data acquisition framework is employed, with the 8684 platform providing authoritative city directories and route indexes, while Amap API provides detailed spatial geometric data and operation attribute information.
Collection strategy optimization
Systematic enumeration is implemented instead of random sampling, ensuring comprehensive coverage of operation routes through automated web crawling. Concurrent processing technology (≤2 threads) is adopted to improve data collection efficiency while avoiding API service overload.
Data reliability mechanisms
Error handling and retry mechanisms are established, using exponential backoff strategies to handle temporary service interruptions, and implementing access frequency limit compliance to ensure stable long-term data collection.
Task B: data processing and standardization
Task B implements comprehensive data processing and standardization procedures to ensure spatial accuracy and format consistency.
Coordinate transform
All spatial coordinates are converted from China’s mapping coordinate system (GCJ-02) to the World Geodetic System (WGS-84) by iterative correction algorithm. The algorithm demonstrated an average accuracy of ±5.08 meters, thereby ensuring compatibility with international coordinate systems.
Geometric processing
LineString generation algorithm converted raw routes data into standard spatial geometric objects. And Stop projection algorithm ensures precise correspondence between stops and route geometry.
Data standardization
A bilingual field naming system is established using UTF-8 encoding to ensure multilingual compatibility. All attribute fields follow GTFS-compatible data structure design.
Operation data enhancement
Systematic collection and standardization of key operational attributes such as operating hours and fare information, providing data support for in-depth operation analysis.
Network segmentation
Each complete bus route is decomposed into basic segment units between consecutive stops, providing fundamental data support for cross-sectional passenger flow calculation and fine-grained operation analysis. Segment data contains origin-destination station information, segment distance, number of passing routes and other attributes. These elements facilitate network topology analysis and operation efficiency assessment research.
Task C: quality control and cross-validation
To ensure data quality through multi-dimensional validation, Task C establishes a rigorous quality control system.
Multi-source Cross-validation
Duplicate detection, boundary checking, and API schema validation are implemented to prevent data redundancy and errors. Cross-comparison among multiple data sources improves data reliability and completeness.
Statistical quality metrics
A quantified quality assessment indicator system has been developed, achieving a 97.2% validity verification rate. Data quality is continuously monitored through statistical indicators such as coordinate validity checks and data matching rate assessments.
Geographic coverage validation
Administrative completeness assessment and network topology analysis are conducted to ensure the completeness of the dataset in terms of geographic coverage and network connectivity. Coverage rates of different-level cities are evaluated to verify the representativeness of the dataset.
Task D: Dataset Organization and Final Products
Task D completes the final organization and productization of the dataset, and generates standard data products.
File Structure Organization
The hierarchical file structure is organized with classification management by mode, supporting independent access and analysis of bus and metro data.
Product Specification:
Bus Dataset: Complete operation network covering 140,371 routes in 350 cities.
Metro Dataset: Metro operation systems containing 992 lines in 46 cities.
Technical Specifications: WGS-84 coordinate.
Standardized output
Generates multiple format data products, including national and city-divided Shapefiles, meeting the needs of different application scenarios. All data products include complete metadata descriptions and usage instructions.
Network segment analysis and route decomposition
Network segment analysis method
To support fine-grained network topology analysis and operational efficiency assessment, CPTOND-2025 introduces a network segment analysis method. This method decomposes complete bus and metro routes into network segments between consecutive stops, where each network segment represents a direct connection between two adjacent stops.
Route segmentation technology
Based on stop projection positions along routes, the route segmentation algorithm decomposes original route geometries into precise network segments. Key technical steps are as follows:
Stop Projection: Project all stops onto the route’s geometric path (polyline) to determine their positions along the route. For each stop, calculate the distance from the route’s starting point to the stop’s projection point, establishing a projection distance value that represents the stop’s sequential position along the route.
Sorting: Sort all stops in ascending order based on their projection obtained in step 1. This ensures stops are arranged in the correct sequence from the route’s origin to destination, following the actual route alignment.
Segment Division: Based on the sorted stop sequence, utilize Shapely library’s substring functionality to extract precise route segments between each pair of consecutive stops.
Distance Calculation: Calculate the geodetic distance along the route geometry for each segment between consecutive stops using coordinate projection methods, measuring the actual travel distance (in kilometers) rather than straight-line distance.
Segment Aggregation: Aggregate network segments with identical start-end points, calculating average distances and route frequencies.
Segment data enhancement
Network segment data contains rich attribute information, including origin and destination stop information (stop names in Chinese and English, stop IDs), segment distance, number of routes, and city information. This fine-grained data structure provides robust data support for network topology analysis, accessibility assessment, operational efficiency calculation.
Operation data enhancement
Operation time information
The system collects first bus hour (start_time) and final bus hour (end_time) for each route, supporting operation duration analysis and service level assessment.
Fare structure data
Basic fare (basic_price) and full journey fare (total_price) are collected, reflecting different cities’ fare policies and segmented pricing systems.
Operating entity information
The Chinese name (company_cn) and English name (company_en) of operating companies are recorded to support operating entity analysis and market structure research.
Mileage information calculation
Based on route geometric data, precise calculation of route length for each route is performed, providing fundamental data for operation cost analysis and network scale assessment.
Data Records
Comprehensive operational field specifications
The dataset is available at Figshare (10.6084/m9.figshare.29377427)22. The dataset contains standardized fields including operating time, fare information, operating entities, and route attributes, comprehensively covering key attributes of public transport operation. Table 2 provides detailed specifications of the core operational data fields included in CPTOND-2025, demonstrating the comprehensive nature of the operational information captured in the dataset.
Table 2.
Core Operation Data Fields.
| Field Category | Field Name | Data Type | Description | Example Value |
|---|---|---|---|---|
| Operating Time | start_time | String | First bus hour (HHMM) | “0500” |
| end_time | String | Last bus hour (HHMM) | “2300” | |
| Fare Information | basic_price | String | Basic fare (CNY) | “2” |
| total_price | String | Full journey fare (CNY) | “5” | |
| Operating Company | company_cn | String | Operating company Chinese name | “北京公交集团” |
| company_en | String | Operating company English name | “Beijing Public Transport Group” | |
| Route Attributes | route_type | String | Route type Chinese name | “普通公交” |
| type_en | String | Route type English name | “Regular buses” | |
| loop | String | Loop route indicator | “0”/“1” | |
| length | Float | Route mileage (km) | 12.5 |
Spatial data outputs and GIS integration
The dataset provides standardized Shapefile formats for GIS integration:
bus_routes.shp: National bus route geometries (LineString, 140,371 records)
bus_stops.shp: National bus stop locations (Point, 3,571,284 records)
metro_routes.shp: National metro line geometries (LineString, 992 records)
metro_stops.shp: National metro station locations (Point, 17,731 records)
bus_segments.shp: Bus network segment data (LineString, inter-station connection segments)
metro_segments.shp: Metro network segment data (LineString, inter-station connection segments
bus_stops_unique.shp: Deduplicated bus stops (Point, organized by city)
metro_stops_unique.shp: Deduplicated metro stations (Point, organized by city)
Network segment data structure
Network segment shapefiles contain the following key fields:
s_stop_cn/s_stop_en: Origin stop names in Chinese and English
s_stopid: Origin stop ID
e_stop_cn/e_stop_en: Destination stop names in Chinese and English
e_stopid: Destination stop ID
distance: Network segment distance (kilometers)
num: Number of routes using this network segment
city_cn/city_en: City names in Chinese and English
All shapefiles contain optimized attribute tables with field names adhering to the 10-character limit while maintaining information integrity. Network segment data provides a fine-grained data foundation for network topology analysis, accessibility research, and operational efficiency assessment.
Data Overview
The CPTOND-2025 dataset comprehensively documents the operational status of China’s public transport networks as of June 2025, utilizing a structured hierarchical directory storage architecture that enables independent processing and analysis of bus and metro operational networks. This dataset provides complete coverage of bus operation networks across 350 cities and metro operational systems in 46 cities. Notably, 45 cities possess dual-mode transportation systems incorporating both bus and metro services, resulting in a total coverage of 351 individual urban units. The dataset encompasses a total route length of approximately 3.408 million kilometers, comprising approximately 3.375 million kilometers of bus network operations and 33,000 kilometers of metro network operations. Table 3 provides detailed statistics on the route length distribution of the dataset. Figures 3, 4 respectively illustrate the nationwide spatial distribution patterns of bus networks and metro networks within the CPTOND-2025 dataset.
Table 3.
Dataset Route Length Statistics.
| Category | Number of Cities | Number of Routes | Number of Stops | Total Route Length (km) | Average Route Length (km) |
|---|---|---|---|---|---|
| Bus Network | 350 | 140,371 | 3,571,284 | ~3,375,000 | ~24.05 |
| Metro Network | 46 | 992 | 17,731 | ~33,000 | ~32.93 |
| Total | 351* | 141,363 | 3,589,015 | ~3,408,000 | ~24.11 |
*45 cities have both bus and metro systems; Taiwan Province is counted as a single city unit due to API data structure limitations.
Fig. 3.
Spatial Distribution of Bus Networks in CPTOND-2025.
Fig. 4.
Spatial Distribution of Metro Networks in CPTOND-2025.
Data Volume Specification: The bus network data contains 140,371 route records, counted separately for upbound and downbound directions of routes. Station data includes 3,571,284 records, covering all routes, including duplicate stops across different routes. To support network topology analysis requirements, the dataset also provides deduplicated stop data files (bus_stops_unique.shp), organized by city with deduplicate stops.
Technical Validation
Spatial accuracy validation
To ensure the spatial accuracy of the CPTOND-2025 dataset for urban public transport network analysis, this study establishes a spatial accuracy assessment system based on multi-source data cross-validation. The validation sampling strategy selects representative samples from different-level cities and transport modes, and conducts accuracy comparison analysis through third-party data sources.
Validation sample selection and data source configuration
Beijing, Shanghai, and Guangzhou were selected as validation regions due to their comprehensive public transport networks and high-quality reference data availability. Using a stratified random sampling strategy, 3 metro lines and 100 bus routes were selected from each city for validation, yielding a total of 310 metro stations and 6,675 bus stops as validation samples. Baidu Maps API was utilized as an independent reference data source to obtain corresponding station coordinates in the GCJ-02 coordinate system. Then coordinates were compared with the WGS-84 converted coordinates in the CPTOND-2025 dataset for accuracy analysis.
Metro station accuracy validation
GCJ-02 to WGS-84 coordinate transformation accuracy analysis was conducted on a total of 310 metro stations across Beijing, Shanghai, and Guangzhou. Validation results indicated that the average positional deviation between different data sources after GCJ-02 to WGS-84 conversion is 34.79 m, with a standard deviation of 39.43 m. The position differences for metro stations are primarily attributed to: 1)Spatial Complexity: Metro stations typically have multiple entrances/exits and occupy larger geographical spaces, with different data sources potentially selecting different representative locations; 2)Positioning Strategies Differences: Different map service providers employ varying reference point strategies for metro station positioning (such as platform center, main entrance, station hall center, etc.); 3)Inconsistent Measurement Standards: Differences in measurement standards and accuracy requirements for original coordinate collection affect final accuracy.
Bus stop accuracy validation
The same accuracy validation analysis was performed on 6,675 randomly selected bus stops in the three cities mentioned above. The results demonstrated that the average positional deviation for bus stops is 3.69 meters with a standard deviation of 3.13 meters, significantly superior to metro station data. The higher accuracy of bus stops is mainly due to: 1)Defined Spatial Extent: Bus stops have relatively smaller and fixed spatial extents, reducing positioning ambiguity; 2)Unified Positioning Standards: Geographic positioning standards for bus stops are relatively uniform, typically using bus stop signs or shelter locations as reference points.
Comprehensive accuracy assessment
This study evaluates the overall spatial accuracy of the CPTOND-2025 dataset using weighted average method. Considering that the number of bus stops (3,571,284) significantly exceeds the number of metro stations (17,731), weighting by station count yields an overall average accuracy of 5.08 meters for the dataset.
This precision level meets the accuracy requirements for urban public transport network analysis, accessibility assessment, and operational planning applications. It complied with international quality standards for public transport datasets, and can support various urban transportation research applications.
Spatial accuracy validation scope and limitations
Detailed spatial accuracy validation was conducted for bus stops and metro stations in three first-tier cities (Beijing, Shanghai, Guangzhou), where both high-quality official reference data and comprehensive OSM data are available. For smaller cities, particularly those in western regions, limited availability of official reference data and incomplete OSM coverage constrained our ability to conduct similar spatial accuracy assessments. However, the consistency of data collection methodology and source platforms (8684 and Amap) across all cities suggests that spatial accuracy patterns observed in major cities are likely representative of the broader dataset. Users working with data from smaller cities should be aware of this validation limitation and consider conducting local validation for applications requiring high spatial precision.
Operation data completeness assessment
Operating time completeness
48.4% of bus routes and 26.2% of metro lines contain complete operating time information (start_time and end_time).
Fare information coverage
86.6% of bus routes and 68.7% of metro lines contain complete fare information. The bus route data reveals that 85.4% of routes contain basic fare information with a full journey fare coverage rate of 81.5%. A similar examination of metro line data reveals that 68.7% of lines include both basic fare and full journey fare coverage information.
Operating company information
74.5% of routes contain operating company information, supporting operating entity analysis and market structure research.
Network data validation and urban hierarchy coverage
Metro network data validation
To ensure the accuracy and reliability of metro network data in CPTOND-2025, this study constructed a comprehensive validation system based on authoritative official data sources. Using the latest route maps and operation information published on official websites of metro operating companies as benchmarks, systematic cross-validation was conducted on metro networks across all 46 cities/regions covered in the dataset. The validation framework encompasses three core dimensions: (1) consistency verification of operation route quantities, (2) accuracy comparison of station node numbers, and (3) standardized validation of route and station nomenclature.
The validation procedure was conducted as follows
For each of the 46 cities, we systematically compared our dataset against official metro operator websites and publicly available route maps. Specifically, we verified: (a) the total number of operational metro lines in each city matched official records; (b) the number of stations on each line corresponded exactly to official data; (c) route names (both Chinese and English) were consistent with official nomenclature; and (d) station names matched official designations. Any discrepancies identified during this process were cross-checked with multiple authoritative sources (including official metro company announcements, government transportation department reports, and verified news releases) and corrected accordingly. This systematic, line-by-line and station-by-station validation process was documented in a verification matrix (available upon request).
Through this rigorous item-by-item comparative analysis, validation results demonstrate that the dataset achieves 100% completeness in route coverage and station information accuracy, meaning that all 46 metro systems’ route counts, station counts, and naming conventions precisely match official sources, fully confirming the high-quality characteristics of CPTOND-2025 metro network data.
Urban hierarchy coverage validation
Based on the urban commercial attractiveness ranking classification standards published by Yicai·New First-Tier Cities Research Institute in 202423, the CPTOND-2025 dataset achieves coverage of 346 out of 356 cities nationwide, with an overall coverage rate of 97.2%. At the first-tier city level, the four core cities of Shanghai, Beijing, Shenzhen, and Guangzhou achieve 100% complete coverage, with all simultaneously integrating dual network data for both bus and metro systems. New first-tier cities, second-tier cities, and fourth-tier cities all achieve 100% coverage, encompassing 15 new first-tier cities including Chengdu, Hangzhou, and Chongqing; 30 second-tier cities including Foshan, Shenyang, and Jinan; and 95 fourth-tier cities. Third-tier cities achieve a coverage rate of 97.2% (69/71). Fifth-tier cities achieve a coverage rate of 94.3% (133/141), with the 8 uncovered cities mainly distributed in remote western regions including Qinghai, Inner Mongolia, Tibet, and Heilongjiang, where population density is low and public transport development is relatively limited, having minimal impact on the representativeness of national-scale analysis. The dataset also integrates public transport networks from Hong Kong and Macao Special Administrative Regions, establishing a complete data infrastructure for cross-regional comparative research. Overall, CPTOND-2025 achieves near-complete coverage in high-tier cities while maintaining excellent representativeness and scientific rigor in medium and low-tier cities, fully meeting the data quality requirements for national-scale public transport network analysis research.
Comparison with openstreetmap and international standards
Metro network data validation
To ensure the accuracy and reliability of metro data in CPTOND-2025, this study constructed a comprehensive validation system based on authoritative official data sources. Using the latest route map and operation information published on official websites of metro operating companies as benchmarks, systematic cross-validation was conducted on metro networks across all 46 cities/regions. The validation framework encompasses three core dimensions: consistency verification of operation route quantities, accuracy comparison of station node numbers, and standardized validation of route and station nomenclature. Through rigorous comparative analysis, validation results demonstrate that the dataset achieves 100% completeness in route coverage and station accuracy, fully confirming the high-quality characteristics of CPTOND-2025 metro network data.
Bus network data validation methodology
Addressing the complexity and diversity of bus network data, this study established a comprehensive comparative analysis system based on multi-source data cross-validation. After stratified sampling, the validation strategy selects the top 15 cities by GDP in China as validation samples, representing typical characteristics of different development levels and geographical regions. The validation process utilizes 3 data sources for systematic comparison: (1) Official websites of city bus companies, official WeChat public accounts, and statistical data from transportation bureaus as authoritative benchmarks; (2) OpenStreetMap (OSM) open geographic data, extracting bus route and stop data through the Overpass API; (3) The CPTOND-2025 dataset itself.
The validation indicator system includes multiple dimensions such as route coverage completeness, station data accuracy, and spatial distribution rationality. To ensure scientific comparison, all data sources adopt unified statistical standards: route data is calculated separately for upbound and downbound directions, and station data is deduplicated based on actual physical locations. Table 4 presents the comprehensive multi-source data comparative analysis results for the top 15 cities by GDP, providing quantitative evidence of CPTOND-2025’s data quality advantages.
Table 4.
Multi-source Data Comparative Analysis for Major Cities.
| City Name | Official Data Bus Routes | CPTOND-2025 (Bus Routes/Stops) | OSM Data (Bus Routes/Stops) |
|---|---|---|---|
| Shanghai | 1859 | 3225/68240 | 3388/54344 |
| Beijing | 2257 | 4137/109964 | 3913/33496 |
| Shenzhen | 896 | 1976/48585 | 3679/16417 |
| Chongqing | 976 | 3240/65657 | 1362/5092 |
| Guangzhou | 1661 | 2612/62089 | 2070/15414 |
| Suzhou | 846 | 2421/63956 | 1840/3880 |
| Chengdu | 1704 | 3048/60422 | 960/9626 |
| Hangzhou | 1520 | 3246/72337 | 975/17486 |
| Wuhan | 1167 | 1658/38982 | 1175/9800 |
| Nanjing | 854 | 1792/42341 | 692/3778 |
| Tianjin | 815 | 1831/49546 | 768/3262 |
| Ningbo | 842 | 2446/66955 | 2098/1200 |
| Qingdao | 932 | 1605/44246 | 432/3118 |
| Wuxi | 364 | 1353/34317 | 1510/16910 |
| Changsha | 507 | 1161/36634 | 478/1685 |
Notes: Official data counts only physical route numbers; CPTOND-2025 and OSM data count upbound and downbound directions separately, resulting in approximately double the route numbers compared to official data.
The finding of the comparative analysis demonstrates that the CPTOND-2025 dataset has significant data quality advantages. In terms of route coverage, CPTOND-2025 maintains good consistency with official data, with route numbers calculated separately for upbound and downbound directions being approximately double the official statistics, conforming to scientific data processing standards. Comparative results with OpenStreetMap data indicate that CPTOND-2025 performs outstandingly in terms of data completeness and consistency. Additionally, CPTOND-2025 has excellent data completeness across cities of different scales. CPTOND-2025 achieved comprehensive and accurate coverage from first-tier cities to new first-tier cities, and further to second-tier cities. In contrast, OpenStreetMap data exhibits notable data gaps in some cities, reflecting the limitations of crowdsourced data in terms of systematicity and completeness, with this difference being particularly pronounced in large cities with high station density. This comparative validation fully confirms the effectiveness of CPTOND-2025’s professional data source integration strategy and systematic data collection methodology, establishing a solid data quality foundation for subsequent national-scale public transport network analysis research.
Overall, CPTOND-2025 achieves near-complete coverage in high-level cities while maintaining excellent representativeness and scientific rigor in medium and low-level cities, fully meeting the data quality requirements for national-scale public transport network analysis research. Through multi-source data cross-validation, CPTOND-2025 achieves international advanced levels in core indicators including completeness, accuracy, consistency, and timeliness, providing a high-quality data foundation for China’s public transport network research.
Usage notes and applications
Temporal validity
The data reflects the network status in June 2025, with specific collection dates from June 1–30, 2025. This time window was selected for the following reasons: (1) avoiding operational adjustment periods during major holidays such as Spring Festival; (2) relatively stable summer operating schedules; (3) ensuring consistency and comparability of data collection. Users should consider the dynamic characteristics of transport network changes, particularly regarding new route openings and route adjustments that may occur subsequent to data collection. For time-sensitive applications, users are advised to cross-reference with current official sources or contact the corresponding author regarding potential dataset updates.
Fare information update frequency
The fare information in CPTOND-2025 reflects the pricing policies as of June 2025. Public transport fares in China typically remain stable for extended periods (often 1–3 years or longer), with changes usually occurring due to government policy adjustments rather than frequent market fluctuations. However, users should note that fare structures may change over time, and the dataset represents a snapshot rather than real-time pricing. For applications requiring current fare information, users are advised to verify with official sources or consider the dataset as a baseline for comparative analysis. Future versions of this dataset may incorporate updated fare information as resources permit.
Taiwan province data limitations
Due to Amap API service restrictions, the update frequency and field consistency of Taiwan Province data are relatively limited.
Operating company information
Operating company information for some smaller cities may be insufficient in detail, mainly concentrated at the county-level.
Dynamic operation data
The dataset provides static operating time and fare information, and does not include real-time service intervals, dynamic scheduling, and other operation information.
Inter-city metro line considerations
Some metro lines operate across multiple cities (e.g., Beijing-Tianjin intercity lines). These lines appear in the datasets of both cities they serve. Users conducting multi-city analyses should be aware of this data structure and implement deduplication strategies based on line IDs when necessary. The ‘city_cn‘ and ‘city_en‘ fields indicate the administrative city of registration, but the route geometry may extend beyond city boundaries. For network connectivity analysis spanning multiple cities, users should check for duplicate line names and identical station sequences to identify inter-city lines.
Validation coverage considerations
While network completeness has been validated across all city tiers, detailed spatial accuracy validation focused on major cities where reference data is readily available. Users requiring high-precision spatial analysis for smaller cities should consider: (1) cross-referencing with local official sources where available; (2) conducting field verification for critical applications; (3) being aware that spatial accuracy in smaller cities may vary from the 34.79 m average deviation measured in major cities.
Operation data completeness assessment
48.4% of bus routes and 26.2% of metro lines contain complete operating time information (start_time and end_time). 86.6% of bus routes and 68.7% of metro lines contain complete fare information. The bus route data reveals that 85.4% of routes contain basic fare information with a full journey fare coverage rate of 81.5%. A similar examination of metro line data reveals that 68.7% of lines include both basic fare and full journey fare coverage information. 74.5% of routes contain operating company information, supporting operating entity analysis and market structure research.
Acknowledgements
We sincerely acknowledge the significant data support provided by various data providers, including the 8684 Public Transit Platform and Amap, for the construction of this dataset. Without the open data services of these platforms, this research would not have been possible. We extend special thanks to Gengchen Li for his valuable suggestions and technical guidance during the dataset construction process and data validation phase, whose professional insights played a crucial role in enhancing the dataset quality. This research was supported by the National Key Research and Development Program of China (Grant Numbers: 2021YFA1000300, 2021YFA1000304).
Author contributions
L.W. conceived the study, developed data collection methodology, performed data processing, and wrote the manuscript. H.W. conducted data validation and quality control. Y.G., L.O., D.X., X.H., M.Z., M.C. and D.S. participated in data validation procedures. D.G. and Z.Z. contributed to manuscript writing supervision. X.H.Z. and X.D.Z. provided technical supervision and contributed to manuscript revision.
Data availability
The CPTOND-2025 dataset has been stored at 10.6084/m9.figshare.2937742722.
Code availability
The complete data collection, processing, and validation code has been uploaded to Figshare22. The specific contents comprise: Bus_Operation_Data_Crawler.py: The Python script is designed for systematic acquisition of national bus operation network data, including route information and station locations. Metro_Operation_Data_Crawler.py: The Python script is designed for systematic acquisition of national metro operation system data, including line information and station coordinates. Bus_Data_Processor.py: A comprehensive bus data processing and quality control script that handles coordinate system transformation and generates standardized Shapefile formats for bus networks. Metro_Data_Processor.py: A comprehensive metro data processing and quality control script that handles coordinate system transformation and generates standardized Shapefile formats for metro networks. Bus_Segment_Processor.py: A bus network segment analysis processing script that implements precise decomposition of bus routes into inter-station network segments, coordinate projection-based segment distance calculation, and aggregated statistical analysis of duplicate segments. Metro_Segment_Processor.py: A metro network segment analysis processing script that implements precise decomposition of metro lines into inter-station network segments, coordinate projection-based segment distance calculation, and aggregated statistical analysis of duplicate segments. The code requires Python 3.8 + as the core development environment, with key dependencies including GeoPandas 0.10 + for geospatial data processing, Shapely 1.8 + for geometric computations, Requests 2.25 + for HTTP request handling, and BeautifulSoup4 4.9 + for web data parsing and crawling. Additionally, coordinate transformation libraries provide precise coordinate system conversion functionality, while NetworkX supports complex network topology analysis and data validation procedures.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Xinghua Zhang, Email: zhangxh@bjtu.edu.cn.
Xiaodong Zhang, Email: zhang-xd23@mails.tsinghua.edu.cn.
References
- 1.Gallotti, R. & Barthelemy, M. Anatomy and efficiency of urban multimodal mobility. Scientific Reports4, 6911, 10.1038/srep06911 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Derrible, S. & Kennedy, C. The complexity and robustness of metro networks. Physica A: Statistical Mechanics and its Applications389, 3678–3691, 10.1016/j.physa.2010.04.008 (2010). [Google Scholar]
- 3.Han, B. et al. Statistical Analysis and Review of Global Urban Rail Transit Operations in 2024. Urban Rapid Rail Transit38, 1–12 (2025). [Google Scholar]
- 4.Metros, C. A. o. Essential Interpretation of the Urban Rail Transit 2024 Annual Statistical and Analysis Report. China Metros, 22–25, 10.14052/j.cnki.china.metros.2025.04.019 (2025).
- 5.China, M. O. T. O. Statistical Bulletin on the Development of the Transport Industry 2023. (China Communications Press, Beijing, 2024).
- 6.Lu, K. et al. Urban Rail Transit in China: Progress Report and Analysis (2015–2023). Urban Rail Transit11, 1–27, 10.1007/s40864-024-00231-7 (2025). [Google Scholar]
- 7.Salonen, M. & Toivonen, T. Modelling travel time in urban networks: comparable measures for private car and public transport. Journal of Transport Geography31, 143–153, 10.1016/j.jtrangeo.2013.06.011 (2013). [Google Scholar]
- 8.Farber, S. & Fu, L. Dynamic public transit accessibility using travel time cubes: Comparing the effects of infrastructure (dis)investments over time. Computers, Environment and Urban Systems62, 30–40, 10.1016/j.compenvurbsys.2016.10.005 (2017). [Google Scholar]
- 9.Zhong, C., Arisona, S. M., Huang, X., Batty, M. & Schmitt, G. Detecting the dynamics of urban structure through spatial network analysis. International Journal of Geographical Information Science28, 2178–2199 (2014). [Google Scholar]
- 10.Liu, X., Gong, L., Gong, Y. & Liu, Y. Revealing travel patterns and city structure with taxi trip data. Journal of transport Geography43, 78–90 (2015). [Google Scholar]
- 11.Kujala, R., Weckström, C., Darst, R. K., Mladenović, M. N. & Saramäki, J. A collection of public transport network data sets for 25 cities. Scientific Data5, 180089, 10.1038/sdata.2018.89 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Gallotti, R. & Barthelemy, M. The multilayer temporal network of public transport in Great Britain. Scientific Data2, 140056, 10.1038/sdata.2014.56 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wang, S., He, J., Ma, R., Cheng, Z. & Ding, H. A Comprehensive Vector Dataset of Bus Networks Across China for the Year 2024. Scientific Data12, 524, 10.1038/s41597-025-04894-0 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.von Ferber, C., Holovatch, T., Holovatch, Y. & Palchykov, V. Public transport networks: empirical analysis and modeling. The European Physical Journal B68, 261–275, 10.1140/epjb/e2009-00090-x (2009). [Google Scholar]
- 15.Sen, P. et al. Small-world properties of the Indian railway network. Physical Review E67, 036106, 10.1103/PhysRevE.67.036106 (2003). [DOI] [PubMed] [Google Scholar]
- 16.Sienkiewicz, J. & Hołyst, J. A. Statistical analysis of 22 public transport networks in Poland. Physical Review E72, 046127, 10.1103/PhysRevE.72.046127 (2005). [DOI] [PubMed] [Google Scholar]
- 17.Fayyaz, S., Liu, S. K. & Zhang, X. C. G. An efficient General Transit Feed Specification (GTFS) enabled algorithm for dynamic transit accessibility analysis. PLOS ONE12, e0185333, 10.1371/journal.pone.0185333 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Inc, G. General Transit Feed Specification Referencehttps://developers.google.com/transit/gtfs/reference (2023).
- 19.Chen, Y. & Acheampong, R. A. Mobility-as-a-service transitions in China: Emerging policies, initiatives, platforms and MaaS implementation models. Case Studies on Transport Policy13, 101054, 10.1016/j.cstp.2023.101054 (2023). [Google Scholar]
- 20.Li, L. et al. in Proceedings of the 20th International Systems and Software Product Line Conference. 271–275.
- 21.Luo, R. & Liu, Y. Bus Stop Information Data Collection and Analysis Technology Based on Web Crawler Technology. 2022 3rd International Conference on Computer Science and Management Technology (ICCSMT), 197–200 (2022).
- 22.Wang, L. et al. CPTOND-2025. figshare10.6084/m9.figshare.29377427 (2025).
- 23.Institute, Y. N. F.-T. C. R. 2024 New First-Tier Cities Attractiveness Ranking. (Yicai·New First-Tier Cities Research Institute, Shanghai, 2024).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
- Wang, L. et al. CPTOND-2025. figshare10.6084/m9.figshare.29377427 (2025).
Data Availability Statement
The CPTOND-2025 dataset has been stored at 10.6084/m9.figshare.2937742722.
The complete data collection, processing, and validation code has been uploaded to Figshare22. The specific contents comprise: Bus_Operation_Data_Crawler.py: The Python script is designed for systematic acquisition of national bus operation network data, including route information and station locations. Metro_Operation_Data_Crawler.py: The Python script is designed for systematic acquisition of national metro operation system data, including line information and station coordinates. Bus_Data_Processor.py: A comprehensive bus data processing and quality control script that handles coordinate system transformation and generates standardized Shapefile formats for bus networks. Metro_Data_Processor.py: A comprehensive metro data processing and quality control script that handles coordinate system transformation and generates standardized Shapefile formats for metro networks. Bus_Segment_Processor.py: A bus network segment analysis processing script that implements precise decomposition of bus routes into inter-station network segments, coordinate projection-based segment distance calculation, and aggregated statistical analysis of duplicate segments. Metro_Segment_Processor.py: A metro network segment analysis processing script that implements precise decomposition of metro lines into inter-station network segments, coordinate projection-based segment distance calculation, and aggregated statistical analysis of duplicate segments. The code requires Python 3.8 + as the core development environment, with key dependencies including GeoPandas 0.10 + for geospatial data processing, Shapely 1.8 + for geometric computations, Requests 2.25 + for HTTP request handling, and BeautifulSoup4 4.9 + for web data parsing and crawling. Additionally, coordinate transformation libraries provide precise coordinate system conversion functionality, while NetworkX supports complex network topology analysis and data validation procedures.




