Abstract
Socio-economic data with fine-grained spatial resolution forms the basis of socio-spatial analysis and policymaking. In response to the limited availability of such data in China, this study provides an open-access, community-level dataset on education percentile rank — a more accurate indicator of social status than years of education. Our dataset comprises 122,126 communities, covering 97.9% of prefecture-level administrative units and 81.8% of county-level administrative units. The data is estimated using an XGBoost machine learning model based on the relationship between mean education percentile rank and the characteristics of the built environment, including functions and facilities, street scene elements, vitality, human perception, physical disorder, and topography at the community level. Multi-source data, including the Chinese General Social Survey, points of interest, road networks, night-time lighting, and street view images processed using computer vision techniques such as semantic segmentation, object detection, and image regression, are used for model training and inference. Our final education predictions are highly accurate at prefecture, county, and community levels. This dataset enables fine-grained socio-spatial analyses across disciplines.
Subject terms: Geography, Education, Socioeconomic scenarios
Background & Summary
Educational attainment is a fundamental indicator of individual socio-economic status, and together with income, constitutes the widely used International Socio-Economic Index (ISEI)1. Unlike income, which fluctuates over the life course and is often underreported due to its sensitive nature, educational attainment tends to be more stable and measurable. It also serves as a robust predictor of various socio-economic outcomes. In contemporary China, for example, higher levels of education are crucial for securing high incomes2, gaining social prestige3, entering elite administrative and professional positions4, achieving intergenerational upward mobility5,6, obtaining an urban hukou7, and attaining middle-class status8,9. While educational attainment is a powerful determinant of individual socio-economic outcomes, its significance extends beyond the individual level. When aggregated at the community level, it serves as a key indicator of a community’s socio-economic composition, resources, social cohesion, and developmental potential10. Prior studies have considered community-level educational attainment to analyze citywide socio-spatial issues, such as the distribution of deprivation, regional inequality, and segregation in the residential and activity spaces11,12.
In the U.S., the American Community Survey (ACS) collects annual data on educational attainment, alongside other socio-demographic characteristics across census units. By supplementing the decennial census, the ACS enables a wide range of social research and supports foundational understandings of American society. However, China has long lacked detailed educational data with fine spatial resolution. Annual statistical reports only publish aggregated population figures by educational attainment at the prefecture level, while decennial censuses release such data at the county/district level. Both of these have coarse geographic resolution. In response, the rise of big data and computational methods has enabled some internet companies to estimate users’ educational attainment based on consumption habits and spatio-temporal behaviors12,13. However, these alternative data sources face three major limitations. First, these estimation processes are a black box, making it difficult to assess the validity of their predictions. Second, internet product users are typically not representative of the broader population; for example, older adults and disadvantaged groups are less likely to engage with digital platforms. Third, location-based services (LBS) data are usually proprietary, fragmented across cities, and inaccessible to the public, which greatly limits the potential for large-scale, cross-city, and fine-grained socio-spatial research in China.
To overcome these limitations, the emerging urban visual intelligence analytical framework leverages geo-located big data and artificial intelligence techniques to identify characteristics of the built environment and their associations with social profiles and demographic statistics. This approach enables the prediction of socio-economic indicators at high spatial resolution and extensive geographic coverage with regular update frequencies, lower costs, and greater accessibility14. Pioneering research has demonstrated that street view elements extracted via semantic segmentation and object detection (e.g., vehicle characteristics), land use characteristics identified from remote sensing, and functional composition derived from points of interest (POIs) can accurately predict community-level socio-economic indicators15–17. Although visual cues from street view images are highly predictive, combining multi-source big data further improves prediction accuracy18. Other studies employ street scenes and computer vision techniques to estimate human perception such as perceived wealth and safety in street scenes, finding that these perceptions are strongly correlated with local educational attainment at the community level19–21.
Building on the urban visual intelligence analytical framework and using the XGBoost machine learning model, our study establishes the community-level relationship between various built environmental factors and education percentile rank — a cohort-specific, degree-inflation-adjusted indicator of relative social status22. The community education data for training are derived from six waves of the Chinese General Social Survey (CGSS). Then, based on our trained model, we predict the average education percentile rank of residents in communities across China for 2020. This analysis incorporates a diverse set of built environmental factors, including those quantified through deep learning methods such as image segmentation, object detection, and image regression. The robustness of our prediction results is evidenced by a high coefficient of determination (R2 > 0.9) on the test set, and strong correlations (r ≈ 0.85) between our estimates and the official census data or LBS-derived estimates at the prefecture, county, and community levels.
Our dataset provides mean education percentile ranks of residents in 122,126 Chinese communities in 2020. It covers a total area of 492,200 square kilometers and encompasses all 31 province-level administrative units, 326 prefecture-level administrative units (including prefectural-level cities, prefectures, autonomous prefectures, leagues, and municipalities), and 2,337 county-level administrative units (including districts, counties, and county-level cities) in mainland China. In terms of population coverage, our dataset covers 85.5% of the total population in these county-level units. Population data are sourced from WorldPop’s population count dataset at a resolution of 3 arc (~100 m at the equator) (https://hub.worldpop.org/geodata/listing?id=69). In terms of land coverage, our dataset encompasses 82.7% of the total urban impervious surface area (i.e. urban construction land) within these county-level administrative units in 2018. Land use data are presented by Li et al.23 (https://data-starcloud.pcl.ac.cn/iearthdata/14).
Designed to support high-resolution social research, our dataset of community-level education percentile rank has a wide range of potential applications. It enables detailed analysis of urban social structures including class, social inequality, and residential segregation, and serves as a critical input for examining other socio-spatial phenomena such as patterns of gentrification, crime, and housing market dynamics.
Methods
Overview
Our study takes communities as the unit of analysis. Community boundaries are defined by community or village committee areas, which are the smallest administrative units in China. The samples used to train the machine learning model are drawn from six waves of the CGSS, conducted in 2010, 2013, 2015, 2017, 2018, and 2021 (https://www.cnsda.org/index.php). The CGSS is widely used in social science research and is representative at the national level24. Communities appeared in multiple waves are treated as distinct samples to account for changes in the built environment and demographic compositions. A total of 2,730 communities make up the sample. Educational information in the sample data comes from the CGSS. We derived built environment variables for training from multiple big data sources, including street view images, POIs, road networks, and night-time light remote sensing imagery, for corresponding years.
When applying the trained machine learning model to predict educational attainment in communities across the nation, communities with missing built environment data are excluded. There are two types of communities missing in our dataset. The first type is non-urbanized land, consisting of sparsely populated agricultural zones, ecological areas, and rural settlements. The second type comprises newly expanded urban construction land located in outer suburban areas. These newly built areas are characterized by low population density and limited socio-economic activities. In other words, our dataset covers almost all urban construction land in city centers and inner suburbs, as well as well-established urban construction land in outer suburbs, across 2,337 county-level administrative units. Given the nationwide availability of built environment data, we utilized 2020 data for prediction. Figure 1 presents the workflow diagram illustrating the entire process of training, prediction, and validation in this study.
Fig. 1.
Workflow of estimation and validation of the community-level education percentile rank.
Response: Education percentile rank
The education percentile rank measures an individual’s position relative to others in the same birth cohort in terms of educational attainment22. It ranges from 0 to 100. Unlike absolute years of schooling, the percentile rank addresses the issue of incomparable absolute values across generations, which results from the continuous expansion of education25,26.
Following the methodology of Xie and Zhang22, we estimated the education percentile rank for each birth cohort at different qualification levels based on the educational attainment composition within each cohort as recorded in the census. To mitigate the effects of mortality, we obtained the educational attainment composition of the population born in 1949 or earlier, 1950 to 1969, 1970 to 1984, and 1985 onwards from the 1982, 2000, 2010, and 2020 censuses, respectively. The population born between 1922 and 1997 was divided into 76 distinct birth cohorts, each corresponding to a specific year of birth. Data conversion is performed separately for every birth cohort. For each cohort, we calculated the percentage of the population at each level of education. We then ranked the population from lowest to highest according to their educational attainment and calculated the cumulative percentage for education level i. Finally, the percentile rank of education level i is determined by adding the cumulative percentage of the population with an education level lower than i and half the percentage of the population with an education level of i.
We use the mean education percentile rank among residents of each community as the response in our models. This is reasonable because, first, according to the education percentile rank of each birth cohort, there has been a continuous trend of degree inflation in China, except during the Cultural Revolution (Fig. 2). For example, the education percentile rank of individuals born in 1935 who completed senior high school is 96.6, indicating that only 3.4% of people in their cohort have higher educational attainment. However, for those born in 1995 with the same level of education, the rank drops to 45.0. Clearly, it is unreasonable to assume that people from different cohorts have the same social status simply because they have achieved the same level of education. Second, competition for social resources typically occurs between individuals of the same generation. For example, people born in the 2000s compete with each other for their first jobs, while those born in the 1970s or 1980s compete with their peers for promotions. Consequently, even when individuals from different generations have different levels of education, their capacity to access social resources is similar if they are in the same percentile rank within their cohort22,25. Third, distinct spatial distribution patterns emerge among different age groups in Chinese cities. The elderly tend to live in city centers, whereas most younger people opt for more affordable housing in the suburbs27,28. Therefore, the spatial distribution of educational attainment is closely linked to the spatial patterns of age groups. Using educational percentile rank largely overcomes the confounding effects of age-related factors.
Fig. 2.

Percentile rank of education level for each birth cohort.
Features: Built environment
We systematically measured the characteristics of the built environment from six perspectives: functions and facilities, street scene elements, vitality, physical disorder, human perception, and topography. We compared the predictive accuracy of models using different spatial units to measure the built environment, comprising (1) standalone community areas, and (2) community areas integrated with concentric buffer zones ranging from 0.5 to 3 km at 0.5 km intervals. Among these configurations, the machine learning model that employed the 1 km buffer integration performed best in terms of test-set predictive accuracy. Therefore, we measured the characteristics of the built environment within each community and its 1 km buffer zone.
Functions and facilities
We measured functions and facilities based on the classic ‘3D’ (Density, Diversity, Design) framework29. We introduced 16 features to reflect density: the POI densities of retail stores, cultural facilities, schools, sports facilities, hospitals, administrative institutions, business offices, factories, warehouses, car parks, parks and squares; the shortest road network distances to subway stations and bus stops; building density; average building height; and vegetation coverage. The entropy index of the aforementioned eleven types of POI facilities was used to measure functional diversity30. We employed road network density and the proportion of arterial and branch roads relative to the total road length to reflect design. We obtained POI, road network, building and vegetation data from the Amap Open Platform (https://lbs.amap.com/), OpenStreetMap (https://download.geofabrik.de/), the Building Footprint Dataset (10.5281/zenodo.11397015/)31 and the Normalized Difference Vegetation Index (NDVI) from the Moderate Resolution Imaging Spectroradiometer (MODIS) (https://modis.gsfc.nasa.gov/data/dataprod/mod13.php/)32, respectively.
Street scene elements
We extracted pixel-level environmental elements from street view images using semantic segmentation of the Swin Transformer model pre-trained on the ADE20K dataset33,34. We then calculated the pixel proportion of each of the 17 element types in every image, including sky, wall, pedestrian, plant and tree, fence, water, window, door, signboard, building, road, sidewalk, car, bus, truck, streetlight and pole, and bench and chair. For each element type, the mean pixel proportion across all street view images within the community and its 1 km buffer zone was used as a predictor. In addition, we calculated the Simpson index of the pixel proportions of the 17 element types to measure street scene diversity35. To crawl street view images, we obtained road network data for mainland China from OpenStreetMap in 2021 and set image sampling points at 100-meter intervals along the centerline of all roads. Using the Baidu Panorama Static Image API (https://lbsyun.baidu.com/faq/api?title=viewstatic-base/), we scraped all street view images taken at each sampling point between 2013 and 2021 at four horizontal angles (0°, 90°, 180°, and 270°) and one vertical angle (20°). A total of 20,814,846 valid street view images were acquired.
Vitality
We considered two commonly used measures: night-time lighting and ambient population density36. We used open-access night-time light remote sensing imagery at a resolution of 15 arc-seconds (~500 m at the equator) from the Visible Infrared Imaging Radiometer Suite (VIIRS) sensor on the Suomi National Polar-orbiting Partnership (NPP) satellite (https://eogdata.mines.edu/products/vnl/)37,38. Ambient population data comes from the LandScan dataset (https://landscan.ornl.gov/). This provides the daily average ambient population count for each year, with a resolution of 30 arc-seconds (~1 km at the equator)39.
Physical disorder
Visual signs of physical disorder, such as public incivility and deterioration, are a barometer of neighborhoods’ socio-economic conditions, including concentrated disadvantage and social disorganization40–43. We employed object detection to identify physical disorder by marking locations in each image with bounding boxes for the three most prevalent types: litter, graffiti (including illegal advertising), and street encroachment (including abandoned vehicles, piles of debris, and street vendors)44. We manually annotated 23,353 Chinese street view images to create datasets for the three types of physical disorder (Fig. 3), and divided them into training and test sets at an 8:2 ratio. Then, we trained the disorder detection model using YOLOv7 for 50 epochs45. The final model with the best performance was used for inference on all street view images across China (Fig. 4). For this final model, the precision reaches 94.1%, the recall reaches 85.2%, the F1 score reaches 89.4%, and the mean average precision when IOU/Intersection over Union is 0.5 (i.e. mAP50) reaches 0.899 on the test set of physical disorder datasets. Lastly, for each type of physical disorder, we calculated the average area ratio of its bounding boxes across all images within each community and its 1 km buffer zone. These three area ratios were taken as features.
Fig. 3.
Example of annotated physical disorder elements using LabelImg. We selected 23,353 street view images containing litter, graffiti, and street encroachment. We then used the LabelImg annotation tool to draw a bounding box around each target element in each image.
Fig. 4.
Examples of automatic inference for physical disorder. For each image, we computed the ratio of the pixel area of the bounding boxes for litter, graffiti, and street encroachment to the total pixel area of the image.
Human perception
Human perception refers to individuals’ feelings about a place, reflecting their overall evaluation of environmental quality46. These subjective perceptions are closely linked to a community’s objective socio-economic status, including its education level19. We employed image regression to measure two perceptions: perceived wealth and perceived safety. Referring to the MIT Place Pulse dataset47, we created a dataset consisting of 32,196 Chinese street view images with annotated perception scores. Volunteers performed a total of 160,970 comparisons of perceived wealth and safety between pairs of images in this dataset (Fig. 5). Since urban planners are skilled at objectively evaluating environmental quality, we invited 40 of them to volunteer for annotation. These comparison results were converted into wealth and safety perception scores for each image in the interval [0,10] using the Microsoft Trueskill algorithm19,48. Higher scores represent greater perceived wealth or safety. Our perception datasets were also divided into training and test sets at a ratio of 8:2. We trained the dataset using the deep convolutional neural network EfficientNetV2 with modified fully connected layers49. The image regression model with the highest accuracy rate for wealth and safety perception was obtained after 40 and 43 epochs, respectively. The model that best predicts wealth perception has a coefficient of determination (R2) of 88.3%, a mean absolute error (MAE) of 0.065 and a root mean square error (RMSE) of 0.088 on the test set. The counterpart model for safety perception has respective values of 87.5%, 0.074 and 0.095. After inference (Fig. 6), the average wealth perception score and average safety perception score of all the street view images in each community and its 1 km buffer zone were used as features.
Fig. 5.
Qt Graphical user interface for pairwise comparison of street view images. We developed a pairwise image comparison program using the PyQt5 Python package. In the Qt graphical user interface, the questions used for comparison are “Which scene looks wealthier?” and “Which scene looks safer?”. Volunteers need to choose “left”, “right”, or “same”.
Fig. 6.
Examples of automatic inferences for wealth and safety perception scores. Panels a to c show three scenarios of wealth perception, ranging from low to high. Panels d to f show the corresponding scenarios for safety perception.
Topography
Since the topography directly affects the level of socio-economic development in an area, we introduced the average altitude and slope of each spatial unit as features. This data is sourced from the NASA/JAXA ASTER GDEM V3 digital elevation model (DEM), with a resolution of 1 arc-second (~30 meters) (https://search.earthdata.nasa.gov/).
Model training and prediction
We employed an XGBoost regressor combined with missing value imputation and Bayesian hyperparameter optimization to train our model and then predict the community’s mean education percentile rank. The model was trained in two stages: Imputation and hyperparameter optimization.
When measuring the features of the built environment, some features were found to have missing values in 896 sample communities. Simply excluding these samples could lead to model underfitting and information loss, resulting in biased parameter estimation and reduced predictive efficacy50. Therefore, we employed the multivariate imputation by chained equations (MICE) method51 alongside Bayesian ridge regression for robust imputation. This method performs well at filling in missing values52. The steps involved in imputation using this method are as follows: First, the mean value of each feature is used to fill in the respective missing values. Second, the relationships between the features need to be established. We treat the feature with missing values as y and the remaining features as x, and then fit these to a Bayesian ridge regression model. The predicted y values of the missing values, estimated by this model, are used to update the initial filled values. The imputation process is carried out sequentially for each feature. It starts with the feature that has the fewest missing values and moves on to the feature that has the most. Once all features with missing values have been imputed using regression, one iteration is complete. Third, after ten iterations, the imputation process finishes and the final imputed values are obtained in the last round. If the imputed values for a feature variable do not differ significantly between two consecutive iterations, the imputation process for that feature can be stopped early.
Then, we performed Bayesian hyperparameter optimization to train the XGBoost machine learning model. The sample community data was divided into training and test sets at a ratio of 8:2. The test set contained only samples without missing values. Subsequently, Bayesian optimization based on a tree-structured Parzen estimator (TPE) was used to tune the hyperparameters within the defined hyperparameter space53. The objective function is the average R2 value obtained through 10-fold cross-validation. Unlike traditional methods such as grid search and random search, this approach uses a TPE to construct a probabilistic model. It leverages historical evaluation results to estimate the probability distribution of hyperparameter configurations relative to the observed performance metric, and then selects configurations with a higher expected improvement for subsequent evaluation. This enables the efficient identification of a near-optimal configuration within limited computing resources54,55.
After 500 trials, the optimal hyperparameter configuration is found to be as follows: n_estimators = 345, max_depth = 4, learning_rate = 0.193, subsample = 0.665, reg_alpha = 0.005, and reg_lambda = 10. Compared to the model without missing value imputation, the final model we trained has an R² that is 0.182 higher on the test set. The mean absolute error (MAE) and root mean squared error of the final model are also 2.435 and 3.341 lower, respectively (see more details in Technical Validation). This proves that our previous imputation is necessary and effective.
Finally, we used the above trained XGBoost model to predict the mean education percentile rank of each community based on built environment characteristics across the country. To ensure accuracy, areas with missing environmental big data (mostly located in agricultural and ecological zones) were excluded from the prediction.
Our dataset can be updated periodically every three to five years. This requires two conditions to be met. First, the XGBoost model used to predict community education percentile rank should be retrained once the CGSS has updated its survey data for sample communities, which occurs approximately every two years. Second, the nationwide community-level built environment features should be updated regularly to provide the necessary predictors for prediction.
Data Records
The predicted community-level education percentile rank datasets for 2020 are stored as GeoTIFF (.tif) files with the WGS84 projection. They are all available in the public repository Figshare (10.6084/m9.figshare.29654591)56. As the first part of the dataset, the “National Dataset on Community-Level Education Percentile Rank Estimation in China.zip” is provided as GeoTIFF raster files (.tif) within a ZIP archive. The archive contains the following files.
*.tif: The primary GeoTIFF raster file containing community-level education percentile rank. This file can be processed directly using common GIS desktop software such as ArcGIS and QGIS, as well as programming language packages such as Rasterio in Python.
*.tfw: A plain-text world file that records the affine transformation parameters (e.g., pixel size, rotation terms, and the coordinates of the raster origin).
*.tif.aux.xml: An auxiliary XML file that stores derived metadata such as per-band statistics, histograms, color tables, and processing history.
*.tif.ovr: An external overview (pyramid) file containing lower-resolution representations of the raster at multiple zoom levels. This speeds up the display and navigation of the raster in ArcGIS software.
For the convenience of users, the second part of the dataset provides separate GeoTIFF datasets for each of China’s 31 provinces. Each provincial dataset is a ZIP archive named “Provincial Dataset on Community-Level Education Percentile Rank Estimation (Province Name)”. The content and file organization of each provincial archive follow the structure described above.
As the third part of the dataset, we publish a simplified table version showing the mean education percentile rank of residents in each community. This table includes the community’s name, the latitude and longitude of its centroid, and the names of the county-, prefecture- and province-level administrative units in which it is located.
Figure 7 illustrates the spatial distribution of the predicted education percentile rank across six Chinese megacities. Overall, the highest levels of educational attainment are found in city centers, followed by suburban subcenters and then more distant suburbs. However, some socio-spatial distribution characteristics vary across cities. For example, residents of historical districts in the city center of Beijing are not the most highly educated. In Shenzhen, the distribution of highly educated communities is polycentric, while in Guangzhou it is monocentric.
Fig. 7.
The spatial distribution of education percentile rank across six selected megacities in China. Panels a to f are Beijing, Shanghai, Guangzhou, Shenzhen, Chengdu, and Wuhan, respectively.
Technical Validation
We conducted three sets of technical validations. First, we examined the performance of the XGBoost model using the test set. Second, we aggregated the predicted educational attainment of each community at prefecture and county levels and compared these values with the published census results, which we used as the presumed ground truth. Third, for selected cities where community-level educational attainment estimates are available from census data or LBS data, we assessed the similarity between our model’s predictions and these census or LBS benchmarks at the community level.
Predictive accuracy in the test set
Regarding the predictive accuracy of our estimation, the mean absolute error (MAE) and the root mean squared error (RMSE) are 3.808 and 5.203, respectively. The coefficient of determination (R2) is 0.918, indicating that the feature vector of various built environmental factors can predict 91.8% of the variation in community education percentile rank in our test set. Based on these indicators, our prediction is slightly more accurate than that of a similar study conducted in the United States18.
Prefecture-/county-level correlation analysis
China’s population census publishes the proportion of people with different levels of education in each prefecture- and county-level administrative unit. We calculated the mean years of education in each prefecture- and county-level administrative unit using the following formula: no schooling or preschool = 0 years; elementary school = 6 years; middle school = 9 years; high school = 12 years; college = 15 years; undergraduate college = 16 years; master’s degree = 19 years; doctoral degree = 22 years. We also aggregated the estimates for each community from this study into prefecture- and county-level administrative units. For each of these units, we calculated the average mean education percentile rank for all communities, weighted by the population size of each community according to the WorldPop dataset (Fig. 8).
Fig. 8.
Spatial distribution of education percentile rank. Panels a and b represent the distribution of mean education percentile rank at the prefecture and county level, respectively.
Next, we conducted a correlation analysis between the 2020 census results and the aggregated values of our estimates. To avoid interference in the correlation analysis caused by a lack of community estimations in areas with low built environment big data coverage, we excluded county-level administrative units where less than 10% of the urban land area was covered by street view images. We observed a strong positive correlation between census’s mean years of education and our estimated aggregated mean education percentile rank. The Pearson correlation coefficients for prefecture- and county-level administrative units are 0.87 and 0.84, respectively (Fig. 9a,b). To ensure robustness, we experimented with different baseline standards for built environment big data coverage, such as 20%, 30%, and 40%. The Pearson correlation coefficient remained consistently above 0.8. Given the aforementioned differences between years of education and education percentile rank, this degree of correlation is very high at the macro level.
Fig. 9.
Correlation between census/LBS education indicators and our estimates. Panels a and b show the relationship between census mean years of schooling and our estimated mean education percentile rank at prefecture and county level in China, respectively. Panel c shows the relationship between the percentage of the low-educated derived from LBS data and our estimated mean education percentile rank at community level in Beijing. Panel d shows the relationship between census mean years of schooling and our estimated mean education percentile rank at community level in Guangzhou.
Community-level correlation analysis
As previously mentioned, socio-economic data with high spatial resolution is extremely scarce in China. Therefore, we chose two cases on which to conduct community-level validation. First, we used a location-based services (LBS) big dataset in Beijing within the 5th Ring Road (i.e., the urban center), provided by one of China’s largest internet companies (http://huiyan.baidu.com). This dataset predicts users’ education levels based on their spatio-temporal trajectories, consumption information, and other factors when using online maps and apps12. The Pearson correlation coefficient between the proportion of the LBS population with low educational attainment (junior high school level or below) and our estimated mean education percentile rank across communities in Beijing is −0.873, indicating a strong negative correlation (Fig. 8c).
Second, we compared the mean years of education for each community across all Guangzhou districts, derived from the restricted-access 2020 census, to our estimate of the mean education percentile rank. The correlation coefficient between the two is 0.836, further demonstrating the validity of our predicted community education levels (Fig. 8d).
While the prediction accuracy of our dataset is high, there is still room for improvement. One potential constraint is the accuracy with which built environment features can be measured. For example, measures of street scene elements, physical disorder, and human perception rely on street view images, which are only available for urban public streets. The environment within gated communities remains inaccessible for observation and measurement. Future research could integrate images from channels such as social media to increase the spatial coverage of visual data.
Usage Notes
The community-level education percentile rank dataset is in GeoTIFF format. The dataset is publicly available under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0), allowing for unrestricted access, sharing, and adaptation with appropriate attribution. It can be processed using GIS software such as ArcGIS and QGIS, as well as programming language packages such as Rasterio in Python.
Acknowledgements
The authors would like to thank Guangwen Song of Guangzhou University for providing some of the street view images and for their technical support. We would like to express our gratitude to the editor and the anonymous reviewers for their valuable comments and suggestions.
Author contributions
Y.J. Zhang conceived the original idea and supervised the research. Z.Y. Pan and B. Qin collected the data and performed data cleaning. Z.Y. Pan and Y.Y. You developed methodology framework, produced this dataset, and analyzed the results. Y.J. Zhang and L. Cai wrote the manuscript.
Data availability
The datasets of community-level education percentile rank estimation in China are openly available on Figshare at 10.6084/m9.figshare.2965459156.
Code availability
The community-level education percentile rank dataset was created using Python 3.9.7 and ArcGIS 10.6 software platform. The code for our extreme gradient boosting (XGBoost) machine learning algorithm, which is used to predict community education percentile ranks, is available at the public repository Figshare (10.6084/m9.figshare.29648798)57.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Ganzeboom, H. B. G., Graaf, P. M. D. & Treiman, D. J. A standard international socio-economic index of occupational status. Social Science Research.21, 1–56 (1992). [Google Scholar]
- 2.Xiao, Y. & Bian, Y. The influence of hukou and college education in China’s labour market. Urban Studies.55, 1504–1524 (2018). [Google Scholar]
- 3.Xie, Y., Dong, H., Zhou, X. & Song, X. Trends in social mobility in postrevolution China. Proceedings of the National Academy of Sciences of the United States of America.119, e2117471119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Walder, A. G., Li, B. & Treiman, D. J. Politics and life chances in a state socialist regime: Dual career paths into the urban Chinese elite, 1949 to 1996. American Sociological Review.65, 191–209 (2000). [Google Scholar]
- 5.Nee, V. A theory of market transition: From redistribution to markets in state socialism. American Sociological Review.54, 663–681 (1989). [Google Scholar]
- 6.Yan, W. & Deng, X. Intergenerational income mobility and transmission channels in a transition economy: Evidence from China. Economics of Transition and Institutional Change.30, 183–207 (2022). [Google Scholar]
- 7.Wu, X. & Treiman, D. J. Inequality and equality under Chinese socialism: The Hukou system and intergenerational occupational mobility. American Journal of Sociology.113, 415–445 (2007). [Google Scholar]
- 8.Goodman, D. S. G. Middle class China: Dreams and aspirations. Journal of Chinese Political Science.19, 49–67 (2014). [Google Scholar]
- 9.Ponzini, A. Educating the new Chinese middle-class youth: The role of quality education on ideas of class and status. The Journal of Chinese Sociology.7, 1–18 (2020). [Google Scholar]
- 10.Sampson, R. J. Great American city: Chicago and the enduring neighborhood effect. (University of Chicago Press, 2013).
- 11.He, Q., Musterd, S. & Boterman, W. Understanding different levels of segregation in urban China: A comparative study among 21 cities in Guangdong province. Urban Geography.43, 1036–1061 (2022). [Google Scholar]
- 12.Zhang, Y., Wang, J. & Kan, C. Temporal variation in activity-space-based segregation: A case study of Beijing using location-based service data. Journal of Transport Geography.98, 103239 (2022). [Google Scholar]
- 13.Chen, Y., He, J., Wei, W., Zhu, N. & Yu, C. A multi-model approach for user portrait. Future Internet.13, 147 (2021). [Google Scholar]
- 14.Zhang, F. et al. Urban visual intelligence: Studying cities with artificial intelligence and street-level imagery. Annals of the American Association of Geographers.114, 876–897 (2024). [Google Scholar]
- 15.Gebru, T. et al. Using deep learning and Google street view to estimate the demographic makeup of neighborhoods across the United States. Proceedings of the National Academy of Sciences of the United States of America.114, 13108–13113 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Suel, E., Polak, J. W., Bennett, J. E. & Ezzati, M. Measuring social, environmental and health inequalities using deep learning and street imagery. Scientific Reports.9, 6229 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Suel, E., Bhatt, S., Brauer, M., Flaxman, S. & Ezzati, M. Multimodal deep learning from satellite and street-level imagery for measuring income, overcrowding, and environmental deprivation in urban areas. Remote Sensing of Environment.257, 112339 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Fan, Z., Zhang, F., Loo, B. P. Y. & Ratti, C. Urban visual intelligence: Uncovering hidden city profiles with street view images. Proceedings of the National Academy of Sciences of the United States of America.120, e2220417120 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Naik, N. et al. Computer vision uncovers predictors of urban change. Proceedings of the National Academy of Sciences of the United States of America.114, 7571–7576 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Rossetti, T., Lobel, H., Rocco, V. & Hurtubia, R. Explaining subjective perceptions of public spaces as a function of the built environment: A massive data approach. Landscape and Urban Planning.181, 169–178 (2019). [Google Scholar]
- 21.Zhang, Y. et al. Quantifying physical and psychological perceptions of urban scenes using deep learning. Land Use Policy.111, 105762 (2021). [Google Scholar]
- 22.Xie, Y. & Zhang, C. The long-term impact of the Communist Revolution on social stratification in contemporary China. Proceedings of the National Academy of Sciences of the United States of America.116, 19392–19397 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Li, X. et al. Mapping global urban boundaries from the global artificial impervious area (GAIA) data. Environmental Research Letters.15, 094044 (2020). [Google Scholar]
- 24.Bian, Y. & Li, L. The Chinese General Social Survey (2003–2008): Sample designs and data evaluation. Chinese Sociological Review.45, 70–97 (2012). [Google Scholar]
- 25.Chen, Y., Naidu, S., Yu, T. & Yuchtman, N. Intergenerational mobility and institutional change in 20th century China. Explorations in Economic History.58, 44–73 (2015). [Google Scholar]
- 26.Li, M. & Cao, J. Multi-generational educational mobility in China in the twentieth century. China Economic Review.80, 101990 (2023). [Google Scholar]
- 27.Shi, Z. et al. A data-driven framework for analyzing spatial distribution of the elderly cardholders by using smart card data. ISPRS International Journal of Geo-Information.10, 728 (2021). [Google Scholar]
- 28.Wang, D. & Li, S. Socio-economic differentials and stated housing preferences in Guangzhou, China. Habitat International.30, 305–326 (2006). [Google Scholar]
- 29.Cervero, R. & Kockelman, K. Travel demand and the 3Ds: Density, diversity and design. Transportation Research Part D-Transport And Environment.2, 199–219 (1997). [Google Scholar]
- 30.Sung, H., Lee, S. & Cheon, S. Operationalizing Jane Jacobs’s urban design theory: Empirical verification from the great city of Seoul, Korea. Journal of Planning Education and Research.35, 117–130 (2015). [Google Scholar]
- 31.Che, Y. et al. 3D-GloBFP: The first global three-dimensional building footprint dataset. Earth System Science Data.16, 5357–5374 (2024). [Google Scholar]
- 32.Huang, S., Tang, L., Hupy, J. P. & Shao, G. A commentary review on the use of normalized difference vegetation index (NDVI) in the era of popular remote sensing. Journal of Forestry Research.32, 1–6 (2021). [Google Scholar]
- 33.Liu, Z. et al. Swin Transformer: Hierarchical vision transformer using shifted windows. (IEEE/CVF Conference on Computer Vision. 2021).
- 34.Zhou, B. et al. Semantic understanding of scenes through the ADE20K dataset. International Journal of Computer Vision.127, 302–321 (2019). [Google Scholar]
- 35.Paszkowski, W. & Sobiech, M. The modeling of the acoustic condition of urban environment using noise annoyance assessment. Environmental Modeling & Assessment.24, 319–330 (2019). [Google Scholar]
- 36.Chen, L., Zhao, L., Xiao, Y. & Lu, Y. Investigating the spatiotemporal pattern between the built environment and urban vibrancy using big data in Shenzhen, China. Computers, Environment and Urban Systems.95, 101827 (2022). [Google Scholar]
- 37.Lan, F., Gong, X., Da, H. & Wen, H. How do population inflow and social infrastructure affect urban vitality? Evidence from 35 large- and medium-sized cities in China. Cities.100, 102454 (2020). [Google Scholar]
- 38.Xia, C., Yeh, A. G. & Zhang, A. Analyzing spatial relationships between urban land use intensity and urban vitality at street block level: A case study of five Chinese megacities. Landscape and Urban Planning.193, 103669 (2020). [Google Scholar]
- 39.Lebakula, V. et al. LandScan global 30 arcsecond annual global gridded population datasets from 2000 to 2022. Scientific Data.12, 495 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Wilson, J. Q. & Kelling, G. L. Broken windows: The police and neighborhood safety. Atlantic Monthly.249, 29–38 (1982). [Google Scholar]
- 41.Sampson, R. J. & Raudenbush, S. W. Systematic social observation of public spaces: A new look at disorder in urban neighborhoods. American Journal of Sociology.105, 603–651 (1999). [Google Scholar]
- 42.Bader, M. D., Mooney, S. J., Bennett, B. & Rundle, G. A. The promise, practicalities, and perils of virtually auditing neighborhoods using Google Street view. The ANNALS of the American Academy of Political and Social Science.669, 18–40 (2017). [Google Scholar]
- 43.Hwang, J. & Naik, N. Systematic social observation at scale: Using crowdsourcing and computer vision to measure visible neighborhood conditions. Sociological Methodology.53, 183–216 (2023). [Google Scholar]
- 44.Hoeben, E. M., Steenbeek, W. & Pauwels, L. J. R. Measuring disorder: Observer bias in systematic social observations at streets and neighborhoods. Journal of Quantitative Criminology.34, 221–249 (2018). [Google Scholar]
- 45.Wang, C. Y., Bochkovskiy, A. & Liao, H. Y. M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. (IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023).
- 46.Yao, Y. et al. Discovering the homogeneous geographic domain of human perceptions from street view images. Landscape and Urban Planning.212, 104125 (2021). [Google Scholar]
- 47.Salesses, P., Schechtner, K. & Hidalgo, C. A. The collaborative image of the city: Mapping the inequality of urban perception. PLOS ONE.8, e68400 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Zhang, F. et al. Measuring human perceptions of a large-scale urban region using machine learning. Landscape and Urban Planning.180, 148–160 (2018). [Google Scholar]
- 49.Tan, M. & Le, Q. EfficientNetV2: Smaller models and faster training. International Conference on Machine Learning.139, 7102–7110 (2021). [Google Scholar]
- 50.Rubin, D. B. Inference and missing data. Biometrika.63, 581–590 (1976). [Google Scholar]
- 51.van Buuren, S. & Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. Journal of Statistical Software.45, 1–67 (2011). [Google Scholar]
- 52.Rácz, A. & Gere, A. Comparison of missing value imputation tools for machine learning models based on product development cases studies. LWT-Food Science And Technology.221, 117585 (2025). [Google Scholar]
- 53.Bergstra, J., Bardenet, R., Bengio, Y. & Kegl, B. Algorithms for hyper-parameter optimization. Advances in neural information processing systems.24, 2546–2554 (2011). [Google Scholar]
- 54.Nishio, M. et al. Computer-aided diagnosis of lung nodule using gradient tree boosting and Bayesian optimization. PlOS one.13, e0195875 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Echabarri, S., Do, P., Vu, H. C. & Bornand, B. Machine learning and Bayesian optimization for performance prediction of proton-exchange membrane fuel cells. Energy and AI.17, 100380 (2024). [Google Scholar]
- 56.Zhang, Y., Pan, Z., You, Y., Cai, L. & Qin, B. Datasets of community-level education percentile rank estimation in China. figshare. Dataset.10.6084/m9.figshare.29654591 (2025).
- 57.Zhang, Y., Pan, Z., You, Y., Cai, L. & Qin, B. XGBoost regressor for estimating community-level education percentile rank in China. figshare. Dataset.10.6084/m9.figshare.29648798 (2025).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
- Zhang, Y., Pan, Z., You, Y., Cai, L. & Qin, B. Datasets of community-level education percentile rank estimation in China. figshare. Dataset.10.6084/m9.figshare.29654591 (2025).
- Zhang, Y., Pan, Z., You, Y., Cai, L. & Qin, B. XGBoost regressor for estimating community-level education percentile rank in China. figshare. Dataset.10.6084/m9.figshare.29648798 (2025).
Data Availability Statement
The datasets of community-level education percentile rank estimation in China are openly available on Figshare at 10.6084/m9.figshare.2965459156.
The community-level education percentile rank dataset was created using Python 3.9.7 and ArcGIS 10.6 software platform. The code for our extreme gradient boosting (XGBoost) machine learning algorithm, which is used to predict community education percentile ranks, is available at the public repository Figshare (10.6084/m9.figshare.29648798)57.








