Skip to main content
PLOS One logoLink to PLOS One
. 2022 Dec 7;17(12):e0278120. doi: 10.1371/journal.pone.0278120

Occupations on the map: Using a super learner algorithm to downscale labor statistics

Michiel van Dijk 1,2,*, Thijs de Lange 1, Paul van Leeuwen 1, Philippe Debie 3
Editor: Sotirios Koukoulas4
PMCID: PMC9728836  PMID: 36476753

Abstract

Detailed and accurate labor statistics are fundamental to support social policies that aim to improve the match between labor supply and demand, and support the creation of jobs. Despite overwhelming evidence that labor activities are distributed unevenly across space, detailed statistics on the geographical distribution of labor and work are not readily available. To fill this gap, we demonstrated an approach to create fine-scale gridded occupation maps by means of downscaling district-level labor statistics, informed by remote sensing and other spatial information. We applied a super-learner algorithm that combined the results of different machine learning models to predict the shares of six major occupation categories and the labor force participation rate at a resolution of 30 arc seconds (~1x1 km) in Vietnam. The results were subsequently combined with gridded information on the working-age population to produce maps of the number of workers per occupation. The super learners outperformed (n = 6) or had similar (n = 1) accuracy in comparison to best-performing single machine learning algorithms. A comparison with an independent high-resolution wealth index showed that the shares of the four low-skilled occupation categories (91% of the labor force), were able to explain between 28% and 43% of the spatial variation in wealth in Vietnam, pointing at a strong spatial relationship between work, income and wealth. The proposed approach can also be applied to produce maps of other (labor) statistics, which are only available at aggregated levels.

Introduction

Labor is recognized as one of the three primary factors of production in economics and accounts for around 50% of total global income [1]. Work forms a central part of most people’s lives. Globally more than 3.1 billion people are working actively or are looking for work [2] and those that are employed spend around 40 hours per week on the job [3]. There is overwhelming evidence that labor activities are distributed unevenly across space because of a combination of differences in costs, economies of scale and spillover effects [48]. However, detailed statistics that illustrate the geographical distribution of labor and work are not readily available. ILOSTAT (ilostat.ilo.org), the most comprehensive global labor statistics database maintained by the International Labor Office (ILO), only provides country-level data. Summary reports of population censuses and national labor-force surveys sometimes present subnational labor information, but mostly at the level of coarse first-level administrative units. The availability of spatially explicit labor statistics will support the formulation of targeted social policies that aim to improve the match between local labor supply and demand, and support local employment, contributing to economic growth and welfare.

The contribution of this paper is to demonstrate an approach to downscale subnational labor statistics to a fine-scale spatial grid using machine learning approaches. As such, this paper contributes to a rapidly expanding literature, which has used machine learning and advanced statistical models to create fine-scale gridded maps of socio-economic indicators. Key examples include the mapping of population [9], educational attainment [10], child growth [11], poverty [12] and wealth [13]. To the best of our knowledge, this is the first application to apply these techniques to downscale labor information.

Our approach resembles that of [14] and [15], who created gridded population and livestock maps, respectively. In contrast to these papers, which applied a single machine learning model (i.e. random forest), we used an ensemble approach, a so-called super learner, in combination with high-resolution remote sensing data and other spatial predictors to predict the shares of six major occupation categories and the labor force participation rate at a resolution of 30 arc seconds (~1x1 km) in Vietnam. A super learner combines the results of different machine learning models to generate predictions with the same or higher accuracy than those of single models [16]. Apart from better performance, combining the outcomes of multiple machine learning models also results in more robust outcomes [17], which is particularly important in case machine learning techniques are used to extrapolate results [18], such as in our application.

The results of our analysis provide insights into the geographical distribution of workers within a country. The occupation maps make it possible to identify regions that are characterized by low-skilled employment, which might be vulnerable to rising imports from low-cost countries [19]. Similarly, they can be combined with climate change projections to show the locations where workers will be most exposed to extreme temperatures, which is particularly relevant for active workers in the agriculture, construction, and manufacturing sectors [20]. This type of information can be used to better assess the impact of climate change on labor productivity and associated losses in national income [21, 22]. The occupation maps can also be regarded as proxies for the geographical distribution of income in a country as there is a strong correlation between occupational attainment and wages [23]. Finally, the maps also provide broad information on the spatial distribution of industrial activity as workers tend to live close to the workplace [24, 25], and the occupation categories are closely related to the main economic sectors: agriculture, manufacturing and services [26].

We selected Vietnam as a case study to demonstrate our approach because it is a lower middle-income country that has experienced a phase of rapid economic growth and structural transformation [27]. As a consequence, the occupational distribution is rather diverse, including roughly equal shares of agricultural and non-agricultural workers as well as a fair share of high-skilled jobs (S1 Fig). Another advantage was the availability of relatively detailed subnational occupation statistics, which were an essential input for the analysis.

Materials and methods

Fig 1 summarizes our analytical approach. The next sections provide additional information on the target variables and their data sources, the super learner algorithm that was used to downscale the subnational labor statistics and the geospatial predictors that informed the model.

Fig 1. Overview of the analytical approach.

Fig 1

(a) Subnational statistics on the labor force participation rate and occupation shares for (b) six occupation classes and (c) a large number of geospatial predictors were combined to train (d) super learner models. (e) Spatial information on the working-age population was combined with (f) resulting gridded predictions to produce (g) maps of the number of workers per occupation category. This figure has been designed using resources from Flaticon.com. Maps produced using data from [30] and calculations by authors, see text.

Target variables

Our primary target variable was the total number of workers within an occupation category. We distinguished between six broad occupation categories based on the most recent version of the International Standard Classification of Occupations (ISCO) (S1 Table). The ISCO occupation classes are designed to capture two interrelated concepts: (a) the type of job since occupation is defined as a “set of jobs whose main task and duties are characterized by a high degree of similarity” and (b) skill level, which is defined as “function of complexity and range of tasks and duties to be performed in an occupation” [28]. To estimate the number of workers Oi per occupation category i = 1,…, 6, we broke it down into three components, in line with standard ILO definitions (ilostat.ilo.org):

Oi=WP×LFPR×OSi

where WP is the working-age population, which, is defined as all persons aged 15 and older. LFPR is the labor force participation rate, defined as the number of persons in the labor force as a percentage of the working-age population. The labor force is the sum of the number of persons employed and unemployed, aged 15 and older. OS is the share of the labor force with occupation i.

The three occupation components have a distinct spatial distribution (S2 Fig). Labor force participation rates are higher in remote areas and lower in the major cities, such as Hanoi and Ho Chi Minh City, and the more densely populated coastal areas. This is consistent with the patterns in other countries, which show that labor force participation, in particular for women, is higher in rural areas because of limited possibilities to participate in educational activities and more possibilities to combine child-rearing and farm work in the absence of off-farm employment [29]. Similarly, the share of non-agricultural occupations and the size of the working-age population tend to be higher in densely populated urban areas.

Our main source of information for the labor statistics was the 2009 Population and Housing Census organized by the Vietnam General Statistics Office that is available from IPUMS [30]. Representative labor statistics were available for 674 districts, which are the second-level administrative units in Vietnam. The census included questions for each person in the labor force on his/her type of occupation using the detailed ISCO 08 classification. We used this information to calculate the share of each major occupation group at the district level. Similarly, we also used the census to estimate the district labor force participation rate by dividing all persons that are categorized as being in the labor force by the total population aged 15 and older.

To calculate the working age population at each grid cell, we used population maps with 5-year age group compositions from [31]. The main data source for these products was the 2009 Vietnam population census, which we also used as the main input.

As a final step, we used a logit transformation to change the scale of our dependent variables, which are both proportions that are bounded between zero and one. As the model predictions are not guaranteed to be within this range, it is recommended to change the scale to between negative and positive infinity [32].

Geospatial predictors

Initially, we selected 32 predictors, which we expected to explain the observed spatial distribution of the labor force participation rate and occupation categories (S2 Table). The majority of predictors were taken from the WorldPop (worldpop.org) and WorldClim (worldclim.org) open access archives, which contain a diverse range of remote sensing and other geospatial data at a resolution of 3–30 arc seconds that cover multiple periods, including land cover, night lights, transport networks, topography and climate indicators. In addition, we added geospatial layers with information on distance to large industrial facilities (energy, iron, steel and cement plants, and mines) and major transport hubs (airports and ports), which are often part of industrial zones, where a large number of people are employed. As workers tend to live in the vicinity of their work, we assumed that these layers have predictive power and support the training of the machine learning models. Information on the location of power, iron and steel, and cement plants was taken from the Global Infrastructure Emissions (GID) database (www.gidmodel.org.cn). The GID contains information on emissions from energy-intensive industries, which can be regarded as a proxy for their location. We also collected information on the location of mining areas, airports and ports. These datasets were further processed to create geospatial layers with distance information.

Where possible, we selected layers with information for the year 2009, consistent with the population census data. However, in a few cases, in particular for the predictors related to the location of industrial facilities, only recent data was available. To harmonize the data, all raster layers were resampled to a resolution of 30 arc seconds in WGS 84 projection. As the occupation data was available at the second administrative unit level, all predictors were aggregated to the same level using the median values for each administrative area.

We applied three feature engineering steps that result in better performance of machine learning models [32]: (a) we applied a Yeo-Johnson power transformation [33] to make the variable distributions more symmetric, (b) we normalized all data to have a standard deviation of one and a mean of zero, where mean values and standard deviations were estimated from the training dataset and (c) we removed all variables that had a correlation of 0.7 or larger with other variables. As a consequence of the last step 14 predictors were removed from the analysis (S3 Fig).

Super learner

We used an ensemble approach, referred to as a super learner, to predict labor statistics at the grid level. A super learner is an algorithm that combines the results of multiple machine learning models. Predictions are then generated by weighting the outcomes of the individual member models. It has been demonstrated that predictions of a super learner have the same or higher accuracy in comparison to those generated by means of single machine learning models [16].

The process to train the super learner followed the steps as described in [16]. In total, we trained seven separate super learners; one model to spatially predict the labor force participation rate (LFPR) and six models to predict the shares for each occupation class (OSi). Each super learner combined the results of six different machine learning algorithms that have been used extensively in (spatial) prediction exercises: random forest, (random_forest) extreme gradient boosting (xgboost), neural network (neural_network), polynomial (svm_poly) and radial (svm_radial) basis support vector machines, and generalized linear model via penalized maximum likelihood (glmnet). All six machine learning algorithms and the super learner models used data at the subnational level for training and testing (i.e. 674 data points in total).

All models (i.e. super learner and member models) were implemented using the tidymodels framework [34] in the R software environment [35]. We started by applying the tidymodels tune and dials packages to optimize the tuning of hyperparameters for all super learner member models. For each model, we conducted a grid search with 30 model-specific parameter combinations that were selected using a Latin hypercube design. Each parameter combination was evaluated and fitted by means of 5 repeats of 10 fold cross-validation, using 80% of the dataset (n = 538) for training. Inspection revealed that a few models had very poor performance, predicting (near) constant values. Therefore, we removed all models for which the predictions had a standard deviation lower than 0.001. We preferred to exclude these poor models as it results in more parsimonious super learner models, without loss in performance. In the next step, we tuned a regularized generalized linear model to optimally combine the predictions of the member models and determine the relative weights for the super learner. We used the default settings that constrain the coefficients of the super learner to be non-negative. We analyzed the accuracy of our modeling framework by fitting all selected member models (i.e. those with a weight larger than zero) and the super learner on hold-out data that contains 20% of the main dataset (n = 136). For each model, we derived the RMSE and the R2. The super learners were combined with the 30 arc seconds predictor maps to predict the labor force participation rates and occupation shares for all grid cells in Vietnam. We applied an inverse logit transformation to obtain values between zero and one. For the predictions of the six occupation shares we used a standardization function to ensure that the total sum of shares equals one in each grid cell:

occupationijc=occupationiji=16occupationij

where occupationijc is the corrected share for occupation i in grid cell j. Finally, the occupation maps were produced by multiplying grid-level information on labor force participation rate, occupation share and working-age population. We followed the approach used by [18] to calculate 67% (1 standard deviation) grid-cell prediction errors for the super learner models. This involved two steps (a) fitting a quantile regression random forest model using the training dataset predictions of the super learner member as input and (b) estimating error statistics per grid cell using the forestError package that implements the method proposed by [36].

For each member model we created variable importance measures that demonstrated by how much a models’ performance (measured by the RMSE) changes if an explanatory variable is removed, as well as accumulated local effects plots, which showed how features relate to the machine learning predictions on average. The methods to create both types of plots, implemented with the R DALEX package [37], are model-agnostic and did not assume anything about the structure of the super learner member models [38]. All maps were produced using the R software environment [35]. Scripts to reproduce the analysis are available at: https://github.com/michielvandijk/occupations_on_the_map.

Results

Number of workers per occupation

Fig 2 presents maps of the number of workers for each of the six occupation categories for the entire country, and around Hanoi and Ho Chi Minh City, the two regions with the highest (working age) population density (S2h Fig). The maps clearly show different spatial patterns between the occupations. A relatively large number of agricultural workers are located in the Red River (South-East of Hanoi) and Mekong (South-West of Ho Chi Minh City) River Deltas as well as in the rural areas throughout the country. The high absolute number of agricultural workers observed very close to Hanoi can be explained by the combination of a large working-age population, widespread rice cultivation and well-developed urban agriculture [39] in this area. As expected, the share of agricultural workers is very small (0–2%) in urban areas and large in rural areas (50–100%) (S4 Fig). The spatial distribution of the other five occupation groups is mainly concentrated in urban and semi-urban areas. Large numbers of clerks and service workers, and craft workers and operators are depicted in the centers of the large cities but also in the industrial and populated areas that are spread around Hanoi and Ho Chi Minh City, as well as the urban coastal regions. This finding is to be expected as these two occupation categories cover the majority of jobs in the manufacturing and service sectors that employ most workers in industrial and populated areas in Vietnam. In contrast, the distribution of managers and professionals, and technicians and associate professionals is much more concentrated, with a very large presence in the centers of Hanoi and Ho Chi Minh City. These findings can be explained by the fact that high-skilled jobs are mainly concentrated in the head offices of large companies and ministries that are located in the center of major cities. In relative terms, the two high-skilled occupation categories are observed in all (semi)urban areas but only make up very small shares of the total labor force (S1 and S4 Figs). Elementary occupations contain a diverse group of jobs, including both unskilled agricultural and non-agricultural workers and therefore might be allocated in rural and urban areas. The maps show that, at least in Vietnam, elementary occupations mostly seem to be of the urban type (e.g. cleaners, hand packers and street vendors).

Fig 2.

Fig 2

Spatial predictions for the number of (a) Managers and professionals, (b) Technicians and associate professionals, (c) Clerks and service workers, (d) Agricultural workers, (e) Craft workers and operators and (f) Elementary occupations per grid cell. Bottom panels depict the spatial distribution of the number of workers around Hanoi and Ho Chi Minh City, respectively. An interactive version can be accessed at https://shiny.wur.nl/occupation-map-vnm.

Model evaluation

Fig 3 summarizes the performance of the seven super learner models that were run for each variable. The amount of variation (R2) explained by the super learner model ranged from 0.63 for elementary occupations to 0.94 for agricultural workers. A comparison between the super learners and selected member models showed that the ensemble approach outperformed (n = 6), or was comparable (n = 1) to the predictions of the individual models with the highest RMSE. In all seven super learner models that were fitted, xgboost was the best performing machine learning algorithm, followed by random forest (n = 4), neural network (n = 2) and polynomial support vector machines (n = 1) (Table S3-S9 in S1 File). Predictions were located around the 1:1 line and the mean error was close to zero, indicating that the models are not biased.

Fig 3.

Fig 3

Performance of the super learners for the seven target variables: (a-c) Managers and professionals, (d-f) Technicians and associate professionals, (g-i) Clerks and service workers, (j-l) Agricultural workers, (m-o) Craft workers and operators, (p-r) Elementary occupations and (s-u) Labor force participation rate. The large panels compare the district-level observations with the super learner predictions for the hold-out dataset (n = 136). The dashed blue line represents the 1:1 line. Small panels show the RMSE and the R2 for the super learners (red lines) and member models (gray lines) for the hold-out dataset. Predictions and observations are logit transformed values.

We also investigated if the models were able to adequately reproduce the number of workers at higher subnational aggregations. To do this, we aggregated the predictions for the number of workers at grid cell level and compared them with the district-level number of workers that can be derived from the model input data for each occupation category (S2 Fig). A strong relationship between the two values indicated that the models are producing realistic values on average. Grid cell predictions were based on all predictor values, not only the district-level median values that were used to train the model, and therefore are likely to include observations at the tails of the distribution (e.g. remote areas). If the models would structurally under- or overperform for such observations, this would result in poor aggregated predictions and implies the models might be biased. The high R2 of 0.85–0.95 (S6 Fig) between model outcomes and observed subnational statistics suggested that this is not the case.

To provide an indication of where the super learners were accurate and where they were not, we calculated the prediction error at grid cell level (S5 Fig). The maps showed that the errors were the most prominent in the regions where the occupation shares were very small, or even zero (i.e. agricultural workers in urban areas and non-agricultural workers in rural areas) and where the models were used to extrapolate on the edges of the training space.

To investigate the importance of the predictor variables on the model outcomes, we conducted a variable importance analysis to investigate which predictors contributed the most to the model outcomes (Fig S7-S13 in S2 File) and we generated accumulated local effects plots to get insight on how they were related (Fig S14-S20 in S3 File). Urbanization was a key predictor of occupation share as the related predictor (urbpx_prp_5) was consistently ranked among the most important variables that explained the model results. Several predictors related to land use and economic activity frequently featured as explanatory variables in some of the occupation models. Not surprisingly, for agricultural workers, the distance to cropland (esaccilc_dst011) and climate (bio_6) were also important determinants. In the models for managers and professionals distance to ports and airports were frequently included, which might indicate that high-skilled workers tend to be located in places that are internationally connected. Predictors related to the location of industrial activity (iron_steel, power, mining and cement) showed up as important for several member models but, perhaps in contradiction to expectations, and with the exception of mining, were not uniquely related to the location of craft workers and operators and technicians and associate professionals, of which a large share were expected to live close to these facilities.

Relationship with spatial wealth

To evaluate the accuracy of our model estimates at grid level, we also compared our results with an exogenous dataset that presents comparable or related information [13, see 40 for a comparable approach to evaluate crop distribution maps]. Various studies showed that wealth is strongly correlated with income [e.g. 41], which, in turn, is determined for a large part by labor income, and, hence, type of occupation. We therefore also expected to find a correlation between our gridded occupations maps and a global wealth index [13], which shows the spatial distribution of wealth at a resolution of 2.4 x 2.4 km. The wealth index is based on a machine learning model that used satellite imagery, mobile phone networks and connectivity data from Facebook as predictors and asset information from the Demographic and Health Surveys (DHS) as target variables, covering the period 2010–2018. This period does not overlap with our base year of 2009, but as the reallocation of labor across sectors is a long-run and gradual process [42], we expected that our occupation maps will also be representative for the period covered by the wealth map. The information in the DHS was collected by the U.S. Agency for International Development and is therefore independent of the population census that we used to train the super learners.

The relationship between wealth and occupation was confirmed by Fig 4, which shows that the proportion of low-skilled occupations, composed of agricultural workers (R2 = 0.39), craft and operators (R2 = 0.43), clerks and service workers (R2 = 0.29) and elementary occupations (R2 = 0.28), which make up 91% of the labor force (S1 Fig), explained a moderate part of relative wealth at grid-cell level. Wealth tends to be higher in (semi-)urban areas and therefore was negatively correlated with higher shares of agricultural workers, while a positive correlation with wealth was found for the other low-skilled occupations. There was a limited correlation between the wealth index and the share of managers and professionals, and technicians and associate professionals. A possible reason for this might be the limited coverage of (wealthy) high-skilled workers in the DHS. Another explanation could be the uneven distribution of wealth between high- and low-skilled workers within (urban) grid cells, which is not picked up by an average measure such as the wealth index.

Fig 4.

Fig 4

Scatterplot of the relative wealth index from [13] against the predicted occupation shares for (a) Managers and professionals, (b) Technicians and associate professionals, (c) Clerks and service workers, (d) Agricultural workers (e) Craft workers and operators and (f) Elementary occupations.

Discussion

This paper presented an approach to downscale subnational labor statistics into high-resolution gridded maps using a super learner machine learning algorithm. The results showed that the predictions of the super learners were more accurate or at least comparable with those of the best-performing member models. Potential disadvantages of using ensemble approaches are the higher computational costs and complexity vis-à-vis fitting only a single type of machine learning model. Nonetheless, with the growing availability of high computing power, we expect this approach to be increasingly adopted by researchers with interest in the spatial prediction of social and biophysical indicators [e.g. 43].

This research can be extended, refined and improved in several directions. First, the predictive power of the super learners might be increased by collecting and incorporating additional spatial predictors. For example, the model for managers and professionals, which had relatively poor validation metrics, might benefit from spatial layers that show the distance or travel time to schools, hospitals, city halls and other government buildings, which are typical work locations for this type of occupation [see 14 for an example involving the distance to health facilities]. Similarly, the accuracy of the craft workers and operators model might be improved by adding data on the location of industrial zones and factories. Unfortunately, digital maps with this type of information are not available for Vietnam. Overall, the predictive power of spatial indicators that measure the distance to work will probably be lower for high-skilled workers as they tend to commute over longer distances in comparison to low-skilled workers [24, 25].

Second, in the future we aim to apply our approach to larger regions, such as South-East Asia and the world, resulting in a product that complements existing global maps with socio-economic indicators, such as population (www.worldpop.org), education [10] and wealth [13]. For a substantial number of countries, the required subnational labor statistics can be extracted from the population census, which is available by means of the IPUMS database [30]. Another source of information is summary reports of national labor force surveys that are regularly published by national statistical agencies. The level of detail, however, can differ considerably between data sources and countries. In the case of Vietnam, we were able to obtain information for second-level administrative units. For many other countries, only a few data points, representing first-level administrative units or broad regions, might be available. Combining the data of a large number of countries will make it possible to train machine learning models that achieve higher accuracy and can be used to generate plausible out-of-sample predictions for countries for which no data is available [13, 15].

Finally, an interesting avenue of future research would be to investigate the possibility of implementing our approach to derive high-resolution maps for other (labor) indicators, such as gender-specific occupation maps. As mentioned above, differences in labor force participation between men and women have a strong spatial dimension, which can be disentangled by our analytical approach. Such an indicator would also inform spatial predictions of the proportion of women in managerial positions, which is one of the SDG5: Gender equality indicators. Another relevant indicator is the unemployment rate, listed under SDG8: Decent work and economic growth. However, mapping unemployment rates is complex because it is the sum of frictional, cyclical, and structural components [44]. In advance, it is not clear which of these components will be captured by the various spatial predictors. For example, night lights are a proxy for economic activity [45] and therefore might pick up the sum of all three components, while the distance to cropland, airports and industrial facilities might be correlated with structural unemployment rates. Although our analysis focused on labor data, it can equally be applied to other statistics that are only available at the subnational level, including sectoral economic output, health and education information [46, 47].

Supporting information

S1 Fig. Total labor force and occupation shares in Vietnam for the year 2009.

Source: Minnesota Population Center. Integrated Public Use Microdata Series, International: Version 7.2 [dataset]. Minneapolis, MN: IPUMS; 2019. https://doi.org/10.18128/D020.V7.2 for labor force participation rate and occupation shares, and Pezzulo et al. (2017) for working age population.

(PNG)

S2 Fig

District-level information: occupation shares for (a) Managers and professionals, (b) Technicians and associate professionals, (c) Clerks and service workers, (d) Agricultural workers, (e) Craft workers and operators, (f) Elementary occupations, (g) Labor force participation rate and (h) Working age population in Vietnam for the year 2009. Source: Minnesota Population Center. Integrated Public Use Microdata Series, International: Version 7.2 [dataset]. Minneapolis, MN: IPUMS; 2019. https://doi.org/10.18128/D020.V7.2 for labor force participation rate and occupation shares, and Pezzulo et al. (2017) for working age population.

(PNG)

S3 Fig. Correlation matrix for predictors before normalization and Yeo-Johnson power transformation.

After normalization and Yeo-Johnson power transformation, we used step_corr(., threshold = .7) from the R recipes package to remove all predictors with an absolute correlation equal or larger than 0.7. Consequently, 14 predictors were removed from the analysis (bio_1, bio_5, bio_6, dmsp, dst_bsgmi, dst_ghslesaccilcguf, esaccilc_dst040, int_airports, osm_dst_road, osm_dst_roadintersec, srtm_slope, srtm_topo, travel_time, viirs), leaving 18 predictors that were used as final input.

(PNG)

S4 Fig

Super learner results for (a) Managers and professionals, (b) Technicians and associate professionals, (c) Clerks and service workers, (d) Agricultural workers, (e) Craft workers and operators, (f) Elementary occupations and (g) Labor force participation rate.

(PNG)

S5 Fig

Prediction errors for (a) Managers and professionals, (b) Technicians and associate professionals, (c) Clerks and service workers, (d) Agricultural workers, (e) Craft workers and operators, (f) Elementary occupations and (g) Labor force participation rate. Prediction errors are logit transformed values, which are provided with a probability of 67%, which is the 1 standard deviation upper and lower prediction interval.

(PNG)

S6 Fig

District-level comparison between observations and super learner predictions for (a) Managers and professionals, (b) Technicians and associate professionals, (c) Clerks and service workers, (d) Agricultural workers, (e) Craft workers and operators and (f) Elementary occupations. Dashed blue line represents the 1:1 line. Solid blue line indicates the regression line, with 95% confidence intervals in grey. District-level observations on the number of workers are calculated by multiplying district-level data on occupation share, labor force participation and working age population (aggregated from grid-level values), depicted in S2 Fig.

(PNG)

S1 Table. Main occupation categories based on the International Classification of Occupations 08 (ILO 2012).

(PDF)

S2 Table. List of selected predictors for the machine learning models.

(PDF)

S1 File. Hyperparameters for super learner model members, sorted by RMSE.

(PDF)

S2 File. Variable importance plots.

Results are presented for the five super learner model members with the highest weight and for the top 10 predictors. Starting values for the horizontal bars indicate the RMSE for the full model. Predictors with the largest bars are the most important because permuting them results in higher RMSE. Error bars indicate results for 10 different permutations.

(PDF)

S3 File. Accumulated local effects.

Results are presented for the five super learner model members with the highest weight.

(PDF)

Acknowledgments

We would like to thank Tom Hengl for advice on how to calculate the prediction errors.

Data Availability

The main results and input datasets of this study are presented in the Supporting information files. All input and output maps produced for this study can also be accessed by means of an interactive web application (https://shiny.wur.nl/occupation-map-vnm). Input and output data are available on Zenodo (DOI: 10.5281/zenodo.6419272) and scripts to reproduce the analysis are available on GitHub (https://github.com/michielvandijk/occupations_on_the_map).

Funding Statement

This research was funded by a grant Wageningen University & Research Programme on "Food Security and Valuing Water" (project code KB-35-005-001) that is supported by the Dutch Ministry of Agriculture, Nature and Food Quality, and a contribution from the Wageningen University and Research investment fund. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

References

  • 1.Karabarbounis L, Neiman B. The Global Decline of the Labor Share. The Quarterly Journal of Economics. 2014;129: 61–103. doi: 10.1093/QJE/QJT032 [DOI] [Google Scholar]
  • 2.ILO. World Employment and Social Outlook–Trends 2021. Geneva, Switzerland: International Labour Office; 2021.
  • 3.Giattino C, Ortiz-Ospina E, Roser M. Working Hours. Our World in Data. 2013. https://ourworldindata.org/working-hours
  • 4.Krugman P. Geography and Trade. Cambridge, MA: MIT Press; 1991.
  • 5.Ellison G, Glaeser EL. The Geographic Concentration of Industry: Does Natural Advantage Explain Agglomeration? American Economic Review. 1999;89: 311–316. doi: 10.1257/AER.89.2.311 [DOI] [Google Scholar]
  • 6.Grekousis G, Gialis S. More Flexible Yet Less Developed? Spatio-Temporal Analysis of Labor Flexibilization and Gross Domestic Product in Crisis-Hit European Union Regions. Social Indicators Research. 2019;143: 505–524. doi: 10.1007/S11205-018-1994-0/FIGURES/4 [DOI] [Google Scholar]
  • 7.Strumsky D, Lobo J, Mellander C. As different as night and day: Scaling analysis of Swedish urban areas and regional labor markets: https://doi-orgezproxylibrarywurnl/101177/2399808319861974. 2019;48: 231–247. [Google Scholar]
  • 8.Rickard SJ. Economic Geography, Politics, and Policy. Annual Review of Political Science. 2020;23: 187–202. doi: 10.1146/annurev-polisci-050718-033649 [DOI] [Google Scholar]
  • 9.Leyk S, Gaughan AE, Adamo SB, De Sherbinin A, Balk D, Freire S, et al. The spatial allocation of population: a review of large-scale gridded population data products and their fitness for use. Earth System Science Data. 2019;11: 1385–1409. doi: 10.5194/ESSD-11-1385-2019 [DOI] [Google Scholar]
  • 10.Graetz N, Woyczynski L, Wilson KF, Hall JB, Abate KH, Abd-Allah F, et al. Mapping disparities in education across low- and middle-income countries. Nature. 2019. doi: 10.1038/s41586-019-1872-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Osgood-Zimmerman A, Millear AI, Stubbs RW, Shields C, Pickering BV, Earl L, et al. Mapping child growth failure in Africa between 2000 and 2015. Nature. 2018;555: 41–47. doi: 10.1038/nature25760 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Pokhriyal N, Jacques DC. Combining disparate data sources for improved poverty prediction and mapping. Proceedings of the National Academy of Sciences of the United States of America. 2017;114: E9783–E9792. doi: 10.1073/pnas.1700319114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Chi G, Fang H, Chatterjee S, Blumenstock JE. Microestimates of wealth for all low- and middle-income countries. Proceedings of the National Academy of Sciences. 2022;119: e2113658119. doi: 10.1073/pnas.2113658119 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Stevens FR, Gaughan AE, Linard C, Tatem AJ, Jarvis A, Hashimoto H. Disaggregating Census Data for Population Mapping Using Random Forests with Remotely-Sensed and Ancillary Data. Amaral LAN, editor. PLOS ONE. 2015;10: e0107042. doi: 10.1371/journal.pone.0107042 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Nicolas G, Robinson TP, Wint GRW, Conchedda G, Cinardi G, Gilbert M. Using Random Forest to Improve the Downscaling of Global Livestock Census Data. Bond-Lamberty B, editor. PLOS ONE. 2016;11: e0150424. doi: 10.1371/journal.pone.0150424 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.van der Laan MJ, Polley EC, Hubbard AE. Super Learner. Statistical Applications in Genetics and Molecular Biology. 2007;6. doi: 10.2202/1544-6115.1309 [DOI] [PubMed] [Google Scholar]
  • 17.Polley EC, Laan MJ van der. Super Learner In Prediction. 2010.
  • 18.Hengl T. Extrapolation is tough for trees (tree-based learners), combining learners of different type makes it less tough. 2021. https://medium.com/nerd-for-tech/extrapolation-is-tough-for-trees-tree-based-learners-combining-learners-of-different-type-makes-659187a6f58d
  • 19.Autor DH, Dorn D, Hanson GH. The China Syndrome: Local Labor Market Effects of Import Competition in the United States. American Economic Review. 2013;103: 2121–2168. doi: 10.1257/aer.103.6.2121 [DOI] [Google Scholar]
  • 20.Graff Zivin J, Neidell M. Temperature and the Allocation of Time: Implications for Climate Change. Journal of Labor Economics. 2014;32: 1–26. doi: 10.1086/671766 [DOI] [Google Scholar]
  • 21.Orlov A, Sillmann J, Aunan K, Kjellstrom T, Aaheim A. Economic costs of heat-induced reductions in worker productivity due to global warming. Global Environmental Change. 2020;63: 102087. doi: 10.1016/j.gloenvcha.2020.102087 [DOI] [Google Scholar]
  • 22.de Lima CZ, Buzan JR, Moore FC, Baldos ULC, Huber M, Hertel TW. Heat stress on agricultural workers exacerbates crop impacts of climate change. Environmental Research Letters. 2021;16: 44020. doi: 10.1088/1748-9326/abeb9f [DOI] [Google Scholar]
  • 23.Gibbons R, Katz LF, Lemieux T, Parent D. Comparative Advantage, Learning, and Sectoral Wage Determination. Journal of Labor Economics. 2005;23: 681–724. doi: 10.1086/491606 [DOI] [Google Scholar]
  • 24.O’Kelly ME, Lee W. Disaggregate Journey-to-Work Data: Implications for Excess Commuting and Jobs–Housing Balance. Environment and Planning A: Economy and Space. 2005;37: 2233–2252. doi: 10.1068/a37312 [DOI] [Google Scholar]
  • 25.Sang S, O’Kelly M, Kwan M-P. Examining Commuting Patterns. Urban Studies. 2011;48: 891–909. doi: 10.1177/0042098010368576 [DOI] [Google Scholar]
  • 26.Duernecker G, Herrendorf B. Structural Transformation of Occupation Employment. 2021. https://ssrn.com/abstract=3932029
  • 27.Tarp F, editor. Growth, Structural Transformation, and Rural Change in Viet Nam. Oxford University Press; 2017. doi: 10.1093/acprof:oso/9780198796961.001.0001 [DOI] [Google Scholar]
  • 28.ILO. International Standard Classification of Occupations. Volume 1: Structure, group definitions and correspondence tables. Geneva, Switzerland: International Labour Office; 2012.
  • 29.Mammen K, Paxson C. Women’s Work and Economic Development. Journal of Economic Perspectives. 2000;14: 141–164. doi: 10.1257/jep.14.4.141 [DOI] [Google Scholar]
  • 30.Minnesota Population Center. Integrated Public Use Microdata Series, International: Version 7.2 [dataset]. Minneapolis, MN: IPUMS; 2019. 10.18128/D020.V7.2 [DOI]
  • 31.Pezzulo C, Hornby GM, Sorichetta A, Gaughan AE, Linard C, Bird TJ, et al. Sub-national mapping of population pyramids and dependency ratios in Africa and Asia. Scientific Data. 2017;4: 170089. doi: 10.1038/sdata.2017.89 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Kuhn M, Johnson K. Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press; 2020. [Google Scholar]
  • 33.Yeo I-K, Johnson RA. A New Family of Power Transformations to Improve Normality or Symmetry. Biometrika2. 2000;87: 954–959. [Google Scholar]
  • 34.Kuhn M, Wickham H. Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. 2020. https://www.tidymodels.org
  • 35.R Core Team. R: A Language and Environment for Statistical Computing. Vienna; 2021. https://www.r-project.org/
  • 36.Lu B, Hardin J. A Unified Framework for Random Forest Prediction Error Estimation. Journal of Machine Learning Research. 2021;22: 1–41. Available: http://jmlr.org/papers/v22/18-558.html [Google Scholar]
  • 37.Biecek P. DALEX: Explainers for Complex Predictive Models in R. Journal of Machine Learning Research. 2018;19: 1–5. Available: http://jmlr.org/papers/v19/18-416.html [Google Scholar]
  • 38.Biecek P, Burzykowski T. Explanatory Model Analysis. New York: Chapman; Hall/CRC; 2021. [Google Scholar]
  • 39.Lee B, Binns T, Dixon A. The Dynamics of Urban Agriculture in Hanoi, Vietnam. Field Actions Science Reports The journal of field actions. 2010; 0–8. [Google Scholar]
  • 40.Yu Q, You L, Wood-Sichra U, Ru Y, Joglekar AKB, Fritz S, et al. A cultivated planet in 2010 –Part 2: The global gridded agricultural-production maps. Earth System Science Data. 2020;12: 3545–3572. doi: 10.5194/essd-12-3545-2020 [DOI] [Google Scholar]
  • 41.Sahn DE, Stifel D. Exploring Alternative Measures of Welfare in the Absence of Expenditure Data. Review of Income and Wealth. 2003;49: 463–489. doi: 10.1111/J.0034-6586.2003.00100.X [DOI] [Google Scholar]
  • 42.Timmer MP, de Vries GJ, de Vries K. Patterns of Structural Change in Developing Countries. In: Weiss J, Tribe M, editors. Routledge handbook of industry and development. Routledge; 2015. [Google Scholar]
  • 43.Hengl T, Miller MAE, Križan J, Shepherd KD, Sila A, Kilibarda M, et al. African soil properties and nutrients mapped at 30 m spatial resolution using two-scale ensemble machine learning. Scientific Reports. 2021;11: 6130. doi: 10.1038/s41598-021-85639-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Rissman ER. What is the natural rate of unemployment? Federal Reserve Bank of Chicago Economic Perspectives. 1986;10: 3–17. [Google Scholar]
  • 45.Gibson J, Olivia S, Boe-Gibson G, Li C. Which night lights data should we use in economics, and where? Journal of Development Economics. 2021;149: 102602. doi: 10.1016/J.JDEVECO.2020.102602 [DOI] [Google Scholar]
  • 46.Smits J, Permanyer I. The Subnational Human Development Database. Scientific Data. 2019;6: 1–15. doi: 10.1038/sdata.2019.38 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Kalkuhl M, Kotz M, Wenz L. DOSE—The MCC-PIK Database Of Subnational Economic output. 2021. doi: 10.5281/ZENODO.4681306 [DOI] [Google Scholar]

Decision Letter 0

Sotirios Koukoulas

26 May 2022

PONE-D-22-10590Occupations on the map: Using a super learner algorithm to downscale labor statisticsPLOS ONE

Dear Dr. van Dijk,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jul 10 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Sotirios Koukoulas, Ph.D

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

2. We note that Figures 1, 2, S2-S11 in your submission contain map/satellite images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

a) You may seek permission from the original copyright holder of Figures 1, 2, S2-S11 to publish the content specifically under the CC BY 4.0 license.  

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

b) If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

The following resources for replacing copyrighted map figures may be helpful:

USGS National Map Viewer (public domain): http://viewer.nationalmap.gov/viewer/

The Gateway to Astronaut Photography of Earth (public domain): http://eol.jsc.nasa.gov/sseop/clickmap/

Maps at the CIA (public domain): https://www.cia.gov/library/publications/the-world-factbook/index.html and https://www.cia.gov/library/publications/cia-maps-publications/index.html

NASA Earth Observatory (public domain): http://earthobservatory.nasa.gov/

Landsat: http://landsat.visibleearth.nasa.gov/

USGS EROS (Earth Resources Observatory and Science (EROS) Center) (public domain): http://eros.usgs.gov/#

Natural Earth (public domain): http://www.naturalearthdata.com/.

3. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. 

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: My opinion is that the paper could be accepted with major revisions by authors considering the following comments.

Major comments. The methodology is not clearly presented and as such, is not convincing. Equally important, how the output gridded map is compared with the actual occupational status is vague and probably incorrect. For this reason, it has to be refined.

Here are my comments in detail:

1. Introduction. The paper lacks a paragraph explaining which is the originality of this paper in terms of the methodological approach. Is it just the use of a super learner, or something more, like designing the gridded map? This should be stated more clearly.

2. Line 31-33. I suggest you include more recent papers regarding the spatial uneven distribution of labor and expand a little more on the subject. For example, include the following recent works. Strumsky, et al., 2021. “As different as night and day: Scaling analysis of Swedish urban areas and regional labor markets." Environment and Planning B: Urban Analytics and City Science. Grekousis et al., 2018 “More Flexible Yet Less Developed? Spatio‑Temporal Analysis of Labor Flexibilization and Gross Domestic Product in Crisis‑Hit European Union Regions”. Social Indicators Research.

3. Page 6, Line 89. The labor force includes the unemployed. How is equal to the sum of all workers? Rewrite and add citation for the labor force definition.

4. Line 105 … of each grid cell. Of which grid? Authors should mention clearly how they selected the grid (e.g., based on other grid data readily available, or just designed by themselves?) How many cells has this grid?

5. Line 138. Even 0.7 is highly likely to bring multicollinearity issues. I think authors should remove variables correlated at 0.7 level or higher. For example, I would expect viirs with dmsp to be highly correlated. What’s their correlation?

6. Line 138. Which variables were removed, and how many did you finally keep? Table S2 just mentions the 32 predictors. It should be clear which variables were used for training.

7. Line 162. From my understanding authors trained a set of 538 objects, tested over 136 objects on an unknown (not specified in their text) number of variables to make a prediction for some thousands (I believe) cells. Authors should clearly explain their methodology process. For example, which are the specific variables used as predictors for each different target variable? However, I think that if they used the above approach, the results may not be accurate. The reason is that occupational data were downscaled from only 674 districts to some thousands (I guess) and then used again for prediction at this lower scale. Actually, neural networks are kind of prone to errors when the dataset is not large.

8. Line 210. TablesS3-S9. Authors should use a standard way of presenting results in TablesS3-S9 making thus easier for other researchers to delve deeper into the details of the adopted architecture and the performance of the ANNs and the other machine learning methods used. Authors are strongly recommended to use and cite a reporting scheme proposed by Grekousis 2019 “Artificial neural networks and deep learning in urban geography: A systematic review and meta-analysis.” Computers Environment and Urban Systems. Authors should use Table 4 of the above paper at least for the neural networks they already report and adapt it according to their needs (for example, as this is an ensemble approach, they don’t have to provide any graphs).

9. Line 212. This is like a loop. Authors begin from generalized data (small dataset), downscale them, and then generalize again. It is expected that prediction error is likely to decrease when you get back to the near original scale. This approach is not convincing unless they can provide more evidence. Authors should better explain or follow another path. For example, it would be more accurate if authors could find occupational data for a higher (more detailed) administrative level than the second level they used, for let’s say 100 randomly selected cells (each cell could enclose more than one of the more detailed administrative level). As the occupational data come from the national census, it’s expected that data should exist. Then, authors could just compare the predicted values at each cell, with the actual values at the overlapping administrative units.

10. Line 218. How exactly is this error calculated? What is the unit of the error mapped at fig S11 (e.g., std dev?)

11. Line 226. Esaccilc_dist190 does not exist in Table S2

12. Section3.3. I think this section is not well presented and not convincing. Authors try to use a different dataset to compare their output. First, the wealth index spans in a long period (2010-2018) outside the reference study time (2009). Second, the R2 is low (below 0.50) in all cases, something also clearly seen in Figure 4, so a conclusion of a good fit is an overstatement. I would suggest dropping this section as it creates more confusion than convincing evidence for the model’s accuracy.

Reviewer #2: Dear Editor,

I have carefully read the manuscript with title "Occupations on the map: Using a super learner algorithm to downscale labor statistics" which concerns the implementation of a generic machine learning approach (the super learner algorithm) in constructing a map illustrating occupation in the desired (depending also on the availabel data) spatial detail. The machine learning tools employed for the statistical analysis are well enstablished (relying mostly on already developed packages in R) and the results are both interesting and interpretable. The novelty of this paper is considerable from the econometrics/geographical perspective and it mainly concerns the application of a flexible ML procedure in analysing sampled data in various spatial scales in order to construct/recover the occupation map in a region. The manuscript is well written and the interested reader can easily follow it. Therefore, I recommend the current paper for publication.

Some (very) minor concerns/suggestions:

- p.4, l.85: i should be a subscript

- p.7, l.138: "we removed all variables that had a correlation of 0.9 or larger" You could try also a Ridge-type estimation scheme to keep all the available information. You may at least mention this approach as a possible alternative in order to keep the same set of predictors for all of you modelling operations to avoid inconsistencies in the interpretation.

- p.7, l.139: It would be nice to provide a small description on how the Super Learner procedure works. It would be excellent if you could provide a graphical representation of this procedure to your problem illustrating the various ML methods that you are using (Random Forests, Logistic model, etc).

- You might also provide some rough estimates for the training costs in CPU time if it is possible.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Dec 7;17(12):e0278120. doi: 10.1371/journal.pone.0278120.r002

Author response to Decision Letter 0


27 Oct 2022

Dear Reviewers,

Thank you very much for the useful comments. We believe these comments are relevant and therefore we have made an effort to respond to them in the best possible way. As a consequence the paper has improved a lot. We sincerely hope that our changes are in line with the expectations of the reviewers and the paper can be accepted for publication.

In the remainder of this letter, we provide a point-by-point response to the various comments. We do this separately for the comments made by each reviewer. Original comments are in black and our response is in blue. Where relevant, line numbers are added to better trace the changes made to the document. We also made some other changes following the comments of the PLOS ONE editor related to copyright of figures, which are placed at the end of the document.

Best wishes,

Michiel van Dijk (on behalf of all authors).

Reviewer #1

Major comments. The methodology is not clearly presented and as such, is not convincing. Equally important, how the output gridded map is compared with the actual occupational status is vague and probably incorrect. For this reason, it has to be refined.

Here are my comments in detail:

1. Introduction. The paper lacks a paragraph explaining which is the originality of this paper in terms of the methodological approach. Is it just the use of a super learner, or something more, like designing the gridded map? This should be stated more clearly.

Thanks for the comment. Our paper makes two contributions to the literature: (1) We present an approach to create gridded maps for labor statistics. Labor is one if the key factors of production and therefore detailed spatially explicit statistics are key information for (local) policy makers. Labor statistics are only available at the national level and sporadically at the subnational level. We provide an approach to create fine scale gridded maps of labor statistics to better inform policies (2) the existing literature on downscaling of subnational indicators as well as the related literature on spatial extrapolation of geo-coded information (e.g. for soils) has mainly relied on the use of a single ML approach (mostly random forest), which might lead to biased or sub-optimal results. Our paper uses an ensemble approach (i.e. a super learner) to address this issue.

To emphasize this better, we rearranged the introduction and clearly stated our contribution in the beginning of the introduction L48-56: “The contribution of this paper is to demonstrate an approach to downscale subnational labor statistics to a fine-scale spatial grid using machine learning approaches. As such this paper contributes to a rapidly expanding literature, which uses machine learning and advanced statistical models to create fine-scale gridded maps of socio-economic indicators. Key examples include the mapping of population (Leyk et al. 2019), educational attainment (Graetz et al. 2019), child growth (Osgood-Zimmerman et al. 2018), poverty (Pokhriyal and Jacques 2017) and wealth (Chi et al. 2022). To the best of our knowledge, this is the first application to apply these techniques to downscale labor data.”

2. Line 31-33. I suggest you include more recent papers regarding the spatial uneven distribution of labor and expand a little more on the subject. For example, include the following recent works. Strumsky, et al., 2021. “As different as night and day: Scaling analysis of Swedish urban areas and regional labor markets." Environment and Planning B: Urban Analytics and City Science. Grekousis et al., 2018 “More Flexible Yet Less Developed? Spatio Temporal Analysis of Labor Flexibilization and Gross Domestic Product in Crisis Hit European Union Regions”. Social Indicators Research.

Thanks for the suggestions. We have added the references in L38-39.

3. Page 6, Line 89. The labor force includes the unemployed. How is equal to the sum of all workers? Rewrite and add citation for the labor force definition.

The labor force definition as well as all other definitions were taken from the ILO and we added the reference on L122-123. The confusion is caused by the term ‘workers’, which is not clearly defined and might suggest this only included the number of employed people. Census and labour force surveys ask questions on type of occupation to all people that are in the labour force, either employed or unemployed. We removed the confusing sentence and added a clarification in L142-143. The census included questions for each person in the labor force on his/her type of occupation using the the detailed ISCO 08 classification.

4. Line 105 … of each grid cell. Of which grid? Authors should mention clearly how they selected the grid (e.g., based on other grid data readily available, or just designed by themselves?) How many cells has this grid?

Indeed, this is not clear. In this case we simply used an existing product of Pezzulo et al (2017), which presents a gridded map with population data broken down by age classes. We also explained how this map was calculated by the authors, which might have created confusion as it seemed that we made the calculations. We added the following lines (L148-152) to make this clear: “To calculate the working age population at each grid cell, we used population maps with 5-year age group compositions from Pezzulo et al. (2017). The main data source for this product was the 2009 Vietnam population census, which we also used as the main input.”

5. Line 138. Even 0.7 is highly likely to bring multicollinearity issues. I think authors should remove variables correlated at 0.7 level or higher. For example, I would expect viirs with dmsp to be highly correlated. What’s their correlation?

We implemented the proposal of the reviewer and reran our analysis with a maximum correlation of 0.7. As a result, 14 out of the 32 predictors were removed from the analysis (before only 4 but this was not well described). We added a correlation matrix in the SI and describe clearly which variables were kept for the analysis. We also mention that 14 variables are removed in the main text L185-186. As a consequence, the outcomes have changed somewhat but are overall very similar to the ones presented in our first submission.

6. Line 138. Which variables were removed, and how many did you finally keep? Table S2 just mentions the 32 predictors. It should be clear which variables were used for training.

See response to previous comment.

7. Line 162. From my understanding authors trained a set of 538 objects, tested over 136 objects on an unknown (not specified in their text) number of variables to make a prediction for some thousands (I believe) cells. Authors should clearly explain their methodology process. For example, which are the specific variables used as predictors for each different target variable? However, I think that if they used the above approach, the results may not be accurate. The reason is that occupational data were downscaled from only 674 districts to some thousands (I guess) and then used again for prediction at this lower scale. Actually, neural networks are kind of prone to errors when the dataset is not large.

This comment is not clear to us. The reviewer thinks that we ‘somehow’ downscaled the subnational data to the grid cell and used this for prediction. This is not the case. We ran all machine learning models, including the super learner, which simply combines the individual models, at the level of the subnational units (538 objects for training and 136 objects for testing) by combining data of the target variable at the subnational value with the median of grid cell data that is located in the subnational areas (177-179). As explained on L57-58, this approach is identical Stevens et al. (2015) and Nicolas et al. (2016), who created gridded population and livestock maps, respectively. To make this clearer we added the following sentence in L201-203: “All six machine learning algorithms and the super learner models used data at the subnational level for training and testing (i.e. 674 data points in total).”

8. Line 210. TablesS3-S9. Authors should use a standard way of presenting results in TablesS3-S9 making thus easier for other researchers to delve deeper into the details of the adopted architecture and the performance of the ANNs and the other machine learning methods used. Authors are strongly recommended to use and cite a reporting scheme proposed by Grekousis 2019 “Artificial neural networks and deep learning in urban geography: A systematic review and meta-analysis.” Computers Environment and Urban Systems. Authors should use Table 4 of the above paper at least for the neural networks they already report and adapt it according to their needs (for example, as this is an ensemble approach, they don’t have to provide any graphs).

We agree it is important to add detail on the machine learning approaches used, assumptions and hyperparameters selected. For this reason, we added Tables S3-S9, which provide the main hyperparameters of the most important models in our super learner. With respect to the neural network, it provides additional detail on the three parameters that can be tuned with the R neural network package we used for the analysis (nnet, Venables and Ripley, 2002). The elaborate reporting scheme proposed by Grekousis also lists several other parameters, which are, however, not relevant (i.e. tunable) for the nnet package, e.g. regularization coefficient and dropout. Moreover, as we ran a very large number of models in an ensemble framework, it is not practically feasible to add all the details in the form of tables in an annex. Instead, we prefer to make our analysis fully reproducible by adding code and data in open repositories (see code and data statement in the papers). In this way the interested reader can extract all information possible, even going beyond the reporting scheme of Grekousis (2019). In addition, we added an extra paragraph in the SI, that lists the R packages used for each of the machine learning models, which makes it easy for interested readers to look for additional information on hyperparameters etc. and stresses the need to provide clarity about packages used and hyperparameters selected in the spirit of Grekousis (2019).

9. Line 212. This is like a loop. Authors begin from generalized data (small dataset), downscale them, and then generalize again. It is expected that prediction error is likely to decrease when you get back to the near original scale. This approach is not convincing unless they can provide more evidence. Authors should better explain or follow another path. For example, it would be more accurate if authors could find occupational data for a higher (more detailed) administrative level than the second level they used, for let’s say 100 randomly selected cells (each cell could enclose more than one of the more detailed administrative level). As the occupational data come from the national census, it’s expected that data should exist. Then, authors could just compare the predicted values at each cell, with the actual values at the overlapping administrative units.

We agree with the reviewer that using more detailed subnational information would be best for validation. Unfortunately, the micro data from the population census provided by IPUMS is only representative at 2nd level so we cannot calculate labor force statistics at finer subnational levels. Despite of this, we do think that our current approach is useful for validation because it says something about the bias of the model(s). Grid cell predictions are based on all predictor values (e.g. the full distribution of observations) within a subnational unit, not only the median values that were used to train the model. As such it will also include values around the median, even including observations at the tails of the distribution (e.g. remote areas with near zero nightlight etc), which were most likely not observed in the training and testing datasets. If the model would structurally under- or overperform for such values, this would result in a strong bias. On the other hand, if we find a strong relationship between aggregated predictions and subnational census values, this is evidence that the model is producing realistic values on average (i.e. has a low bias).

We added several sentences to make this point in 283-294. “We also investigated if the models were able to adequately reproduce the number of workers at higher subnational aggregations. To do this, we aggregated the predictions for the number of workers at grid cell level and compared this with the district-level number of workers that can be derived from the model input data for each occupation category (Figure S2). A strong relationship between the two values indicates that the models are producing realistic values on average. Grid cell predictions are based on all predictor values, not only the median values that were used to train the model, and therefore are likely to include observations at the tails of the distribution (e.g. remote areas). If the models would structurally under or over perform for such values, this would result in poor aggregated predictions and implies the models might be biased. The high R2 of 0.85-0.95 (Figure S6) between model outcomes and observed subnational statistics suggest this is not the case.”

10. Line 218. How exactly is this error calculated? What is the unit of the error mapped at fig S11 (e.g., std dev?)

The calculation of the error is explained in L227-231 but we now also added the unit (and in the SI as well): “We followed the approach used by Hengl et al. (2021) to calculate 67% (1 standard deviation) grid-cell prediction errors for the super learner models.”

11. Line 226. Esaccilc_dist190 does not exist in Table S2

This was a typo. As a consequence of the updated correlation coefficient restriction (see above), the list of included predictors has changed so we have updated this section accordingly.

12. Section3.3. I think this section is not well presented and not convincing. Authors try to use a different dataset to compare their output. First, the wealth index spans in a long period (2010-2018) outside the reference study time (2009). Second, the R2 is low (below 0.50) in all cases, something also clearly seen in Figure 4, so a conclusion of a good fit is an overstatement. I would suggest dropping this section as it creates more confusion than convincing evidence for the model’s accuracy.

We do not agree with the remark of the reviewer. As explained in the text, there is a clear theoretical link between share of occupation in a region/grid cell and average wealth. One would expect to pick this up by means of a simple regression analysis. There are no standard definitions for a ‘low’ and ‘high’ R2 but in our opinion an R2 of between 0.5 and 0.2 is often considered as moderate and possibly even high in social sciences (note that an R2 of 0.5 is equal to a correlation of ~0.7!), but certainly not low. In fact, we think the finding of a simple R2 of around 0.4 for the two main occupation classes (agric and craft workers) which make up 73% of the labor force, is remarkable and confirms the validity of our approach. Also note that we run the regression on a large number of grid cells (~ 33.000) and therefore lower R2 are too be expected because of random errors. We suspect that the correlation would even be higher if we could add various control variables such as local taxes, education level etc but such an analysis would fall out of the scope of this paper, nor are the data available to do this. Finally, it is very common in the literature on downscaling to use external datasets for validation. Other examples are Chi et al. (2022) and Yu et al. (2020). We follow this tradition. Finally, it is correct that the wealth map and our product cover different periods. Change in occupational structure, however, is a long-run process related to structural change in the economy. It is therefore plausible to assume that the occupation map for 2009 is also representative for the period 2010-2018, which is covered by the poverty map, and therefore both can be compared. We modified this paragraph to explain these pojnts (341-344). “This period does not overlap with our base year of 2009, but as the reallocation of labor across sectors is a long-run and gradual process (Timmer, Vries, and Vries 2015), we expect that our occupation maps will also be representative for the period covered by the wealth map.”

Reviewer #2

Some (very) minor concerns/suggestions:

- p.4, l.85: i should be a subscript

Thanks for spotting this – Changed.

- p.7, l.138: "we removed all variables that had a correlation of 0.9 or larger" You could try also a Ridge-type estimation scheme to keep all the available information. You may at least mention this approach as a possible alternative in order to keep the same set of predictors for all of you modelling operations to avoid inconsistencies in the interpretation.

Reviewer #1 also had a remark about this assumption. We decided to follow his advice and use a correlation of 0.7 instead (see above).

- p.7, l.139: It would be nice to provide a small description on how the Super Learner procedure works. It would be excellent if you could provide a graphical representation of this procedure to your problem illustrating the various ML methods that you are using (Random Forests, Logistic model, etc).

We added the following paragraph (L188-193) to explain the concept of a super learner algorithm. “We used an ensemble approach, referred to as a super learner, to predict labor statistics at the grid level. A super learner is an algorithm that combines the results of multiple machine learning models or the same model with different parameters and settings. Predictions are then generated by weighting the outcomes of the individual member models. It has been demonstrated that predictions of a super learner have the same or higher accuracy in comparison to those generated by means of single machine learning models (Laan, Polley, and Hubbard 2007).”

We have thought about adding some sort of graphical representation as suggested by the reviewer but are not fully convinced about the need to add this (nor do we have clear ideas on what to add).

- You might also provide some rough estimates for the training costs in CPU time if it is possible.

This is an interesting idea but as we have used different machines to run our model. We are currently in the process of running the scripts on a high-compute cluster and therefore did not collect consistent data on CPU time and, hence, are unable to add this.

# PLOS ONE editor

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

We had a look at the templates and made changes accordingly.

2. We note that Figures 1, 2, S2-S11 in your submission contain map/satellite images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

Figure 1 and S2 depict data from IPUMS. This data can be used for analysis and visualization as long as an appropriate reference is added, which we have done in the text. Figure 2, S10 and S11 present the results of our analysis so this can be presented without any problems. S3-S9 present maps of our predictors. Although all datasets are open access, we are not 100% sure if we can share all maps in this format. For this reason, we have removed these maps from the SI. As readers can easily recreate this data

Attachment

Submitted filename: response_to_reviewers.docx

Decision Letter 1

Sotirios Koukoulas

10 Nov 2022

Occupations on the map: Using a super learner algorithm to downscale labor statistics

PONE-D-22-10590R1

Dear Dr. van Dijk,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Sotirios Koukoulas, Ph.D

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: All comments were addressed. The statistical analysis has been performed appropriately and rigorously. The paper is accepted with no other comments.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

Acceptance letter

Sotirios Koukoulas

29 Nov 2022

PONE-D-22-10590R1

Occupations on the map: Using a super learner algorithm to downscale labor statistics

Dear Dr. van Dijk:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Sotirios Koukoulas

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Total labor force and occupation shares in Vietnam for the year 2009.

    Source: Minnesota Population Center. Integrated Public Use Microdata Series, International: Version 7.2 [dataset]. Minneapolis, MN: IPUMS; 2019. https://doi.org/10.18128/D020.V7.2 for labor force participation rate and occupation shares, and Pezzulo et al. (2017) for working age population.

    (PNG)

    S2 Fig

    District-level information: occupation shares for (a) Managers and professionals, (b) Technicians and associate professionals, (c) Clerks and service workers, (d) Agricultural workers, (e) Craft workers and operators, (f) Elementary occupations, (g) Labor force participation rate and (h) Working age population in Vietnam for the year 2009. Source: Minnesota Population Center. Integrated Public Use Microdata Series, International: Version 7.2 [dataset]. Minneapolis, MN: IPUMS; 2019. https://doi.org/10.18128/D020.V7.2 for labor force participation rate and occupation shares, and Pezzulo et al. (2017) for working age population.

    (PNG)

    S3 Fig. Correlation matrix for predictors before normalization and Yeo-Johnson power transformation.

    After normalization and Yeo-Johnson power transformation, we used step_corr(., threshold = .7) from the R recipes package to remove all predictors with an absolute correlation equal or larger than 0.7. Consequently, 14 predictors were removed from the analysis (bio_1, bio_5, bio_6, dmsp, dst_bsgmi, dst_ghslesaccilcguf, esaccilc_dst040, int_airports, osm_dst_road, osm_dst_roadintersec, srtm_slope, srtm_topo, travel_time, viirs), leaving 18 predictors that were used as final input.

    (PNG)

    S4 Fig

    Super learner results for (a) Managers and professionals, (b) Technicians and associate professionals, (c) Clerks and service workers, (d) Agricultural workers, (e) Craft workers and operators, (f) Elementary occupations and (g) Labor force participation rate.

    (PNG)

    S5 Fig

    Prediction errors for (a) Managers and professionals, (b) Technicians and associate professionals, (c) Clerks and service workers, (d) Agricultural workers, (e) Craft workers and operators, (f) Elementary occupations and (g) Labor force participation rate. Prediction errors are logit transformed values, which are provided with a probability of 67%, which is the 1 standard deviation upper and lower prediction interval.

    (PNG)

    S6 Fig

    District-level comparison between observations and super learner predictions for (a) Managers and professionals, (b) Technicians and associate professionals, (c) Clerks and service workers, (d) Agricultural workers, (e) Craft workers and operators and (f) Elementary occupations. Dashed blue line represents the 1:1 line. Solid blue line indicates the regression line, with 95% confidence intervals in grey. District-level observations on the number of workers are calculated by multiplying district-level data on occupation share, labor force participation and working age population (aggregated from grid-level values), depicted in S2 Fig.

    (PNG)

    S1 Table. Main occupation categories based on the International Classification of Occupations 08 (ILO 2012).

    (PDF)

    S2 Table. List of selected predictors for the machine learning models.

    (PDF)

    S1 File. Hyperparameters for super learner model members, sorted by RMSE.

    (PDF)

    S2 File. Variable importance plots.

    Results are presented for the five super learner model members with the highest weight and for the top 10 predictors. Starting values for the horizontal bars indicate the RMSE for the full model. Predictors with the largest bars are the most important because permuting them results in higher RMSE. Error bars indicate results for 10 different permutations.

    (PDF)

    S3 File. Accumulated local effects.

    Results are presented for the five super learner model members with the highest weight.

    (PDF)

    Attachment

    Submitted filename: response_to_reviewers.docx

    Data Availability Statement

    The main results and input datasets of this study are presented in the Supporting information files. All input and output maps produced for this study can also be accessed by means of an interactive web application (https://shiny.wur.nl/occupation-map-vnm). Input and output data are available on Zenodo (DOI: 10.5281/zenodo.6419272) and scripts to reproduce the analysis are available on GitHub (https://github.com/michielvandijk/occupations_on_the_map).


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES