Abstract
Linguistic diversity is a key aspect of human population diversity and shapes much of our social and cognitive lives. To a considerable extent, the distribution of this diversity is driven by environmental factors such as climate or coast access. An unresolved question is whether the relevant factors have remained constant over time. Here, we address this question at a global scale. We approximate the difference between pre- versus post-Neolithic populations by the difference between modern hunter–gatherer versus food-producing populations. Using a novel geostatistical approach of estimating language and language family densities, we show that environmental—chiefly climate factors—have driven the language density of food-producing populations considerably more strongly than the language density of hunter–gatherer populations. Current evidence suggests that the population dynamics of modern hunter–gatherers is very similar to that of what can be reconstructed from the Palaeolithic record. Based on this, we cautiously infer that the impact of environmental factors on language densities underwent a substantial change with the transition to agriculture. After this transition, the environmental impact on language diversity in food-producing populations has remained relatively stable since it can also be detected—albeit in slightly weaker form—in models that capture the reduced linguistic diversity during large-scale language spreads in the Mid-Holocene.
Keywords: hunter–gatherers, linguistic diversity, ecological risk, language evolution
1. Introduction
One of the most striking aspects of population diversity in humans is linguistic. Languages compartmentalize us into distinct groups, with wide-reaching consequences. Language boundaries tend to establish and signal group identities, constrain and regulate exchange, and demand lifelong language learning when groups are in contact. The extent of this linguistically driven compartmentalization is staggering. Current estimates go well beyond 7 000 languages [1]. The source is gradual splitting processes of cultural evolution that operate, as already noted by Darwin, with mechanisms similar to biological speciation [2,3]. Many of the most recent such processes—up to about 8 000 years ago—have been reconstructed, and languages are currently estimated to fall into over 400 distinct families of shared descent [1].
Remarkably, however, linguistic diversity is extremely uneven across the world [4–7]. Figure 1 displays the geographical densities of languages that are spoken or are known to have been spoken in the past, before recent globalization events [1]. Maps and analyses of nearest neighbour distances (figure 1) suggest considerable spatial clustering, both at the level of individual languages and at the level of language families. In extreme cases, such as deserts or polar regions, differences in language density simply reflect differences in human presence. But beyond this, regions with similar population sizes can be fragmented into many small-range languages or contain only fewer but larger-range languages [8]. For example, South and Southeast Asia have similar population densities, but South Asia shows a lower language density than Southeast Asia (electronic supplementary material, S2 and figure S1). In general, language and population density are only very weakly correlated (electronic supplementary material, figure S2).
Figure 1.
Distribution of languages (a) and language families (b) in the Glottolog database (see Material and methods). Both distributions show substantial spatial clustering, as revealed by the maps (a and b) and through assessing normalized nearest-neighbour distances (c,d and electronic supplementary material, S1), where 1 (horizontal line) indicates a fully random distribution and 0 full co-location. Strong spatial clustering characterizes food-producing (FP) and hunter–gatherer (HG) populations alike (c), and holds for language families of various sizes, including the largest families (d, colour-matched to the map in b). (Online version in colour.)
Most explanations of geographical differences in language density link the effect to biodiversity and therefore locate the causes in the same environmental factors that drive species richness [9–11]. Biologically richer and more complex environments support higher numbers of small-range linguistic groups, while less productive environments require larger ranges and wider exchange networks to sustain groups [4,8]. As a result, tropical and coastal regions tend to harbour large numbers of short-range languages in close vicinity, while higher latitudes tend to accommodate only fewer and wider-range languages.
An unresolved question is whether the impact of environmental factors has remained stable over time, in particular whether it was the same before and after the advent of agriculture, pastoralism, and the many demographic and cultural developments that these innovations brought about. Some studies find less impact of environmental factors on language density in food-producing (agricultural or pastoralist) than in hunter–gatherer populations because post-Neolithic technology, such as irrigation systems, can overcome environmental challenges [12,13]. By contrast, the Ecological Risk Hypothesis [6,14] predicts a higher impact of the environment on language densities among food-producing than among hunter–gatherer populations because food production is subject to substantial risks in local climate conditions and water access while hunter–gatherer populations have more mobile and adaptive lifestyles. A third possibility, which has not been discussed much, is based on the observation that on a global scale, food-producing and hunter–gatherer populations seem to cluster in different regions, with only limited overlap (figure 1a). This distribution could point to scenarios of niche differentiation or displacements. Such scenarios predict that the impact of the environment has remained stable overall, but that it is associated with different directions of the effect (i.e. different signs of the relevant coefficients). For example, grassland access might positively attract food-producing populations, concurrently pushing away hunter–gatherer populations from the same areas (and yielding a negative coefficient when predicting language density); conversely, ocean access might attract hunter–gatherer populations, concurrently pushing away agricultural populations from the same areas. The three hypotheses are summarized in table 1, together with a null model of no change in the environmental impact on language densities.
Table 1.
Hypotheses and predictions for the impact of the environment on language densities in food-producing (FP) and hunter–gatherer (HG) populations.
| hypothesis | predicted environmental impact |
|---|---|
| ecological risk | less impact on HG than on FP |
| post-Neolithic technology | less impact on FP than on HG |
| differentiation | same; but what increases HG density, decreases FB density, and vice versa |
| null model | same in both direction and magnitude |
Here, we test the predictions of the hypotheses in table 1 against each other. An ideal test would explicitly model the evolutionary processes during the transition to agriculture in a large array of environmental conditions. However, such models are not in sight yet because we lack sufficiently rich and continuous palaeo-climatic estimates and global language phylogeographies. Existing language phylogenies are limited to a dozen families worldwide, and their time depths stay largely within the Neolithic [15]. In response, we approach the question via an approximation by current subsistence type. Specifically, we approximate the difference between pre- and post-agricultural settings by comparing language density drivers in modern hunter–gatherer (HG) and food producer (FP) populations (figure 1a,c). This approximation is justified by recent genomic evidence that the social dynamics and population structure of modern hunter–gatherers is similar to that of pre-Neolithic populations [16].
The approximation is challenged, however, by the fact that language densities were severely reduced during the spread of several large families over the past 8 000 years, such as the spread of Indo-European in Eurasia or of Pama–Nyungan in Australia (figure 1b,d). These spreads homogenized large areas for some period of time, before the spreading languages (e.g. early stages of Indo-European and Pama–Nyungan) diversified and pushed up densities again. In order to control for these events, we follow earlier suggestions [4] and estimate densities not only at the level of languages, but also at the level of families. Family-level estimates capture the density reductions induced by the spreads. In addition, they introduce a control of phylogenetic auto-correlation [17], which places languages of the same family in closer geographical vicinity to each other than unrelated languages (figure 1d). This auto-correlation potentially masks the drivers of language distributions that we aim to test.
2. Material and methods
(a). Language data
We use the Glottolog database [1], which aims to exhaustively list all languages that serve or have served as the regular and dominant means of communication for a human population. Importantly, the database not only lists languages that are currently used, but also languages known to have existed until recently (or even less recently in the few cases where there is a written record). Therefore, the data approximate linguistic diversity before the massive language extinction in the past 100 years and in the present [9,18]. Glottolog also classifies the languages into families. Both the classification and the decision of what counts as a distinct language are based on careful and systematic screening of the published evidence.
Languages are represented as point coordinates in Glottolog. Higher-resolution, areal representations are often problematic and global linguistic databases generally avoid them. In many cases, the relevant spatial shapes are simply unknown. Where they are known, languages often have complex discontinuous distributions in space (e.g. [19,20]), which challenge polygon-based analyses. Point coordinates are problematic, however, for the few large languages with very large distributions, such as English, Russian, or Mandarin Chinese. Here, Glottolog locates the coordinates at the geographical, or, if known, historical, centres of the respective languages. The coordinates for English, for example, are placed in England's East Midlands, where Modern English has its fifteenth century roots.
(b). Subsistence data
For subsistence data, we rely on a list of 1 205 languages spoken by HG populations from a recent compilation [21], augmented by 23 further languages from another source [22]. The HG list was matched to Glottolog language names by hand, starting from available match sets [23]. Since the list is meant to exhaust all known HG populations, we then associated all Glottolog languages that are not on the list with FP subsistence. This possibly induces some error in underestimating HG populations, but there are at present no databases of similar size that would allow to correct this. In total, our database contains 6 672 FP- and 1 251 HG-associated languages, classified into 240 FP- and 247 HG-associated families. Some of these families (around 30%) contain both languages of HG and language of FP populations. However, we perform our analyses separately for HG and FP populations, and so double classification is unproblematic.
(c). Environmental data
We incorporate 10 variables that capture various aspects of the natural environment (see table 2, and electronic supplementary material, S3 for further details). We selected these variables in line with what previous studies have used [7,11,12,14], in order to maintain comparability with earlier results. Following a recent review and classification [7], we use six ‘environmental variables’, representing climate and vegetation. For climate [24], we focus on precipitation, temperature, and derivatives thereof, such as the number of months with average temperatures higher than 15°C (n_warm_months), which is akin to the length of the mean growing season, representing the ecological risk that a population is exposed to [14]. Vegetation, as an outcome of climate, is approximated by grassland coverage [25]. As ‘spatial heterogeneity variables’ [7], we use distance to nearest ocean, distance to the nearest large river, elevation above sea level, and terrain surface roughness (measured as the standard deviation of elevation within a given search radius). In addition to these variables, we include population counts as a control since, as noted in the introduction, sparsely populated areas are expected to support fewer language groups.
Table 2.
List of environmental variables. More detailed information is given in electronic supplementary material, table S1.
| scale (radius (km)) |
||||||
|---|---|---|---|---|---|---|
| variable | description | type | aggregation | local | meso | macro |
| n_warm_months | n months with mean temperature >15°C | climate | median | 0 | 10 | 100 |
| temp_mean | mean annual temperature | climate | median | 0 | 10 | 100 |
| warmest | mean temperature of the warmest quarter | climate | median | 0 | 10 | 100 |
| precipitation_var | seasonal variance of precipitation | climate | median | 0 | 10 | 100 |
| wettest | precipitation in the wettest quarter | climate | median | 0 | 10 | 100 |
| dist_ocean | distance to the nearest ocean | hydrology | min | 0 | 10 | 100 |
| dist_river | distance to the nearest large river | hydrology | min | 0 | 10 | 100 |
| elevation | elevation above sea level | terrain | max | 0 | 10 | 100 |
| roughness | altitude variation | terrain | s.d. | 50 | 100 | 500 |
| grass | km2 of grassland/pasture | vegetation | median | 0 | 10 | 100 |
| population | population count | population | median | 0 | 10 | 100 |
A key issue in environmental variables is that they can be coded at several levels of spatial resolution. Choices here can have a wide-reaching impact known as the modifiable areal unit problem [26]. In order to control for this and to capture the uncertainty in spatial resolution, we adopt a multi-scale approach [27] and extract the environmental information at three spatial scales, namely a local, a meso, and a macro scale (table 2).
Language-level models capture environmental drivers of the present densities, while family-level models capture the reduced densities that were caused by prehistoric spreads. For some of the variables in table 2, such as hydrology and terrain variables, this difference in time is negligible. By contrast, climate, vegetation, and population variables have changed between the time of the spreads and now. The exact time of the spreads is unknown for most families, but given their reconstructed age range of 4 000–8 000 years [15,28], the spreads must have occurred in the Mid-Holocene. Therefore, for family-level models we use mid-Holocene projections for all climate [24] and all vegetation and population [25] data.
(d). Methods
Language density estimates suffer from a series of methodological problems ([7], electronic supplementary material, S4). We aim to solve these (a) by moving from counts in a raster to counts near points that are uniformly distributed on the Earth's sphere, (b) by evaluating observed language densities against what is expected under a random baseline, (c) by incorporating environmental information at multiple scales of resolution, and (d) by resolving the multicollinearity of environmental variables (see electronic supplementary material, S5).
We first create a spherical grid of points that are uniformly distributed so that nearest-neighbour distances are approximately similar between all points. There is no analytical solution to achieve this, a problem known as the Thomson Problem [29]. In response, we use an algorithmic approximation, as available through the function regularCoordinates in the R geosphere package [30]. We compute three uniform point grids with approximately 300, 1 000, and 3 000 points, representing different spatial resolutions while taking into account landmass geometry. We then count the number of distinct languages and distinct families that are nearest to each grid point (figure 2a,b; and electronic supplementary material, figures S3–S4 and table S2 for real-world examples). These counts, and all subsequent analyses, are performed separately for HG and FP populations, and separately for each grid resolution.
Figure 2.
Procedure for counting languages in the data (langC, 1) and in the random baseline distribution (randC, 2) for each grid point. Language counts: language locations (crosses in a) are counted (numbers in b) at nearest grid points (black dots). Random counts: randomly generated locations (crosses in c) are counted at nearest grid points. Repeating this b times yields a random baselines distribution of counts at each grid point (blue histograms in E). (Online version in colour.)
We then compare the observed counts at each grid point to a random baseline, i.e. to what would be expected under a random distribution of languages and language families over the world's landmasses (excluding oceans, and sparsely populated polar regions). For this, we generate random language locations equal in number to the observed languages. We do this with the randomCoordinates algorithm from the geosphere package [30]. This algorithm solves the Thomson Problem by approximation and ensures equal probability of locations across latitudes and longitudes. In order to compute expected counts for families, the generated language locations are furthermore assigned family labels. To do so, we generate a vector of family labels with the number and frequencies of labels taken from the data. We then randomly sort this vector by family and linearly assign labels to random locations along longitudes and latitudes. This ensures that neighbouring locations have a higher probability to be assigned to the same family while keeping the geographical distribution of small and large families random. This procedure mimics the phylogenetic auto-correlation in the observed distribution of families (figure 1), ensuring comparability between observed and expected counts.
For each of B=500 random sets, we count the number of locations (randC) that are nearest to each grid point (figure 2c,d), in parallel to how we count the observed languages and families. This yields a random baseline distribution of language and language family counts at each grid point (figure 2e), directly comparable to the observed counts. At each grid point, we determine the proportion of B random counts that are smaller than the observed count. We take this proportion as an estimate of the cumulative probability P(langC) of the observed count to exceed what is expected under the random baseline process, i.e. P(langC)=mean(langC > randC) (electronic supplementary material, figures S5–S6).
By picking a suitable cut-off interval α, these cumulative probabilities can be evaluated as to whether there are fewer languages than expected under the random baseline (P(langC) < α), or more languages than expected (P(langC) > α). We use the term trough' for points with fewer than expected languages and peak' for points with more than expected languages. Consider, for example, a count of 3 in figure 2b. If this count mostly exceeds the counts under randomization (figure 2e), it will qualify as a peak; if it hardly ever exceeds the counts under randomization, it will be a trough. The choice of the cut-off interval α outside which a count qualifies as a peak or trough depends on the needs of the model that is fitted. We return to this below, after giving more details on our random baseline and introducing our modelling strategy.
As noted in the introduction, language densities can be affected by population size. Although the correlation is weak outside sparsely populated areas (electronic supplementary material, S1), we control for possible confounds in two ways. First, we include population size among the environmental predictors and test its influence. Second, we generate a second set of random locations that is informed by population density. Specifically, we let the probability of a location to be selected during the generation of random distributions to increase in proportion to population counts at that location. For example, densely populated regions like South and Southeast Asia (electronic supplementary material, figure S1), are given a higher chance to gain random locations, when compared with sparsely populated regions like Australia or Siberia. As a result, language counts under this baseline must be higher to qualify as a peak in densely populated regions than in sparsely populated regions. This de-correlates our estimates from the distribution of human presence.
To assess the impact of the environmental variables on language densities, most previous work has relied on regression models. Such models are challenged by the extreme skewing of language distributions and by the massive multicollinearity between the environmental predictors, e.g. both temperature and precipitation strongly correlate with latitude. We opted for Random Forest models [31] instead, which eschew the basic problems of multicollinearity and are more robust against the extremely uneven distribution that characterizes language densities (electronic supplementary material, S6). Random Forest models are ensemble, machine-learning classification algorithms. Ensemble classification means that many (here, 5 000) classification trees are trained with bootstrapped samples of the input data and each split in each tree is based only on a random selection of all available predictors (a technique known as bagging). We use Random Forests to classify grid points as peaks or troughs based on environmental information. The classification result is based on the majority vote over all 5 000 available trees. The main statistic of interest in the result is the out-of-bag classification error, i.e. the classification error for data that has not been seen by the classifier. The impact of environmental factors is higher the better the model can classify grid points accurately, i.e. in line with the observations. Additional statistics of interest are the predictor significance and the predictor importance: the degree to which individual factors contribute to the classification. We estimate predictor significance through a permutation test that avoids problems of any remaining multicollinearity [32]. The importance of a variable is given by the decrease in classification accuracy when only trees are used for the classification that do not use the predictor, as compared to the classification result from all trees.
Classification is facilitated and errors naturally decrease when one type is considerably more frequent than the other. We avoid this over-fitting effect by applying a down-sampling procedure [33] as implemented in the randomForest package in R [31]. This means that during bagging (see above), the algorithm takes bootstrapped sub-samples of the data with balanced frequencies (electronic supplementary material, S6). Another concern of the sample structure is that classifier performance might depend on the spatial extent. In order to assess whether our classification model is robust against this, we fit the Random Forests not only to the global data, but add two regional zoom-in studies. For this, we choose Africa and Australia because they are more comparable with each other than other pairs of continents. Both Africa and Australia have undergone large family spreads, southwards across similar latitudes: Bantu in Africa [34] and Pama–Nyungan in Australia [35]. At the same time, they form an interesting contrast set for our hypotheses because the Bantu spread is associated with FP populations, while the Pama–Nyungan spread is associated with HG populations.
While Random Forest models avoid the problems of multicollinearity in basic model performance, resolving the most striking collinearities helps model interpretation. We therefore replace the raw climate measures by the residuals that are left when the measures are regressed against latitude, for example, the residuals from temp_mean ∼β0 + β1 × |latitude|. In turn, the absolute values of latitude are added as a predictor to the model. The overall structure of the model is as follows, where R refers to residuals and all variables except latitude are entered three times each, as measured on a local, meso, and macro scale (table 2):
![]() |
Models are trained separately for FP and HG populations, for the languages and language family levels, and for all three spatial resolutions (300, 1 000, and 3 000 grid points). Together with the two regional models, each again separately fitted at the language and at the family level, this leads to a total sum of 16 models.
Approaching the relationship between language density and the environment as a classification task means that we need a suitable threshold α that determines whether a grid point is a trough (less languages than expected), a peak (more languages than expected), or in line with what is expected given the random baseline. Setting α to the conventional 0.05 and 0.95 levels of statistical tests results in a distribution where virtually all troughs have zero languages in their vicinity. Counts of 1 or 2 would already fall into the range of expected values. As a result, we would lose information on low-density regions, and at the same time re-introduce a very strong correlation of troughs with the low population size. In order to obtain a classification that better captures language density, we therefore opt for a more liberal α = [0.25, 0.75], i.e. we consider a grid point a trough if P(langC) <0.25 and a peak if P(langC) >0.75. Grid points with 0.25 ≤ P(langC) ≤ 0.75 are left out from further analysis because we assume them to be the result of random processes of spatial dispersal alone. This decision notwithstanding, we also conduct a sensitivity analyses on alternative choices of α.
Figures S7–S9 illustrate what a threshold of α = [0.25, 0.75] means in terms of the actual counts of languages that qualify as peaks and troughs, respectively, for the different grid resolutions. Note that counts vary with local geography. Grid points located along the coastline capture smaller areas on land and thus often have smaller counts (e.g. the East coast of Africa in FP troughs). The opposite holds if coastal grid points are geographically exposed, for instance on the tip of a peninsula. In these circumstances, they have a lower number of neighbouring grid points to share random points with, which in turn increases the counts needed to qualify as peaks and troughs (e.g. the Horn of Africa in FP peaks).
3. Results
(a). Language and language family densities
In line with the strong spatial clustering observed in figure 1, the distribution of language counts around grid points is heavily skewed. For languages associated with food producers (FP), 43% of the grid points have zero languages as neighbours, 35% have between one and five, and 21% have over five. For languages associated with hunter–gatherers (HG), the distribution is even more skewed, with 73% of grid points counting zero languages as neighbours, 24% between one and five, and 3% over five. When grid points have any languages nearby at all, these languages mostly come from one or two distinct families (72% of grid points among FP, and 88% of grid points among HG languages). These are the observations in a 1 000 point resolution, where grid points are about 350 km apart. The maximum count in this grid is found near a point in Papua New Guinea, with 345 languages (electronic supplementary material, figure S4). While the absolute numbers are different, the distributions show similar patterns under higher and lower grid resolutions (electronic supplementary material, figure S3). Language counts are also largely independent of population sizes at grid points; the correlations are very weak (electronic supplementary material, figure S2).
When compared to the random baseline, the vast majority of counts (91% among FP and 81% among HF populations) have a probability below or above what is expected, i.e. they are troughs or peaks (see Material and methods). The distribution is similar at the level of families, with 79% peaks and troughs among FP and 76% among HG populations. Figure 3 shows the spatial distribution of peaks and troughs. Alternative resolutions show the same overall pattern (electronic supplementary material, S7).
Figure 3.
Peaks (red) and troughs (blue) of language and language family counts on a spherically uniform grid of 1 000 geographical points, across subsistence type (FP: food producers versus HG: hunter–gatherers). A count at a given grid point is a peak if its cumulative probability under a random location process is P(langC) > 0.75. A count is a trough if P(langC) < 0.25. Grey dots are grid points with counts that are consistent with the random baseline, 0.25 < P(langC) < 0.75 (see Material and methods). For alternative grid resolutions, see electronic supplementary material, figure S10. The distribution of peaks and troughs differs substantially between food producers and hunter–gatherers, but much less between languages and language families within the respective subsistence types.
Within each subsistence type, the family-level distributions are very similar to the language-level distributions. They are basically just reduced versions of the latter. By contrast, the distributions strongly differ between FP and HG populations. To assess the reasons for this, we now turn to environmental factors.
(b). Language densities and the environment
Pairwise inspection of peak and trough distributions in each variable suggests that peaks tend to be associated with different values than troughs in most variables (electronic supplementary material, S8). The association patterns are similar across subsistence types, and also across language and language family levels, but they are generally stronger among FP than among HG populations (electronic supplementary material, figure S11). This difference in association strength is confirmed by the results from the Random Forest classification model that we trained to predict peaks and troughs from environmental variables (see Material and methods). Figure 4 summarizes the classification errors across subsistence types and across the language and language family levels under a 1 000 point resolution; detailed confusion matrices and results for alternative resolutions and model assumptions are given in electronic supplementary material, S9.
Figure 4.
Classification errors for peaks (red) and troughs (blue) in global models (a) and in two regional zoom-in models capturing FP distributions in Africa and HG distributions in Australia (b). Classification errors are higher for peaks than for troughs across all models and higher for HG than for FP distributions for the global model. In the regional models, classification errors of FP in Africa are in agreement with the global model, and the same is true for HG in Australia on the language level. On the family level, however, classification errors in Australia are substantially smaller than in the global HG model. Results are robust across alternative grid resolutions (electronic supplementary material, figure S12), across different criteria for peaks and troughs (electronic supplementary material, figure S13), and across models that control for population size when estimating peaks and troughs (electronic supplementary material, figure S14). (Online version in colour.)
The accuracy of the Random Forest classification is higher than the degree of explained variation typically achieved by regression [12] or evolutionary models [36]. Performance is furthermore better for troughs than for peaks. Since we control for sample size when fitting the models (see Material and methods) this is unlikely to be an artefact of troughs being more frequent and therefore easy to classify as a default. A more likely reason for the increased model fit is that troughs are generally distributed over more differentiated environments (electronic supplementary material, figure S11). The combined effect of this across many variables facilitates classification.
By contrast to troughs, density peaks are affected by considerably higher classification errors. Here, figure 4 reveals a striking difference: peaks are much better predicted by environmental factors for FP than for HG populations, suggesting that FP peaks are distributed across more highly differentiated environments. At the level of languages, global error rates are 2.4 times higher for HG (36%) than for FP (15%) populations. At the level of families, the difference is slightly attenuated, but HG error rates are still about 1.8 times higher than FP error rates (46% versus 25%). This suggests that our results are robust against the phylogenetic auto-correlation that affects the spatial distribution of languages within families. The results are also robust against differences in grid resolution. Higher and lower grid resolutions (electronic supplementary material, figure S12) show very similar differences between FP and HG societies, with 2.1 (300 point grid) and 1.9 (3 000 point grid) times more errors for HG populations at the language level, and 2.8 (300 point grid) and 2.5 (3 000 point grid) times more errors for HG populations at the family level. Moreover, a sensitivity analysis shows that results are robust against different threshold values (α) used for classifying counts as peaks and troughs (electronic supplementary material, figure S13). Finally, our results are also robust against differences in population size. When population size is controlled for during peak and trough estimation (see Material and methods), HG populations still incur 2.3 times more errors than FP populations at the language level, and 1.7 more at the family level (electronic supplementary material, figure S14).
The difference between HG and FP populations furthermore persists when we zoom-in on regional models in Africa and Australia (figure 4b), albeit in weaker form. At the language level, the error rates in the regional models replicate those in the global model, with 20% error for HG populations in Australia and 14% for FP populations in Africa. At the level of families, the global trend also persists in Africa (30% error, close to the 25% error for FP populations in the global model). However, the pattern is reversed in Australia since the error rates for HG populations there are only 11%, much lower than in any other model. We return to this exception in the discussion section.
Figure 5 shows the environmental variables with significant contributions to the model fit (classification success), as estimated by a permutation procedure (see electronic supplementary material, S10 for more detailed results). Overall, the weight of individual variables is relatively small, suggesting that classification success chiefly depends on the combined effects of variables.
Figure 5.
(a–d) Importance of individual variables for accurate classification of peaks and troughs. The y-axis tracks the decrease in prediction accuracy if a variable is removed from a model, as an index of its importance. Error bars show deviations (if any) across the spatial scales at which the variables are measured (table 2). The chart only contains variables whose contribution is significant under a permutation test (p < 0.05). As explained in the methods section, climate is mainly captured by latitude; the other climate variables enter the model only through the residuals they leave when regressed on latitude, i.e. through the variation they contribute in addition to sheer latitude.
For FP populations at the language level, the model fit is predominantly informed by climate factors (latitude plus precipitation and temperature), population size, and terrain roughness. This is true both globally (figure 5a) and regionally (figure 5b), except that the mean temperature of the warmest quarter (variable ‘warmest’) does not contribute to the model fit in Africa, presumably because it is relatively stable there. The basic pattern is the same under alternative grid resolution (electronic supplementary material, figure S15) except that distance to oceans is also relevant and that temperature has no impact under the coarsest (300 point) resolution.
At the family level, the FP model fit is driven by a more diverse set of variables than at the language level. The pattern is similar in the global and the regional model, except that in the regional Africa model, latitude is slightly less important. This reflects the fact that during the family spreads (chiefly the Bantu spreads), family diversity in Africa survived mainly in the Northern tropics, at similar latitudes (figure 3). Apart from the larger set of relevant variables, family-level models also differ from language-level models in the specific selection of variables: at the family level, population size is unimportant while ocean and river access, as well as elevation, make significant contributions to model fit. The overall pattern is again replicated in analyses under alternative grid resolutions (electronic supplementary material, figure S15).
The more diverse distribution that characterizes FP family-level models is recapitulated in the HG models at both the language and the family level, and across resolutions. Beyond the population size, climate, and terrain roughness variables that also inform FP models, HG models are significantly affected by the variables dist_ocean and dist_river, which capture ocean and river access (figure 5c). The regional models in Australia are different (figure 5d). Here, terrain and hydrology variables are more important, while climate variables only have a weak impact, essentially only through variation beyond latitude (which itself is not a significant contributor to model fit). Population size does not significantly improve model fit in Australia at either the language or the family level.
4. Discussion
One concern of language density estimates is the extent to which they indeed capture population diversity rather than simply patterns in population size or even just human presence. Our results show that while population size significantly informs model fit, it does so only at the language level. This suggests that population size has impacted language distributions only since population size increased to modern heights. Even in this case, however, the contributions are relatively weak since plots of language counts against population size suggest only very weak associations (electronic supplementary material, figure S2). Moreover, population size does not appear to interfere with the impact of environmental factors. The difference between the environmental impact on FP versus HG language densities is the same, regardless of whether population size is treated as an independent factor in the model or as a covariate when estimating language densities (electronic supplementary material, figure S14). These findings suggest that our models capture the relationship between environment and language density independently of variation in population size and human presence.
Turning to the three hypotheses we set out to test (table 1), our results best support a theory of higher environmental impact on FP than on HG language density, rejecting the null hypothesis of no difference. The difference persists across language and family levels, but it is weaker and more diverse at the family level. This reduction in effect size is consistent with the finding that the Mid-Holocene language spreads that family-level models capture are in general only weakly constrained by latitudinal, and therefore climate-related, patterns [37]. The effect is fully reversed, however, in Australia where environmental factors are highly successful in classifying the density of HG language families (though not of languages). The likely reason for this exception is the rapid but continent-wide spread of Pama–Nyungan that severely reduced language density for some time [35]. The reduced peak density (figure 3) that resulted from this greatly facilitates classification based on environmental factors. Apart from this historically transient peculiarity, the global trend is that FP language density is more strongly driven by the environment than HG populations. The environmental variables that significantly contribute to FP model fit are concentrated on climate and, to a considerably weaker extent, terrain roughness. This is in line with the Ecological Risk Hypothesis [6,14]: a wet and warm climate allows more food production and makes groups less dependent on others, furthering social and linguistic diversification into smaller groups. These factors matter less for HG populations, which are more mobile and adaptive and therefore less at the mercy of climate risks. Consistent with this, we find that to the extent that the environment shapes HG language density at all, the set of significant variables is highly diverse, at both the language and the family level.
Our findings do not support theories that assume that post-Neolithic technology has reduced the impact of the environment on language density. This contradicts earlier findings [12,13]. The main reason for this discrepancy is likely to be the fact that we use a much larger sample and a new, exhaustive list of hunter–gatherer languages [21]. In addition, there are two methodological reasons that possibly contribute to explaining the discrepancy. First, we count languages near uniformly distributed points, while previous research counted them in polygons [12]. The polygon approach risks overestimating counts of small languages in the vicinity of large languages. For instance, even though Europe contains a range of small-area languages (Basque, Welsh, Sorbian, Romansh, etc.), its language density is far from a region such as Papua New Guinea, where all languages tend to cover small areas. Second, the Robinson projection that is used in earlier work [13] entails a correlation of polygon size with latitude. At lower latitudes, where FP populations dominate (figure 3), counts risk overestimation and this might outweigh impacts from the climate which is relatively stable in the tropics and sub-tropics. Conversely, the latitude-size correlation risks underestimating counts at high latitudes, which are dominated by hunter–gatherers and more variation in the environment.
Our results are also incompatible with a theory of niche differentiation or displacement which assumes that HG and FP peaks are in complementary distribution, driven by the same environmental factors but with opposite directions of the effects. The map of peak and trough distributions in figure 3 at first suggests complementary distributions in most continents, but the strong differences in the predictive force and importance of environmental variables suggest that the environment affects language densities very differently in FP and HG populations. Thus, the apparent complementarity is likely to be an artefact of local spreads of farming in Eurasia and Africa and of the fact that before the recent European conquests, food production had not set foot at all in Australia.
The weak impact of environmental factors on HG language density raises the question what drives the considerable spatial clustering that we detect nevertheless (figure 1). We leave this question for further research, but a promising explanation points to cultural factors in the form of ‘a self-organized property of preferential attachment behaviour whereby foraging populations preferentially occupied certain locations on the landscape to take advantage of material culture or symbolic resources’ ([38] p. 22).
5. Conclusion
Our findings suggest that environmental, and especially climate factors, drive language densities in FP much more than in HG populations. This difference is robust against various controls of spatial resolution, population size, and model assumptions. Since differences between HG and FP populations approximate difference in pre- versus post-Neolithic population dynamics [16], we tentatively infer a substantial change in how the environment impacted language densities before and after the transition to food production. After this transition, the patterns appear to have remained relatively stable, since the higher impact of environmental variables on FP populations can also be detected, albeit to a weaker extent, in family-level models that capture the reduced density incurred by Mid-Holocene language spreads.
Our findings support the Ecological Risk Hypothesis [6], which predicts social and linguistic diversification from a population's food access. The ecological risk increases with the transition to agriculture and pastoralism because in this type of subsistence, food access is arguably more strongly subject to climate risks (e.g. lack of rainfall during crop or grass growth) than for a HG subsistence (which is more flexible). As a result, linguistic diversification and language density become inversely correlated with ecological risk: more linguistic diversification and social compartmentalization in low-risk (wet and warm) than in high-risk (dry and cold) climates. The many technological innovations (such as irrigation systems) that came in the wake of agriculture do not seem to have cancelled out the risks, at least not until very recently. By contrast, HG populations are more mobile and adaptive, and thus their linguistic diversification is less systematically driven by the environment and more open to cultural evolution and local contingencies.
From a more general perspective, our findings suggest that the transition to agriculture increased the role of the environment in how human populations are fragmented into separate groups. To a considerable extent, the environment has shaped the boundaries between languages and with this, the social and economic networks within which food-producing populations operate. It is likely that the role of the environment on population diversity is currently decreasing again since agricultural technologies can increasingly override climate constraints. However, effects of this cannot yet be detected in the data because the diversification of languages into dialects and eventually further languages is a gradual process that usually takes several centuries and is likely to be slowed down even more by ongoing linguistic globalization.
Supplementary Material
Acknowledgements
We would like to thank the editor and the reviewers for their thoughtful comments, which helped to improve this manuscript.
Data accessibility
All R scripts required for reproducing the results presented in the paper can be downloaded from the following repository: https://github.com/curdon/linguisticDensity_ProcB_derungsEtAl. The repository additionally contains links to all input data, as well as preprocessed results.
Author's contributions
C.D., R.W., and B.B. designed the research; M.K. compiled the data and performed preliminary analyses; C.D. performed the main analysis; C.D. and B.B. wrote the paper, with contributions from R.W.
Competing interests
We declare we have no competing interests.
Funding
Swiss National Science Foundation Sinergia grant no. CRSII1_160739; University of Zurich Research Priority Program ‘Language and Space’.
References
- 1.Hammarström H, Forkel R, Haspelmath M. 2017. Glottolog 3.0. Jena: Max Planck Institute for the Science of Human History. See http://glottolog.org, (accessed on 15 May 2017). [Google Scholar]
- 2.Darwin C. 1871. The descent of man, and selection in relation to sex. London, UK: Murray. [Google Scholar]
- 3.Bromham L. 2017. Curiously the same: swapping tools between linguistics and evolutionary biology. Biol. Philos. 32, 855–886. [Google Scholar]
- 4.Nichols J. 1992. Linguistic diversity in space and time. Chicago, IL: University of Chicago Press. [Google Scholar]
- 5.Mace R, Pagel M. 1995. A latitudinal gradient in the density of human languages in North America. Proc. R. Soc. Lond. B 261, 117–121. ( 10.1098/rspb.1995.0125) [DOI] [Google Scholar]
- 6.Nettle D. 1999. Linguistic diversity. Oxford, UK: Oxford University Press. [Google Scholar]
- 7.Gavin MC. et al. 2013. Toward a mechanistic understanding of linguistic diversity. BioScience 63, 524–535. ( 10.1525/bio.2013.63.7.6) [DOI] [Google Scholar]
- 8.Pagel M, Mace R. 2004. The cultural wealth of nations. Nature 428, 275–278. ( 10.1038/428275a) [DOI] [PubMed] [Google Scholar]
- 9.Sutherland WJ. 2003. Parallel extinction risk and global distribution of languages and species. Nature 423, 276–279. ( 10.1038/nature01607) [DOI] [PubMed] [Google Scholar]
- 10.Gorenflo LJ, Romaine S, Mittermeier RA, Walker-Painemilla K. 2012. Co-occurrence of linguistic and biological diversity in biodiversity hotspots and high biodiversity wilderness areas. Proc. Natl Acad. Sci. USA 109, 8032–8037. ( 10.1073/pnas.1117511109) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Axelsen JB, Manrubia S. 2014. River density and landscape roughness are universal determinants of linguistic diversity. Proc. R. Soc. B 281, 20141179 ( 10.1098/rspb.2014.1179) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Currie TE, Mace R. 2009. Political complexity predicts the spread of ethnolinguistic groups. Proc. Natl Acad. Sci. USA 106, 7339–7344. ( 10.1073/pnas.0804698106) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Currie TE, Mace R. 2012. The evolution of ethnolinguistic diversity. Adv. Complex Syst. 15, 1150006 ( 10.1142/S0219525911003372) [DOI] [Google Scholar]
- 14.Nettle D. 1998. Explaining global patterns of language diversity. J. Anthropol. Archaeol. 17, 354–374. ( 10.1006/jaar.1998.0328) [DOI] [Google Scholar]
- 15.Greenhill SJ, Wu CH, Hua X, Dunn M, Levinson SC, Gray RD. 2017. Evolutionary dynamics of language systems. Proc. Natl Acad. Sci. USA 114, E8822–E8829. ( 10.1073/pnas.1700388114) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Sikora M. et al. 2017. Ancient genomes show social and reproductive behavior of early Upper Paleolithic foragers. Science 358, 659–662. ( 10.1126/science.aao1807) [DOI] [PubMed] [Google Scholar]
- 17.Naroll R. 1965. Galton's problem: The logic of cross-cultural analysis. Soc. Res. (New York) 32, 428–451. [Google Scholar]
- 18.Austin PK, Sallabank J (eds). 2011. The Cambridge handbook of endangered languages. Cambridge, UK: Cambridge University Press. [Google Scholar]
- 19.Nichols J. 2004. The origin of the Chechen and Ingush: a study in Alpine linguistic and ethnic geography. Anthropol. Linguist. 46, 129–155. [Google Scholar]
- 20.Good J. 2013. A (micro-)accretion zone in a remnant zone? longer fungom in areal-historical perspective. In Language typology and historical contingency (eds Bickel B, Grenoble LA, Peterson DA, Timberlake A), pp. 265–282. Amsterdam, The Netherlands: Benjamins. [Google Scholar]
- 21.Güldemann T, McConvell P, Rhodes R. 2018. The language of hunter–gatherers. Cambridge, UK: Cambridge University Press. [Google Scholar]
- 22.Binford LR. 2001. Constructing frames of reference: an analytical method for archaeological theory building using ethnographic and environmental data sets. Berkeley, CA: University of California Press. [Google Scholar]
- 23.Bickel B, Nichols J, Zakharko T, Witzlack-Makarevich A, Hildebrandt K, Rießler M, Bierkandt L, Zúñiga F, Lowe JB. 2017. The AUTOTYP typological databases, version 0.1.0. GitHub, https://github.com/autotyp/autotyp-data/tree/0.1.0. [Google Scholar]
- 24.Hijmans RJ, Cameron SE, Parra JL, Jones PG, Jarvis A. 2005. Very high resolution interpolated climate surfaces for global land areas. Int. J. Climatol. 25, 1965–1978. ( 10.1002/joc.1276) [DOI] [Google Scholar]
- 25.Klein Goldewijk K, Beusen A, Van Drecht G, De Vos M. 2011. The HYDE 3.1 spatially explicit database of human-induced global land-use change over the past 12,000 years. Glob. Ecol. Biogeogr. 20, 73–86. ( 10.1111/j.1466-8238.2010.00587.x) [DOI] [Google Scholar]
- 26.Openshaw S. 1983. The modifiable areal unit problem. Norwich, UK: Geo Books. [Google Scholar]
- 27.Fisher P, Wood J, Cheng T. 2004. Where is Helvellyn? Fuzziness of multi-scale landscape morphometry. Trans. Inst. Br. Geogr. 29, 106–128. ( 10.1111/j.0020-2754.2004.00117.x) [DOI] [Google Scholar]
- 28.Nichols J. 2008. Language spread rates as indicators of glacial-age peopling of the americas. Curr. Anthropol. 49, 1109–1117. [Google Scholar]
- 29.Wales DJ, Ulker S. 2006. Structure and dynamics of spherical crystals characterized for the Thomson problem. Phys. Rev. B 74, 212101 ( 10.1103/PhysRevB.74.212101) [DOI] [Google Scholar]
- 30.Hijmans RJ, Williams E, Vennes C. 2012. geosphere: spherical trigonometry. R package. [Google Scholar]
- 31.Liaw A, Wiener M. 2002. Classification and regression by randomForest. R News 2, 18–22. [Google Scholar]
- 32.Altmann A, Tolocsi L, Sander O, Lengauer T. 2010. Permutation importance: a corrected feature importance measure. Bioinformatics 26, 1340–1347. ( 10.1093/bioinformatics/btq134) [DOI] [PubMed] [Google Scholar]
- 33.Chen C, Liaw A, Breiman L. 2004. Using random forest to learn imbalanced data. Technical Report No. 666, University of California, Berkeley, Department of Statistics (https://statistics.berkeley.edu/tech-reports/666). [Google Scholar]
- 34.Grollemund R, Branford S, Bostoen K, Meade A, Venditti C, Pagel M. 2015. Bantu expansion shows that habitat alters the route and pace of human dispersals. Proc. Natl Acad. Sci. USA 112, 13 296–13 301. ( 10.1073/pnas.1503793112) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Bouckaert RR, Bowern C, Atkinson QD. 2018. The origin and expansion of Pama–Nyungan languages across Australia. Nat. Ecol. Evol. 2, 741–749. ( 10.1038/s41559-018-0489-3) [DOI] [PubMed] [Google Scholar]
- 36.Gavin MC. et al. 2017. Process-based modelling shows how climate and demography shape language diversity. Glob. Ecol. Biogeogr. 26, 584–591. ( 10.1111/geb.12563) [DOI] [Google Scholar]
- 37.Hammarström H. 2010. A full-scale test of the language farming dispersal hypothesis. Diachronica 27, 197–213. ( 10.1075/dia.27.2.02ham) [DOI] [Google Scholar]
- 38.Haas WR Jr, Klink CJ, Maggard GJ, Aldenderfer MS. 2015. Settlement-size scaling among Prehistoric hunter–gatherer settlement systems in the New World. PLoS ONE 10, e0140127 ( 10.1371/journal.pone.0140127) [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All R scripts required for reproducing the results presented in the paper can be downloaded from the following repository: https://github.com/curdon/linguisticDensity_ProcB_derungsEtAl. The repository additionally contains links to all input data, as well as preprocessed results.






