Version Changes
Revised. Amendments from Version 1
The new version contains an extended investigation of the relationship between Stan chain convergence and clade credibility. It also contains a link to all model and data preprocessing files used for the analyses. Further, minor changes to wording and clarity have been applied.
Abstract
Geographical confounding in phylogenetic inference models has long been an issue. Often models have great difficulty detecting whether congruences or similarities between languages in phylogenetic datasets stem from common genetic descent or geographical proximity effects such as language contact. In this study, we introduce a distance-based Gaussian process approach with latent phylogenetic distances that can detect potential geographic contact zones and subsequently account for geospatial biases in the resulting tree topologies. We find that this approach is able to determine potential high-contact areas, making it possible to calculate the strength of this influence on both the tree-level (clade support) and the language-level (pairwise distances).
Keywords: Phylogenetics, Gaussian processes, geospatial confounding, control for language contact
1 Introduction
Phylogenetic linguistics started with the aim to apply phylogenetic inference algorithms from computational biology to data based on collections of cross-linguistic lexical and grammatical data. The goal of this enterprise is to infer phylogenies, i.e., tree diagrams representing the diversification history of language families. These family trees are interesting for historical linguistics in their own right, but they also provide stepping stones for investigations into deep population history, statistical control for typological studies, and several other applications (see, inter alia, Bouckaert et al., 2012; Bowern & Atkinson, 2012; Hruschka et al., 2015; Jäger & Wahle, 2021).
A well-known potential problem of phylogenetic inference algorithms in this domain is that they model language change exclusively as vertical transmission with random mutations. However, historical linguists have been well aware since the 19th century ( Schmidt, 1872) that horizontal transmission via language contact plays an important role in language change. Modeling the effects of language contact as independent random mutations potentially introduces a bias. Also, it ignores an important source of information about language history and concomittant population history processes.
This problem has not gone unnoticed. There are studies such as ( Greenhill et al., 2009) assessing the robustness of phylogenetic inference. A novel approach by Neureiter et al. (2022) enriches phylogenies with contact edges, i.e., connections between simulatenous points on different branches of a phylogeny enabling information transfer. The phylogenetic skeleton and the contact edges are inferred simultaneously.
This approach, however, ignores an important source of information, namely the fact that language contact almost always happens under spatial proximity. It seems therefore natural to model the effect of language contact as a spatial stochastic process. Spatial statistics have recently gained popularity in cross-linguistic studies, e.g., to identify linguistic areas or as a statistical control for typological studies ( Guzmán Naranjo & Becker, 2021; Ranacher et al., 2021). Phylogenetic information is either not used at all or only in a rudimentary fashion in this context though.
The present paper presents a hybrid approach combining spatial with phylogenetic modeling. Briefly put, the probability of a certain language L having a certain feature f is assumed to depend on both the presence or absence of f in L’s spatial neighbors and in L’s genetic relatives. Phylogenies are obtained via statistical inference while controlling for language contact. By comparing the results with vanilla phylogenetic inference, regions of intense language contact can be identified.
First, we present the model architecture and mechanics of the model before testing the model on datasets of known language families with known contact effects, namely ASJP ( Wichmann et al., 2016) in the modified form with cognate inferences produced by Jäger (2018) and Northeuralex ( Dellert et al., 2020).
2 Model architecture
The goal of this model is to infer phylogenetic similarity between languages based on data while geographical confounds are accounted for. This can be achieved by building a model that treats phylogenetic similarity as a latent variable that is inferred as the residual similarity of two languages when linguistic similarity based on geographic proximity is accounted for. Such a model structure can be graphically visualized as the graph in Figure 1.
Figure 1. Graphical representation of the inference model.

Grey nodes represent observed nodes (i.e. data) while white nodes are inferred. Black arrows indicate model-internal connections while dashed arrows show the structure of information flow in the model.
In this model, the statistical model M takes in the geographical similarity G and the phylogenetic similarity P and models those on the linguistic data D. Crucially, P is an inferred (latent) node which is the estimand in this analysis. More concretely, the model is built with G and D as inputs, whereas P remains in variable form. As a result, information in the model flows from the linguistic data and the geographic similarity to the model which makes it possible to infer the phylogenetic similarity. The following model architecture was used for this analysis:
Here, the outcome C is a binary linguistic character of a certain language l and site s. 1 C is modelled as a binomial outcome with a probability of p for this specific site and language. ρ is the phylogenetic similarity effect to be inferred from the latent variable and γ is the geographic similarity effect inferred from the geographical data. Both variables are modelled as Gaussian processes, drawn from a multivariate normal distribution with a mean vector of 0 and the covariance matrices G and P. Their logit sum (of the variables ρ and γ) then enters the binomial process as p, effectively making ρ s,l and γ s,l the log-odds values of site s of language l being inherited or borrowed. They are further derived from the kernel functions that are, in case of the geographic similarity (G), based on a quadratic kernel (cf. McElreath, 2018, ch. 14.5) and the kernel of P is based on a Ornstein–Uhlenbeck kernel. The kernels take in the pairwise distances (phylogenetic or geographic) δ P / δ G between two languages and output the covariance matrix of all languages. The pairwise distances for the phylogenetic kernel are drawn as a latent variable from an exponential distribution with rate 1, while the parameters for the kernels are sampled from an exponential distribution with rate 2. This results in less prior probability being allocated to high maximum covariances and high covariance values for each language pair. In effect this means that the model assumes there to be little covariance between languages a priori.
Note that the Ornstein–Uhlenbeck kernel has a fixed parameter λ = 3. This follows a modelling decision that takes into account the latent property of the phylogenetic distances: were λ (the magnitude of the covariance between languages) inferred, there would be more than one solution for every covariance-distance pair. In practice, the distance function at δ = 1, λ = 1 has the same solution (≈ 0.368) as at δ = 2, λ = 2. If both parameters are inferred, for the model, both solutions are equivalent. In other words, if the model infers a covariance between two languages of 0.368, it is mathematically compatible with either δ = 1, λ = 1 or δ = 2, λ = 2. 2 This means that if both δ and λ are inferred parameters, there cannot be a single-valued solution for any value, leading to extreme collinearity between both parameters. As a result, we fixed λ such that this issue is avoided.
3 Data and contact area selection
As datasets on which to test the model, we used the databases ASJP ( Wichmann et al., 2016) with cognate inferences produced by Jäger (2018) and Northeuralex ( Dellert et al., 2020). Both datasets are large multi-language lexical cognate databases intended for use primarily for phylogenetic inference. Northeuralex was specifically selected since with Dellert (2019), there exists a detailed prior study of geographical contact effects on loanwords particularly in the Baltic Sea regions. Therefore, the results of the study at hand, and thus the performance of the model, can directly be compared to a prior study.
Fromboth datasets, a subset of languages (25 from ASJP and 24 from Northeuralex) were selected in such a way that known high-contact areas are reflected in the dataset on which the model’s performance can be tested. Those areas are: contact between Celtic and English in the British Isles; the Balkan Sprachbund that includes South Slavic languages alongside Albanian, Romanian, and Greek; the Baltic Sea area with contact between Germanic, Slavic, Uralic, and Baltic; the contact zone between Indo-Iranian languages. As a condition for the Gaussian process approach to succeed, we expect the model to recognize these zones by increasing the geographical covariance between languages in these areas, thus increasing the likelihood for individual cognates to be borrowed between constituent languages.
The datasets were adapted to a format suited for the model architecture presented in Section 2: after conversion, the data consist of binary vectors where each vector represents a shared cognate between two or more languages. Table 1 illustrates this data shape.
Table 1. Example data table to illustrate the data shape with two cognates and binary assignments of four hypothetical features A–D.
| A | B | C | D | |
|---|---|---|---|---|
| Cognate 1 | 1 | 1 | 1 | 0 |
| Cognate 2 | 0 | 1 | 0 | 1 |
Here, hypothetical languages A, B, and C share hypothetical cognate 1 whereas only languages B and D share cognate 2. Due to this structure, the model can calculate a probability for each language to have that cognate given the geographic and phylogenetic distances to all other languages. 3
4 Results
We present the results of the analysis on three levels: firstly, we investigate the strength of the geographical confound between languages on a general level, before analyzing the effects the geographical control has on the tree topologies of the datasets. Lastly, we look at the results on a character level, comparing the model’s inference for the likelihood of individual borrowings in the Northeuralex dataset with the borrowings identified in Dellert (2019).
The model was run separately on each dataset so as not to add noise to the analysis due to potentially different coding schemes. Likewise, a ‘confounded’ model was run where the geographical variable was omitted. This allows for a direct comparison between the models with and without the geographical confound.
For modelling, we used the Bayesian programming language Stan ( Stan Development Team, 2022). All results were based on posterior samples. Concretely, we extracted the posterior samples of the inferred pairwise phylogenetic distances between languages and constructed a UPGMA phylogram for each posterior sample. This yields a large number of trees from which clade support values were extracted and the consensus and maximum clade credibility (MCC) tree were constructed.
It was observed that the pairwise distances in the four-chain setup of the analysis did not achieve convergence. However, when analysing the individual chains, they yield the same tree topology. We found that this convergence issue is in part due to the fact that the same tree topology can be expressed with different distance matrices. For example, the following hypothetical upper-triangular distance matrices, M 1 and M 2, yield equivalent tree topologies between three hypothetical languages (not considering branch length).
Here, matrix M 2 is a factor of M 1, meaning that the relative pairwise distances are equal. Since these distances are inferred as a latent variable in the model, each chain of the Bayesian model can converge at either M 1 or M 2, yielding the same tree topology but, by virtue of having converged on different values, inter-chain convergence is not reached. As a result, one needs to check model accuracy based on the tree topologies obtained from posterior samples directly. As described above, the diffent sampling chains of the models yielded tree topologies with minor differences in topology and clade support (see discussion in Section 6.3 in the appendix).
4.1 Evaluating the model
Before we can investigate the model results concerning the accuracy in identifying contact effects, we need to ascertain whether the model succeeds at identifying the major phylogenetic clades. Recall that the goal of this analysis is not to infer the correct tree topology of the given languages. Rather, we analyse the tree topologies here solely to check if the model is able to capture the clades well established in earlier research. The consensus trees constructed from the posterior samples of the model show that the model was able to capture the general topology of relationships between the languages in question (see Figure 2 and Figure 3). As reference trees for this comparison, we use the trees in Jäger (2018) and Dellert (2019) for the ASJP and Notheuralex databases respectively.
Figure 2. Consensus tree (cladogram) of the ASJP language similarity inference.
The values at each internal node indicate the posterior clade support of that node. Clade support threshold for inclusion in the consensus tree was 0.5.
Figure 3. Consensus tree (cladogram) of the Northeuralex language similarity inference.
The included language families are Indo-European and Uralic. The values at each internal node indicate the posterior clade support of that node. Clade support threshold for inclusion in the consensus tree was 0.5.
The models correctly identify several clades with high support. In the ASJP dataset, these include Slavic, Indo-Iranian, Romance, Germanic, and Celtic. In the Northeuralex dataset, they include Germanic (subdivided into North and West Germanic), Balto-Slavic, Slavic (split into East and West Slavic), Baltic, and Uralic (further divided into Finnic and Sámi). Further, in the ASJP dataset, Albanian is identified as an Indo-European outgroup, as is common in such analyses (cf. Ringe et al., 2002, p. 90). However, the model was not able to retrieve more opaque relationships that are sometimes proposed such as Graeco-Armenian ( Clackson, 1994; Jørgensen, 2022) or Italo-Celtic (see discussion in Eska, 2009; Weiss, 2022). Or, even if the status as a subgroup is unclear, frequently emerge as subgroups in phylogenetic analyses, e.g. in Ringe, Warnow, and Taylor (2002), Chang et al. (2015), and Heggarty et al. (2023).
The level of support that well-established clades exhibit in this model indicates that this inference technique was successful. Despite the drawbacks of distance-based methods (see Section 5), the method can be deemed accurate enough to continue analyzing the results.
4.2 Confound strength
Figure 4 and Figure 5 show the geographical confound strengths between each language pairing. 4 Those strengths are the normalized covariance values of the geographical covariance matrices. The strength line segments in the figures indicate the strength of the geographical confound. From the modelling perspective, the confound strength has to be interpreted as the geographically conditioned similarity of the linguistic data. In other words, if two languages exhibit a strong geographical signal, it means they are in part similar because of the geographical variable. Interpreted from the perspective of phylogenetic similarity, strong geographical covariance between two languages means that they are less similar phylogenetically than they seem, as some of the similarity can be attributed to geographical proximity.
Figure 4. Strength of the geographical confounds between selected languages in the ASJP dataset.
Figure 5. Strength of the geographical confounds between selected languages in the Northeuralex dataset.
Both maps show that strong geographical covariances are inferred for the Balkan region, northwest Europe, the northeastern Baltic Sea area and northern Scandinavia. This coincides with the previous assumptions about the linguistic contact zones. Further, strong pairwise covariances can be detected for Italian and Corsican, Czech and Polish, and Western and Eastern Armenian. The findings in Figure 5 are compatible with previous computational analyses of contact lines such as Dellert (2019, pp. 262–265) which show strong contact effects especially in the are of the eastern Baltic coastline and northern Scandinavia.
4.3 Clade-level analysis
For the tree-level analysis, we compare the posterior consensus trees of the model with and without the geographical variable. Going forward, the pure phylogenetic model is referred to as the confounded model while the deconfounded model refers to the model with geographical control.
Firstly, we investigate changes in the consensus trees between the two models. Figure 6 shows a direct comparison between the two consensus trees of the ASJP dataset.
Figure 6. Comparison of the consensus trees of the ASJP dataset (created using the R-package phytools ( Revell, 2012).
Left: confounded tree.
Right: deconfounded tree.
In this tree comparison we can observe that Greek was moved from being clustered with Romance to a not clearly deteminable outgroup position within Indo-European, a reclassification that is more in line with the consensus view of Greek not being a Romance language. Further, Corsican is then clustered more closely with Italian instead of French.
However, not in line with the consensus view is that in the geographically controlled model, Polish and Czech are no longer clustered together, but are moved to an indeterminate position within Slavic. This suggests that the model might be over-correcting for geospatial effects in this instance (see discussion in Section 5).
The tree topologies of the Northeuralex dataset do not show differences between the confounded and deconfounded consensus trees. This does not mean that the geographic control did not affect the results, merely that the differences in clade support do not change a clade support from > 0.5 to < 0.5 or vice versa since the consensus trees show a clade whenever support is > 0.5.
As a next step, we compare the differences in tree support for individual clades between the confounded and deconfounded models on both ASJP and Northeuralex databases. Table 2 and Table 3 show the ten clades with the strongest differences (both positive and negative). A reduction in clade support when transitioning from the confounded to the deconfounded model indicates that adding the geographical control variable leads the model to interpret some of the similarities among the constituent languages as contact-induced rather than phylogenetic. Conversely, an increase in support indicates that the clade in question is phylogenetically more plausible based on the geographical analysis.
Table 2. Differences in clade support in the ASJP database between the confounded and the deconfounded models.
Column change indicates the change in tree support when switching on the geographical control.
| change | difference | clade constituents |
|---|---|---|
| 0.43 → 0.16 | -0.27 | Greek, Italian, French, Corsican, Danish, Dutch,
English, Icelandic, Romanian |
| 0.60 → 0.39 | -0.21 | Czech, Polish |
| 0.60 → 0.40 | -0.20 | Greek, Italian, French, Corsican, Romanian |
| 0.34 → 0.15 | -0.20 | Italian, French, Romanian |
| 0.92 → 0.74 | -0.17 | Urdu, Hindi, Persian |
| 0.00 → 0.25 | 0.25 | Latvian, Danish, Dutch, English, Icelandic |
| 0.06 → 0.26 | 0.20 | Urdu, Hindi, Persian, Italian, French, Corsican,
Romanian |
| 0.05 → 0.25 | 0.20 | Czech, Polish, Bulgarian, Macedonian |
| 0.44 → 0.60 | 0.17 | Italian, Corsican, Romanian |
| 0.07 → 0.23 | 0.16 | French, Romanian |
Table 3. Differences in clade support in the Northeuralex database between the confounded and the deconfounded models.
Column change indicates the change in tree support when switching on the geographical control.
| change | difference | clade constituents |
|---|---|---|
| 1.00 → 0.60 | -0.40 | English, Dutch, German |
| 1.00 → 0.60 | -0.40 | Icelandic, Norwegian (Bokmål), Danish, Swedish |
| 0.26 → 0.15 | -0.11 | Inari Sami, Skolt Sami |
| 0.26 → 0.15 | -0.11 | Inari Sami, Skolt Sami, Kildin Sami |
| 0.26 → 0.15 | -0.11 | Lule Sami, Southern Sami, Northern Sami |
| 0.72 → 0.85 | 0.13 | Lule Sami, Southern Sami, Northern Sami, Inari Sami |
| 0.74 → 0.85 | 0.11 | Skolt Sami, Kildin Sami |
| 0.33 → 0.41 | 0.08 | Lule Sami, Southern Sami |
| 0.66 → 0.72 | 0.06 | Northern Sami, Inari Sami |
| 0.83 → 0.88 | 0.05 | Belarusian, Russian |
It has to be noted, however, that these changes in clade support are not independent. In essence, an increase in clade support for one clade could be caused by a reduction in the competitor clade. For example, the increase in clade support for Italian, Corsican, and Romanian we observe in Table 2 might be caused by the fact that the model downweighs the clade consisting of Italian, French, and Romanian based on inferred contact strength. It is therefore not necessarily the case that these three languages are more likely to constitute a clade based on geography, rather that the alternative hypothesis – Italian, French, and Romanian – has been deemed less likely.
In Table 2, we see that the largest reduction in clade support is in a clade consisting of the Romance and Germanic languages plus Greek which is strongly decreased due to geographical effects. As observed above, Czech and Polish, as well as a clade grouping Greek with the Romance languages, was reduced. Moreover, both the support for Indo-Iranian as well as a clade of Italian, French, and Romanian was decreased. This indicates that some of the similarities between those languages are partly geographically conditioned.
The results from the Northeuralex database showmany changes pertaining to Sámi, particularly identifying contact networks in Lule Sami, Southern Sami, and Northern Sami and Inari Sami, Skolt Sami, and Kildin Sami. Further, the North Germanic (Icelandic, Norwegian (Bokmål), Danish, and Swedish) andWest Germanic (English, Dutch, and German) clades receive less support in the deconfounded model.
4.4 Language-level analysis
In a second step, we can retrieve the pairwise phylogenetic distances of both models and compare the changes. Doing this enables the direct comparison of language pairs which decreased/increased their phylogenetic distance due to adding the geographical variable. Table 4 and Table 5 display the top 10 distances for each dataset, demonstrating the most significant changes. 5
Table 4. Top ten changes in pairwise phylogenetic distance in the ASJP dataset between the confounded and deconfounded models.
| difference (norm.) | Language pair |
|---|---|
| -0.34 | Macedonian, Albanian |
| -0.28 | Macedonian, Latvian |
| -0.26 | Bulgarian, Albanian |
| -0.20 | Russian, Latvian |
| -0.19 | Czech, Latvian |
| 0.56 | Albanian, Greek |
| 0.47 | Albanian, Persian |
| 0.46 | Albanian, Eastern Armenian |
| 0.36 | Corsican, Albanian |
| 0.35 | Albanian, Western Armenian |
Table 5. Top ten changes in pairwise phylogenetic distance in the Northeuralex dataset between the confounded and deconfounded models.
| difference (norm.) | Language pair |
|---|---|
| -0.10 | North Karelian, Swedish |
| -0.09 | Olonets Karelian, Swedish |
| -0.07 | North Karelian, Norwegian (Bokmål) |
| -0.07 | Finnish, Swedish |
| -0.04 | Veps, Swedish |
| 0.23 | Kildin Sami, Swedish |
| 0.21 | Polish, Finnish |
| 0.21 | Belarusian, Finnish |
| 0.19 | Russian, Finnish |
| 0.19 | Northern Sami, Danish |
In Table 4, we find that, predominantly, distances involving Latvian and Albanian changed the most between the two models: the phylogenetic distance between Albanian and other languages in the Balkan Sprachbund were reduced while distances between other languages were increased. However, the increase in similarity between these languages calls for comment: according to the consensus tree ( Figure 2), Albanian is still an outgroup even under the deconfounded model. The changes in the pairwise distances therefore mean that Albanian has been, relatively speaking, moved away in phylogenetic similarity from the other Balkan Sprachbund languages, which resulted in an inevitable decrease in distance from other languages. As can be seen in the consensus tree, this does not mean that Albanian is therefore particularly close to e.g. Persian, just that a movement away from geographically close languages results in an overall decrease in distance from other languages.
Further, Latvian shows decreased similarity with both western and eastern Slavic languages.
In the Northeuralex dataset, we find weaker changes overall; most reduction in phylogenetic similarity is found in pairwise distances between Germanic and Sámi languages on the Scandinavian Peninsula. Conversely, the model reports increases in phylogenetic similarity for Finnish, and some Indo-European languages, which goes against the common consensus. The reason for this result may be that the model overcompensates for Finnish showing geographical effects in the Uralic context by moving it closer to other languages instead.
4.5 Character-level analysis
Recall that themodel infers the the geographical and phylogenetic distances between languages on the character-level, which means that for every character, the model calculates a corresponding log-likelihood value for the character to be geographically conditioned ( γ in the model formula in Section 2). High values of this parameter for a certain character indicate that it has a strong geographical conditioning; i.e., the character is likely to be borrowed. If the model succeeds at finding geographical contact patterns, we would expect to find that the actual loanwords we see in the languages in question correspond to high values of γ in the model results.
To evaluate the character-level accuracy of the inferences, we relied on the goldstandard loanword dataset provided in the supplements of Dellert (2019), which is itself a modification of the World Loanword Database (WOLD) ( Haspelmath & Tadmor, 2009). This enables the direct comparison between the Northeuralex dataset model output and the loanword database. For this analysis, we selected all loanwords where the source and target languages are present in the model, which yields a total of 82 borrowed characters. Figure 7 shows the distributions of log-likelihood values for the geographical variable for each character by whether or not they are indicated as loans in Dellert (2019).
Figure 7. Distribution of per-character geographical log-likelihood by whether or not they are loanwords.
The figure shows that the borrowed characters are attributed a larger log-likelihood in the model. To further corroborate the difference between the groups, we ran a univariate Bayesian regression model with standard normal priors using the Rpackage brms ( Bürkner, 2017). In this model, the log-likelihood was the outcome variable and the binary variable of the borrowings was the predictor. The results show a detectable difference between borrowings = ‘yes’ (mean 0.63, 95% CI [0.42, 0.84]) and borrowings = ‘no’ (mean -0.40, 95% CI [-0.41, -0.40]). Therefore we can conclude that the model succeeds in assigning borrowed words a higher loglikelihood. However, the difference is only present in population means. Concretely, although the model assigns borrowings a higher geographical conditioning on the population-level, it fails at clearly identifying individual characters as borrowings. An accurate classification of characters could therefore not be done. This is crucial for evaluating the usefulness of the model insofar as the model can not be used for identifying loanwords for this reason.
5 Advantages, biases and limitations of the approach
After evaluating the results on several levels of analysis, we now turn to a more in-depth scrutiny of the implications and usefulness of the model for phylogenetic research.
5.1 Summary of the analysis
The analysis has shown that the Gaussian process model approach yields mostly accurate inferences of contact zones and their effects on phylogenetic inference. The model succeeds in many respects at deconfounding phylogenetic distances between languages. However, the model seems to overestimate certain confounds especially with languages that are both highly phylogenetically related and geographically close (see discussion in Section 5.2).
Overall, however, the method provides insights at different levels of analysis, which is useful to investigate micro-level effects such as pairwise distances as well as macro-level effects on a tree topology or individual clades. Despite this, the granularity is not infinitely scalable as the model cannot be used for loanword classifications (see discussion in section 4.5).
Moreover, it has been noticed that due to the strong correlation between reductions and increases in clade support or pairwise language distances, it is not always discernible whether an increase in tree support or language distance for a clade or pair of languages is due to a reduction in another part of the tree.
5.2 Biases and limitations that reduce the efficacy of the model
Several issues limit the accuracy of the approach; some of which are methodological limitations due to how the model is set up, and some of which are biasing factors that interfere with the model’s inferences. This section is intended to sketch out these issues along with some considerations about how to address them.
First of all, the most obvious limitation of this approach is that it relies on a distance-based architecture rather than inferring character evolution as most state-of-the-art phylogenetic models do. This is an issue that cannot be overcome without entirely re-designing the model under an evolutionary model paradigm: the Gaussian processes used in this study are inherently distance-based as they take in information about geographical and phylogenetic relationships in the form of pairwise distances. Related to this point, the model cannot capture historical contact lines as accurately as the geospatial distance matrices used in these analyses are stationary across time can. In effect, this approach does not model changes in contact strength over time as would be possible in a character-evolution model.
A related limitation that encompasses several sub-issues is the inability of singular Gaussian process kernels to handle non-positive-definite matrices. This means that each one of the categories (languages in this case) needs to be proportionally distant from any other category. This means that if there are three languages A, B, and C, A cannot be at the same time close to B but different from C without B also being different from C. Consider the two distance matrices D 1 and D 2.
Here, D 1 yields a positive definite matrix outcome of the GP kernel while D 2 does not. The reason for this is that D 2 is not representable in 2D Euclidean space since the distances between the points are not proportional. 6
Why is this issue important in linguistic datasets? Geospatial (or, for that matter, all contact-induced) relationships between languages cannot be accurately be represented as points in Euclidean space. That is, some languages have several contact points with other languages or nonlinear contact relationships. For example, French is in close contact with both Flemish and Spanish, hence the pairwise distances between French and Flemish and French and Spanish would both be very small while the distances between Flemish and Spanish would be much greater. Further, there can be long-distance or nonlinear contacts such as contacts between English and Spanish in the Americas or French and Arabic in Northern Africa. Additionally, there are contact relationships that are non-geographic, such as Latin influence on several central and western European languages in the middle ages. One could argue, however, that a point-like representation is a sufficiently informative abstraction, yet in such models, we cannot rule out potential mismatches between the Euclidean representations and the in reality observable contacts.
This modelling issue could be addressed, however, by giving each language a covariance matrix of its own that is inferred separately at each step based on a custom distance matrix outfitted with bespoke distances, based on topology and actual contact lines. However, this would increase computation times by a factor equal to the number of languages (i.e. for the approach at hand this would mean a computation time increase of a factor of 25). Even with a moderate number of languages in the dataset (10–30), this would result in computation times of several weeks, even with fully optimized code.
6 Conclusion
We have shown in this study that the proposed Gaussian process model can identify areas, clades and language pairs that have been subject to geospatial effects, making them seem genetically closer than they are. This approach succeeds at partially removing those geospatial confounds, thereby mitigating the risk of mistaking contact-induced similarity between languages as support for closer genetic relationships.
While this model is effective overall, we identified some weaknesses that are due to how Gaussian processes in general handle distance matrices and the known issue that languages are represented as point-like units in such models. Further, we found that the character-based loanword identification failed in the model, meaning that the model in its current form cannot be used to identify borrowings in the dataset, as the model accuracy is too low on such a fine-grained level. We identified problems with chain convergence in the model stemming from the Gaussian process setup and the large number of latent parameters. However, these issues can be mitigated by running more chains and averaging over the posterior samples.
Acknowledgements
Many thanks to Johannes Dellert and Johanna Nichols for their input and comments on the project. Many thanks also to Miri Mertner and Molly Rolf for commenting on the final draft. This research was supported by the DFG Centre for Advanced Studies in the Humanities Words, Bones, Genes, Tools (DFG-KFG 2237) and by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement 834050).
Funding Statement
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No [834050]) and from the DFG Centre for Advanced Studies in the Humanities Words, Bones, Genes, Tools (DFG-KFG 2237).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 2; peer review: 1 approved, 3 approved with reservations, 1 not approved]
Footnotes
1The structure of the data is explained in Section 3.
2In fact, this is true for any real number where δ = λ.
3The code to recreate the datasets as well as the code for the models can be found at 10.5281/zenodo.10247032.
4The same figures can be found in the appendix without the map data and with language labels ( Figure 8, Figure 9).
5Note that in order to make the distance matrices comparable, the pairwise distances were normalized such that the largest inferred distance corresponds to 1 in each model before calculating the differences between the models. A change in -0.34 means that from the confounded to the deconfounded model, the pairwise distance between the two languages in question shrunk by 34 percent of the largest phylogenetic distance.
6This, however, is only a cursory explanation as the mathematical reason why the posititve definiteness is not fulfilled is grounded in the definition of the Gaussian process kernel.
Appendix
6.1 List of languages included
Languages selected from the ASJP database:
Hindi, Urdu, Persian,Western Armenian, Eastern Armenian, Romanian, Italian, Corsican, French, Greek, English, Dutch, Icelandic, Danish, Polish, Czech, Macedonian, Bulgarian, Russian, Ukrainian, Latvian, Scottish Gaelic, Irish Gaelic, Albanian
Languages selected from the Northeuralex database:
Danish, Norwegian, Swedish, Icelandic, German, Dutch, English, Russian, Belarusian, Polish, Lithuanian, Latvian, Veps, Olonets Karelian, North Karelian, Finnish, Livonian, Estonian, Inari Sami, Northern Sami, Southern Sami, Lule Sami, Kildin Sami, Skolt Sami
6.2 Topologies of confounded trees
Figure 8. Strength of the geographical confounds between selected languages in the ASJP dataset, map data removed.
Figure 9. Strength of the geographical confounds between selected languages in the Northeuralex dataset, map data removed.
6.3 A note on convergence
As discussed in Section 4, some convergence issues were noticed that can partially be explained by the nature of inferring distance matrices whose multiples of each other yield the same tree topology. Nevertheless we found further some convergence issues that resulted in disagreements in the topologies of the consensus trees of each chain. Those disagreements were chiefly found in the sub-structure of the Germanic clade and the clade support values of East Slavic and Sámi in the Northeuralex dataset, and Germanic, Romance, and Slavic in the ASJP dataset.
We calculated that the average standard deviation of node support of values in the consensus trees is 0.016 ( ASJP) and 0.031 ( Northeuralex) while the same measure for clades with support higher than ten percent is 0.077 ( ASJP) and 0.069 ( Northeuralex).
This means that the chains mostly disagree in low-support clades which can be interpreted that individual chains tend to rate some low-support nodes higher than others while the general tree structure is in agreement between the chains. Nevertheless, the node support standard deviation measures are high enough to be a potential issue for this method and warrants further scrutiny. To analyze further how convergence issues arise in this approach, we use the following experimental design: the investigation proceeds in a grid search simulation-based fashion that simulates phylogenies and geographic distances of languages of varying length and number of observed sites. Then, the model presented in the main approach is run on each simulated dataset. Further, we aim to investigate if covariance between phylogenetic and geographic distance is responsible for convergence issues in the model which is why we add a covariance term to the grid search (see below). Each simulated run goes through the following steps:
-
1.
Obtaining a phylogeny and simulated character set: A randomtree topology is generated with exclusively extant taxa with the number of taxa specified by the associated parameter of the simulation. Taking this topology, a data generation function simulates the evolution of a binary character (‘0’ and ‘1’ with ‘0’ being the character at the root) by selecting one edge in the topology whose descendant nodes receive the innovated character 1. The edge that is selected is determined by its edge length where the probability of an innovation along this edge is proportional to its length. This simulates a character evolution process with a uniform likelihood of a change in state across all lineages. Since exactly innovation occurs in the tree over the course of the evolution process, it means that homoplastic events do not occur in this process. To simulate loss of the trait, each node and its descendants is given a probability of switching to a ‘0’ with a probability of 10%. As a result, we simulate a uniform-rate character evolution with one innovation per site and a loss rate of 10%. This process is repeated for every site specified by the number of sites parameter of the simulation.
-
2.
Obtaining a geographic distance: For each taxon, we generate a random point on a quadratic two dimensional surface with bounds 0 and 3. Afterwards, we calculate the euclidean distances between each point and obtain a geographic distance matrix that can be fed to the model. To track the effects of potential covariance of phylogenetic and geographic distance on convergence, we run the models both with the aforementioned randomly generated geographic distance and a correlated geographic distance. For this, we calculate the cophenetic distance between the extant taxa in the simulated tree and multiply it by 1.2. The resulting matrix is a multiple of the phylogenetic distance in the tree. This leads the correlation between phylogenetic distance and the geographic distance to be 1 which is the worst-case scenario where geographic distance is entirely predictable from phylogenetic distance. We opted for this worst-case implementation (although this correlation strength is unrealistic in real-world cases) since, if covariance plays a role in chain convergence, we can gauge its maximal effect. Thus, the effect we see with the maximally covariant matrices is the strongest it can be.
-
3.
Running the model: The model was run on the simulated dataset and geographical distance.
-
4.
Obtaining split frequencies and convergence metrics: After the model run, the average r_hat was extracted for all parameters and only the inferred phylogenetic distance matrix separately to investigate if there are differences between overall convergence and convergence of the phylogenetic distances. Further, the posterior phylogenetic distance matrices of each sample were used to construct a tree for each of those samples, for which the split frequencies of each individual split in the trees were calculated. This was done for each chain separately to obtain the mean standard deviation of the split frequencies for each one of the chains. This was done both for the clades that had majority support and for all clades to be able to distinguish convergence in majoritysupported clades.
The grid search parameters are tabulated in Table 6.
Table 6. Parameters used in the grid search.
| Parameter | Grid intervals |
|---|---|
| Number of languages | 5,10,20 |
| Number of total sites | 100, 300, 500 |
| Covariance geo-phylo | ‘yes’, ‘no’ |
Each grid search parameter combination was run 4 times in total to capture variances between the runs with the same parameter settings. The total number of data points are 72.
The results of the grid search show three properties: firstly, none of the runs with correlated distance matrices have supported clades included in their consensus trees. This means that at extreme covariance between geographic and phylogenetic distance, the model cannot distinguish between the two effects. This is expected since in the simulation, there is no residual information from the geographic distance that can be inferred. Secondly, the correlation between the r_hat values of all parameters and the r_hat values only those of the phylogenetic distance matrix is 0.99 which means that there is no difference in convergence of just the phylogenetic distance parameters and the overall model convergence. Moreover, the mean standard deviation of the split frequencies in the majority clades is on average 0.04 (72 %) standard deviations less than that of all clades taken together. This is considerable and means that the majority clades show much better chain convergence.
To further analyze the results, we used linear models to determine the effects of each factor on the different convergence metrics. Table 7 shows the results of the results.
Table 7. Model coefficients (rows) of different linear models with different outcomes (columns).
Boldface figures indicate significance of the predictor of p<0.05.
| outcome:predictor | total
r_hat |
split
freq. |
split freq.
(majority) |
|---|---|---|---|
| No. languages | 0.09 | 0.01 | 0.00 |
| No. sites | 0.00 | -0.00 | -0.00 |
| covariance (yes) | -0.10 | -0.02 | |
| No. languages:No. sites | -0.00 | 0.00 | -0.00 |
| No. languages:covariance (yes) | 0.00 | -0.00 | |
| No. sites:covariance (yes) | -0.00 | 0.00 |
As Table 7 shows, it is the the number of languages in the tree that primarily causes convergence issues. This is the case both for the r_hat convergence metric and the split frequencies convergence metric. This effect, however, does not appear as strong in the split frequencies of only the majority clades, which indicates that even with convergence issues, the results of the posterior trees concerning majority clades are less impacted by this.
To summarize, convergence issues are related to the following factors:
Convergence of posterior trees are related to the number of languages in the sample. That is, the more languages are in the dataset, the more posterior chains fail to converge. This can be explained by the fact that with increased number of languages, the pairwise distances increase to be inferred as parameters drastically increase by , which results in 190 latent parameters for 20 languages and 1,225 for 50 languages. Each of these parameters could potentially be responsible for divergences in the tree topology.
The disagreements were mostly contained to those clades where there is considerable uncertainty in the tree structure, i.e. where small deviations in pairwise distances can result in a change in tree topology.
The disagreements between the chains seem to average out with increasing number of chains. This can be observed by the fact that disagreements in lowsupport languages lead to an increase in uncertainty in the average posterior and therefore have reduced impact on the consensus tree. In other words, with increasing number of chains, the risk of false positive inferences is reduced. In potential future applications of this method, it might therefore be advisable to run more than four chains to optimally harness the effect of this averaging.
References
- Bouckaert R, Lemey P, Dunn M, et al. : Mapping the origins and expansion of the Indo-European language family. Science. 2012;337(6097):957–960. 10.1126/science.1219669 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bowern C, Atkinson Q: Computational phylogenetics and the internal structure of Pama-Nyungan. Language. 2012;88(4):817–845. Reference Source [Google Scholar]
- Bürkner PC: Advanced Bayesian multilevel modeling with theR package brms. arXiv preprint. 2017; arXiv preprint arXiv:1705.11123. 2017. 10.48550/arXiv.1705.11123 [DOI] [Google Scholar]
- Chang W, Cathcart C, Hall D, et al. : Ancestry-constrained phylogenetic analysis supports the Indo-European steppe hypothesis. Language. 2015;91(1):194–244. 10.1353/lan.2015.0005 [DOI] [Google Scholar]
- Clackson J: The linguistic relationship between Armenian and Greek.Oxford: Blackwell,1994. [Google Scholar]
- Dellert J: Information-theoretic causal inference of lexical flow.Berlin:Language Science Press,2019. 10.5281/zenodo.3247415 [DOI] [Google Scholar]
- Dellert J, Daneyko T, Münch A, et al. : NorthEuraLex: A wide-coverage lexical database of Northern Eurasia. Lang Resour Eval. 2020;54(1):273–301. 10.1007/s10579-019-09480-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eska JF: The celtic languages. The Celtic Languages. Ed. byMartin Ball and Nicole Müller. New York: Routledge,2009;36–41. [Google Scholar]
- Greenhill SJ, Currie TE, Gray RD: Does horizontal transmission invalidate cultural phylogenies? Proc Biol Sci. 2009;276(1665):2299–2306. 10.1098/rspb.2008.1944 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guzmán Naranjo M, Becker L: Statistical bias control intypology. Linguist Typol. 2021;26(3):605–670. 10.1515/lingty-2021-0002 [DOI] [Google Scholar]
- Haspelmath M, Tadmor U: WOLD. Leipzig: Max PlanckInstitute for Evolutionary Anthropology,2009. Reference Source [Google Scholar]
- Heggarty P, Anderson C, Scarborough M, et al. : Language trees with sampled ancestors support a hybrid model for the origin of Indo-European languages. Science. 2023;381(6656): eabg0818. 10.1126/science.abg0818 [DOI] [PubMed] [Google Scholar]
- Hruschka DJ, Branford S, Smith ED, et al. : Detecting Regular Sound Changes in Linguistics as Events of Concerted Evolution. Curr Biol. 2015;25(1):1–9. 10.1016/j.cub.2014.10.064 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jäger G: Global-scale phylogenetic linguistic inference from lexical resources. Sci Data. 2018;5(1): 180189. 10.1038/sdata.2018.189 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jäger G, Wahle J: Phylogenetic Typology. Front Psychol. 2021;12: 682132. 10.3389/fpsyg.2021.682132 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jørgensen AR: Celtic. In: The Indo-European Language Family: A Phylogenetic Perspective. Ed. by ThomasEditor Olander. Cambridge University Press,2022;135–151. 10.1017/9781108758666.009 [DOI] [Google Scholar]
- McElreath R: Statistical rethinking: a Bayesian course with examplesin R and Stan.Boca Raton: Chapman and Hall/CRC,2018. [Google Scholar]
- Neureiter N, Ranacher P, Efrat-Kowalsky N, et al. : Detecting contact in language trees: a Bayesian phylogenetic model with horizontal transfer. Humanit Soc Sci Commun. 2022;9:1–14. 10.1057/s41599-022-01211-7 [DOI] [Google Scholar]
- Ranacher P, Neureiter N, van Gijn R, et al. : Contact-tracing in cultural evolution: a Bayesian mixture model to detect geographic areas of language contact. J R Soc Interface. 2021;18(181): 20201031. 10.1098/rsif.2020.1031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Revell LJ: phytools: An R package for phylogenetic comparative biology(and other things). Methods Ecol Evol. 2012;3(2):217–223. 10.1111/j.2041-210X.2011.00169.x [DOI] [Google Scholar]
- Ringe D, Warnow T, Taylor A: Indo-European and computationalcladistics. Trans Philol Soc. 2002;100(1):59–129. 10.1111/1467-968X.00091 [DOI] [Google Scholar]
- Schmidt J: Die Verwantschaftsverhältnisse der indogermanischen Sprachen.H. Böhlau,1872. Reference Source [Google Scholar]
- Stan Development Team: RStan: the R interface to Stan.R package version 2.21.7.2022. Reference Source
- Weiss M: Italo-Celtic. In: The Indo-European Language Family: APhylogenetic Perspective.Ed. by ThomasEditor Olander. Cambridge University Press,2022;102–113. 10.1017/9781108758666.007 [DOI] [Google Scholar]
- Wichmann S, Holman EW, Brown CH: The ASJP Database. (version 17). 2016. Reference Source








