The word for “sky” in the indigenous Saaroa language of Taiwan is laŋica. Across the South China Sea in the Philippines, the speakers of Ilonggo use laŋit, whereas, on the far-flung islands of the Pacific, Hawaiians say lani and Rarotongans and New Zealand Maori raŋi (1). Systematic sound correspondences between many such words tell us that these languages have evolved from a common ancestor to form part of the Austronesian language family. By meticulously comparing the sounds of words across many languages, linguists can learn about the genealogical relationships between languages and the people who speak them, how sounds change through time and even how long-extinct ancestral languages would have sounded. In PNAS, Bouchard-Côté et al. (2) automate this process by using probabilistic models of sound change to trace the evolution of thousands of words across more than 600 Austronesian languages.
The conventional technique for studying language change on the basis of contemporary variation is known as the comparative method (3). This approach identifies shared “cognates” between putatively related languages. Cognates are homologous words of similar meaning that show systematic sound correspondences indicating common ancestry (Fig. 1). Since the 19th century, historical linguists have understood that sound changes occur in a regular but context-sensitive way across the vocabulary of a language. Hence, where Hawaiian has lani and lima for “sky” and “five,” Rarotongan and Maori have raŋi and rima, reflecting a shift in their ancestral lineage from this l sound to r. In deciding whether two words are genuinely cognate, linguists can therefore look beyond superficial similarities by attempting to reconstruct a protolanguage (the common ancestor of the languages in question) and identify regular sound changes acting across the sound systems of its descendants.
The rigorous application of the comparative method can be a complex and labor-intensive task. Accurate comparisons between words must incorporate likely changes to pronunciation and the phonological system and correctly align words allowing for insertions, deletions, metathesis (reversals, such as Old English brid to the modern bird), reduplication (such as Maori paki “to pat” vs. pakipaki “to clap”), and haplology (loss of repeated syllables, such as English library vs. the colloquial libry) among numerous other kinds of change. Change can also be context dependent. For example, in Proto-Germanic, stops (*p, *t, and *k) became voiced (*b, *d, and *g) but only after an unstressed syllable (Verner's Law); in other contexts, a different rule applied. This predictability allows linguists to distinguish true cognates from chance resemblances (such as the word for “eye” in Maori, mata, and Greek, mati) or likely borrowings (e.g., English mountain borrowed from Old French montaigne). All this is done at the same time as evaluating the underlying ancestral genealogy, which depends on and informs the observed patterns of sound change. The result is an iterative process in which multiple parameters are being optimized simultaneously across hundreds or thousands of data points.
Evolutionary biologists face an analogous and equally complex task in reconstructing species ancestry from gene sequence data (4). Like historical linguists, they seek to simultaneously infer homology, ancestral states, the ancestral genealogy, and underlying models of change. Biologists must also deal with alignment problems (including insertions, deletions, reversals, and reduplications) (5), context-dependent rates of change (6), multiple data types (7), and horizontal transmission (8). In response to these challenges, biologists have developed a suite of computational modeling tools that can efficiently explore parameter space and quantify uncertainty for even complex models and large datasets.
Recently, these tools have been applied to language data to model the evolution of words through time and test hypotheses about the origins of major language families (9–11). Until now, most computational models of vocabulary evolution have ignored information on the sounds of specific words, preferring simpler models of the gain and loss of cognates through time. However, this relies on existing cognate judgments from expert linguists, discards useful information in the source data, and cannot provide insight into the process of sound change.
Bouchard-Côté et al. (2) bring evolutionary modeling and historical linguistics one step closer by developing a probabilistic model of sound change that automates the process of ancestral state reconstruction and cognate assignment directly from vocabulary data. Previous attempts to solve this problem have been restricted to small datasets (12, 13), limiting the power and utility of the methods. Others have sought to quantify language diversification by using simple edit distances (14), but these efforts lack any explicit model of change or the ability to infer ancestral forms or cognates.
Bouchard-Côté et al.’s (2) approach adapts probabilistic string transducer algorithms developed by biologists for ancestral genome reconstruction and alignment (15). These computationally efficient algorithms make it possible to analyze large datasets and can handle many of the complexities of sound change considered by the comparative method. Bouchard-Côté et al. (2) infer ancestral sounds by estimating the probability of all possible sound changes occurring along each branch of the language family tree. By linking these probabilities across cognate sets, they can incorporate the regularity of sound change. The string transducer also allows for insertions, deletions, and a degree of context dependence. By adding into the model the further possibility of wholesale replacement with noncognate word forms, the method can reconstruct the birth and death of new cognates and so infer cognate words.
Based on two alternative Austronesian language trees (9, 16), Bouchard-Côté et al. (2) are able to reconstruct ancestral “protoforms” for each cognate set. They benchmark their reconstructions against manual reconstructions of Proto-Oceanic (the common ancestor of modern languages from the Oceanic subgroup) and find an error rate midway between that achieved by randomly assigning cognate words from modern Oceanic languages and the level of disagreement between two linguists’ manual reconstructions. Bouchard-Côté et al. (2) also compare cognate sets inferred under their model to known Oceanic cognate sets (1) and find they can group more than 90% of the words correctly.
One major limitation of the current implementation of Bouchard-Côté et al.’s (2) method is that it requires an existing language tree and so can only be applied to well-studied families in which the hard work of establishing the genealogy has already been done. In principle, however, the approach could be extended to simultaneously infer cognates and the tree directly from word string data. An analogous problem has already been solved in biology with the simultaneous estimation of gene alignment and phylogeny (5).
Regardless, by explicitly modeling probabilities of change across the tree, this new approach makes it possible to statistically test hypotheses that embody long-standing questions about the nature of sound change. Bouchard-Côté et al. (2) demonstrate this
Bouchard-Côté et al.'s contribution can be seen as a first step toward a comprehensive computational model of sound change.
ability by revealing decisive support for the “functional load” hypothesis (17): the more work a sound contrast does in differentiating between words in a language, the less likely that contrast is to be lost. Identifying this pattern required integrating over thousands of data points and would simply not have been practical via manual reconstruction. The same tools could be used to answer questions about other functional dependencies and frequency effects (18), conditioning (19), and whether proposed laws are universal or family-specific.
Bouchard-Côté et al. (2) are careful to point out the limits of their current model and that it is not a replacement for careful linguistic scholarship. Besides not yet inferring the tree, the method falls short of being able to recover ancestral forms with the reliability of an expert linguist. Much of the shortfall may result from the fact that the string transducer algorithm does not permit metatheses, reduplications, or haplologies, and allows context dependency based only on the previous character in the string. However, these should be viewed as challenges to be solved, rather than inherent weaknesses of a computational approach. It is worth noting that biologists have achieved considerable success by starting with very simple models of complex phenomena and gradually increasing realism. Bouchard-Côté et al.’s (2) contribution can be seen as a first step toward a comprehensive computational model of sound change. Indeed, compared with the rudimentary models of nucleotide substitution first used by biologists, Bouchard-Côté et al.’s model of sound change is highly sophisticated. It seems reasonable to expect that computer algorithms will become an increasingly important tool for studying the descent of words. Although they cannot yet outcompete the grand masters of historical linguistics, Bouchard-Côté et al. show that they can certainly play the game.
Footnotes
The authors declare no conflict of interest.
See companion article on page 4224.
References
- 1.Greenhill SJ, Blust R, Gray RD. 2003–2013. Austronesian Basic Vocabulary Database. Available at http://language.psy.auckland.ac.nz/austronesian, accessed January 12, 2013.
- 2.Bouchard-Côté A, Hall D, Griffiths TL, Klein D. Automated reconstruction of ancient languages using probabilistic models of sound change. Proc Natl Acad Sci USA. 2013;110:4224–4229. doi: 10.1073/pnas.1204678110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Campbell L, Poser WJ. Language Classification: History and Method. Cambridge, UK: Cambridge Univ Press; 2008. [Google Scholar]
- 4.Atkinson QD, Gray RD. Curious parallels and curious connections—phylogenetic thinking in biology and historical linguistics. Syst Biol. 2005;54(4):513–526. doi: 10.1080/10635150590950317. [DOI] [PubMed] [Google Scholar]
- 5.Suchard MA, Redelings BD. BAli-Phy: Simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics. 2006;22(16):2047–2048. doi: 10.1093/bioinformatics/btl175. [DOI] [PubMed] [Google Scholar]
- 6.Nevarez PA, DeBoever CM, Freeland BJ, Quitt MA, Bush EC. Context dependent substitution biases vary within the human genome. BMC Bioinformatics. 2010;11(1):462. doi: 10.1186/1471-2105-11-462. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Nylander JAA, Ronquist F, Huelsenbeck JP, Nieves-Aldrey JL. Bayesian phylogenetic analysis of combined data. Syst Biol. 2004;53(1):47–67. doi: 10.1080/10635150490264699. [DOI] [PubMed] [Google Scholar]
- 8.Dagan T, Martin W. Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution. Proc Natl Acad Sci USA. 2007;104(3):870–875. doi: 10.1073/pnas.0606318104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Gray RD, Drummond AJ, Greenhill SJ. Language phylogenies reveal expansion pulses and pauses in Pacific settlement. Science. 2009;323(5913):479–483. doi: 10.1126/science.1166858. [DOI] [PubMed] [Google Scholar]
- 10.Kitchen A, Ehret C, Assefa S, Mulligan CJ. Bayesian phylogenetic analysis of Semitic languages identifies an Early Bronze Age origin of Semitic in the Near East. Proc Biol Sci. 2009;276(1668):2703–2710. doi: 10.1098/rspb.2009.0408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Bouckaert R, et al. Mapping the origins and expansion of the Indo-European language family. Science. 2012;337(6097):957–960. doi: 10.1126/science.1219669. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ellison TM. Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology. Stroudsburg, PA: Association for Computational Linguistics; 2007. Bayesian identification of cognates and correspondences; pp. 15–22. [Google Scholar]
- 13.Oakes MP. Computer estimation of vocabulary in a protolanguage from word lists in four daughter languages. J Quant Linguist. 2000;7(3):233–243. [Google Scholar]
- 14.Brown CH, Holman EW, Wichmann S, Velupillai V. Automated classification of the world’s languages: A description of the method and preliminary results. STUF Language Typology Universals. 2008;61(4):285–308. [Google Scholar]
- 15.Holmes I, Bruno WJ. Evolutionary HMMs: A Bayesian approach to multiple alignment. Bioinformatics. 2001;17(9):803–820. doi: 10.1093/bioinformatics/17.9.803. [DOI] [PubMed] [Google Scholar]
- 16.Lewis PM. Ethnologue: Languages of the World. 16th ed. Dallas: SIL; 2009. [Google Scholar]
- 17.King R. Functional load and sound change. Language. 1967;43:831–852. [Google Scholar]
- 18.Bybee JL. Phonology and Language Use. Cambridge UK: Cambridge Univ Press; 2001. [Google Scholar]
- 19.Blust R. *t to k: An Austronesian sound change revisited. Oceanic Linguistics. 2004;43(2):365–410. [Google Scholar]