Skip to main content
Open Research Europe logoLink to Open Research Europe
. 2023 Jun 20;3:99. [Version 1] doi: 10.12688/openreseurope.16017.1

Lexical data for the historical comparison of Rgyalrongic languages

Yunfan Lai 1,2,a, Johann-Mattis List 1,3
PMCID: PMC10446046  PMID: 37645481

Abstract

As one of the most morphologically conservative branches of the Sino-Tibetan language family, most of the Rgyalrongic languages are still understudied and poorly understood, not to mention their vulnerable or endangered status. It is therefore important for available data of these languages to be made accessible. The present lexical data sets provide comparative word lists of 20 modern and medieval Rgyalrongic languages, consisting of word lists from fieldwork carried out by the first author and other colleagues as well as published word lists by other authors. In particular, data of the two Khroskyabs varieties are collected by the first author from 2011 to 2016. Cognate identification is based on the authors' expertise in Rgyalrong historical linguistics through the neogrammarian comparative method. We curated the data by conducting phonemic segmantation and partial cognate annotation. The data sets can be used by historical linguists interested in the etymology and the phylogeny of the languages in question, and they can use them to answer questions regarding individual word histories or the subgrouping of languages in this important branch of Sino-Tibetan.

Keywords: lexical data; historical linguistics; language phylogeny; Rgyalrongic; Sino-Tibetan; language subgrouping; partial cognate annotation; endangered languages

Plain language summary

Rgyalrongic languages are mainly spoken in Western Sichuan, China, albeit including a now extinct, medieval language, namely Tangut, attested in today’s Ningxia Province. For most non-native speakers, they are the most difficult branch of languages to learn in the Sino-Tibetan family, as they exhibit a large array of word formation strategies such as inflection and derivation. Their word complexity points to their old age and their value in finding out the origin of Sino-Tibetan. This database aims at gathering lexical information of Rgyalrongic languages, in preparation for future work on historical and evolutionary linguistics.

Introduction

Rgyalrongic languages are a Sino-Tibetan branch, mainly spoken in Rngaba Tibetan and Qiang Autonomous Prefecture in Sichuan, China 1, 2 . They are closely related to Lolo-Burmese languages 3 . Apart from their modern varieties, which are mostly endangered or vulnerable, a medieval language, Tangut, is recently recognised to belong to said branch 4 . Rgyalrongic languages are traditionally divided in two sub-branches, the east sub-branch and the west sub-branch. East Rgyalrongic is comprised of four main languages, Situ, Zbu, Japhug and Tshobdun, and West Rgyalrongic of three further sub-branches, Khroskyabs, Stau (aka. Daofu) and Tangut. Recent phylogenetic studies, however, show that Zhaba is also clustered in Rgyalrongic 3 . Thus, it is very likely that language varieties closely linked to Zhaba, such as Queyu and Minyag (aka. Muya, Menya) and Zlarong spoken in the Tibetan Autonmous Region, are also Rgyalrongic languages. These “newcomers” are provisionally termed “Peripheral Rgyalrongic” in the following.

Rgyalrongic is one of the most morphologically conservative branches in the Sino-Tibetan family, and certainly the closest to Proto-Sino-Tibetan compared to all other varieties in China. Therefore, understanding the history of Rgyalrongic languages is vital for the study on the evolution of Sino-Tibetan. The phylogeny of Rgyalrongic with dating information is thus an essential step towards this goal. Lexical data is the most accessible means to approach language phylogeny and has been proven to show accurate results 3, 5 . In order to infer the phylogenetic subgrouping of Rgyalrongic, a lexical dataset with high quality curation is inevitable. This database provides the first annotated resource for the phylogenetic analyses of Rgyalrongic languages. It includes twenty varieties in East, West and Peripheral Rgyalrongic, as shown in the map in Figure 1 and Table 1.

Figure 1. Geographical distribution of the languages in the database.

Figure 1.

Table 1. Languages selected for the database.

Languages ID Language name Glottolog Longitude Latitude Source
Bantawa Kiranti Bantawa bant1281 87.05 27.12 Doornebal (2009) 6
Daofu rGyalrong Daofu horp1240 101.12 30.98 Huang (1992) 7
Japhug rGyalrong Japhug japh1234 101.96 32.21 Jacques (2015) 8
Tangut Tangut tan1334 106.29 38.48 Li (1997) 9
WobziKhroskyabs Khroskyabs Wobzi eree1240 101.43 31.63 Lai (2017) 10
Zhaba Qiangic Zhaba zhab1238 101.06 30.54 Huang (1992) 7
MaerkangrGyalrong rGyalrong Maerkang situ1238 102.30 31.87 Huang (1992) 7
MuyaKangding Muya-kangding west2417 101.96 30.06 Huang (1992) 7
MenyaGao Menya-Gao west2417 101.53 29.80 Gao (2015) 11
OldBurmese OldBurmese oldb1235 96.60 21.46 Dictionary
QueyuXinlong Queyu Xinlong quey1238 100.26 30.98 Huang (1992) 7
QueyuPubarong Queyu Pubarong quey1238 100.78 30.22 Guan Xuan’s fieldwork
SiyuewuKhroskyabs Siyuewu Khroskyabs siya1242 101.42 31.75 Author’s fieldwork
BragbarSitu Bragbar Situ situ1238 101.91 31.82 Zhang (2020) 12
Geshiza Geshiza horp1240 101.65 31.03 Honkasalo (2019) 13
GuanyinqiaoKhroskyabs Guanyinqiao Khroskyabs guan1252 101.66 31.78 Huang (2007) 14
NjorogsKhroskyabs Njorogs Khroskyabs yelo1242 101.86 31.82 Yin (2007) 15
Tshobdun Tshobdun tsho1240 101.83 32.21 Sun (2019) 16
NgyaltsuZbu Ngyaltsu Zbu zbua1234 101.74 32.15 Gong (2018) 17
Zlarong Zlarong zlar1234 98.09 29.93 Zhao (2019) 18
KyomkyoSitu Kymokyo Situ situ1238 102.04 31.99 Prins (2017) 19
MazurStau MazurStau daof1238 101.05 31.04 Gates (2021) 20

Methods

The entire workflow to build our database is illustrated in Figure 2. We started with the collection of raw data, collected from original fieldwork and from existing word lists. We then organised our raw data into a specifically designed and curated word list. In the third step, we conducted data standardisation conforming to standards outlined by the Cross-Linguistic Data Formats Initiative. Finally, we identified and annotated cognate sets for individual morphemes, also known as partial cognates 21 .

Figure 2. Workflow of building up the Rgyalrongic lexical database.

Figure 2.

Major sources of the dataset

The major sources of the dataset include original fieldwork from one of the authors of this study (YFL) and various colleagues who generously shared their lexical data, ie. word lists. The author’s original field data involves two varieties of Khroskyabs, Siyuewu and Wobzi. All the vocabulary needed for this dataset has already been collected before 2017, prior to any of the research projects acknowledged in this paper. Fieldwork involved verbal exchanges with native speakers, requesting pronunciations of words and expressions, without physical contact. An additional source of data was published dictionaries and word lists which were judged as reliable by the authors (see Table 1). Dictionaries and word lists typically contain word forms and translations in Chinese, French or English. Some word lists also provide morphological information and example sentences. Reliability is assessed through two aspects: i) internal phonological consistency of the source data. The authors check if phonemes are correctly identified and that allophones conditioned by phonological environments and morphological alternations are adequately represented in the original sources. ii) external regularity of sound correspondences with the comparative method. The authors checked if cognate forms in the sources exhibit regular correspondences or correspondences that can potentially be explained through alternations or analogy. Languages in our dataset are listed with their sources, their Glottocodes 22 and approximate coordinates in Table 1. There are several cases where two or three languages share the same Glottocode (for the general idea behind Glottocodes, see Forkel and Hammarström 23 ). Maerkang (MaerkangrGyalrong), Bragbar Situ (BragbarSitu) and Kyomkyo Situ (Kyomkyositu) share the glottocode “situ1238”, however, they are distinct varieties of Situ with limited intelligibility. The two dialects of Minyag sharing the glottocode “west2417”, labelled MuyaKangding and MenyaGao, are closely related dialects with minor differences. Queyuxinlong and QueyuPubarong (quey1238) are closely related dialects with significant differences in phonology and vocabulary.

Apart from Rgyalrongic languages, we have included two outgroup Sino-Tibetan languages for the accuracy of phylogenetic inference: Bantawa and Old Burmese. The outgroup is used as a reference point to locate and root the ingroup (Rgyalrongic languages). Bantawa belongs to the Kiranti branch mainly spoken in Nepal. Old Burmese was an ancient Lolo-Burmese language attested between 12th and 16th century in present day Myanmar. These two languages are remotely related to Rgyalrongic. According to Sagart et al. 3 , Kiranti languages branched off from other Sino-Tibetan subgroups approximately 5500 years from present, and Lolo-Burmese separated from Rgyalrongic some 4300 years from present. These two languages are suitable for outgroups in the present study, as Bantawa is sufficiently remote to Rgyalrongic, and Old Burmese has a clear date of attestation and can be used for the calibration of dating.

Data presentation

An extended concept list based on the one used in Sagart et al. 3 is employed as a guideline of our word selection in each language, including 313 concepts linked to Concepticon 24 which provides a unique identifier to all concepts and thus largely facilitates language documentation and historical comparison of lexicon. The concept list used is specially designed for Sino-Tibetan languages, therefore it is most suitable as the starting point of the present dataset. According to our data quality and coverage, we made minor modifications to that concept list by adding and deleting some of the concepts. In particular, we added concepts having a wide coverage in Rgyalrongic languages which are however not widely distributed from the perspectives of the entire Sino-Tibetan family. For instance, we use the general concept for ‘person, human’ instead of ‘the man (male human)’ used in Sagart et al. 3 , which has been shown to be indicative of language subgrouping by Lai 25 ; we also included ‘girl’, as a significant innovation in West Gyalrongic with an s-prefix (compare Stau (West) s-mi and Japhug me (East)), discussed in Lai et al. [ 4, 177]. In addition, concepts such as ‘knife’, ‘work’ and ‘sit’ and so on also exhibit similar types of innovations across Rgyalrongic languages, we therefore consider them as worthwhile to be included in the dataset.

Data standardisation

After collecting the raw word list of each language, we conducted a standardisation process of the data, because the original phonetic transcriptions may differ from each other, some may not adhere strictly to the rules of the International Phonetic Alphabet. The revised transcriptions are based on the transcription conventions in Cross-Linguistic Data Formats reference catalog (CLTS, https://clts.clld.org, 21, 2628) and set up an orthography profile 29 that helped us automatically convert all transcriptions according to our standard. The standardised data aims specifically at the computation of language phylogeny.

Partial cognate annotation

Cognates are words or part of words in different languages that share the same origin, such as English foot and German fuß, both originating from Proto-Germanic *fōts. Cognate forms in daughter languages can be deducted through regular sound rules from the proto-form. In Sino-Tibetan languages, more often than not, we find cognates in word parts rather than in entire words. For instance, Mandarin Chinese yuè-liang ‘moon’ and Taiwanese Southern Min guēh-niû n ‘moon’ only share the first part yuè and guēh, respectively, as cognates. This type of cognacy is termed “partial cognacy”. Partial cognates enable us to segment word forms into morphemes, which improves the accuracy of the computation of language subgrouping. It is thus essential to annotate partial cognates, rather than full cognates, in our Rgyalrongic database. Partial cognate identification is conducted manually with the expert knowledge of the authors, using the web-based EDICTOR tool ( https://digling.org/edictor, 30, 31).

Statistics

The current dataset contains a total of 6335 word forms for 22 distinct language varieties. Word forms correspond to 305 different concepts, and use a total of 413 distinct sounds, with an average inventory size of 72 different sounds per language variety. The word forms have been morphologically segmented, comprising a total of 9116 morphemes. These morphemes have been assigned to 3109 cognate sets. Of these cognate sets, 1665 are unique, which means that we could not identify related words in other languages.

Quality control

We carefully verified the data to ensure the accuracy and correctness. We use our expert knowledge to review every lexical entry in the database. Whenever in doubt, we would contact the authors of the original sources for confirmation and correction.

Using an orthography profile, the transcriptions were converted according to a unified standard for potential reuses. Phonemic and morphemic segmentation, as well as cognate judgements are carefully processed with expert knowledge. See Figure 3.

Figure 3. Partial cognate annotation: The concept ‘yesterday’ in Rgyalrongic is a compound of ‘past’ (cognates ID-7566 or ID-7567) and ‘day’ (cognates ID-7579 or ID-7615).

Figure 3.

Languages differ in the choices of the two parts. Annotating the two parts separately allows us to visualise and analyse the internal morphology of the forms for more accurate inference of phylogeny.

Discussion and conclusion

Rgyalrongic languages are one of the most essential keys to the reconstruction of Proto-Sino-Tibetan as well as to the subgrouping of this language family. Although there exist searchable databases such as STEDT 32 ( https://stedt.berkeley.edu/) and the rGyalrongic Language Database 33 ( https://htq.minpaku.ac.jp/databases/rGyalrong/), the present database is the first Rgyalrongic lexical database that involves expert data curation and cognate annotation, and the only one that is ready for phylogenetic analyses.

For now, only those morphemes are assigned to the same cognate set which occur in words sharing the same meaning. In the future, we hope to extend this analysis to account for cognates across meaning slots, specifically concentrating also on language-internal partial cognates along the lines of the analysis pioneered in Hill and List 21 and further extended in Schweikhard and List 34 . Having annotated the data in this form, cognacy can also be annotated on the word level 35 and computational approaches to phylogenetic reconstruction of Rgyalrongic (and beyond) can be carried out. Thus, the present contribution may serve as the very base of future phylogeny of one of the most conservative sub-branches of Sino-Tibetan.

Acknowledgements

Our sincere gratitude goes to Gao Yang, Jesse Gates, Gong Xun, Guan Xuan, Sami Honkasalo, Guillaume Jacques, Zhang Shuya and Zhao Haoliang for their generosity of contribution of data and detailed explanations to our questions. We also thank Mei-Shin Wu for her help in refining the data.

Funding Statement

This research was funded by the European Union under the Horizon 2020 research and innovation program (ERC Starting Grant CALC, grant agreement number 715618, DOI: \url{https://doi.org/10.3030/715618}, awarded to JML) and under the Horizon Europe research and innovation program (ERC Consolidator Grant ProduSemy, grant agreement number 101044282, DOI: \url{https://doi.org/10.3030/101044282}, awarded to JML), and by the Irish Research Council under the SFI-IRC Pathway Programme (Project id: 21/path-a/9374, Gyalrongic unveiled: Languages, Heritage, Ancestry, awarded to Yunfan Lai). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

[version 1; peer review: 1 approved, 2 approved with reservations]

Data and software availability

Data and Software available from: https://github.com/lexibank/lairgyalrong/.

Archived source code and data at time of publication: https://doi.org/10.5281/zenodo.7866796

License: Creative Commons Attribution 4.0 International license (CC-BY 4.0)

References

  • 1. Sun JTS: Stem alternations in Puxi verb inflection: Toward validating the rGyalrongic subgroup in Qiangic. Lang Linguistics. 2000;1(2):211–232. Reference Source [Google Scholar]
  • 2. Sun JTS: Parallelisms in the verb morphology of Sidaba rGyalrong and Lavrung in rGyalrongic. Lang Linguistics. 2000;1(1):161–190. Reference Source [Google Scholar]
  • 3. Sagart L, Jacques G, Lai Y, et al. : Dated language phylogenies shed light on the ancestry of Sino-Tibetan. Proc Natl Acad Sci U S A. 2019;116(21):10317–10322. 10.1073/pnas.1817972116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Lai Y, Gong X, Gates JP, et al. : Tangut as a West Rgyalrongic language. Folia Linguistica Historica. 2020;41(1):171–203. 10.1515/flih-2020-0006 [DOI] [Google Scholar]
  • 5. Gray RD, Atkinson QD: Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature. 2003;426(6965):435–439. 10.1038/nature02029 [DOI] [PubMed] [Google Scholar]
  • 6. Doornenbal M: A Grammar of Bantawa: Grammar, paradigm tables, glossary and texts of a Rai language of Eastern Nepal.PhD thesis, Leiden University,2009. Reference Source [Google Scholar]
  • 7. Huang B, Dai Q: A Tibeto-Burman Lexicon.Beijing: Central Minzu Univerisity,1992. [Google Scholar]
  • 8. Jacques G: Dictionnaire Japhug-Chinois-Français.Paris: Projet HimalCo,2015. [Google Scholar]
  • 9. Li F: Xia-Han zidian [Tangut-Chinese dictionary].Beijing: China Social Sciences Press,1997. [Google Scholar]
  • 10. Lai Y: Grammaire du khroskyabs de Wobzi.Paris: Université Sorbonne Nouvelle (Paris 3) dissertation,2017. Reference Source [Google Scholar]
  • 11. Gao Y: Description de la langue menya: phonologie et syntaxe.PhD thesis, École des hautes études en sciences sociales,2015. Reference Source [Google Scholar]
  • 12. Zhang S: Le rgyalrong situ de Brag-bar et sa contribution à la typologie de l’expression des relations spatiales: l’orientation et le mouvement associé.PhD thesis, Institut National des Langues et Civilisations Orientales,2020. Reference Source [Google Scholar]
  • 13. Honkasalo S: A grammar of Eastern Geshiza.PhD thesis, University of Helsinki,2019. [Google Scholar]
  • 14. Huang B: Lawurongyu yanjiu [Study on the Lavrung language].Beijing: Nationalities Press,2007. Reference Source [Google Scholar]
  • 15. Yin W: Yelong Lawurongyu Yanjiu [Study on the ’Jorogs Lavrung language].Beijing: Nationalities Press,2007. [Google Scholar]
  • 16. Sun JTS: Tshobdun Rgyalrong Spoken Texts With a Grammatical Introduction.Taipei: Academia Sinica,2007. [Google Scholar]
  • 17. Gong X: Le rgyalrong zbu, une langue tibéto-birmane de Cine du Sud-ouest. Une etude descriptive, typologique et comparative.PhD thesis, Paris: Institut National des Langues et Civilisations Orientales,2018. Reference Source [Google Scholar]
  • 18. Zhao H: A sketch of zlarong the newly discovered language: its phonology, lexicon, morphology and genealogy.Guanzhou: Sun Yat-Sen University master’s dissertation,2019. [Google Scholar]
  • 19. Prins M: A Grammar of rGyalrong, Jiˇaomùzú (Kyom-kyo) Dialects: A Wen of Relations.Brill,2017. [Google Scholar]
  • 20. Gates JP: Grammaire du stau de Mazur.Paris: Écoles des Hautes Études en Sciences Sociales dissertation, 2021. Reference Source [Google Scholar]
  • 21. Hill NW, List JM: Challenges of annotation and analysis in computer-assisted language comparison: A case study on burmish languages. Yearbook of the Poznan Linguistic Meeting. 2017;3(1):47–76. 10.1515/yplm-2017-0003 [DOI] [Google Scholar]
  • 22. Hammarström H, Forkel R, Haspelmath M, et al. : Glottolog [Dataset, Version 4.7]. Zenodo.Geneva,2023. 10.5281/zenodo.7398962 [DOI] [Google Scholar]
  • 23. Forkel R, Hammarström H: Glottocodes: Identifiers linking families, languages and dialects to comprehensive reference information. Semantic Web. 2022;13(6):917–924. 10.3233/SW-212843 [DOI] [Google Scholar]
  • 24. List JM, Tjuka A, van Zantwijk M, et al. : Concepticon [Dataset, Version 3.1].Zenodo, Geneva,2023. 10.5281/zenodo.7777629 [DOI] [Google Scholar]
  • 25. Lai Y: Preinitial denasalisation and palatal fortition in khroskyabs and the gyalrongic word for ‘man’. Linguistics of the Tibeto-Burman Area. 2022;45(2):211–229. 10.1075/ltba.22006.lai [DOI] [Google Scholar]
  • 26. Forkel R, List JM, Greenhill SJ, et al. : Cross-linguistic data formats, advancing data sharing and re-use in comparative linguistics. Sci Data. 2018;5(1):180205. 10.1038/sdata.2018.205 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. List JM, Hill NW, Foster CJ: Towards a standardized annotation of rhyme judgments in Chinese historical phonology (and beyond). Journal of Language Relationship. 2019;17(1–2):26–43. 10.31826/jlr-2019-171-207 [DOI] [Google Scholar]
  • 28. Anderson C, Tresoldi T, Chacon T, et al. : A cross-linguistic database of phonetic transcription systems. Yearbook of the Poznan Linguistic Meeting. 2018;4(1):21–53. 10.2478/yplm-2018-0002 [DOI] [Google Scholar]
  • 29. Moran S, Cysouw M: The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles.Language Science Press,2018;145. Reference Source [Google Scholar]
  • 30. List JM: A web-based interactive tool for creating, inspecting, editing, and publishing etymological datasets.In: Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017;9–12. 10.18653/v1/E17-3003 [DOI] [Google Scholar]
  • 31. List JM: EDICTOR. A web-based interactive tool for creating, inspecting, editing, and publishing etymological datasets [Software Tool, Version 2.0.0].Zenodo, Geneva,2021. 10.5281/zenodo.4685130 [DOI] [Google Scholar]
  • 32. Richard S, John B: The Sino-Tibetan Etymological Dictionary and Thesaurus: STEDT project data resources and protocols.University of California, Berkeley,2000. Reference Source [Google Scholar]
  • 33. Nagano Y, Prins M: rGyalrongic Languages Database.National Museum of Ethnology, Osaka,2013. Reference Source [Google Scholar]
  • 34. Schweikhard NE, List JM: Developing an annotation framework for word formation processes in comparative linguistics. SKASE Journal of Theoretical Linguistics. 2020;17(1):2–26. Reference Source [Google Scholar]
  • 35. Wu MS, List JM: Annotating cognates in phylogenetic studies of southeast asian languages. Language Dynamics and Change. 2023;1–37. 10.1163/22105832-bja10023 [DOI] [Google Scholar]
Open Res Eur. 2023 Jul 31. doi: 10.21956/openreseurope.17295.r33646

Reviewer response for version 1

Norihiko Hayashi 1

First of all, I would like to express my excitement over the recent release of a new database for the rGyalrongic (‘Rgyalrongic’ spelled in this paper) languages. However, I must clarify that I am not an expert in the rGyalrongic languages; rather, my specialization lies in Lolo-Burmese languages. Moreover, I have limited familiarity with statistical methods. Therefore, please be kindly notified that my review may incorporate some perspectives from classical historical linguistics.

As stated in the paper, this database is a compilation of lexical data from nearly 20 languages classified as rGyalrongic. The authors themselves conducted research on two of the languages, while the remaining data either underwent careful analysis by experts in the respective languages or were extracted from dictionaries with detailed word annotations. The finalized database has been published on their website with the cooperation of these experts.

The rGyalrongic languages, as the authors point out, exhibit an exceptionally morphologically complex lexicons among the Tibeto-Burman languages. This complexity not only makes it challenging to explore their historical development but also hinders the investigation of historical relationships among these languages. Until now, historical linguists familiar with these languages have relied on comparative linguistics methods and their own specialized judgment to classify the historical development and phylogeny based on morphological analysis of each language's vocabulary. However, this approach naturally has its limitations.

This paper introduces a system that mitigates assumptions about the historical development of lexicons in each language by incorporating new technologies, search techniques, and even statistical methods. It unveils new possibilities for comparative studies. To achieve this, the paper aims to create a database with maximum scientific transparency, relying on a scrupulous selection of data sets and reliable information provided by experts proficient in each language.

While it has its merits, it also has its drawbacks. To begin with, the attempt is still in its infancy in terms of its overall goal, so one can only hope for further development. The data for rGyalrongic languages is still very limitedly employed. Of course, I comprehend that the principal objective of this project is to meticulously curate and present data in collaboration with experts. However, it is still important to increase the amount of data regarding the future. The author himself mentions that a database of the rGyalrongic language, supervised by Nagano & Prins (2013), is available on the website, where the basic vocabulary and examples of sentences of 81 rGyalrongic languages are uploaded with their sound files. Indeed, similar to the STEDT database, it is essentially a collection of data lacking morphological analysis. However, in the future, with the cooperation of the supervisors, it would be possible to use the data from Nagano & Prins (2013) to conduct a more precise analysis of historical change.

It should be noted that five of the languages in the dataset (Daofu, Zhaba, MaerkangrGyalrong, MuyaKangding, and QueyuXinlong) are taken from Huang (1992). Huang (1992) was a groundbreaking work at the time, a comparative lexicon of primary sources of Tibeto-Burman languages in China, which is very useful for historical and comparative studies. Unfortunately, Huang (1992) contains many printing errors, and the researcher, Prof. Huang Bufan, is deceased, thus we cannot ask her about her research. There are researchers, such as Satoko Shirai and Huang Yang on Zhaba (Gong 2007 is also a reference grammar of Zhaba), and Takumi Ikeda on Muya. For Situ rGyalrong, Yasuhiko Nagano also recently published a grammar in English (Nagano 2022). With the cooperation of these researchers who are still active in the field, there should be room for additional data and comparative studies in the future.

The paper presents the partial cognate annotation by exemplifying a word ‘yesterday’ in Figure 3, which this paper considers is a compound of ‘day’ and ‘past’. This figure leads us to think that /ɣ/ in the Wobzi word /snə ɣ/ corresponds to /x/ in the Siyuewu word / xsnə/ and both /ɣ/ and /x/ denote the concept ‘past’. This is really inspiring, and if we search the same word in Nagano & Prins 2013 database, there are many cases similar to {pə/ ma/ wa}+{snə}, like /pəʃe’ʃni/ (Jinchuan Jimu Zhouchan), /ma sȵi lo/ (Maerkang Soman), /mbasni’lo/ (Hongyuan Shuajinsi Selong), /wə 3ʂo 3ʃnə 5/ (Jinchuan Manai Genza), etc. The initial syllable of these words might indicate 'past,' which should be noted in the Rangtang Puxi Siyaowucun word / ɣsnə/, having the same meaning as 'yesterday'.

Additionally, Nagano (2022) also provides a few intriguing expressions for ‘yesterday’, warranting a comparative analysis. Collaborating with experts and integrating data from other databases could significantly enhance the development of this project.

Regarding the dataset of this project, while Bantawa and Old Burmese, languages that are quite distant historically from rGyalrongic languages, are included, Written Tibetan and Amdo Tibetan are not included in this dataset. Of course, this may be because there are many cases of Tibetan languages providing loanwords to rGyalrongic, but this situation may require some explanation. As for Old Burmese, it certainly has a phonetic writing system and the date of use is clear, so it is possible that the historical relationship between the two languages can be put on an absolute time scale. However, if the Bantawa data are to be combined, it seems to me that more Qiangic languages should be introduced, including Pumi and various Qiang dialects.

At any rate, as mentioned earlier, this database will continue to develop, and it is expected that the data will be updated in the future. At the moment, this database still needs more user-friendly interfaces which are already presented as in the STEDT and Nagano & Prince’s databases. However, at the same time, if researchers of Tibeto-Burman languages apply the methodology of this project and develop datasets for each subgroup, it will be possible to discover new problems of historical change and to solve existing problems. At the very least, if the Lolo-Burmese dataset, which is the reviewer's specialty, is developed in a similar manner, it will allow for a more scientific analysis of the relationships between languages within Lolo-Burmese (which may modify the framework of Matisoff 1972, 2003, and Bradley 1979). It will also lead to a more detailed elucidation of the relationship between rGyalrongic and Lolo-Burmese, which was discussed here, than has been the case so far. We are convinced that this project deserves further attention.

Are sufficient details of methods and materials provided to allow replication by others?

Yes

Is the rationale for creating the dataset(s) clearly described?

Yes

Are the datasets clearly presented in a useable and accessible format?

Yes

Are the protocols appropriate and is the work technically sound?

Yes

Reviewer Expertise:

My specialties are Tibeto-Burman Linguistics, especially Lolo-Burmese languages. I have also interests in documenting Tai-Kadai languages in Mainland Southeast Asia.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

References

  • 1. : Proto-Loloish. Malmö: Curzon Press .1979;
  • 2. : Zhaba-yu yanjiu. [Study of the Zhaba Language. (in Chinese)]. Beijing: Minzu chubanshe .2007;
  • 3. Zangmian yuzu yuyan cihui. [A Tibeto-Burman Lexicon. (in Chinese)]. Beijing: Zhongyang minzu xueyuan chubansh .1992;
  • 4. : Berkeley: University of California Press .Handbook of Proto-Tibeto-Burman;
  • 5. : Handbook of Proto-Tibeto-Burman. Berkeley: University of California Press .2003;
  • 6. : rGyalrong: A Comprehensive Grammar. Tokyo: Fukyosha .2022;
  • 7. : The rGyalrongic Languages Database.2013; Reference source
  • 8. : Reference source
Open Res Eur. 2023 Jul 27. doi: 10.21956/openreseurope.17295.r33643

Reviewer response for version 1

Mark Alves 1

SUMMARY: This data report presents the status of ongoing historical linguistic research on the Rgyalrongic language group within Sino-Tibetan. It has involved assembling lexical data, identifying cognates, and preparing the data to use in phylogenetic study. The data includes thousands of words from 20 languages (mostly modern language, but the long-dead Tangut language) from a variety of previous studies. The goal is to make this data available for others with similar goals to use and, potentially, re-evaluate/re-test.

METHODOLOGY: The approach in assembling the data is proper: selection and organization of lexical data of as many doculects of Rgyalrongic as possible, identification of cognates / partial cognates, semantic and phonetic standardization (to a reasonable degree), and preparing in files useable with software that can do phylogenetic work.

There is one methodological matter I think worth giving attention to. While the assessment of cognates certainly must be done by researchers with relevant training and experience in working with the languages, such that they have trustworthy intuition about cognate status (your repeated use of the word "expert" in the article, which I think should be reduced), it would also be useful to acknowledge that phonological patterns were part of your mental criteria. You cannot and should not cover Rgyalrongic historical phonology in this short report, but a sentence or two mentioning that your intuition as researchers familiar with the languages involved noting recurring patterns (e.g., instances of consistency, but patterns of differences in consonants or vowels, etc.), along with a reference of a couple key references addressing some of these patterns, would help clarify the cognate identification process beyond "trustworthy intuition".

DATA AND SOFTWARE: While all the data is readily available, some guidance / tips on using the data would be helpful to some potential users.

SUGGESTIONS

Below is a list of suggestions for editing, wording, and clarity of details. Some are optional, some are necessary, and some are strongly recommended.

  • Regarding "The present lexical data sets," consider "The lexical data sets the authors have assembled."

  • "are collected...from 2011 to 2016" > "were collected...from 2011 to 2016"

  • In "expertise in Rgyalrong historical linguistics through the neogrammarian comparative method", the phrase "through the neogrammarian comparative method" is unclear in relation to the preceding main clause. I suggest either "through application of the neogrammarian comparative method" or perhaps "expertise in research on Rgyalrongic historical linguistics through the neogrammarian comparative method" (I don't think "neogrammarian" is needed).

  • segmantation > segmentation 

  • "albeit including" doesn't connect to the previous sentence smoothly. Consider something like "though Tangut, an extinct medieval language, was spoken in Ningxia Province."

  • "For most non-native speakers, they are the most difficult branch of languages to learn..." is odd. Non-native speakers of all Sino-Tibetan languages? Just remove "for most non-native speakers". 

  • "...they exhibit a large array of word formation strategies such as inflection and derivation..." is not logical since for "a large array", you can only list inflectional and derivational morphology (but perhaps reduplication). I think you mean that the inflectional and derivational morphology has a complex range of morphophonological patterns, combined with complex morphosyntax.

  • Instead of "old age," perhaps "historical depth in the language group".

  • Instead of "finding out the origin of Sino-Tibetan", perhaps "exploring Sino-Tibetan language history."

  • Regarding "This database aims...", since this is an article, not a database, you need to write something like "The database considered in this article" or something like that.

  • Instead of "Rgyalrongic languages are a Sino-Tibetan branch, mainly spoken...", perhaps "Rgyalrongic languages form a Sino-Tibetan branch and are mainly spoken..."

  • "They are closely related to Lolo-Burmese languages" is minimal information, especially for non-specialists. Consider "They belong to the Qiangic sub-branch of Burmo-Qiangic and are thus closely related to Lolo-Burmese languages."

  • "...a medieval language, Tangut, is recently recognised to belong to said branch" > "...the extinct Tangut language has been recently recognized as a Rgyalrongic language" or "has been recently determined to belong to this sub-branch".

  • "...languages, Situ, ..." > "...languages: Situ, ..."

  • fuß > Fuß

  • The sentence "Thus, it is very likely that language varieties closely linked to Zhaba, such as Queyu and Minyag (aka. Muya, Menya) and Zlarong spoken in the Tibetan Autonmous (> Autonomous) Region, are also Rgyalrongic languages" is confusing. Since you include these languages in the study, "it is very likely" sounds too tentative, so consider "we have considered....to be Rgyalrongic languages in our study".

  • In the phrase "These “newcomers”...", "newcomers" is somewhat ambiguous. Perhaps "new additions to the Rgyalrongic group"?

  • The statement "certainly the closest to Proto-Sino-Tibetan compared to all other varieties in China" is problematic. First, while there are many reconstructions of aspects of Sino-Tibetan morphology, these are not (to my knowledge) agreed upon to a major degree, which makes your claim premature. I agree that Rgyalrongic's complex morphological system suggest historical depth, and undoubtedly sheds light on Proto-Sino-Tibetan. But statements such as "certainly the closest...compared to all..." is a statement that would need substantial support to make, which is not the goal of this report. A statement such as "and it has a complex and, in our view, highly conservative morphological system that may give hints as to ancient features in the Sino-Tibetan language family", or instead of "may give", "has given" if you have a publication to cite with such support.

  • "for the study on" > "for the study of"

  • Instead of "The phylogeny of Rgyalrongic with dating information" ("with dating information" is awkward), consider "Phlogenetic research on Rgyalrongic to provide chronological information."

  • For "has been proven to show accurate results3,5", consider "has been proven to show accurate results in both Sino-Tibetan and other language families3,5" since that is what the two references show.

  • "high quality curation is inevitable" > "high quality curation is essential"

  • "the phylogenetic analyses of Rgyalrongic languages" > "the phylogenetic analysis of Rgyalrongic languages"

  • Instead of "It includes", consider "It contains lexical data from".

  • Instead of "The entire workflow", consider "The workflow".

  • Instead of "a specifically designed and curated," consider "a designed and curated".

  • "has already been collected before 2017" < "was collected before 2017"

  • "Dictionaries and word lists" > "These dictionaries and word lists"

  • "Some word lists" > "Some of the word lists"

  • Reorganize the following for standard presentation of items: "Reliability is assessed through two aspects: i) internal phonological consistency of the source data. The authors check if phonemes are correctly identified and that allophones conditioned by phonological environments and morphological alternations are adequately represented in the original sources. ii) external regularity of sound correspondences with the comparative method." > "Reliability is assessed through two aspects: i) internal phonological consistency of the source data and ii) external regularity of sound correspondences with the comparative method. The authors check if phonemes are correctly identified and that allophones conditioned by phonological environments and morphological alternations are adequately represented in the original sources. In addition,..."

  • "The authors check if..." > "The authors checked if..." (Parallel to the past tense in the subsequent sentence)

  • In "...minor differences. Queyuxinlong and QueyuPubarong", consider "minor differences. In contrast, Queyuxinlong and QueyuPubarong" for evident contrast emphasis.

  • Be careful: "These two languages are remotely related to Rgyalrongic" of which one is Old  Burmese in Lolo-Burmese appears to contrast with your statement that Lolo-Burmese is closely related. You might need to replace "closely" in the previous instance since a time depth of 4,300 years suggests more linguistic distance, albeit within a common branch.

  • "...sufficiently remote to..." > "...sufficiently remote from..." 

  • Instead of "largely facilitates", consider "facilitates".

  • Suggestion: ", however," > ". However," and ", therefore" > ". Therefore,"

  • Instead of "which are however not widely distributed from the perspectives of the entire Sino-Tibetan family", consider "which are not widely distributed in other branches of Sino-Tibetan."

  • "included ‘girl’, as" > included ‘girl’ as"

  • "languages, we therefore" > "languages. We therefore"

  • "consider them as worthwhile to be included in the dataset" > "consider them worth including in the dataset"

  • "from each other, some may" > "from each other, and some may"

  • "can be deducted" > "can be deduced"

  • The sentence "For instance, Mandarin Chinese yuè-liang ‘moon’ and Taiwanese Southern Min guēh-niûn ‘moon’ only share the first part yuè and guēh, respectively, as cognates" is not ideal as Sinitic is typologically divergent, and this example of bisyllabic compounds doesn't--in my view--sufficiently highlight the challenges in cognacy identification in polysyllabic languages in other branches of Sino-Tibetan. It would be best to show an example of this complex situation from Sino-Tibetan languages in other branches, and that have affixation of different sources, but cognate roots, not compounding. However, if you do keep this example, it would help to clearly show that they share 月, but that liang is for 亮 while Taiwanese niu is for 娘 (with the kind of certainty than cannot be offered for languages outside of Sinitic), and are thus distinct etyma.

  • The sentence "Partial cognates enable us to segment word forms into morphemes, which improves the accuracy of the computation of language subgrouping" is not quite right. First, the cognates don't allow segmentation: identification of partial cognates does this. Next, the cause of improved accuracy of computational subgrouping of the languages is the improved accuracy of cognate identification by marking both full and partial cognates. Consider rewording this.

  • Question regarding "...annotate partial cognates, rather than full cognates...", does this mean there are no full cognates in the data? That may be possible if all words are morphologically marked with affixes. If so, make that clear. If not, "rather than" needs to be fixed.

  • Perhaps there are different styles, but I generally use commas with numbers such as "6335" > "6,335" except when they are years. If this is standard for this publication, that's fine. Otherwise, please check throughout the paper.

  • Regarding "...22 distinct language varieties", your abstract mentions 20. Please clarify.

  • The phrase "413 distinct sounds" is ambiguous. I assume it mean "speech sounds" (not phonemes since that would require phonemic assessment). You might add "(i.e., consonants, vowels, and tones)" (if tones are among the sounds).

  • "average inventory size of 72 different sounds" > Consider "average inventory of 72 distinct sounds". sounds"

  • Regarding "The current dataset contains a total of 6335 word forms for 22 distinct language varieties. Word forms correspond to 305 different concepts," does this mean that 6,335 word forms fit into just 305 meanings? It's possible, but I would expect a larger number unless the wordlists are all short, as fieldwork lists sometimes are. Or do you mean that you focused on identification of words in 305 concepts? Please make this clear.

  • Regarding "These morphemes have been assigned to 3109 cognate sets. Of these cognate sets, 1665 are unique, which means that we could not identify related words in other languages," this seems to say that 1,665 cognates sets do not have cognates in other languages, which is contradictory. Please rewrite to clarify.

  • Regarding "We carefully verified the data to ensure the accuracy and correctness," the last part is general. First, "accuracy" and "correctness" are synonyms, so just use "accuracy", and then add briefly accuracy of what specifically about the data? That is was entered correctly from the sources, and/or other aspects? Clarifying this can increase confidence.

  • I don't recommend overusing "expert, so for "We use our expert knowledge to review every lexical entry in the database" > "We reviewed every lexical entry in the database." However, reviewed them for what specifically? What kinds of problems were or aspects were you concerned with? You can just say "We checked the data," but that's uninformative.

  • "potential reuses" > "potential reuse" (noncount noun)

  • "judg ements are" > "judgments, are"

  • Again "...are carefully processed with expert knowledge", perhaps just "...were processed based on our experience in working with languages in the region" or something like that.

  • Regarding "Languages differ in the choices of the two parts.", please clarify. Perhaps "The word forms differ in terms of cognacy of different morphemes"???

  • Regarding "...for more accurate inference of phylogeny," this is data that is computationally analyzed, so not inferenced, right? Perhaps "...for more accurate data used for computational phylogenetic analysis."

  • Regarding "Rgyalrongic languages are one of the most essential keys to the reconstruction of Proto-Sino-Tibetan as well as to the subgrouping of this language family," you must have at least a couple of references for this claim. If so, add them. If not, you need to hedge this statement, such as "We believe the Rgyalrongic languages have great potential value in reconstruction...".

  • Again, "involves expert data curation and cognate annotation", do you need the word "expert"? One use in the article is sufficient, and this sentence is already clear without it.

  • "phylogenetic analyses" > "phylogenetic analyses" (The noncount meaning is best here)

  • Regarding "For now, only those morphemes are assigned to the same cognate set which occur in words sharing the same meaning.", this is confusing. Perhaps "For now, only those morphemes assigned to the same cognate set occur in words sharing the same meaning." Is that what you mean?

  • Regarding "...to account for cognates across meaning slots", by "across meaning slots," do you mean near synonyms and/or instances of semantic extension, or something else? Please clarify.

  • "concentrating also on" > "concentrating on" (Is "also" needed"?)

  • "on the word level" > "at the word level"

MY RESEARCH BACKGROUND

My research focus is on Austroasiatic, with Vietnamese as the center of this, but study of Vietnamese language history in particular requires much information about Chinese and Kra-Dai language history, and Southeast Asian language history broadly, all part of my broader research agenda. This involves sifting lexical data in digital databases for etymological study, naturally including cognate identification and interdisciplinary evidence of language contact. Also in Austroasiatic historical linguistics, I must deal with neighboring Tibeto-Burman languages and frequently consult STEDT (the Sino-Tibetan Etymological Dictionary)). However, while I have seen studies of Rgyalrongic languages, I have little specific knowledge of of them, though as I am interested in the ongoing research into Sino-Tibetan (aka Trans-Himalayan) language history, this brief report is of interest, both in terms of the potential use in Sino-Tibetan historical linguistics and in the methodology the authors used and challenges they face in identifying cognates.

Are sufficient details of methods and materials provided to allow replication by others?

Partly

Is the rationale for creating the dataset(s) clearly described?

Yes

Are the datasets clearly presented in a useable and accessible format?

Yes

Are the protocols appropriate and is the work technically sound?

Yes

Reviewer Expertise:

Historical linguistics (phonology, syntax, etymology, dialectology, language contact, inter-disciplinary approaches) of Vietnamese, Austroasiatic, and Southeast Asia broadly

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Open Res Eur. 2023 Jul 7. doi: 10.21956/openreseurope.17295.r32973

Reviewer response for version 1

Alexis Michaud 1

Summary of the submission:

This database gathers lexical information of Rgyalrongic languages to facilitate future research on historical and evolutionary linguistics. It includes word lists from fieldwork and published sources, providing data for phylogenetic analysis of the Sino-Tibetan family.

An important disclaimer is that this reviewer has no first-hand knowledge of any Rgyalrongic language, and only contributes to historical-comparative work as part of a team that includes professional historical linguists. I guess that it is relevant to further clarify that my main research focus is on the Naish subgroup of Sino-Tibetan (Trans-Himalayan) – the Naxi, Na (Mosuo) and Laze languages –, and that the topic of Rgyalrongic reconstruction is of great interest to me, as progress in Rgyalrongic historical linguistics provides a gradually improved basis for establishing cognate sets between Rgyalrongic and Naish.

The data note «Lexical data for the historical comparison of Rgyalrongic languages» introduces the current version of a database which, as can be looked up from the stats of the GitHub repository, has been set up in early 2020 and has undergone significant reworking in the Spring of 2022. Publication on the Open Research Europe platform will hopefully help clarify to a wide audience the usefulness of the database in its present state, as well as draw attention to a number of forward-looking features in its architecture. The database is a cutting-edge research tool for the comparative study of Rgyalrongic in its Sino-Tibetan context. The database does not aim to aggregate lexical data for historical-linguistic purposes, as done e.g. in the Sino-Tibetan Etymological Dictionary and Thesaurus (STEDT) project at the University of California at Berkeley. The database under review only contains 305 hand-picked cognate sets, currently amounting to a total of 6,335 word forms (for 22 distinct language varieties). By contrast, the STEDT lexical file is more than an order of magnitude larger, containing over 376,000 words in about 200 languages and dialects. Looking for the word ‘nose’ in STEDT ( https://stedt.berkeley.edu/~stedt-cgi/rootcanal.pl/gnis?t=nose), one gets a set of no less than 1,241 records, with twelve separate etyma. The entries are arranged by language subgroups, without cognacy judgments. There are 129 entries for six Rgyalrongic languages. These languages are (adding, in brackets, the corresponding names in the work under review): Ganzi Danba Geshenzha (Geshizha), Daofu/Ergong (Daofu/Mazur), Lavrung (Khroskyabs), Minyag (Muya/Menya), and several dialects of Rgyalrong. The database under review is much smaller in terms of number of items (again, by nearly an order of magnitude), but has better language coverage (covering about twice as many languages) and, crucially, only returns one item for the intended word: the cognate decided upon manually by the expert in charge of the database (the corresponding author of the data note, LAI Yunfan). There are thus fundamental differences between the STEDT database and the database under review. The STEDT, complete with an online user interface, is a functional and versatile large-scale tool that offers ample opportunity for ‘data crunching’, including peeking at items that are semantically related (‘to blow one’s nose’, ‘nose flaps’...). By contrast, the database under review is a diamond point: it is small by design, intended as a building-block for state-of-the-art explorations in computer-assisted historical linguistics. Use of standardized formats allows it to be plugged into a range of computational pipelines, to be used in association with other datasets as required for a specific research purpose.

From the point of view of research methods, the database under review is up to the highest standards from several perspectives: that of Open Science (the data note itself is published under a permissive CC BY license), that of Sino-Tibetan historical linguistics (as the data curation process is exemplary), and that of interdisciplinary work associating linguistics and computer science. Such lovingly handcrafted, computation-friendly cognate sets allow for a seamless integration of the time-honoured methods of historical linguistics with computational approaches. It is to be hoped that other datasets that are key to progress in Sino-Tibetan historical linguistics, such as Old Chinese reconstructions in the Baxter-Sagart system (currently hosted at a custom website, and thus not looking very ‘future-proof’: https://ocbaxtersagart.lsait.lsa.umich.edu/), will find their way to (i) archives ensuring long-term conservation and (ii) hubs routinely used by computational linguists and computer scientists, such as GitHub, GitLab, Bitbucket and such – the landscape here changes rapidly, but the investment of publishing datasets on these platforms is well worth the effort, as it greatly facilitates interdisciplinary collaborations.

On a slightly critical note, the data note presenting the database does not appear to me to do full justice to the database’s design and applications. In the «Plain language summary», the authors’ description of the database as work «in preparation for future work on historical and evolutionary linguistics» strikes me as unnecessarily modest. Perspectives of uses of the database are pushed back into a somewhat vague future, as if the database were intended as food for thought for linguists living at some point along the (theoretical) line of digital eternity. True, posterity may be grateful and appreciative, but the same very general point could be made about any other data set that gets curated and archived. In real life, not that many of the datasets in digital archives such as Zenodo may eventually be used in future, and still fewer may prove useful in research. By contrast, the database of rGyalrongic languages is not only a functional tool for research: it could be argued to be, by now, a tool whose usefulness has been tried and tested. The tool was being used at the same time as it was being set up. In computational terms, one could say that public release of the database is a transition from active ‘alpha-testing’ to ‘beta-testing’ by an open-ended list of end-users. Moreover, new end-users could potentially also serve as contributors in the mid run, for other cognate sets that may include a different choice of languages and/or concepts. The authors of the database are part of the team of authors of an influential 2019 article arguing that «Dated language phylogenies shed light on the ancestry of Sino-Tibetan»; there are clear shared features between the cognate sets used in the 2019 article and in the database under review, such as use of the second author’s Concepticon. Thus, the database under review is part of a growing body of data and tools that possesses demonstrated value to the field of Sino-Tibetan diachronic research. I would therefore suggest rephrasing the relevant passage of the «Plain language summary» from «in preparation for future work on historical and evolutionary linguistics» to «as a tool for research in the field of historical and evolutionary linguistics».

Concerning matters of form, the data note could do with an additional round of text editing to ensure best readability for authors outside the first circle of specialists of this area of Sino-Tibetan. There are 20 languages in Figure 1, and 22 in Table 1. It would be useful to provide clarifications in the caption to Figure 1. Quick fix: Figure 1. Geographical distribution of the Rgyalrongic languages in the database. Full explicitness would be a service to some readers: adding a clarification to the effect that ‘(Bantawa and Old Burmese, which are not Rgyalrongic languages, are not shown on this map.)’ Some redundancy would help here. The profusion of proper names specific to the area, and the presence of concepts specific to historical linguistics, are perfectly natural in work of this nature, and are essentially inevitable. But their combination tends to put off members of an extended readership of linguists, when one’s best efforts at imbibing the area-specific information in an article leave one baffled. So, efforts at clarity are most needed, to accompany readers accustomed to softer linguistic landscapes. (Decisions at the authors’ choice.) The issue extends to language names and language identifiers. There is room for improvement here, and a small amount of further work in this space would entail large benefits for the legibility of the article – and arguably, for the usability of the database for a not-too-narrow public. Currently, language names do not have a unified syntax. Taking «Menya-Gao» as an example, where «Menya» is the language name, and «Gao» the family name of the author who collected the original data: «Menya-Gao» is OK as an identifier (for computational purposes), but not too great as a language name. The syntax is, to say the least, unusual. It could make sense if the authors wanted to lay emphasis on the specific ‘doculects’ used in their study, and to give credit to the authors who provided the indispensable basis for the database, and who bear responsibility for data quality. If so, the syntax would yield: Menya Gao, Muya Huang, Situ Zhang, Japhug Jacques, Stau Gates, Khroskyabs Lai, Rma Sims, Zbu Gong, and so on. If the authors wish to float this new practice, it should be applied consistently. But since information on the source of data is provided in the table, it would make excellent sense to go by common practices in English-language publications, and use a syntax such as:

In detail, «Kangding Muya» would be somewhat under-specific, since «Menya-Gao» is spoken in Kangding, too, and the main text describes «Muya» and «Menya» as alternative names for Minyag: «Minyag (aka. Muya, Menya)», seemingly in the same way as «Sino-Tibetan» and «Trans-Himalayan» are different labels for the same language family, or, at the language level, «Pumi» and «Prinmi», «Na» and «Mosuo», «Standard Mandarin» and «Putonghua», «Shixing» and «Xumi», etc. But crucially, there is a dialect difference between the scopes of the two labels as used in the database: Figure 1 has Muya and Menya as distinct data points. Changes to the language names in Table 1 do not imply any modifications to the suite of computer files & scripts: only a change to the table and to the main text of the presentation article under review here.

The repository where the dataset is made available could do with quick additional UX improvements, too: improving the experience of less advanced users – basically, linguists with an awareness of the usefulness of computer scripts but without regular practice of scripting. To begin with, a path to the main reference files should be indicated clearly on the repo’s landing page, and perhaps repeated in some of the subfolders, along the lines of ‘This folder contains . In case you are looking for the database in CSV format, please refer to .’ Thus, in the presentation text (the article under review) the language ‘MazurStau’ appears in Figure 1 (as ‘Mazur’) and in Table 1. The forms.csv file, which (at the time of this review) had been updated on April 26th, 2023, has the MazurStau forms. But this reviewer has not been able to locate any ‘MazurStau’ data in the list of cognates (wordlist-cognates.tsv) in the ‘analysis’ folder. For instance: for ‘nose’, the .tsv file has 21 entries; the Stau form /sni/ (Gates 2021:50) is conspicuously cognate, and would make for a complete cognate set of 22 varieties. Is the .tsv file, last updated on April 20th, 2020, an old file, not relevant to the database in the present state? If so, some cleanup of the GitHub repo or additional explanations would be a service to potential users, so they don’t get led down a garden path when following the enticing lead of an ‘analysis’ folder name. Providing such folders with a short README file would constitute sufficient guidance, and would be well worth the effort – pending the release of future newfangled tools on the GitHub web interface that allow for easy navigation to key files: those that are updated most often, that are referenced at key points in scripts, and other such criteria.

Minor points:

Introduction: «In order to infer the phylogenetic subgrouping of Rgyalrongic, a lexical dataset with high quality curation is inevitable»: inevitable > indispensible

In the discussion of elicitation, the mention «without physical contact» is puzzling. What are the intended implications? Does it have to do with the differences in legal frameworks for experiments involving human subjects depending on the protocol, with/without physical contact? If the mention is intended as a legal disclaimer, the statement should be made in a relevant section or footnote, not in the main text, where it acts as a distractor. I would suggest making a reference to a publication that sets out the essentials of linguistic fieldwork, with any qualifications and additions the authors may wish to make. Thus, interested readers would get a useful pointer to a resource explaining about linguistic fieldwork, and broaching the all-important human topic of the investigator’s relationship with consultants.

p. 5 German fuß: why not capitalize the F, thus: Fuß?

Acknowledgements: and detailed explanations to our questions > and detailed explanations in answer to our questions

References: the URL provided for

Gates JP: Grammaire du stau de Mazur. Paris: Écoles des Hautes Études en Sciences Sociales dissertation, 2021.

is unhelpful: it is a link to an online announcement of the PhD defense, not a link to the document itself. Since the dissertation is available online, the authors should cite an institutional repository: either the dissertation’s entry on theses.fr ( https://theses.fr/2021EHES0054), or a direct link to the PDF, though the latter option is likely to prove less ‘time-proof’ ( https://www.theses.fr/2021EHES0054/abes/Gates_Jesse_these_2020.pdf).

Are sufficient details of methods and materials provided to allow replication by others?

Partly

Is the rationale for creating the dataset(s) clearly described?

Yes

Are the datasets clearly presented in a useable and accessible format?

Yes

Are the protocols appropriate and is the work technically sound?

Yes

Reviewer Expertise:

Naish subgroup of Sino-Tibetan (Naxi, Na/Mosuo, Laze); phonetics/phonology; language documentation and conservation; Computational Language Documentation

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Open Res Eur. 2023 Jul 8.
Yunfan Lai 1

Dear Alexis, Thank you very much for your detailed comments. We will improve the paper accordingly after we have received one more review.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Availability Statement

    Data and Software available from: https://github.com/lexibank/lairgyalrong/.

    Archived source code and data at time of publication: https://doi.org/10.5281/zenodo.7866796

    License: Creative Commons Attribution 4.0 International license (CC-BY 4.0)


    Articles from Open Research Europe are provided here courtesy of European Commission, Directorate General for Research and Innovation

    RESOURCES