Skip to main content
PLOS One logoLink to PLOS One
. 2026 Mar 11;21(3):e0343164. doi: 10.1371/journal.pone.0343164

Community size rather than grammatical complexity better predicts Large Language Model accuracy in a novel Wug Test

Nikoleta Pantelidou 1,*,#, Evelina Leivada 1,2,#, Raquel Montero 1, Paolo Morosi 1,*
Editor: Wei Lun Wong3
PMCID: PMC12978473  PMID: 41811800

Abstract

The linguistic abilities of Large Language Models are a matter of ongoing debate. This study contributes to this discussion by investigating model performance in a morphological generalization task that involves novel words. Using a multilingual adaptation of the Wug Test, six models were tested across four partially unrelated languages (Catalan, English, Greek, and Spanish) and compared with human speakers. The aim is to determine whether model accuracy approximates human competence and whether it is shaped primarily by linguistic complexity or by the size of the linguistic community, which affects the quantity of available training data. Consistent with previous research, the results show that the models are able to generalize morphological processes to unseen words with human-like accuracy. However, accuracy patterns align more closely with community size and data availability than with structural complexity, refining earlier claims in the literature. In particular, languages with larger speaker communities and stronger digital representation, such as Spanish and English, revealed higher accuracy than less-resourced ones like Catalan and Greek. Overall, our findings suggest that model behavior is mainly driven by the richness of linguistic resources rather than by sensitivity to grammatical complexity, reflecting a form of performance that resembles human linguistic competence only superficially.

Introduction

Large Language Models (LLMs) are Artificial Intelligence systems designed to interact using human language. Their high performance across different domains, including education, medicine, finance, and translation [1] stems from their ability to generate contextually appropriate and syntactically diverse responses, which in turn reflects a sophisticated manipulation of linguistic structures and rules [2]. Despite these achievements, however, the linguistic abilities of LLMs remain a matter of ongoing debate. It has been observed, for instance, that the mechanisms through which LLMs learn and process language differ fundamentally from those underlying human cognition [3]. In the semantic domain, it has been argued that signifiers are not accessible to LLMs [4] and that models lack representations of words comparable to those in the human mind [5]. Consequently, LLMs are often described as possessing only functional competence, but lacking conceptual meaning [4]. In addition, several studies report that models underperform compared to humans in tasks requiring grammaticality judgments, both in terms of accuracy and consistency [610]. It thus remains to be established whether the functional competence of LLMs not only approaches that of humans but also extends to novel material not included in their training data [11].

This question is particularly relevant because LLMs are trained on vast amounts of data from the internet, whose quantity and quality are crucial determinants of their performance [1,12,13]. If the data are not carefully selected and preprocessed, models may produce biased, harmful, and stereotypical responses [1416]. Since such biases are present in all textual sources, companies developing LLMs attempt to mitigate them through the choice of resources and subsequent filtering and evaluation procedures [16].

Against this backdrop, several studies have investigated the extent to which LLMs can manipulate different languages in ways that demonstrate human-like abilities and an understanding of underlying linguistic rules, even when confronted with inputs absent from their training data. The existing literature, however, has focused predominantly on the syntactic and semantic abilities of LLMs, with relatively little attention to morphology – the component of language that generates words or lexemes according to systematic patterns of covariation in form and meaning [17]. Two relevant studies nonetheless examine the morphological capacities of LLMs, using multilingual adaptations of the Wug Test [18]. The Wug Test was originally designed to assess whether children apply grammatical rules to novel words, by asking participants to provide an inflected or derived form of a nonce word. In the original paradigm, participants were introduced to a fictional character with a sentence such as “This is a wug.”, containing the nonce word wug. They were then shown two of these characters and prompted to complete the sentence “Now there are two ___.”. The target answer, wugs, indicated that participants possessed an internal representation of English pluralization rule, extending beyond rote memorization.

In the context of LLMs, previous work has assessed the morphological capabilities of ChatGPT-3.5 through a multilingual adaptation of the Wug Test [19]. Their experiment used invented words in English, German, Tamil, and Turkish, to evaluate the model’s ability to generalize morphological rules – particularly, plural formation – to unseen data. The findings revealed that, although never reaching the performance of the best human annotator or the strongest baselines, ChatGPT-3.5 performed best in German, surpassing English, Turkish, and Tamil. This outcome is intriguing given that English exhibits a simpler morphological system than German in the nominal domain, typically involving the suffix -s or -es [20], with only a limited number of irregular forms. German, by contrast, employs a variety of pluralization strategies, including several suffix classes (e.g., -e, er, -n/en, -s) and frequent stem modifications (e.g., umlaut), which collectively contribute to a high degree of morphological complexity [21]. [19] therefore suggest that factors beyond morphological complexity must have influenced the model’s generalizations. At the same time, however, since English is far more represented in the training data, the findings also imply that multiple proxies – including but not limited to data exposure – likely shaped the model’s performance. Two related questions follow from this study: (i) does morphological complexity influence the performance of LMMs on morphological tasks such as the Wug Test? And (ii) does the interaction between linguistic complexity and language-community size affect LLMs’ performance across languages?

A partial response comes from [22], who ran the Wug Test in French, German, Portuguese, Romanian, Spanish, and Vietnamese with both ChatGPT-3.5 and ChatGPT-4. In their study, the original English Wug Test was translated into the respective languages, and linguistically trained native speakers evaluated the translations. Their results show that both models generally succeeded in generating the target morphemes for nonce words, with GPT-4 slightly outperforming GPT-3.5. More broadly, [22] argue that LLMs’ success in generating correct forms is predicted by the language’s morphological complexity, particularly integrative complexity, which refers to the degree of predictability of inflected forms. A summary of the most relevant studies is given in Table 1.

Table 1. Summary of previous studies.

Study Title Key finding
Dang et al. 2024a Morphology Matters: Probing the Cross-linguistic Morphological Generalization Abilities of Large Language Models through a Wug Test The amount of training data is more important than a language’s morphological complexity
Dang et al. 2024b Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5 Languages with many irregularities benefit more since they have a larger presence in the training data
Weissweiler et al. 2023 Counting the Bugs in ChatGPT’s Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model The influence of factors beyond linguistic complexity on the model’s morphological generalization should be considered.

Given the contrasting findings of [19] and [22], several questions remain unresolved. Concretely, it is unclear whether LLM performance in morphological tasks is primarily driven by linguistic complexity or by the size of the language community and its representation in training data. Furthermore, the scope of existing work is limited: [19] tested only ChatGPT-3.5, while [22] compared ChatGPT-3.5 and ChatGPT-4, leaving the overall picture fragmented and model-specific. As a result, we lack a comprehensive account of how LLMs handle morphological generalization across languages. The present study addresses this gap by systematically testing six models (ChatGPT-3.4, ChatGPT-4, Grok 3, Bert, DeepSeek and Mistral) across four partially unrelated languages (Catalan, English, Greek, and Spanish), comparing their performance with human speakers through a multilingual adaptation of the Wug Test.

On linguistic complexity and community size

Discovering what modulates the linguistic abilities of LLMs is paramount for understanding what model scaling can do, and whether it can lead to better linguistic performance. Based on previous literature [19,22], the two important concepts behind our study design are linguistic complexity and community size. Linguistic complexity is a multifaceted notion whose definition varies across different subfields and can be identified with structural, cognitive, and developmental complexity [23]. Since the present study focuses on LLM morphological performance, the relevant dimension is that of structural complexity, understood as the quantity of overt formal features in a given language and the way in which they are organized and interconnected.

With respect to morphology, various measures of complexity have been proposed. For verbal morphology, for instance, the frequency of tensed forms, the variety of past tense structures, and the number of various verb inflections are often considered [24]. The present study, however, focuses exclusively on nominal inflection, and in particular plural formation. Therefore, to operationalize morphological complexity across the four languages under investigation, we adopted the method proposed by [25]. This approach distinguishes two dimensions of morphological complexity: fusion and informativity. The fusion dimension captures the extent to which a given language uses phonologically bound markers (i.e., affixes) rather than phonologically independent markers. Languages with affixes encoding tense, aspect, and mood on verbs, or case, gender, and number on nouns and pronouns, receive higher fusion scores. The informativity dimension, by contrast, reflects the number of obligatory grammatical distinctions marked in a language: the more categories obligatorily expressed, the higher the informativity score.

To calculate the fusion and informativity scores in the languages we test, we employ Grambank v1.0 [26], a global database covering 2,467 languages and coding 195 structural features relevant to fusion and informativity. Complexity was calculated as a global measure, following [25]’s procedure. Using Python [27] in the Spyder environment [28], we computed scores for each language as follows. For fusion, only features with a Fusion weight of 1 were considered: each language received one point if the feature was present (coded as 1) and zero points if absent (coded as 0). The Fusion score for each language was calculated as the mean of these features. Features with Fusion weights of 0, 0.5, or missing values were excluded. For informativity, features were grouped by grammatical function (e.g., singular, tense). A language was counted as marking a function if at least one feature in the group was coded as present, and the Informativity score was calculated as the proportion of marked groups relative to the total number of groups with at least one present feature. The code, input files, and results are available at https://osf.io/4z5n6/.

The results show clear differences across languages (Table 2). English displays the lowest fusion score (0.29), while Greek scores highest (0.53). Greek also ranks highest in informativity (0.44), closely followed by Spanish (0.42), with English again lowest. Catalan and Spanish show very similar values on both metrics, though Catalan’s average score (0.4165) is slightly higher than Spanish’s (0.4125). Ordering the languages from least to most complex yields: English < Spanish < Catalan < Greek.

Table 2. Fusion and informativity scores per language.

Language_ID Language Fusion_scores Informativity_scores Mean scores
mode1248 Greek 0.538462 0.448980 0.493721
stan1289 Catalan 0.418182 0.423077 0.4206295
stan1288 Spanish 0.400000 0.425926 0.412963
stan1293 English 0.291667 0.285714 0.2886905

Turning to community size and LLMs’ training data, it is important to note that English, spoken by hundreds of millions of native speakers and used globally as a lingua franca, overwhelmingly dominates the online domain, resulting in a vast representation in the training corpora. Spanish, with a large global community of native and second-language speakers, also benefits from an abundant digital footprint, though still smaller in scale than English. By contrast, languages with smaller speaker populations, such as Greek or Catalan, are usually represented less extensively. That said, the correlation between population size and training data availability is not strictly linear. Catalan, for example, has fewer speakers than Greek, but benefits from a relatively strong digital infrastructure thanks to cultural and political initiatives promoting its use online. Conversely, widely spoken languages with large populations but less online visibility—such as Hindi or Bengali—remain underrepresented relative to their number of speakers. Thus, while larger speaker communities generally increase the likelihood of richer training data, factors such as digitization policies, cultural prestige, and technological adoption also play a decisive role.

For the purposes of this study, and in the absence of precise information regarding the exact amount of training data used by each model, we make the plain assumption that community size directly correlates with the amount of training data available to the models. Accordingly, the ordering of tested languages by community size/training data is: English (~1.5 billion speakers)> Spanish (~488 million speakers)> Greek (~12 million speakers)> Catalan (~8 million speakers).

Finally, it is also worth noting that the traditional view often assumes a close correlation between linguistic complexity and community size. According to the linguistic niche hypothesis [29,30], the sociolinguistic environment shapes linguistic complexity: small, homogeneous communities with mostly native speakers (i.e., esoteric communities) tend to preserve or develop greater morphological complexity, whereas large, heterogeneous communities with many L2 learners (i.e., exoteric communities) tend toward simplification. This relationship, however, remains debated. [31] indeed highlighted the role of non-native speakers, arguing that exoteric communities with high proportions of L2 speakers tend toward simplification, while esoteric communities preserve irregularities. [32] provided further evidence in support of this view, showing that larger societies may evolve simpler grammatical systems. By contrast, [33] found that it is the absolute number of speakers, rather than the proportion of L2 speakers, that correlates with complexity. More recently, [25] reported only a very weak correlation between population size and (reduced) complexity. Taken together, these findings suggest that while community size can affect linguistic complexity, the effect is neither straightforward nor uniform but shaped by additional sociolinguistic and demographic factors. This debate makes it particularly relevant to examine the interplay between linguistic complexity and community size in the context of LLM performance across languages.

The present study

This study investigates how LLMs extend morphological generalizations across languages, with particular attention to the impact of linguistic complexity and community size. To frame this investigation, we articulate three guiding research questions (RQs):

  • RQ1: Do LLMs exhibit human-like behavior in the generalization of novel morphological forms, or do they deviate from human baselines?

  • RQ2: Does the linguistic complexity of a language influence model performance? If so, to what extent?

  • RQ3: Alternatively, is model accuracy primarily conditioned by the amount of training data and the size of the speaker community?

To address these questions, we designed a multilingual adaptation of the Wug Test, systematically evaluating six models (ChatGPT-3.5 ([34]), ChatGPT-4 ([35]), Grok 3 ([36]), BERT ([37]), DeepSeek ([38]), and Mistral ([39])) across four partially unrelated languages (Catalan, English, Greek, and Spanish), and comparing their performance with the responses of human speakers.

The RQs give rise to three testable predictions. With respect to RQ1, prior work suggests that LLMs perform reliably in tasks that involve relatively constrained morphological operations (e.g., [19,22]). Accordingly, we expect models to approximate human behavior on a task as elementary yet revealing as the Wug Test.

For RQ2, if structural complexity exerts a decisive influence, then LLMs’ should perform better in languages with lower complexity. Specifically, their performance is expected to follow the ranking:

English > Spanish > Catalan > Greek.

Regarding RQ3, if the main factor influencing LLMs’ performance is community size, and, by extension, the amount of training data, we instead expect the ranking to be: English > Spanish > Greek > Catalan.

These predictions were evaluated by directly comparing model and human performance across the four tested languages. The analysis focuses on overall accuracy rates, cross-linguistic patterns, and the interaction between linguistic complexity and data availability. By examining where and how models diverge from human baselines, the study aims to determine whether their behavior reflects genuine morphological competence or merely sensitivity to distributional properties in the input. This approach also allows us to assess which of the two factors (i.e., structural complexity or resource availability) better accounts for performance differences across languages.

Methodology

The task employed in this study is a modified version of the Wug Test [18]. The Wug Test assesses the ability to apply grammatical rules pertinent to inflectional morphology to words that participants have never encountered before. As mentioned in the Introduction, the choice of this test is motivated by previous literature [[19,22]], which provided interesting results regarding the morphological abilities of LLMs in this task, but also generated questions regarding what drives their performance. Specifically, we designed a novel version of the Wug Test featuring 30 test items: 15 words consisting of two syllables and 15 words of three syllables. The full task is available in the OSF repository (https://osf.io/4z5n6/).

The task was created and hosted on the PCIbex Farm platform ([40]). The sentences were presented in the written modality on the screen, and participants were asked to type their responses in the designated blank spaces using a computer or a mobile device. The primary objective of the task was to elicit the plural form of each nonce word. To minimize confounds, each test item followed an identical structure, differing only in the novel word presented, as exemplified below.

[ENG] Continue the phrase with only one word. Here is an example: Yesterday a glorp appeared in my garden. Today there was another one. Now there are two? [response: glorps].

Continue the phrase with only one word: Yesterday a sottle appeared in my garden. Today there was another one. Now there are two? ______

This uniformity ensured that neither semantic properties nor contextual cues could influence responses, thereby allowing a direct comparison between human participants and language models under equivalent linguistic conditions. The test was developed in 4 languages differing in size and levels of complexity, namely Catalan, English, Greek, and Spanish, and evaluated by native speakers of each language.

The nonce words created in this study respect the morphophonological constraints of each target language. They were systematically derived from existing lexical items in each target language by altering the initial consonant. This design choice was motivated by evidence showing that consonants play a more critical role than vowels in lexical identification and word recognition for both humans and LLMs [41]. Consequently, manipulating consonants provides more robust and reliable stimuli. In addition, the novel words were balanced within each language according to syllable count and grammatical gender. This control minimized the risk of inadvertent biases toward particular phonological patterns or gendered forms and ensured the cross-linguistic comparability of the stimuli. Table 3 provides sample items from each language, illustrating the distribution of two- and three-syllable words and gender marking where applicable.

Table 3. Sample of the stimuli per tested language.

English Spanish Catalan Greek
2-syllable 3-syllable 2-syllable 3-syllable 2-syllable 3-syllable 2-syllable 3-syllable
jater mucumber danta (FEM) sestino (MASC) gavall (MASC) frúixola (FEM) λέφα (FEM) λύννεφο (NEU)
bocket rospital ñafa (FEM) meclado (MASC) famí (MASC) zoquina (FEM) τάμπα (FEM) ζεβύρι (NEU)
capkin forpedo zulta (FEM) fepillo (MASC) deó (MASC) flimona (FEM) φέστη (FEM) ρεχνίδι (NEU)

Participants

A total of 160 participants (78 F) took part in the study, with 40 adult native speakers per language to ensure balanced representation across the tested languages. Participants were recruited via the online platform Prolific and compensated for their participation. All participants provided written informed consent before taking part in the study. The experiment was carried out in accordance with the Declaration of Helsinki and had the written approval of the ethics committee (Comité d’Ètica en la Recerca (CERec)) of the Autonomous University of Barcelona (application no. 7150). The recruitment started on March 15th, 2025, and ended on May 15th, 2025. Exclusion criteria included self-reported cognitive, neurological, hearing, or speech-related impairments. In addition, participants who failed to complete at least 50% of the task in the target way were removed from the final sample. No time limits were imposed on task completion, although participants were required to respond to all prompts in order to finish the experiment.

The same test was also administered to six LLMs: ChatGPT-3.5, ChatGPT-4, Grok 3, BERT, DeepSeek, and Mistral. Model evaluation was primarily conducted manually through the respective user interfaces, with the exception of the BERT model, which was evaluated programmatically using Python due to its comparatively slow response time when accessed manually. For the manual evaluation, each prompt was entered independently into a new chat session without any prior conversational context. After receiving the model’s response, the conversation history was deleted to avoid contextual bias across prompts.

Since no paid subscriptions were used for any of the evaluated models, usage was constrained by the limitations imposed on free-tier access. Consequently, testing was periodically interrupted upon reaching usage limits, requiring waiting periods before further evaluation could proceed. As a result, the manual testing process extended over approximately one month. In contrast, the automated evaluation of the BERT model, conducted via Python, was completed in under two hours, highlighting a substantial difference in efficiency between manual and automated testing procedures.

Regarding computational resources, BERT was free. The implementation relied on standard open-source libraries relevant to natural language processing, including AutoTokenizer and AutoModelForMaskedLM from the Hugging Face [42]. No specialized hardware (e.g., GPUs or TPUs) was employed, and all computations were performed on general-purpose computing resources. The raw datasets, the code used for the analyses, and the experimental stimuli are available in the OSF repository. Fig 1 diagrams the experimental design and testing process.

Fig 1. Experimental design and testing process.

Fig 1

Data annotation

The central measure of interest of the present study was accuracy in the pluralization of nonce words across human participants and models, as the aim was examining possible effects of agent (i.e., humans vs. LLMs), language complexity, and community size on accuracy. Accuracy was measured by comparing each response with a predefined target based on the morphophonological rules of the relevant language. Correct responses were scored as 1 and incorrect responses as 0. Both stressed and unstressed forms were accepted, provided that the stress placement was accurate. Misplaced stress, in contrast, considered a violation of the phonological rules of the language, were coded as inaccurate. With respect to pluralization strategies, all test items conformed to regular patterns based on the morphological rules of each language. Misspelled answers or misapplied irregular forms were also scored as inaccurate. The same coding procedures were applied to human and model responses.

Results and data analysis

Turning first to human participants, descriptive analyses revealed some cross-linguistic variation. English speakers achieved the lowest accuracy (84.8%), followed by Catalan (90.9%), Greek (94.8%), and Spanish (95.7%). These differences are shown in Fig 2.

Fig 2. Humans’ performance across languages in the Wug Test.

Fig 2

In order to test if these differences were statistically significant, we conducted in R [43] (version 4.5.1) a mixed-effects logistic regression model (glmer) [44] with Accuray (correct, incorrect) as the dependent variable and Language (Catalan, English, Greek and Spanish) as the independent variable. Participants and Items were added as crossed-random factors. The null model (accuracy ~ 1 + (1| item) + (1| participant)) was compared with the model including Language as a fixed effect (accuracy ~ Language +(1| item) + (1| participant)) via the anova()-function. Including Language significantly improved model fit: p < 3.1e-08. Additionally, random slopes for Language were included within the items as this also significantly improved model fit (as estimated by comparing the logLikehoods of the models using the anova()-function). Results showed a main effect of Language (χ² = 24.792, p < 0.001) on accuracy. Post-hoc comparisons were then run using the emmeans()-function [45] with Tukey adjustment. Table 4 shows the results obtained. While English was significantly different from the other languages, no significant differences were found across the other languages. These results indicate that, even in humans, morphological generalizations show language-dependent variation, with English pluralization posing greater challenges. As we discuss in the Discussion section, the difficulty of specific test items may have contributed to English being ranked lowest.

Table 4. Results of the post-hoc test on human responses. Values are given in log-odds ratio. P-value adjustment was conducted with the Tukey method.

Contrast ß SE z p
Catalan – English 1.671 0.574 2.913 0.0188*
Catalan – Greek 0.145 0.598 0.243 0.9950
Catalan – Spanish −0.655 0.724 −0.904 0.8027
English – Greek −1.526 0.443 −3.441 0.0032*
English – Spanish −2.326 0.560 −4.151 0.0002**
Greek – Spanish −0.800 0.593 −1.348 0.5322

Direct comparisons between humans and models showed broadly similar levels of performance, except for the BERT model, whose performance was at floor across all languages (Fig 3). In English and Spanish, models generally outperformed humans, with ChatGPT-4, DeepSeek, Grok 3, and Mistral reaching 100% accuracy in Spanish, whereas humans averaged 95.6%. In English, models also surpassed humans, with Mistral at 100% compared to the human mean of 84.8%. In contrast, in Greek and Catalan, models did not (generally) outperform humans.

Fig 3. Agent comparisons: Accuracy scores.

Fig 3

In order to determine if LLMs performed statistically different from humans, a mixed-effects logistic regression model (glmer) was also run with Accuracy (correct vs. incorrect) as the dependent variable and Agent (human vs. LLM) as the independent variable. Participants and items were added as crossed random factors, allowing for random intercepts. Given some of the variability present across the tested LLMs, three different glmers were run with. i) humans vs all LLMs, ii) humans vs. all LLMs except BERT, and iii) humans vs. only the best performing LLM (ChatGPT-4). Table 5 shows that Agent was only a significant predictor of accuracy when BERT was included in the analysis (χ² = 5.917, p = 0.015). When BERT was excluded, Agent was not significant (χ² = 0.085, p = 0.77). Similarly, when the best performing model was compared to humans, it yielded no significant effect of Agent (χ² = 0.002, p = 0.96).

Table 5. Summary of mixed-effects regression model with Agent as main effect.

Comparison β SE z p
Humans vs. LLMs (excl. BERT) −0.1813 0.6231 −0.2910 0.7711
Humans vs. ChatGPT-4 −0.0625 1.2927 −0.0484 0.9614
Humans vs. All LLMs −1.6087 0.6613 −2.4325 0.015*

Despite LLM performing on par with humans when their scores were averaged across languages, we wanted to determine if LLM vs. human performance was alike within each language. To test this, another set of glmers were run, but this time including Language and Agent as fixed factors: in the first model they were included as two main effects and in the second model as an interaction. Model selection was performed by comparing the logLikehoods of the models using the anova()-function. When all LLMs (i.e., also BERT) were included in the data analysis, results show a significant main effect of both Agent (χ² = 5.837, p = 0.016) and Language (χ² = 31.846, p < 0.001); with humans performing significantly better than models across all languages (β = 1.55, SE = 0.641, z = 2.416, p = 0.016). When BERT was excluded from the data analysis, a significant interaction between Agent and Language was found (χ² = 28.323, p < 0.001). Post-hoc comparisons using the emmeans()-function with Tukey adjustment show that Agent did not play a significant role across the tested languages with the exception of English, where humans performed statistically worse than LLMs (Table 6).

Table 6. Results of the post-hoc test (excluding BERT) on the effect of Agent (human – LLM) per language. Values are given in log-odds ratio. P-value adjustment was conducted with the Tukey method. Asterisks indicate statistically significant contrasts.

Human - LLMs β SE z p
English −1.948 0.736 −2.646 0.0081*
Spanish −0.975 0.884 −1.103 0.2701
Catalan 0.986 0.593 1.663 0.0964
Greek 0.708 0.635 1.114 0.2652

Error analyses provide further insight into language-specific patterns. In English, both humans and models occasionally overgeneralized irregular plural forms (e.g., sungi for sungus). Humans additionally produced real-word substitutions (e.g., fomputer → computer), typos (e.g., macumber for mucumbers) and inserted irrelevant words such as “more” or “yes”; these types of errors were not observed in models. In Greek, humans and models alike struggled with particular stems (e.g., pluralizing λίγρης as λίγρες or λίγρηδες), though humans produced typos absent in models. In Catalan, both agents often failed to apply orthographic adjustments, yielding forms such as fronjes instead of fronges. In Spanish, errors were rare, though some humans inserted extraneous words (e.g., or no), while ChatGPT-3.5 occasionally failed to pluralize. Overall, models tended to mirror human error patterns, though human responses contained a wider range of idiosyncratic deviations, including especially irrelevant insertions. Table 7 provides examples of correct and incorrect responses produced by both agents. A full breakdown of error types is available in the results files of the “Humans” and “Models” folders of the OSF repository.

Table 7. Sample of correct and incorrect responses from humans and LLMs.

Language Given example in singular Humans’ correct answers Humans’ responses with anomalies & failures LLMs’ correct answers LLM(s’) responses with anomalies & failures
English nandy nandies nandys nandies child
English sungus sunguses sungi/sungu/sungus’s sunguses sungi/sung
English luty luties lutties/lutys/luty’s luties lutys/rain
English watellite watellites watellies/wattelites/satellites watellites tree
Catalan famí famins Famís/famils/ famíssos famís/ famílics/ més
Catalan madira madires madires més
Catalan fronja fronges fronja/franges/fronjes/frongues fronjes/ més
Catalan claça claces classes/ claçes/calces claçes/claques/ més
Spanish zoche zoches zoche/zocheas zoches árbol
Spanish damino Daminos daños/dominos daminos hombre
Spanish troplema troplemas troplemas árbol
Spanish danta dantas dantes/dardas/frutos dantas niño
Greek φοντίκι (fodiki) φοντίκια (fodikia) ροντίκια(rodikia)/ ποντίκια (podikia) φοντίκια (fodikia) Φωντίκια (fodikia misspelled)
Greek λύννεφo (linnefo) λύννεφα

(linnefa)
λύννεφο (linnefo) λύννεφα

(linnefa)
λύνεφα (linefa)/ βιβλίο (vivlio)
Greek τίλος(τίλοι) τίλοι (tili) τίλοι (tili) ακόμη (akomi)
Greek λαρέκλα(larekla) λαρέκλες(larekles) λαρέκλες(larekles) άλλο(allo)

Taken together, these findings confirm that both humans and models perform well in generalizing morphological rules to novel words, with comparable overall accuracy. However, performance is modulated by language: as shown above English emerges as the most difficult system for humans, while Catalan proves most challenging for models.

In order to better understand the effect of Language on the performance of LLMs, another mixed-effects logistic regression model (glmer) was run on the LLMs data. As before, Accuracy (correct vs. incorrect) was coded as the dependent variable and Language (Catalan, English, Spanish, Greek) as the independent variable. Items and models were added as crossed random factors, allowing for random intercepts when possible. Given that BERT behaved significantly different from all the other LLMs, and that when including it in the data analysis the statistical models did not comply with some of required assumptions (e.g., within-groups deviations from uniformity), the rest of the paper will report the results of the data analysis excluding BERT. It must be highlighted that the main findings are not affected by whether this model is included or not, and any discrepancies will be reported (see the OSF repository, file Model_analyis.pdf for a comparison of including and excluding this model).

Results show a significant main effect of Language on accuracy (χ² = 26.517, p < 0.001). A post-hoc analysis was conducted with the emmean()-function to better understand the differences across languages. Table 8 shows the results. As can be seen, LLMs performed significantly worse in Catalan compared to all the other tested languages.

Table 8. Results of the post-hoc test on LLMs (excluding BERT). Values are given in log-odds ratio. P-value adjustment was conducted with the Tukey method. Asterisks indicate statistically significant contrasts.

Contrast ß SE z p
Catalan – English −2.250 0.584 −3.855 0.0007**
Catalan – Greek −0.989 0.408 −2.426 0.0722
Catalan – Spanish −2.984 0.770 −3.875 0.0006**
English – Greek 1.261 0.613 2.057 0.1674
English – Spanish −0.734 0.890 −0.826 0.8424
Greek – Spanish −1.995 0.792 −2.520 0.0569

The pattern observed is thus very different from the one obtained from humans (Table 4); a result which raises the question of which factor best determines the accuracy of LLMs cross-linguistically. As was mentioned in the introduction, two different hypotheses have been put forth in the literature to explain LLMs performance: the language’s complexity and the language’s community size. To determine which one is a better predictor, we run two additional mixed-effects logistic regressions with Accuracy as the dependent variable and Items and Model (if the model converged) as cross-random effects. In one of the models, the independent variable was the Complexity Score of the tested languages, and in the other the Community Size (z-scored). Results show a significant effect of Complexity Score (χ² = 5.678, p = 0.017), and Community Size (χ² = 11.991, p = 0.0005). Model selection is performed by comparing Akaike Information Criterion (AIC). The AIC was lower in the model with Community Size as a main effect (AIC = 277.858) than when Complexity Score was the main effect (AIC = 290.659). This suggests that, out of the two predictors, Community Size is the better one. The same conclusions are reached if BERT is included in the data analysis.

Lastly, we wanted to determine whether there is an interaction between the two predictors (Community Size and Complexity Score). The previous two models (the one with Community Size as a main effect and the one with Complexity Score as the main effect) were compared with a model in which both Community Size and Complexity Score were main effects (model 2), and with a model in which there was an interaction between Community Size and Complexity Score (model 3). The models were compared by means of the logLikelihood of the models using the anova()-function. Results showed that the best fitting model was the one with the interaction (p < 0.001). Importantly, this model shows that there is a significant interaction between the two main predictors (χ² = 5.837, p = 0.016). To better understand the effect of the interaction, the predictions of the model were plotted using the ggeffects package [46] (Fig 4).

Fig 4. Predictions of LLMs accuracy based on complexity score and Community Size.

Fig 4

Overall, these results highlight the role not only of Complexity but also of Community Size in determining LLMs’ accuracy in deriving plural regularizations patterns. The bigger the community size, the better the models are predicted to perform. Interestingly, if two languages have a similar number of speakers, the more complex language is predicted to outperform the other one.

Discussion

This study examines how LLMs generalize morphological patterns across languages by addressing three main research questions: whether LLMs exhibit human-like behavior in the generalization of novel morphological forms (RQ1); whether linguistic complexity influences performance (RQ2); and whether accuracy is instead primarily driven by training data size and speaker community (RQ3).

In relation to RQ1, our findings show that LLMs performed remarkably similarly to humans. Excluding clear outliers such as BERT, whose performance was uniformly at floor across languages, mixed-effects analyses revealed no statistically significant differences between the two agents. Moreover, models matched human accuracy in three of the four languages, even outperforming significantly humans in English. These results indicate that LLMs are able to generalize morphological rules to unseen items with a level of reliability that is comparable to that of humans, replicating previous evidence contending that models can successfully reproduce certain aspects of human morphological reasoning, at least in constrained, rule-based tasks [19,22].

The main contribution of the present study concerns RQ2 and RQ3, which probe the relative influence of linguistic complexity and data availability on the models’ performance. Our results reveal that both factors affected accuracy, but community size emerged as the stronger predictor. While linguistic complexity did lead to some variability in LLM accuracy – especially in morphologically rich languages like Catalan and Greek – model performance in languages with smaller speaker populations and consequent limited digital presence is systematically worse: LLMs were less accurate in Catalan and Greek, while English and Spanish, both supported by vast digital corpora, crucially achieved more consistent and higher accuracy.

One might object that these results could also reflect an effect of linguistic complexity, since Catalan and Greek are ranked highest also according to this metric. However, a closer examination of the results in Fig 3 reveals a clear asymmetry: the ranking of model performance observed in our study (i.e., Spanish > English > Greek > Catalan) aligns more closely with the distribution of community size (English > Spanish > Greek > Catalan) than with that of linguistic complexity (English < Spanish < Catalan < Greek). Statistical analyses corroborate this pattern, indicating that although both factors contribute, community size is the strongest predictor of accuracy. Interestingly, moreover, the effect of linguistic complexity runs counter to conventional expectations found in the literature: when community sizes are comparable, our model predicts that greater complexity is in fact associated with higher accuracy.

A related and somewhat surprising observation concerns English, the least complex and most widely spoken language in our sample. Despite this advantage, it was not the best-performing language for either humans or models. Instead, Spanish consistently yielded the highest model accuracy, even though it is more morphologically complex and has a smaller speaker community.

This unexpected outcome —which potentially contradicts our main claim that community size better predicts model accuracy— deserves some clarification. In our study, the English task contained tokens that resemble words with irregular plural forms, thus complicating rule generalization for both humans and models. Spanish, by contrast, exhibits highly regular paradigms across all stimuli, which likely facilitated the models’ consistent accuracy. Indeed, in English, both human participants and models alike occasionally extended familiar pluralization patterns inappropriately, producing forms such as sungi (target: sunguses) or lutys (target: luties). These analogical extensions mirror human tendencies in morphological generalizations and align with [47]’s observation that LLMs and humans both rely on structural analogy when confronted with novel linguistic material. The inherent irregularities of the English stimuli may thus be argued to contribute to lower accuracy scores.

Moreover, the comparatively weaker performance of English relative to Spanish may reflect broader typological differences: Germanic languages generally display greater morphological irregularities than Romance languages [48], which can hinder token- or pattern-based generalization in models. Similarly, Greek, despite its smaller speaker community, shows high accuracy, likely due to its morphological regularity. Together, these observations suggest that while data availability is the primary driver of LLM performance, internal morphological consistency may also moderate accuracy.

This nuanced conclusion refines existing claims in the literature. Whereas [22] identify linguistic complexity as the stronger predictor of model behavior, our results instead underscore the predominant importance of training resources, tempered by the interaction with morphological regularities. This becomes particularly evident when English performance is set aside. For instance, Greek which is the most complex language according to our metrics, did not occupy the lowest position; on the contrary, it systematically outperformed Catalan, which is relatively linguistically simpler. This finding is unexpected if grammatical complexity were the main determinant of model performance. Even more strikingly, Spanish and Catalan are ranked very closely in terms of linguistic complexity, yet their results diverge significantly, as shown in Table 8. This difference aligns with their relative representation in the training data: Spanish, far more digitally present, achieved markedly higher model accuracy than Catalan, which is poorly represented. Taken together, these findings challenge the assumption that linguistic complexity is a direct proxy for enhanced model performance. and instead foreground the decisive role of data resources.

Put differently the results support the conclusion that LLMs exhibit a high degree of resource sensitivity while remaining largely insensitive to linguistic complexity. Although these two factors undoubtedly interact, the evidence indicates that model performance is conditioned primarily by the quantity and representativeness of their training data, rather than by their sensitivity to the internal structural complexity of natural language systems. Models excel in languages with abundant and well-digitized corpora, but they do not exhibit systematic sensitivity to the morphological intricacies of those languages. This finding suggests that LLMs may rely on more mechanistic processes, such as tokenization, to parse language [3,7,8]. Such an interpretation aligns with research showing that sub-word tokenization methods like Byte Pair Encoding [49] privilege high-frequency morphological patterns and thereby benefit resource-rich languages like Spanish.

Turning briefly to human accuracy, several factors may help explain why performance did not reach ceiling levels. Some participants produced incorrect responses in the initial test items, likely due to a misinterpretation of the task instructions. For example, when prompted with “Now there are two ___?” several participants replied “” ‘yes’, which implied that they were evaluating the truth value of the sentence rather than filling in the gap. In other cases, responses that were semantically appropriate (i.e., that correctly performed the task at stake, namely pluralization) were coded as incorrect due to orthographic or prosodic inaccuracies, such as misspellings, typos, or misplaced stress. Such deviations, which were absent from the models’ output, likely reflect human-specific performance factors such as attention lapses, cognitive fatigue, or time pressure.

Certain limitations of this study warrant consideration. First, although linguistic complexity scores were computed using data from a typologically diverse linguistic database, these scores encompass a broad range of linguistic features extending beyond morphology to syntax and semantics. While such holistic measures provide valuable insights into cross-linguistic diversity, they may not align precisely with the specific focus of this study. Given that the Wug Test primarily targets inflectional morphological processes, it would be more accurate to isolate and assess morphological complexity as a distinct construct. A morphology-focused metric would likely alter the relative scores and enable more precise cross-linguistic comparisons. Future research would therefore benefit from constructing dedicated indices of morphological complexity, which take into consideration factors such as paradigm size, rule regularity, allomorphy, and morphological transparency, to better contextualize both model and human performance.

A second limitation concerns the relatively small number of languages tested. Although the selected languages were chosen to ensure diversity in both training data size and structural complexity, this sample restricts the generalizability of the findings across the world’s languages, especially those that are underrepresented. Including languages with radically different typological profiles (e.g., agglutinative, polysynthetic, or tonal systems) would offer a more comprehensive understanding of how LLMs handle morphological generalization under varying linguistic and data conditions.

A third limitation concerns the strength of the link we have established between size of the community and size of the training data. While there is a connection between the two —a small community that speaks a minoritized, non-official language will likely have less resources, and this scarcity of high-quality digitalized data will cast an upper limit on LLM performance—, taking community size as a proxy for volume of training data is neither that simple nor that straightforward. For example, Basque, a language isolate with no demonstrable relationship with any other language, has a community of about 800,000 people. Swahili, a Bantu language spoken across several countries, has a community of 150 million people. However, Basque has a much stronger digital presence than Swahili, with six times more Wikipedia articles in the former than in the latter [50] This suggests that while a link between community size and training data size exists, this is modulated by many geopolitical factors, such that the so-called ‘low-resourced’ languages are not a uniform category. This weakens any attempt to establish a direct relationship between ‘low-resourced’ vs. ‘high-resourced’ language and LLM performance.

A last limitation concerns prompt-related bias. Even subtle changes in wording can significantly alter LLM replies in linguistic tasks. It is possible that choosing different prompts or different nonce words would have led to different results. However, as our results agree with what has been reported in the literature through different prompts and task designs ( [19,22]), we assess this possibility as relatively low.

Conclusion

This study examined how LLMs extend morphological generalizations to novel words across languages differing in linguistic complexity and community size. The results revealed that while models can replicate human-like accuracy, their success is especially determined by the availability of training data rather than on grammatical complexity. Languages with broader community sizes and greater digital representation, such as Spanish and English, consistently yielded higher model accuracy than less represented ones like Greek and Catalan. At the same time, the regularity of a language’s morphological system can moderate performance, allowing even poorly represented languages to achieve higher accuracy when structural patterns are transparent and consistent. These findings suggest that while LLMs remain powerful pattern learners, their behavior reflects a distributional, rather than structural, grasp of language: one that mirrors human performance only in output but not necessarily in the underlying mechanisms.

Acknowledgments

We would like to thank Sergi Balari and Elena Pagliarini for their valuable feedback on previous versions of this work. We are also grateful to M.Teresa Espinal for her help with the Catalan stimuli, and Olena Shcherbakova and Hedvig Skirgård for their assistance with the Grambank complexity scores. All remaining errors are our own.

Data Availability

All data files are available from the OSF database (https://osf.io/4z5n6/).

Funding Statement

EL acknowledges funding from the Spanish Ministry of Science, Innovation & Universities MCIN/AEI/https://doi.org/10.13039/501100011033) under the research project CNS2023-144415. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Naveed H, Khan AU, Qiu S, Saqib M, Anwar S, Usman M. A comprehensive overview of large language models. arXiv. 2024. https://arxiv.org/abs/2307.06435 [Google Scholar]
  • 2.Atox N, Clark M. Evaluating Large Language Models through the Lens of Linguistic Proficiency and World Knowledge: A Comparative Study. Wiley. 2024. doi: 10.22541/au.172479372.22580887/v1 [DOI] [Google Scholar]
  • 3.Cuskley C, Woods R, Flaherty M. The Limitations of Large Language Models for Understanding Human Language and Cognition. Open Mind (Camb). 2024;8:1058–83. doi: 10.1162/opmi_a_00160 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Dentella V, Günther F, Leivada E. Language in vivo vs. in silico: Size matters but Larger Language Models still do not comprehend language on a par with humans due to impenetrable semantic reference. PLoS One. 2025;20(7):e0327794. doi: 10.1371/journal.pone.0327794 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Leivada E, Marcus G, Günther F, Murphy E. A sentence is worth a thousand pictures: Can large language models understand hum4n l4ngu4ge and the w0rld behind w0rds?. arXiv. 2023. doi: 10.48550/arXiv.2308.00109 [DOI] [Google Scholar]
  • 6.Barattieri Di San Pietro C, Frau F, Mangiaterra V, Bambini V. The pragmatic profile of ChatGPT: Assessing the communicative skills of a conversational agent. Sistemi intelligenti. 2023;35(2):379–400. doi: 10.1422/108136 [DOI] [Google Scholar]
  • 7.Dentella V, Günther F, Leivada E. Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias. Proc Natl Acad Sci U S A. 2023;120(51):e2309583120. doi: 10.1073/pnas.2309583120 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Dentella V, Günther F, Murphy E, Marcus G, Leivada E. Testing AI on language comprehension tasks reveals insensitivity to underlying meaning. Sci Rep. 2024;14(1):28083. doi: 10.1038/s41598-024-79531-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Javier Vazquez Martinez H, Lea Heuser A, Yang C, Kodner J. Evaluating Neural Language Models as Cognitive Models of Language Acquisition. In: Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP, 2023. 48–64. doi: 10.18653/v1/2023.genbench-1.4 [DOI] [Google Scholar]
  • 10.Zhou H, Hou Y, Li Z, Wang X, Zhang D, How M. How well do large language models understand syntax? An evaluation by asking natural language questions. 2023. [Google Scholar]
  • 11.Hupkes D, Giulianelli M, Dankers V, Artetxe M, Elazar Y, Pimentel T, et al. A taxonomy and review of generalization research in NLP. Nat Mach Intell. 2023;5(10):1161–74. doi: 10.1038/s42256-023-00729-y [DOI] [Google Scholar]
  • 12.Kandpal N, Deng H, Roberts A, Wallace E, Raffel C. Large Language Models Struggle to Learn Long-Tail Knowledge. In: ICML’ 23: Proceedings of the 40th International Conference on Machine Learning, Honolulu, Hawaii USA, 2023. 15696–707. [Google Scholar]
  • 13.Piantadosi S. Modern language models refute Chomsky’s approach to language. In: Gibson E, Poliak M. From fieldwork to linguistic theory: A tribute to Dan Everett. Berlin: Language Science Press. 2024. 353–414. [Google Scholar]
  • 14.Abid A, Farooqi M, Zou J. Persistent Anti-Muslim Bias in Large Language Models. In: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 2021. 298–306. doi: 10.1145/3461702.3462624 [DOI] [Google Scholar]
  • 15.Shaikh O, Zhang H, Held W, Bernstein M, Yang D. On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023. 4454–70. doi: 10.18653/v1/2023.acl-long.244 [DOI] [Google Scholar]
  • 16.Navigli R, Conia S, Ross B. Biases in Large Language Models: Origins, Inventory, and Discussion. J Data and Information Quality. 2023;15(2):1–21. doi: 10.1145/3597307 [DOI] [Google Scholar]
  • 17.Haspelmath M, Sims A. Understanding Morphology. 2nd ed. Oxford (UK): Routledge. 2010. [Google Scholar]
  • 18.Berko J. The Child’s Learning of English Morphology. Word. 1958;14(2–3):150–77. doi: 10.1080/00437956.1958.11659661 [DOI] [Google Scholar]
  • 19.Weissweiler L, Hofmann V, Kantharuban A, Cai A, Dutt R, Hengle A, et al. Counting the Bugs in ChatGPT’s Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. 6508–24. doi: 10.18653/v1/2023.emnlp-main.401 [DOI] [Google Scholar]
  • 20.Yule G. Morphology. In: Yule G. The Study of Language. 4 ed. New York: Cambridge University Press. 2010. 66–79. [Google Scholar]
  • 21.Wiese R. The grammar and typology of plural noun inflection in varieties of German. J Comp German Linguistics. 2009;12(2):137–73. doi: 10.1007/s10828-009-9030-z [DOI] [Google Scholar]
  • 22.Anh D, Raviv L, Galke L. Morphology Matters: Probing the Cross-linguistic Morphological Generalization Abilities of Large Language Models through a Wug Test. In: Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, 2024. 177–88. doi: 10.18653/v1/2024.cmcl-1.15 [DOI] [Google Scholar]
  • 23.Pallotti G. A simple view of linguistic complexity. Second Language Research. 2014;31(1):117–34. doi: 10.1177/0267658314536435 [DOI] [Google Scholar]
  • 24.Bulté B, Housen A. Defining and operationalising L2 complexity. In: Housen A, Kuiken F, Vedder I. Dimensions of L2 performance and proficiency: Investigating complexity, accuracy and fluency in SLA. John Benjamins. 2012. 21–46. [Google Scholar]
  • 25.Shcherbakova O, Michaelis SM, Haynie HJ, Passmore S, Gast V, Gray RD, et al. Societies of strangers do not speak less complex languages. Sci Adv. 2023;9(33):eadf7704. doi: 10.1126/sciadv.adf7704 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Skirgård H, Haynie HJ, Hammarström H, Blasi DE, Collins J, Latarche J. Grambank v1.0. 2023. doi: 10.5281/zenodo.7740140 [DOI] [Google Scholar]
  • 27.Python Software Foundation. Python. 2023. https://www.python.org/downloads/release/python-3120/
  • 28.Spyder. 2024. https://docs.spyder-ide.org/current/quickstart.html
  • 29.Lupyan G, Dale R. Language structure is partly determined by social structure. PLoS One. 2010;5(1):e8559. doi: 10.1371/journal.pone.0008559 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Dale R, Lupyan G. Understanding the origins of morphological diversity: the linguistic niche hypothesis. Advs Complex Syst. 2012;15(03n04):1150017. doi: 10.1142/s0219525911500172 [DOI] [Google Scholar]
  • 31.Wray A, Grace GW. The consequences of talking to strangers: Evolutionary corollaries of socio-cultural influences on linguistic form. Lingua. 2007;117(3):543–78. doi: 10.1016/j.lingua.2005.05.005 [DOI] [Google Scholar]
  • 32.Raviv L, Meyer A, Lev-Ari S. Larger communities create more systematic languages. Proc Biol Sci. 2019;286(1907):20191262. doi: 10.1098/rspb.2019.1262 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Koplenig A. Language structure is influenced by the number of speakers but seemingly not by the proportion of non-native speakers. R Soc Open Sci. 2019;6(2):181274. doi: 10.1098/rsos.181274 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P. Training language models to follow instructions with human feedback. arXiv. 2022. https://arxiv.org/abs/2203.02155 [Google Scholar]
  • 35.OpenAI. GPT-4 Technical Report. 2024.
  • 36.xAI. Grok-3. 2024. https://docs.x.ai/docs/models/grok-3
  • 37.Devlin J, Chang M-W, Lee K, Toutanova K. In: Proceedings of the 2019 Conference of the North, 2019. 4171–86. doi: 10.18653/v1/n19-1423 [DOI] [Google Scholar]
  • 38.DeepSeek A. DeepSeek. 2023. https://deepseek.com
  • 39.Mistral AI Team. Mistral AI Version 7B. 2023. https://mistral.ai/news/announcing-mistral-7b/
  • 40.Zehr J, Schwarz F. PennController for Internet Based Experiments (IBEX). 2023. [Google Scholar]
  • 41.Toro JM. Emergence of a phonological bias in ChatGPT. arXiv. 2023. https://arxiv.org/abs/2305.15929 [Google Scholar]
  • 42.Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Transformers: State-of-the-Art Natural Language Processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020. 38–45. doi: 10.18653/v1/2020.emnlp-demos.6 [DOI] [Google Scholar]
  • 43.R Core Team. The R Project for Statistical Computing. 2024. https://www.R-project.org/
  • 44.Generalized Linear Mixed-Effects Model (GLMM). https://search.r-project.org/CRAN/refmans/lme4/html/glmer.html
  • 45.Estimated Marginal Means (EMMs). 1980. https://cran.r-project.org/web/packages/emmeans/index.html
  • 46.Lüdecke D. ggeffects: Tidy Data Frames of Marginal Effects from Regression Models. JOSS. 2018;3(26):772. doi: 10.21105/joss.00772 [DOI] [Google Scholar]
  • 47.Musker S, Duchnowski A, Millière R, Pavlick E. LLMs as Models for Analogical Reasoning. In: 2025. https://arxiv.org/abs/2406.13803 [Google Scholar]
  • 48.Hernández-Gómez C, Basurto-Flores R, Obregón-Quintana B, Guzmán-Vargas L. Evaluating the Irregularity of Natural Languages. Entropy. 2017;19(10):521. doi: 10.3390/e19100521 [DOI] [Google Scholar]
  • 49.Dang TA, Raviv L, Galke L. Tokenization and morphology in multilingual language models: A comparative analysis of mT5 and ByT5. arXiv. 2024. https://arxiv.org/abs/2410.11627 [Google Scholar]
  • 50.Claus H. Now you are speaking my language: why minoritised LLMs matter. Ada Lovelace Institute. 2024. https://www.adalovelaceinstitute.org/blog/why-minoritised-llms-matter/ [Google Scholar]

Decision Letter 0

Wei Lun Wong

15 Dec 2025

Dear Dr. Morosi,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jan 22 2026 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Wei Lun Wong

Academic Editor

PLOS One

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Partly

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: Yes

**********

Reviewer #1: This paper examines how Large Language Models apply morphological rules to novel words using a multilingual version of the Wug Test. By comparing six models across four languages with human speakers, the study evaluates whether model performance reflects true linguistic competence or merely the amount of training data available. The findings suggest that data availability and community size, rather than linguistic complexity, primarily shape model accuracy.

- Please explain the motivation / scientific importance for comparing particularly community size and grammar complexity.

- In the abstract, please distinguish between, resource size and community size, as it creates confusion.

- Please add a model diagram in the methodology section.

- It is suggested to use more appropriate (specific) wording than "language-blind" as language could mean many things. At present it leads to confusion.

- Colors in Fig 2 are indistinguishable, please add visibly separate patterns over the bars.

- The study will benefit from if authors separately test the generative LLMs from reasoning/thinking LLMs.

- "English, the least complex of the four languages we tested, was not the best-performing language for either humans or models. Instead, Spanish consistently yielded the highest model accuracy, despite its greater linguistic complexity." This authors' conclusion is in contrast to their main claim in the paper. Authors need to investigate as to what is the actual reason behind this, when English is simple, has larger community, and more resources?

- Similarly authors need to investigate that why: "At the same time, Greek, which is the most complex language according to our metrics, did not occupy the lowest position; on the contrary, it systematically outperformed Catalan, which is relatively linguistically simpler."

- Authors should perform statistical significance test to verify the relevance of these claims.

- Please explain the compute time and resources that were invested to conduct the study.

- For the Wug Test, give multiple examples in the results showing how different models performed and how humans performed. Also add the cases where reported anomalies were seen. Like the failure cases and the unexpected anomalistic cases.

- The length of the paper appears to be rather short, more emphasis is given on literature review, however the presentation can be improved. It is suggested to make a chronological table of all the studies performed on this topic and list down their conclusions/ key-findings, experimental setup, datasets used.

- Furthermore, the results section need to be strengthen, showing multiple examples/results.

- Please explain why and how accuracy is selected as the metric of choice. In many cases it is not the correct measure of the performance, also add the AUC, recall, sensitivity, f1-score, precision scores.

Overall, the manuscript appears to make useful contribution, but further justifications are required.

Reviewer #2: Dear Author,

The manuscript presents a clear and well-designed study examining how Large Language Models generalize morphological rules across four languages using a multilingual Wug Test. The research question is timely, and the methodology—especially the construction of nonce stimuli, the balanced design, and the use of GLMMs—is appropriate and transparent. Ethical approval, participant recruitment, and data availability are all thoroughly documented.

The results are clearly presented, and the interpretation is reasonable, particularly the conclusion that model performance aligns more with community size and data exposure than with structural complexity. Some claims, however, would benefit from slightly more cautious wording. The limitations section could also briefly address potential prompt-related biases when interacting with different models.

Overall, this is a strong and relevant contribution. With minor clarifications and small stylistic adjustments, the manuscript would be suitable for publication in PLOS ONE.

Reviewer #3: Thank you for the opportunity to review this interesting and timely manuscript. The study raises valuable questions; however, several areas would benefit from clarification, refinement, and further detail to strengthen the overall contribution. My detailed comments are as follows:

* Lines 252–254: This paragraph does not appear necessary and could be removed without affecting the clarity or structure of the manuscript.

* Wug test materials: You mention 30 items; however, the file provided (“Humans: Task and stimulus: novel words.xlsx”) shows 15 two-syllable and 15 three-syllable words. It is important to clarify this breakdown in the manuscript and explicitly indicate that the full list is available in the supplementary materials.

* Selection criteria for nonce words: Please provide more information on how the nonce words were selected. Clarifying the linguistic criteria used would improve methodological transparency.

* Targeted morphological processes: Briefly state which morphological processes were targeted in the Wug test (e.g., inflectional morphology). Explain the motivation behind selecting these particular processes and whether this selection is supported by prior research.

* Lines 296–297: The exclusion criteria include cognitive, neurological, hearing, or speech-related impairments. Please comment on whether these criteria could have influenced the results. If these factors are not expected to affect outcomes, clarify the rationale for including them.

* Mode and medium of testing: The manuscript should specify whether the test was administered in written form, spoken form, or both. If spoken, please indicate whether any recording requirements or controls were implemented for human participants.

* Line 339: The link provided earlier is repeated here; the duplication is unnecessary.

* Table 3: It is unclear why Catalan was not reported in this table. Please clarify or revise accordingly. Additionally, the inclusion of “(Intercept)” requires explanation, and it would be helpful to comment on this table in the main text.

* Language quality: A careful proofreading pass is needed to address minor issues with capitalization and punctuation.

* Line 394: The phrase “errors not observed in models” requires clarification. Does this refer to all errors, or only the specific error types discussed?

* Lines 401–402: When referencing the full breakdown of error types available in the OSF repository, please specify the file name to guide readers.

* Lines 476–477: The statement that “Germanic languages display higher levels of morphological irregularity than Romance languages” introduces a comparative rationale. If this comparison is central to the study, please present it earlier in the Methods section. It would also be helpful to explain the rationale for selecting these four languages and what distinguishes them.

* Choice of languages: More broadly, please articulate the logic behind selecting these four languages. What features or contrasts make them suitable for comparison in this context?

* Line 497: Consider rephrasing the sentence beginning with “Before concluding, it is important to acknowledge the limitations…” to avoid a dialogic construction.

* Statistical Analysis: The statistical framework chosen for this study is appropriate for the data structure. Using generalized linear mixed models (GLMMs) to model binary accuracy is sound, and treating test items as random effects is a clear strength. However, several aspects of the analysis would benefit from refinement to ensure full rigor and transparency. Most importantly, the models for human data do not include participants as random effects, despite repeated observations, which may underestimate variance and inflate the apparent significance of fixed effects. Additionally, the structure of the combined human–model analysis is under-specified, particularly regarding whether an Agent × Language interaction was included, and the reporting relies primarily on p-values without providing effect sizes or diagnostic checks. Finally, the conclusions about linguistic complexity versus community size extend beyond what is directly supported by the statistical tests presented. Addressing these points would substantially strengthen the analytical robustness of the manuscript.

* Use of tools: Please ensure that all tools, platforms, or software referenced in the Methods and Results sections also appear in the References list.

* References: The reference list contains inconsistencies in formatting. Please revise to ensure adherence to the journal’s required style.

Overall, the study presents promising results, and addressing the points above will help enhance the clarity, rigor, and completeness of the manuscript.

Reviewer #4: The manuscript presents a technically sound and methodologically rigorous study that clearly supports its conclusions. The experimental design, which is a controlled multilingual Wug Test administered to both human participants and six LLMs, is appropriate for examining morphological generalization, and the authors justify their choice of languages, stimuli construction, and procedures in detail. The statistical analyses are correctly executed and suitable for the research questions. Accuracy is modeled with GLMMs including random effects for items, model comparisons are performed through likelihood ratio tests, and estimated marginal means are used to interpret cross-linguistic differences. These methods are transparent, replicable, and provide robust support for the claims made. Data availability fully complies with PLOS ONE policies, with all raw data, code, and stimuli openly accessible on OSF without restrictions. The manuscript itself is clearly written, logically structured, and expressed in standard academic English. While minor phrasing issues appear occasionally, they do not impede comprehension or interpretation of the findings. Overall, the authors deliver a well-executed study that contributes meaningfully to ongoing debates about LLM linguistic competence and presents results in a way that is both empirically grounded and accessible to a broad readership.

Room for improvement:

Although the manuscript is well executed, several weaknesses limit the strength of its conclusions. First, the operationalization of “linguistic complexity” is not fully aligned with the study’s goals. The authors rely on global Grambank fusion and informativity scores (e.g., lines 147–166), but these metrics incorporate many grammatical domains irrelevant to nominal morphology. This weakens claims about the relationship between morphological complexity and model accuracy, especially when the authors later acknowledge (lines 499–507) that a morphology-specific measure would be more appropriate. Second, the English stimuli appear to contain several nonce forms that unintentionally resemble irregular plural patterns (e.g., sungus, lutie), which the authors note may have misled both humans and models (lines 467–474). Because these irregularity-triggering forms are not systematically described or quantified, it is difficult to assess whether English’s lower accuracy reflects linguistic complexity, task design artifacts, or item-specific biases. Finally, some claims in the discussion appear stronger than the data justify. For example, the conclusion that models are “language-blind” and guided primarily by resource availability (lines 453–457) may overgeneralize from only four languages, two of which are typologically similar (Catalan and Spanish). These weaknesses should be addressed to strengthen the study’s empirical and theoretical claims.

Reviewer #5: Good paper, interesting idea. I like the Wug Test setup across languages. The main finding is solid and worth publishing. But there are some pretty big problems to fix first.

• Stats need a major clean-up. The numbers in the paper don't add up enough for me to really trust them yet.

• That Table 3 is confusing. What's it comparing to? I need to see the full stats table.

• Saying results aren't "significant" with p=.055 and .092 is kind of shaky. That's really close. You can't just say "they're the same" and move on. Talk about what those borderline numbers might mean.

• Kicking BERT out because it got 0% feels like cheating. If it's that bad, that's actually a cool finding! Either put it back in and explain why it's so different, or give a better reason upfront for why it doesn't count.

• "Community Size" = Training Data? Not so fast. Your whole argument rests on this, but it's a huge assumption. Just because more people speak Spanish doesn't mean GPT saw exactly that much more Spanish text. The internet is weird. You need to either defend this link way better with evidence, or tone down your conclusions a lot and call this a major guess you had to make.

• You're over-selling it. Calling LLMs "language-blind" in the title is too much. Your own data shows they kind of notice if a language is regular (like Spanish). Tone it down. Also, be careful saying they only have "superficial" competence: that's a philosophy paper. Your experiment just shows they're good at this specific pattern-matching task, and their skill depends on how much stuff they've read.

• Tell us exactly what you typed into ChatGPT. The prompt matters.

• Figure 2 is a mess of lines. Make it simpler.

• Fix the references. Some are missing info.

• Don't blame "hard test items" for English being tough. Just stick with the "Germanic languages are irregular" explanation, which is better.

The core idea is cool and the paper should eventually be published. But you got to fix the stats, be more honest about the "community size" guess, and don't claim more than you actually proved. Do that, and you'll have a much stronger paper.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes:  Nada AlJamal

Reviewer #4: Yes:  Parisa Etemadfar

Reviewer #5: Yes:  EBA TERESA GAROMA

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

To ensure your figures meet our technical requirements, please review our figure guidelines: https://journals.plos.org/plosone/s/figures

You may also use PLOS’s free figure tool, NAAS, to help you prepare publication quality figures: https://journals.plos.org/plosone/s/figures#loc-tools-for-figure-preparation.

NAAS will assess whether your figures meet our technical requirements by comparing each figure against our figure specifications.

PLoS One. 2026 Mar 11;21(3):e0343164. doi: 10.1371/journal.pone.0343164.r002

Author response to Decision Letter 1


20 Jan 2026

Dear Reviewers and editor,

We greatly appreciate the comments, questions and suggestions from the five Reviewers, as they have significantly contributed to the improvement of the revised paper we are now resubmitting.

Please see below the responses to the specific comments made by the five reviewers. You can also access these responses in the file "Response to Reviewers" we have uploaded with our resubmission.

Sincerely,

Dr. Paolo Morosi on behalf of all the authors

We thank the Editor for the helpful reviews they secured for us and the five Reviewers for their comments, which have contributed to strengthening our work. Below, we respond to each comment and list the changes we made in the revised version of the manuscript.

Reviewer #1

This paper examines how Large Language Models apply morphological rules to novel words using a multilingual version of the Wug Test. By comparing six models across four languages with human speakers, the study evaluates whether model performance reflects true linguistic competence or merely the amount of training data available. The findings suggest that data availability and community size, rather than linguistic complexity, primarily shape model accuracy.

- Please explain the motivation / scientific importance for comparing particularly community size and grammar complexity.

We thank the Reviewer for their assessment and for the very helpful comments which we address below, linking them to the revised manuscript.

- Please explain the motivation / scientific importance for comparing particularly community size and grammar complexity.

In the revised manuscript, the motivation has been added on pp. 6-7, lines 130-133.

- In the abstract, please distinguish between, resource size and community size, as it creates confusion.

The abstract has been rephrased in order to make the connection between the two notions, community size and volume of training data, clearer. To some degree, the latter is conditioned by the former. For example, a small community that speaks a minoritized language is unlikely to have a strong digital presence. This explains why the performance of LLMs has been found to be better in big, standard, official languages. On pp. 29 (lines 643-654) of the revised manuscript, we have also added discussion on low-resourced vs. high-resourced languages in order to better justify the link we put forth between training data/resource size and community size.

- Please add a model diagram in the methodology section.

A model diagram has been added as Fig 1 on pp. 15.

- It is suggested to use more appropriate (specific) wording than "language-blind" as language could mean many things. At present it leads to confusion.

The title has been changed.

- Colors in Fig 2 are indistinguishable, please add visibly separate patterns over the bars.

We have changed the colors and and also the presentation of Figure 3. Now the models appear on the x-axis and the languages are separated across different panels.

- The study will benefit from if authors separately test the generative LLMs from reasoning/thinking LLMs.

In the Agent_analysis file in the OSF repository (https://osf.io/4z5n6/, go to: Code > Humans_and_Models_Wug_Test), in Sections 3.4 and Section 3.5 we now also include this analysis comparing reasoning and general LLMs with humans. While only general purpose models are the ones that come as different from humans, this is due to the effect of BERT; once this model is removed (section 3.5) neither reasoning nor non-reasoning are statistically different from humans.

- "English, the least complex of the four languages we tested, was not the best-performing language for either humans or models. Instead, Spanish consistently yielded the highest model accuracy, despite its greater linguistic complexity." This authors' conclusion is in contrast to their main claim in the paper. Authors need to investigate as to what is the actual reason behind this, -when English is simple, has larger community, and more resources?

We thank the reviewer for raising this important point, which we addressed in the revised manuscript. Our new statistical analysis indeed confirms that both community size and linguistic complexity are significant predictors of model accuracy, and that community size is in fact the stronger predictor. These results align with the core claim of our manuscript.

However, the reviewer is right that, given that English is linguistically simpler and has the largest speaker community, it not being the best-performing language comes as unexpected. Nonetheless, it is important to note that the two factors, community size and linguistic complexity, interact in ways that are difficult to fully disentangle with currently available metrics. As discussed in the revised version of the manuscript, we relied on a global complexity score, following previous work. Although Spanish comes out as more complex than English overall, specific subcomponents of complexity (e.g., morphological complexity in plural formation in the case at hand), may differ across languages. It is conceivable, for instance, that English plural morphology presents greater irregularity in the specific Wug-test domain we examined, which could depress model performance relative to Spanish. This interpretation, however, is only speculative, given the lack of fine-grained complexity metrics.

Importantly, however, this line of reasoning would also predict Spanish and Catalan to pattern similarly, as both languages share comparable morphological properties, especially in plural formation. Yet our data show that Catalan performed substantially worse than both English and Spanish. This discrepancy is one of the reasons that led us to hypothesize that community size – and thus representation in the training data – is likely to be a better predictor of model accuracy. Our new statistical analysis supports this interpretation.

A further factor that may have disproportionately affected English performance relates to broader typological tendencies. As discussed in the manuscript, Germanic languages have been reported to exhibit higher levels of morphological irregularity than Romance languages, which can hinder token- or pattern-based generalization in models. These irregularities are also reflected in our prompt design: we deliberately used pseudo-words modeled on irregular nouns (e.g., sungus from fungus) to probe the limits of the models’ generalizations. Together, these factors may have increased error rates specifically in English, as models—like humans—sometimes produced the irregular plural form rather than the target generalization. This effect would be less pronounced in Spanish or Catalan, where irregular plural patterns are less frequent.

In sum, although we can identify plausible contributing factors, we agree that a complete explanation for the relative performance of English cannot be provided with certainty at present. We have now addressed this point more explicitly in the revised manuscript and noted it as an avenue for future research.

- Similarly authors need to investigate that why: "At the same time, Greek, which is the most complex language according to our metrics, did not occupy the lowest position; on the contrary, it systematically outperformed Catalan, which is relatively linguistically simpler."

We thank the reviewer for raising this point. As noted in our response to the previous comment, disentangling the contributions of linguistic complexity and community size is challenging, given that both factors are significant predictors of model accuracy in our new statistical analysis. However, the specific pattern highlighted here, namely Greek outperforming Catalan despite being the more complex language, does not contradict our main claim. In fact, it supports it: if linguistic complexity were the primary determinant of performance (i.e. the less complex a language is, the better LLMs perform at it), Catalan should have ranked higher than Greek. The opposite outcome suggests that community size and, more broadly, digital representation plays a more decisive role. Greek has a substantially larger speaker community than Catalan. Therefore, this asymmetry provides a plausible explanation for why Greek systematically outperformed Catalan in our results. We have clarified this point in the revised manuscript on p. 27, lines 585-592.

- Authors should perform statistical significance test to verify the relevance of these claims.

Table 8 now includes the pair-wise comparisons across the different languages in the LLMs performance. In the model, Catalan is statistically different from English and Spanish; there are two border-line cases (i.e. Greek-Spanish and Catalan-Greek), but they do not reach significance.

- Please explain the compute time and resources that were invested to conduct the study.

3 paragraphs have been added in the Participants section in lines 320-340.

- For the Wug Test, give multiple examples in the results showing how different models performed and how humans performed. Also add the cases where reported anomalies were seen. Like the failure cases and the unexpected anomalistic cases.

In the Results and analysis section, p. 26, Table 7 has been added, which includes examples across all languages with the target responses as well as with the anomalies and failures by both humans and LLMs.

- The length of the paper appears to be rather short, more emphasis is given on literature review, however the presentation can be improved. It is suggested to make a chronological table of all the studies performed on this topic and list down their conclusions/ key-findings, experimental setup, datasets used.

The Introduction section was updated with Table 1, including the relevant studies that directly relate to ours.

- Furthermore, the results section need to be strengthen, showing multiple examples/results.

In the Results section, p. 26, Table 7 has been added, which includes examples across all languages with the target responses as well as with the anomalies and failures by both humans and models.

- Please explain why and how accuracy is selected as the metric of choice. In many cases it is not the correct measure of the performance, also add the AUC, recall, sensitivity, f1-score, precision scores.

The Wug Test taps into the ability to generalize rules to novel words that have not been encountered before. We do not measure precision or recall in this task. In line with previous literature, we coded Accuracy in this linguistic task as target/accurate (1) or non-target/inaccurate (0). AUC-ROC curves or F1 score are used with heavily imbalanced datasets, when traditional metrics like accuracy can be misleading, which is not the case here.

Overall, the manuscript appears to make useful contribution, but further justifications are required.

We thank once again the Reviewer for their very helpful feedback. We hope that we have provided the justifications required, but we remain available to further work any point the Reviewer may deem unclear.

Reviewer #2

The manuscript presents a clear and well-designed study examining how Large Language Models generalize morphological rules across four languages using a multilingual Wug Test. The research question is timely, and the methodology—especially the construction of nonce stimuli, the balanced design, and the use of GLMMs—is appropriate and transparent. Ethical approval, participant recruitment, and data availability are all thoroughly documented.

The results are clearly presented, and the interpretation is reasonable, particularly the conclusion that model performance aligns more with community size and data exposure than with structural complexity.

We thank the Reviewer for their positive assessment.

Some claims, however, would benefit from slightly more cautious wording. The limitations section could also briefly address potential prompt-related biases when interacting with different models.

We agree with the Reviewer. Various claims have been reworked in favor of more cautious wording, and especially the limitations section has been significancy expanded, also discussing the role of prompt-related biases (pp. 30, lines 656-660)

Reviewer #3

Thank you for the opportunity to review this interesting and timely manuscript. The study raises valuable questions; however, several areas would benefit from clarification, refinement, and further detail to strengthen the overall contribution. My detailed comments are as follows:

We thank the Reviewer for their helpful and detailed comments. We address each one of the below.

* Lines 252–254: This paragraph does not appear necessary and could be removed without affecting the clarity or structure of the manuscript.

We agree with the Reviewer. This paragraph has been removed from the revised manuscript.

* Wug test materials: You mention 30 items; however, the file provided (“Humans: Task and stimulus: novel words.xlsx”) shows 15 two-syllable and 15 three-syllable words. It is important to clarify this breakdown in the manuscript and explicitly indicate that the full list is available in the supplementary materials.

In the revised version, both these changes have been implemented (pp. 12-13).

* Selection criteria for nonce words: Please provide more information on how the nonce words were selected. Clarifying the linguistic criteria used would improve methodological transparency.

In the revised version, this information is given on p. 13, lines 289-291.

* Targeted morphological processes: Briefly state which morphological processes were targeted in the Wug test (e.g., inflectional morphology). Explain the motivation behind selecting these particular processes and whether this selection is supported by prior research.

In the revised version, this information has been added on p. 12, lines 263-271.

* Lines 296–297: The exclusion criteria include cognitive, neurological, hearing, or speech-related impairments. Please comment on whether these criteria could have influenced the results. If these factors are not expected to affect outcomes, clarify the rationale for including them.

We did not have any concrete expectations about the results, but these are typical exclusion criteria in studies that aim to establish neurotypical human baselines.

* Mode and medium of testing: The manuscript should specify whether the test was administered in written form, spoken form, or both. If spoken, please indicate whether any recording requirements or controls were implemented for human participants.

This information has been added on p. 13.

* Line 339: The link provided earlier is repeated here; the duplication is unnecessary.

The duplication has been removed.

* Table 3: It is unclear why Catalan was not reported in this table. Please clarify or revise accordingly. Additionally, the inclusion of “(Intercept)” requires explanation, and it would be helpful to comment on this table in the main text.

Table 3 has become now Table 4 and has been updated with the pairwise comparisons of the post-hoc test to show all the comparisions across all the languages. This was calculated using the emmeans()-function with Tukey adjustment.

* Language quality: A careful proofreading pass is needed to address minor issues with capitalization and punctuation.

We thank the reviewer for this recommendation. We have carefully proofread the revised manuscript and corrected minor issues with capitalization and punctuation.

* Line 394: The phrase “errors not observed in models” requires clarification. Does this refer to all errors, or only the specific error types discussed?

We thank the reviewer for this observation. We have revised the sentence to clarify that we were referring only to the specific error types discussed in the preceding sentence, not to all possible errors.

* Lines 401–402: When referencing the full breakdown of error types available in the OSF repository, please specify the file name to guide readers.

In lines 453-455 of the Results section, the file and folder names have been specified.

* Lines 476–477: The statement that “Germanic l

Attachment

Submitted filename: Response to Reviewers.docx

pone.0343164.s002.docx (46.6KB, docx)

Decision Letter 1

Wei Lun Wong

2 Feb 2026

Community size rather than grammatical complexity better predicts Large Language Model accuracy in a novel Wug Test

PONE-D-25-55707R1

Dear Dr. Morosi,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager®  and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support .

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Wei Lun Wong

Academic Editor

PLOS One

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #2: All comments have been addressed

Reviewer #3: All comments have been addressed

Reviewer #5: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions??>

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #5: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #5: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #5: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #2: (No Response)

Reviewer #3: Yes

Reviewer #5: Yes

**********

Reviewer #2: he author has appropriately implemented all the required revisions. The revised manuscript shows clear improvements and meets the journal's scientific publication standards, with no major remaining concerns."

Reviewer #3: (No Response)

Reviewer #5: Review points properly addressed

• Motivation for comparing community size and grammatical complexity: The authors have added a clear explanation of the motivation in the Introduction (pp. 6-7). The connection between community size, digital representation, and training data availability is now better articulated, strengthening the rationale for the study.

• Clarification of “community size” vs. “resource size” in the Abstract: The abstract has been rephrased to clarify the relationship between community size and training data volume. The revised version is clearer and avoids potential confusion.

• Addition of a model diagram: A clear experimental design diagram (Fig. 1) has been added to the Methodology section, which improves the readability and reproducibility of the study.

• Removal of “language-blind” from the title; The title has been changed to a more precise and less overreaching formulation, which aligns better with the empirical findings.

• Improvement of Figure 2/3 readability: The revised Figure 3 now uses clearer color distinctions and a more intuitive layout (models on the x-axis, languages in separate panels), addressing my earlier concern about visual clarity.

• Additional analysis separating generative and reasoning LLMs: The authors conducted additional analyses comparing reasoning and non-reasoning models, which are now available in the OSF repository. This adds depth to the results and addresses my suggestion for further model-type comparisons.

• Statistical significance testing and post-hoc comparisons: The authors have rerun their statistical models with improved specifications (including participant random effects) and provided post-hoc comparisons in Table 8. The reporting is now more transparent and rigorous.

• Strengthened Introduction and Results sections: The Introduction now includes Table 1 summarizing relevant prior work, and the Results section is more detailed with added tables and examples, addressing my concern about the paper’s initial brevity.

Remaining Minor Suggestions

While the authors have done an excellent job revising the manuscript, a few minor points could still be polished:

• Reference formatting: Although the authors state that references have been corrected, I noticed a few inconsistencies in formatting (e.g., capitalization, use of “et al.”, DOI presentation). A final careful pass to ensure adherence to PLOS ONE style is recommended.

• Clarity in limitations: The limitation regarding the link between community size and training data is well-acknowledged. However, the authors might briefly suggest how future work could better operationalize this relationship (e.g., using corpus size estimates rather than speaker counts).

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #2: Yes:  Dr. Neamah Dahash Farhan Professor in University of Baghdad / College of Islamic Sciences -Iraq

Reviewer #3: Yes:  Nada AlJamal

Reviewer #5: Yes:  EBA TERESA GAROMA

**********

Acceptance letter

Wei Lun Wong

PONE-D-25-55707R1

PLOS One

Dear Dr. Morosi,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS One. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Wei Lun Wong

Academic Editor

PLOS One

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Response to Reviewers.docx

    pone.0343164.s002.docx (46.6KB, docx)

    Data Availability Statement

    All data files are available from the OSF database (https://osf.io/4z5n6/).


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES