Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2023 Apr 17;120(17):e2218367120. doi: 10.1073/pnas.2218367120

Do some languages sound more beautiful than others?

Andrey Anikin a,b, Nikolay Aseyev c, Niklas Erben Johansson d,1
PMCID: PMC10151606  PMID: 37068255

Significance

Despite the abiding popular interest, there is hardly any empirical research on whether some languages sound more beautiful than others and whether some phonetic features are universally attractive. We carefully controlled for language familiarity and cultural biases in the first large-scale, cross-cultural comparison of hundreds of languages and did not find any widely shared preferences for specific languages or phonetic features. While some types of human voices may be generally attractive, the languages themselves were surprisingly uniform in terms of their esthetic appeal to the average person in our sample. This initial finding promotes an egalitarian view of extant world languages, demonstrates the feasibility of cross-cultural phonesthetic research, and raises important questions about the role of esthetics in language evolution.

Keywords: language attitudes, phonesthetics, cross-cultural, voice

Abstract

Italian is sexy, German is rough—but how about Páez or Tamil? Are there universal phonesthetic judgments based purely on the sound of a language, or are preferences attributable to language-external factors such as familiarity and cultural stereotypes? We collected 2,125 recordings of 228 languages from 43 language families, including 5 to 11 speakers of each language to control for personal vocal attractiveness, and asked 820 native speakers of English, Chinese, or Semitic languages to indicate how much they liked these languages. We found a strong preference for languages perceived as familiar, even when they were misidentified, a variety of cultural-geographical biases, and a preference for breathy female voices. The scores by English, Chinese, and Semitic speakers were weakly correlated, indicating some cross-cultural concordance in phonesthetic judgments, but overall there was little consensus between raters about which languages sounded more beautiful, and average scores per language remained within ±2% after accounting for confounds related to familiarity and voice quality of individual speakers. None of the tested phonetic features—the presence of specific phonemic classes, the overall size of phonetic repertoire, its typicality and similarity to the listener’s first language—were robust predictors of pleasantness ratings, apart from a possible slight preference for nontonal languages. While population-level phonesthetic preferences may exist, their contribution to perceptual judgments of short speech recordings appears to be minor compared to purely personal preferences, the speaker’s voice quality, and perceived resemblance to other languages culturally branded as beautiful or ugly.


It has long been debated which languages are esthetically pleasing, and why. Phonesthetics—the perception of beauty in spoken language that is independent of meaning—is mentioned already in the Talmud: “Four languages are pleasing for use in the world: Greek for song, Latin for battle, Syriac (Aramaic) for dirges, Hebrew for speech” (1). Popular discussions have continued in more recent times, from Tolkien’s mellifluous Elvish and the Black Speech of Mordor (2) to speculations about the perfect language for singing. In contrast, and until very recently (3, 4), academic research has been all but silent on the topic of phonesthetics. Labeling some languages as intrinsically beautiful and others as ugly is politically incendiary, and there are difficult methodological challenges because the perception of beauty in a language depends on idiosyncratic factors such as previous exposure. Nevertheless, a scientific investigation of phonesthetics is becoming more feasible due to improved accessibility both of recordings from less familiar world languages and of international samples of raters for perceptual studies, and it offers valuable theoretical insights beyond mere claims that language X is more beautiful than Y. Speech is of paramount importance to human societies, and a proper understanding of its esthetic properties will be a crucial addition to the active research on the perceptual features behind the esthetics of visual arts (5) and music (6, 7). Furthermore, an esthetic evaluation of spectrotemporal features in speech, such as specific phonemes or prosodic patterns, may affect their typological prevalence, making phonesthetics a potentially relevant factor in language evolution. We therefore took advantage of a newly available corpus of recordings from hundreds of world languages (https://live.bible.is) to test whether the phonetic structure of some languages makes them universally appealing, and if so, what phonetic features are responsible for this effect.

To test whether phonesthetic preferences exist, it is essential both to understand what makes this a theoretical possibility and to consider what other factors may be involved when a subject in a perceptual experiment reports liking or disliking the sound of a particular language. Starting from the lowest level of auditory perception, speech is only one type of input to a neurological system that evolved to process sound in general. In fact, the human auditory brain has changed very little compared to our primate ancestors (8): if anything, speech itself may be adapted for exploiting the sensitivity of the auditory system (9, 10). When we hear “How do you do?”, the initial stages of processing are the same as for any environmental sound. As a result, the same primitive acoustic features that we enjoy or dislike in environmental sounds should affect our esthetic perception of speech. Unfortunately, Feynman’s pessimistic view that “in understanding why only certain sounds are pleasant to our ear.... [we are] probably no further advanced now than in the time of Pythagoras” (11) still rings true. There are notoriously unpleasant sounds, such as fingernails on blackboard (12), and the aversiveness of industrial noises is an important practical concern, but the fundamental reasons for why these sounds are so disturbing remain mysterious. While it is doubtful that any phones (elementary sounds used in human speech, different from phonemes in that they do not need to distinguish between words) are strongly aversive as acoustic primitives, some classes may well be more pleasing than others. For instance, there are claims that consonants l and m and high vowels are overrepresented in the words that native English speakers consider beautiful (13), while German speakers perceive short vowels, voiceless consonants, and hissing sibilants as affectively negative (14). Word meaning is a hopeless confound when speakers evaluate the “melody” of words in their own language, but the hypothesis that speech sounds lie on a phonesthetic continuum is testable. If so, the presence of specific phones, such as clicks or retroflexes, could make an unfamiliar language more or less pleasant to hear.

As individual phones are strung together, the spectrotemporal complexity of the resulting speech signal may have some esthetically optimal level at which the signal is neither too predictable nor too complex, based on the widely accepted general principle that preferred sensory input should sufficiently activate the corresponding brain areas without being exceedingly difficult to process (6, 7, 15). Just as music performed with some minor irregularities by a human performer is more pleasant than the same piece played impeccably by a machine (7), algorithmically perfect, rule-based speech is not appealing (15), but neither is hard-to-follow speech that is poorly enunciated or masked by noise (16). The implication to phonesthetics is that the overall phonetic complexity of a language could have an inverted-U-shaped relation to its esthetic appeal: for example, the number of vowels and consonants in a language should be high enough to encode semantic information but not so high as to overtax the processing capacity of the auditory system. Alternatively, the limit on the number of phonemes may be set by vocal production rather than perception, or listeners might even fail to notice whether an unfamiliar language is phonetically rich or simple because categorical perception of phonemes is a trained skill (17). The voice itself also has an important esthetic dimension. Voices are more appealing if they sound healthy and sex-typical (18), presumably because we have evolved to look for signs of fitness in the voice, creating some universal standards of auditory beauty analogous to the appeal of population-typical, symmetrical faces, and unblemished skin (1921). Such voice-specific preferences do not translate into any concrete predictions about which phonetic features should be esthetically appealing, but they constitute important confounds in phonesthetic research, necessitating the inclusion of multiple speakers per language or within-speaker comparisons (2), ideally with fully fluent bilingual speakers, who can provide recordings of multiple languages with identical voice quality (4).

Moving from auditory perception to meaning, words and utterances are recognized and processed semantically. The first stage of matching input to template is potentially a major source of esthetic experience because of the so-called mere exposure effect: we like what has become familiar from repeated exposure (2224). Intrinsically noxious stimuli generally do not become likable with exposure (23), but we may learn to appreciate questionable sounds such as harsh metal music (25). In the case of speech, listeners may recognize the language as a whole, particular words, or perhaps even specific phones or phone combinations such as distinct consonant clusters. In fact, a vague sense of familiarity may be enough to affect preferences as explicit recognition is not a prerequisite of a mere exposure effect (23). We therefore predicted that languages should be judged more beautiful if they phonologically overlap with the listener’s first language and contain typologically common phones (1) due to sounding both more familiar and more prototypical, just as averaged faces and voices are often judged to be more beautiful (21, 26).

Finally, we must consider sociocultural factors that affect the desirability or “prestige” of languages. Just as dialects of the same language are often preferred or disliked (2729), languages associated with a particular geographical region, country, or social category (e.g., migrants or ethnic minorities) may be perceived as socially more or less desirable, potentially complicating the effect of familiarity. For instance, among the 16 European languages rated by European listeners in ref. 3, the seldom-recognized Icelandic was rated as more likable than the always-recognized German, although both languages are quite close phonetically. A language does not need to be identified correctly for this effect to occur: actual or perceived resemblance to a marked cultural category may suffice.

Considering all these confounds, the optimal design for studying phonesthetics would be to record multiple speakers from a large number of phonetically diverse and completely unrecognizable languages, to be rated by listeners from several linguistic-cultural groups. In our best attempt to approach this design, we obtained recordings of 228 languages from 43 language families (Fig. 1), including between 5 and 11 speakers per language to control for personal vocal attractiveness. These recordings were then rated on pleasantness by speakers of English, Chinese, or Semitic languages. To provide a measure of familiarity, listeners were also asked to indicate if they recognized the language and, if so, in which part of the world it was spoken (SI Appendix, Fig. S1). The study addressed three main questions: 1) How strong are phonesthetic effects—that is, how pronounced are the differences between world languages in their intrinsic esthetic appeal? 2) Do speakers of different languages rank other languages similarly, demonstrating some cross-cultural concordance in phonesthetic preferences? 3) What phonetic characteristics make a language more or less pleasant?

Fig. 1.

Fig. 1.

Included languages (N = 228) colored by language family (N = 43). See SI Appendix, Table S1 for the full list.

Results

Extralinguistic Confounds.

The average pleasantness score was 12.2% higher (i.e., 12.2 points on a scale of 0 to 100, 95% CI [11.1, 13.3]) in the 14% of trials in which participants explicitly indicated that they recognized a language. In line with this strong and culture-specific (SI Appendix, Fig. S2) familiarity effect, the most beautiful languages according to Chinese speakers were Mandarin, English, and Japanese, whereas speakers of Semitic languages preferred Spanish, English, Italian, and Arabic (SI Appendix, Fig. S3). Any attempt to estimate intrinsic phonesthetic appeal of different languages therefore requires that such obvious confounds be identified and controlled for. Interestingly, languages were misidentified more than half the time (50.3%), but the boost in pleasantness was very similar regardless of whether a language was recognized correctly (12.7% [11.6, 13.9]) or incorrectly (11.6% [10.5, 12.8]). In other words, perceived rather than actual familiarity with a language made it more attractive.

Listeners may not always report familiarity: for example, in ~26% of trials, participants failed to report that English sounded familiar, although they performed the experiment in English. A language may also feel vaguely familiar, but not enough to place it on a map. To account for unreported or semirecognition, we defined “residual familiarity” of each language as the proportion of trials in which it was recognized by a particular group of listeners (English, Chinese, or Semitic) and then statistically controlled for it after excluding trials with explicit recognition. Indeed, residual familiarity remained an important predictor of pleasantness ratings (+16.3% [13.1, 19.4] over the observed range of residual familiarity).

There were only negligible differences between world regions when the language was not recognized (SI Appendix, Fig. S4A), suggesting that languages spoken in different parts of the world do not sound intrinsically beautiful or unpleasant, regardless of the listeners’ own first language. English speakers displayed little or no preference for specific perceived regions and rated any familiar-sounding language as more pleasant (SI Appendix, Fig. S4B). In contrast, Chinese speakers preferred the languages that they thought were spoken in North Asia and North America and had a bias against the (supposedly) African languages, whereas speakers of Semitic languages preferred North and South America. In sum, genuine psycholinguistic preferences may be masked both by a general familiarity effect and by culture-specific biases. Such biases cannot be accounted for by residual familiarity because they can be either positive or negative, and therefore, we replicated the analyses below after simply excluding all languages with substantial familiarity. A cutoff of 20% was chosen based on the distribution of reported familiarity rates, which are clearly inflated due to listeners trying to guess blindly (SI Appendix, Fig. S2), resulting in the exclusion of 13% of languages rated by English-speaking listeners, 7% for Chinese raters, and 26% for Semitic raters.

Finally, some speakers’ voices may be intrinsically appealing, whatever the language. There was a statistically uncertain general preference for female voices (+2.1% [−0.7, 4.2]), but no appreciable main effect of listener’s sex (−0.1% [−2.4, 2.1] for female vs male listeners) or clear interaction between the speaker’s and listener’s sex. We also found a mild positive effect of background music (+1.0% [0.2, 1.8]) but not of audio quality (−0.2% [−0.9, 0.5]). In addition, we extracted 19 acoustic characteristics and estimated their effect on pleasantness ratings (SI Appendix, Fig. S5 and Table S3). The most consistent effect was preference for low-pitched and breathy voices within each sex, as well as for lower-pitch variability and spectral novelty. An important caveat is that some of these features may be affected by nonlinguistic content in the recordings, while others may be capturing language-specific phonetic peculiarities: For example, tonal languages would tend to have higher pitch variability. Therefore, we replicated the analyses in the following sections both with (Figs. 2 and 3) and without (SI Appendix, Figs. S6 and S8) statistically controlling for five of the most important acoustic features and either after excluding languages with familiarity over 20% (Figs. 2 and 3) or after controlling for residual familiarity (SI Appendix, Figs. S7 and S9). Trials with explicit recognition of a language were always excluded.

Fig. 2.

Fig. 2.

Pleasantness scores by speakers of English, Chinese, and Semitic languages are weakly correlated. (A and B) Conditional scores of 228 languages (A) and 43 language families (B) after excluding all languages with familiarity >20% per group and controlling for five robust acoustic predictors of the ratings: cepstral peak prominence, entropy, spectral novelty, pitch, and pitch variability (SI Appendix, Fig. S9 and Table S5). Pearson’s correlations with 95% CI and blue regression lines are calculated pointwise from posterior distributions of centered conditional language scores from two separate mixed models, not merely from the most credible point estimates (solid points = medians of posterior distributions). (C and D) Conditional scores averaged across all three groups of listeners highlight some outliers among languages (C) and families (D). The x-coordinate is only added to reduce clutter. All scores are on a scale of 0 to 100.

Fig. 3.

Fig. 3.

Phonetic features do not have a consistent effect on pleasantness ratings. All shown predictors are tested simultaneously in one multilevel multiple regression model per listener group after excluding all languages with familiarity over 20% per group and controlling for acoustic predictors. Each point shows the predicted effect of changing one phonetic feature, while holding all other predictors constant, on the pleasantness score of a single recording. Medians of posterior distribution and 95% CIs from four mixed models. To focus on the most robust effects, we grayed out the points with <95% of posterior probability to one side of zero. N (trials/languages) = 16,792/198 for English, 15,928/210 for Chinese, 12,457/164 for Semitic, and 46,928/199 for all groups combined. The phonetic features are explained in SI Appendix, Table S3.

Comparing Language Scores by English, Chinese, and Semitic Speakers.

Conditional language scores, calculated from mixed models after accounting for familiarity and acoustic controls, were weakly, but reliably correlated between English, Chinese, and Semitic raters (Pearson’s r = 0.21 to 0.23; Fig. 2A), suggesting some cross-cultural concordance in preferring specific languages. These correlations were slightly higher when calculated for language families instead of individual languages (Fig. 2B), but overall, the majority of languages and families had fairly similar conditional scores that lay within ±2 to 3% on the rating scale of 0 to 100%, which is a modest difference compared to, for example, the 12% boost due to familiarity. Interestingly, the concordance between groups of listeners increased when we omitted acoustic predictors (SI Appendix, Fig. S6), suggesting that cross-cultural convergence in pleasantness scores can partly be a consequence of preferences for a specific voice quality and manner of speaking, rather than language-specific phonetics.

Although we excluded all languages recognized in over 20% of trials, likely effects of some unaccounted-for familiarity remained, which is easier to see in conditional scores aggregated across English, Chinese, and Semitic raters (Fig. 2 C and D). The high placement of the English-based creole Tok Pisin and the Indo-European family is particularly striking. Likewise, languages from the Uto-Aztecan family may have scored so high because they sounded vaguely familiar due to strong Spanish influences. The effect of familiarity was not always positive: for example, Thai and Yongbei Zhuang from the Tai–Kadai family were rated low by Mandarin speakers, who usually recognized both of these languages, in an alternative model with all languages included (SI Appendix, Fig. S7). Some of other families with low scores were represented by only one (e.g., Ticuna–Yuri in the Amazon) or two (e.g., Kru in West Africa) languages, so unpleasant voices of individual speakers or other extralinguistic factors could have affected the scores. Generally, however, it seems harder to explain why some languages and families were considered unattractive, and familiarity cannot be the only reason. Avar and Chechen from the Nakh-Daghestanian family in northern Caucasus were hardly ever recognized, yet they received unusually low scores, and so did Karakalpak (spoken in Uzbekistan) from the otherwise above-average Turkic family.

The Effect of Phonetic Features.

The case for universal phonesthetic preferences can become much stronger if we can not only demonstrate that listeners around the world agree about which languages sound more or less beautiful but also pinpoint the phonetic or prosodic features responsible. However, none of the tested phonetic features predicted pleasantness scores in all three groups of listeners (English, Chinese, or Semitic; Fig. 3). This is unlikely to be a consequence of simultaneously considering too many predictors in multiple regression as voice-specific acoustic controls preserved clear effects. The few detected group-specific effects were small and hard to interpret. Thus, surprisingly, speakers of tonal languages in the Chinese group rated other tonal languages as 1.5% [0.6, 2.4] less pleasant than nontonal languages. These findings were replicated without controlling for acoustic measures of voice quality to ensure that the effects of phonetic features were not masked by acoustic predictors, except that the negative effect of tonality became more robust for all groups of listeners (SI Appendix, Fig. S8). Notably, the diversity of phonemic repertoire was not associated with pleasantness ratings: The overall number of vowels in a language was a vanishingly weak negative predictor of pleasantness scores (Chinese group −0.4% [−0.8, 0.0], English −0.4% [−0.9, 0.1], Semitic +0.2% [−0.2, 0.6]), and no pronounced effect was found for the number of consonants in any group. Phonemic typicality of a language was a marginal positive predictor only in the Semitic group (+0.6% [−0.1, 1.3]), nor did we find any effect of phonemic similarity between the listener’s first language and the rated language (−0.1% [−0.4, 0.2] overall). Considering the theoretical possibility of U-shaped relationships, we also estimated the effect of quantitative predictors (e.g., the number of vowels or the typicality index) with generalized additive models, but did not find marked nonlinear effects (SI Appendix, Fig. S10).

Discussion

If there is something intrinsically beautiful about the sound of certain languages, even listeners who are unfamiliar with these languages should reliably rate them as more pleasant. However, once we have accounted for familiarity and preferences for specific voice types, the scores of all unfamiliar languages varied within just a few percentage points. The 228 world languages that we tested, with all their phonetic and prosodic diversity, thus sounded comparably attractive to the average listener in our sample as long as they were not familiar. Of course, individual listeners may have their personal preferences, and we found both positive and negative cultural biases as well as a general preference for languages perceived as familiar, confirming the crucial role of sociolinguistic factors (3, 2729) and the mere exposure effect (2224). Beyond that, however, there was little agreement between listeners about what languages or phonetic features they found attractive. The listeners clearly attended to the task and did not answer at random as familiarity and several acoustic features had consistent and strong effects on pleasantness ratings. Thus, if genuine phonesthetic differences between languages exist, they appear to be relatively small at the population level.

The second question addressed by the study was the degree to which phonesthetic preferences are consistent across cultures. Conditional scores of languages and families by English, Chinese, and Semitic groups of listeners, calculated after accounting for familiarity- and speaker-dependent preferences, aligned better than expected by chance, suggesting some cross-cultural agreement on which languages are intrinsically more beautiful. However, the correlation between the scores by English, Chinese, and Semitic raters was low and sensitive to outliers—a handful of languages or families with particularly high or low scores—which means that the apparent cross-cultural concordance in phonesthetic judgments might be caused by indirect familiarity effects such as lexical-phonetic resemblance to widely recognized languages with strong cultural connotations. An intriguing finding was the greater cross-cultural agreement about which languages were particularly unattractive compared to which ones were uncommonly beautiful, as well as a general negative skew in the distribution of average scores of unfamiliar languages. Unless this is an artifact caused by unpleasant voices of individual speakers, it suggests that phonesthetic research might obtain better traction by focusing on the negative pole—that is, on the inherently unpleasant acoustic and phonetic features that languages normally avoid.

Third, we attempted to determine what phonetic characteristics, if any, make some languages beautiful and others unpleasant. In particular, we tested the contribution of several discrete phonetic features, overall phonetic complexity (number of vowels and consonants), phonetic typicality, and the overlap with the listener’s mother tongue. None of these features noticeably affected pleasantness scores, with the possible exception of a slight preference for nontonal languages. In a few cases, there were not enough languages with a particular feature to estimate its effect reliably (e.g., only two languages had clicks), but the effects of most phonetic features were estimated quite precisely and were very close to zero, so we can be confident that they did not strongly affect listeners’ preferences. Phonetic overlap with the listener’s mother tongue was not associated with pleasantness, and speakers of tonal languages expressed no general preference for tonality. This is surprising, considering the strong familiarity effects, and suggests that listeners may recognize words or supra-segmental prosodic patterns rather than individual phonemes or lexical tones. Of particular significance to our theoretical predictions, neither the overall phonetic complexity of a language nor the typicality of its phonetic repertoire predicted pleasantness ratings. The number of phonemes may be an inadequate measure of complexity, and the recordings may not have been sufficiently long to be phonetically representative of each language, so it is possible that phonetic complexity is indeed a relevant factor, but the present study was not powerful enough to detect its effect. Another interesting possibility is that the complexity of all languages may be kept within the optimal zone defined by the need to encode and extract information efficiently. It has been suggested that all languages have similar information carrying capacity per second, compensating for differences in phonological complexity by the corresponding variation in grammatical complexity or the rate of syllable production (30). However, the amount of variation should not be exaggerated: There are no natural languages with only two or three phonemes, which would use the communication channel inefficiently, or with 10,000, which would be impossible to produce and discriminate reliably. Instead, the languages in our sample had between 14 and 51 phonemes, counting as in (31). Thus, continuous cultural selection may ensure that the size of phonetic repertoire remains reasonably consistent, making all languages comparable in terms of both information-carrying capacity and esthetic appeal.

While the attractiveness of voices, rather than languages, was not the main focus of this study but rather a confound, it was interesting to observe that all acoustic effects were very similar in English, Chinese, and Semitic groups of listeners, confirming that preferences for specific voice types are not culture-specific (18). This has important implications for future studies of phonesthetics: While the attractiveness of individual voices must be accounted for, the relevant acoustic confounds probably do not have to be estimated separately for each tested group of listeners. An important caveat is that acoustic measures may be affected not only by a speaker’s voice quality but also by recording conditions (e.g., background noises and the distance to the microphone) as well as language-specific phonetics (e.g., the number of fricatives) and prosody (e.g., lexical tones), so these effects need to be interpreted with caution. Likewise, cinematic portrayals of gender stereotypes may impact the observed slight preference for female voices. However, we replicated the analyses with and without controlling for acoustic predictors, and the main conclusions remained unchanged with regard to both cross-cultural concordance and the effect of phonetic features.

While this is the largest cross-cultural study on phonesthetic preferences to date, it has a number of important limitations. It will be important to obtain longer and more controlled speech stimuli for future phonesthetic research, enabling more nuanced acoustic and phonetic analyses of the recordings. For instance, we used standard phonetic inventories that described each language in general, but with more standardized recordings, it could be possible to phonetically transcribe the actual rated passages. Other potentially relevant phonetic measures can then be obtained, including specific consonant clusters (1), prosodic features such as speech rate and dynamic range of loudness, more robust measures of voice quality, and improved estimates of phonetic typicality. Other types of speech recordings can also be tested, including unstaged conversations representative of the language heard in everyday life. Likewise, while we did include three linguistically and culturally distinct groups of listeners, they were all recruited and tested on a British online testing platform. Thus, raters in all three groups were literate and exposed to European languages. In future, it will be important to extend phonesthetic experiments to monolingual, non-WEIRD (32) samples more representative of humanity as a whole. Another approach to explore in future studies would be to look for evidence of sound symbolism in words like “beautiful” and “ugly”. If particular phonetic features are universally overrepresented in these words, this could indicate a phonesthetic preference for these features. No such effect has been found so far (31), although there is emerging language-specific evidence that certain phonemes are associated with affective meanings (13, 14). Thus, the likely connection between sound symbolism and phonesthetics ought to be fertile ground for future studies.

Despite these limitations, the present study has made two important contributions. First, we now know that the expected phonesthetic effects are very strongly affected by real or even imagined familiarity, which means that large numbers of diverse languages must be tested. An interesting alternative is to use artificial languages, such as versions of Elvish (2) or meaningless synthetic speech. Second, by setting an upper boundary on population-level esthetic preferences, we have emphasized the fundamental phonetic and esthetic unity of world languages.

Materials and Methods

Stimuli.

Recordings were obtained from the soundtrack of a religious film publicly available in hundreds of languages (https://live.bible.is/jesus-film). This is part of the bible.is project, which is increasingly used for linguistic research due to its unique breadth in terms of the number of included languages and families (e.g., refs. 33 and 34). We included all languages in which the film was available provided that there were several male and female voice actors and the soundtrack was of sufficient quality, using clean dubbing rather than a voice-over with two languages audible. We identified 11 scenes with relatively noise- and music-free monologues by ten actors and the narrator (normally four female and seven male voices). Importantly, these were the same scenes from the same film for all languages; thus, the context, type of speech (neutral narration, two friends conversing, a speaker addressing a crowd, etc.), expressed emotion, and for the most part, the nonlinguistic noises were identical across the compared language. The scene number was then taken into account when analyzing the data, providing a better-controlled comparison of the sampled languages. The audio was prepared in Audacity (https://www.audacityteam.org/) by trimming long pauses and removing occasional low-frequency noise; all recordings were then normalized for rms amplitude. A.A. and N.E.J. rated audio quality and the presence of background music, removing ~5% of poor-quality recordings. The final sample consisted of 2,125 recordings from 228 languages (Fig. 1 and SI Appendix, Table S1). Each language was represented by 5 to 11 recordings, and thus up to 11 unique speakers per language. The recordings were 5 to 19 s in duration (mean ± SD = 10.7 ± 3.2 s), so the total duration of audio per language was 55 to 127 s (100 ± 13 s).

Participants.

We recruited 820 raters (514 women, 298 men, 8 other/unspecified; mean age = 35, range 18 to 77) on the online platform Prolific (https://prolific.co/). With this sample size, each of 2,125 unique recordings was rated on average 29 times, range [12, 68]. Three target groups were tested separately: native speakers of English (predominantly British English, filtered on Prolific as “first language = English + geographical location = UK”), Chinese (“first language = Chinese/Mandarin/Hakka/Cantonese”), or Semitic languages (“first language = Arabic/Hebrew/Maltese”). These three groups were chosen as they are culturally influential, with different writing systems and profound phonetic differences, yet are well represented in the available pool of participants. We asked each participant to report their first or best language and other languages that they speak fluently. The English group consisted only of those participants who explicitly reported English as their first/best language (SI Appendix, Table S4). We considered anyone who reported being fluent in Chinese or Cantonese as belonging to the Chinese group, including 72 individuals who reported to be fluent in a Sinitic language, but listed English as their first language. Likewise, anyone who reported fluency in a Semitic language was placed into the Semitic group.

Procedure.

The main perceptual experiment was written in javascript and performed online. The study was exempt from ethical approval in accordance with the Swedish Ethical Review Act (2003:460). Participants were first informed about the nature and goals of the study, provided informed consent by agreeing to terms and conditions, and filled in a questionnaire on linguistic background and general demographics. In each trial, a participant heard a spoken phrase and was asked: “How much do you like the sound of this language?” Responses were given on a horizontal Visual Analog Scale (VAS) marked from Not at all to Very much. If a participant checked the box “I think I recognize this language”, they were asked to identify in which region the language was spoken (SI Appendix, Fig. S10). The recording could be replayed, and there was no time limit for responding. Participants completed 50 trials each, where each trial contained one randomly chosen recording from a randomly chosen language. To validate the perceptual rating scale, we also performed three additional experiments using a subset of 50 recordings from 50 relatively unfamiliar languages (average familiarity <10%), selected at random in inverse proportion to the typicality of their pleasantness ratings so as to include recordings with both high and low average ratings. Three alternative wordings were used: 1) How much do you like the sound of this language? Not at all... Very much (as in the main study); 2) How beautiful do you find this language? Ugly... Beautiful; 3) How beautiful do you find this language—the language itself, not the voice? Ugly... Beautiful. We recruited 20 native speakers of British English on Prolific for each validation experiment, so each sound was rated 20 times on each version of the response scale. The average ratings per recording in these three experiments and the original study correlated with Pearson’s r between 0.84 and 0.92, with an overall Intraclass Correlation Coefficient of 0.73, 95% CI [0.63, 0.82] over the aggregated ratings (SI Appendix, Fig. S11). Thus, listeners understood the question as intended, namely as referring to the intrinsic beauty of a language, regardless of the precise formulation.

Acoustic and Phonetic Features.

Each recording was analyzed acoustically with R package soundgen (35) to extract voice pitch and its variability, breathiness, and other commonly used measures of voice quality (SI Appendix, Table S5). Information about tonality, the number of vowels and consonants, the presence of specific phonemes such as clicks, and so on was obtained from standard references, such as Phoible (36), and a range of phonological descriptions derived from language grammars. We also estimated the phonetic typicality of each language and its similarity to each listener’s first language from full phonetic inventories, which were available for 226 out of 228 languages. Full details are available in the SI Appendix, Supplementary text and Tables S2 and S3.

Data Analysis.

Unaggregated responses were analyzed using Bayesian multilevel models fit with the R package brms (37). The outcome variable was a single rating of a recording on a continuous scale (0 to 1), which was modeled with zero-one-inflated beta distribution (38). Each model predicted the rating in an individual trial as a function of population-level predictors, such as language familiarity, and random or group-level effects such as language, family, clip number (one of 11 scenes in the source film), and subject. We allowed the effect of language to vary across scenes because scene-specific sound effects, such as echo and background noise, were not identical for all languages. Each language and each sound were thus assigned a unique group-level intercept, the effect of language on ratings was assumed to vary across subjects, and the variance of responses (phi) was assumed to vary across participants to account for individual differences in using the response scales. Posterior distributions of model parameters and fitted values were summarized by their medians and 95% credible intervals (CIs). The audio, datasets, and R code for audio manipulation and data analysis are available in online supplements (https://osf.io/nhxkv/).

Supplementary Material

Appendix 01 (PDF)

Acknowledgments

We are grateful to Gabriel Vogel for his comments on the study.

Author contributions

A.A., N.A., and N.E.J. designed research; A.A., N.A., and N.E.J. performed research; A.A. contributed new reagents/analytic tools; A.A. analyzed data; and A.A., N.A., and N.E.J. wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

Although PNAS asks authors to adhere to United Nations naming conventions for maps (https://www.un.org/geospatial/mapsgeo), our policy is to publish maps as provided by the authors.

This article is a PNAS Direct Submission.

Data, Materials, and Software Availability

Data (audio, datasets, and R scripts) have been deposited at https://osf.io/nhxkv/10.17605/OSF.IO/NHXKV (39).

Supporting Information

References

  • 1.Deutscher G., Through the Language Glass: Why the World Looks Different in Other Languages (Metropolitan books, 2010). [Google Scholar]
  • 2.Mooshammer C., Hornecker H., Walch M. C., Xia Q., The influence of the mother tongue on the perception of constructed fantasy languages. Phon. Phonol. Im Deutschsprachigen Raum (2022). https://www.linguistik-in-frankfurt.de/pundp/#pll_switcher. [Google Scholar]
  • 3.Reiterer S. M., Kogan V., Seither-Preisler A., Pesek G., “Foreign language learning motivation: Phonetic chill or Latin lover effect? Does sound structure or social stereotyping drive FLL?” in Psychology of Learning and Motivation (Elsevier, 2020), pp. 165–205. https://www.linguistik-in-frankfurt.de/pundp/#pll_switcher. [Google Scholar]
  • 4.Hilton N. H., Gooskens C., Schüppert A., Tang C., Is Swedish more beautiful than Danish? Matched guise investigations with unknown languages. Nord. J. Linguist. 45, 30–48 (2022). [Google Scholar]
  • 5.Marin M. M., Lampatz A., Wandl M., Leder H., Berlyne revisited: Evidence for the multifaceted nature of hedonic tone in the appreciation of paintings and music. Front. Hum. Neurosci. 10, 536 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Delplanque J., De Loof E., Janssens C., Verguts T., The sound of beauty: How complexity determines aesthetic preference. Acta Psychol. (Amst.) 192, 146–152 (2019). [DOI] [PubMed] [Google Scholar]
  • 7.Brattico P., Brattico E., Vuust P., Global sensory qualities and aesthetic experience in music. Front. Neurosci. 11, 159 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Fitch W. T., The biology and evolution of speech: A comparative analysis. Annu. Rev. Linguist. 4, 255–279 (2018). [Google Scholar]
  • 9.Smith E. C., Lewicki M. S., Efficient auditory coding. Nature 439, 978 (2006). [DOI] [PubMed] [Google Scholar]
  • 10.Theunissen F. E., Elie J. E., Neural processing of natural sounds. Nat. Rev. Neurosci. 15, 355–366 (2014). [DOI] [PubMed] [Google Scholar]
  • 11.Feynman R. P., Leighton R. B., Sands M., The Feynman Lectures on Physics, Vol. I: The New Millennium Edition: Mainly Mechanics, Radiation, and Heat (Basic books, 2011). [Google Scholar]
  • 12.Halpern D. L., Blake R., Hillenbrand J., Psychoacoustics of a chilling sound. Percept. Psychophys. 39, 77–80 (1986). [DOI] [PubMed] [Google Scholar]
  • 13.Crystal D., Phonaesthetically speaking. Engl. Today 11, 8–12 (1995). [Google Scholar]
  • 14.Aryani A., Conrad M., Schmidtke D., Jacobs A., Why’piss’ is ruder than’pee’? The role of sound in affective meaning making. PLoS One 13, e0198430 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Schröder M., “Emotional speech synthesis: A review” in Seventh European Conference on Speech Communication and Technology (2001). [Google Scholar]
  • 16.Dragojevic M., Giles H., I don’t like you because you’re hard to understand: The role of processing fluency in the language attitudes process. Hum. Commun. Res. 42, 396–420 (2016). [Google Scholar]
  • 17.Liberman A. M., Harris K. S., Hoffman H. S., Griffith B. C., The discrimination of speech sounds within and across phoneme boundaries. J. Exp. Psychol. 54, 358–368 (1957). [DOI] [PubMed] [Google Scholar]
  • 18.Pisanski K., Feinberg D. R., “Vocal attractiveness” in The Oxford Handbook of Voice Perception (2019), pp. 606–626. [Google Scholar]
  • 19.Cunningham M. R., Roberts A. R., Barbee A. P., Druen P. B., Wu C.-H., “Their ideas of beauty are, on the whole, the same as ours”: Consistency and variability in the cross-cultural perception of female physical attractiveness. J. Pers. Soc. Psychol. 68, 261 (1995). [Google Scholar]
  • 20.Langlois J. H., et al. , Maxims or myths of beauty? A meta-analytic and theoretical review. Psychol. Bull. 126, 390 (2000). [DOI] [PubMed] [Google Scholar]
  • 21.Rhodes G., The evolutionary psychology of facial beauty. Annu. Rev. Psychol. 57, 199–226 (2006). [DOI] [PubMed] [Google Scholar]
  • 22.Zajonc R. B., Attitudinal effects of mere exposure. J. Pers. Soc. Psychol. 9, 1 (1968).5667435 [Google Scholar]
  • 23.Bornstein R. F., Exposure and affect: Overview and meta-analysis of research, 1968–1987. Psychol. Bull. 106, 265 (1989). [Google Scholar]
  • 24.Montoya R. M., Horton R. S., Vevea J. L., Citkowicz M., Lauber E. A., A re-examination of the mere exposure effect: The influence of repeated exposure on recognition, familiarity, and liking. Psychol. Bull. 143, 459 (2017). [DOI] [PubMed] [Google Scholar]
  • 25.Ollivier R., Goupil L., Liuni M., Aucouturier J.-J., Enjoy the violence: Is appreciation for extreme music the result of cognitive control over the threat response system? bioRxiv [Preprint] (2019). 10.1101/510008 (Accessed 14 August 2020). [DOI]
  • 26.Bruckert L., et al. , Vocal attractiveness increases by averaging. Curr. Biol. 20, 116–120 (2010). [DOI] [PubMed] [Google Scholar]
  • 27.Edwards J., Refining our understanding of language attitudes. J. Lang. Soc. Psychol. 18, 101–110 (1999). [Google Scholar]
  • 28.Fridland V., Bartlett K., Correctness, pleasantness, and degree of difference ratings across regions. Am. Speech 81, 358–386 (2006). [Google Scholar]
  • 29.Cargile A. C., Giles H., Understanding language attitudes: Exploring listener affect and identity. Lang. Commun. 17, 195–217 (1997). [Google Scholar]
  • 30.Pellegrino F., Coupé C., Marsico E., A cross-language perspective on speech information rate. Language 87, 539–558 (2011). [Google Scholar]
  • 31.Erben Johansson N., Anikin A., Carling G., Holmer A., The typology of sound symbolism: Defining macro-concepts via their semantic and phonetic features. Linguist. Typology 24, 253–310 (2020). [Google Scholar]
  • 32.Henrich J., Heine S. J., Norenzayan A., Most people are not WEIRD. Nature 466, 29–29 (2010). [DOI] [PubMed] [Google Scholar]
  • 33.Black A. W., “Cmu wilderness multilingual speech dataset” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 5971–5975. [Google Scholar]
  • 34.Salesky E., et al. , A corpus for large-scale phonetic typology. ArXiv [Preprint] (2020). 10.48550/arXiv.2005.13962 (Accessed 14 August 2020). [DOI]
  • 35.Anikin A., Soundgen: An open-source tool for synthesizing nonverbal vocalizations. Behav. Res. Methods 51, 778–792 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Moran S., McCloy D., PHOIBLE 2.0 (Max Planck Institute for the Science of Human History, Jena, 2019). [Google Scholar]
  • 37.Bürkner P.-C., brms: An R package for Bayesian multilevel models using Stan. J. Stat. Softw. 80, 1–28 (2017). [Google Scholar]
  • 38.Ospina R., Ferrari S. L., A general class of zero-or-one inflated beta regression models. Comput. Stat. Data Anal. 56, 1609–1623 (2012). [Google Scholar]
  • 39.Anikin A., Aseyev N., Erben Johansson N., “Do some languages sound more beautiful than others?”. The Open Science Framework. https://osf.io/nhxkv/10.17605/OSF.IO/NHXKV. Deposited 9 January 2023. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

Data Availability Statement

Data (audio, datasets, and R scripts) have been deposited at https://osf.io/nhxkv/10.17605/OSF.IO/NHXKV (39).


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES