Skip to main content
The Journal of the Acoustical Society of America logoLink to The Journal of the Acoustical Society of America
. 2024 May 13;155(5):3206–3212. doi: 10.1121/10.0026020

What's special about human speech? A student exercise for comparing speech production between humans and chimpanzees

William P Shofner 1,a),
PMCID: PMC11219077  PMID: 38738937

Abstract

Modern humans and chimpanzees share a common ancestor on the phylogenetic tree, yet chimpanzees do not spontaneously produce speech or speech sounds. The lab exercise presented in this paper was developed for undergraduate students in a course entitled “What's Special About Human Speech?” The exercise is based on acoustic analyses of the words “cup” and “papa” as spoken by Viki, a home-raised, speech-trained chimpanzee, as well as the words spoken by a human. The analyses allow students to relate differences in articulation and vocal abilities between Viki and humans to the known anatomical differences in their vocal systems. Anatomical and articulation differences between humans and Viki include (1) potential tongue movements, (2) presence or absence of laryngeal air sacs, (3) presence or absence of vocal membranes, and (4) exhalation vs inhalation during production.

I. INTRODUCTION

Modern humans are the only species known to produce speech for acoustic communication. Chimpanzees share a common ancestor on the phylogenetic tree with modern humans and are the closest living relatives to humans. Despite sharing a common ancestor with humans, chimpanzees do not spontaneously produce speech or speech sounds. Students studying speech science, especially speech production, often learn that there are several anatomical differences between humans and chimpanzees related to speech production. For example, modern humans are unique among primates in having lost the vocal membranes of the larynx during human evolution. The laryngeal vocal membranes are thin flaps of tissue that extend up from the vocal folds and are a common trait across all other primates (Nishimura et al., 2022). The loss of the vocal membrane allows humans to produce more stable, less chaotic speech sounds (Mergell et al., 1999). In addition, modern humans are unique among the great apes in having lost the laryngeal air sacs during their evolution (Fitch, 2000). The laryngeal air sacs are large pouches that extend into the thorax from the larynx and can be inflated with air (Fitch, 2000; de Boer, 2012). The absence of laryngeal air sacs changes the formant structure of vowels and improves vowel discrimination in noise (de Boer, 2012). The laryngeal air sacs may also be another source of chaotic phenomena in primate vocalizations (Fitch et al., 2002). Finally, humans can produce a wide range of phonemes because of a wide range of tongue movements during vocalization. This wide range of tongue movements is due in part to a descended larynx and hyoid bone. The tongue is attached to the hyoid, so the tongue root is descended along with the larynx. This descent of the larynx and tongue root in human evolution allows for front–back tongue movements (i.e., tongue advancement) as well as up–down tongue movements (i.e., tongue height). Most mammals are limited to up–down tongue movements (Fitch, 2000). Although the larynx descends in chimpanzees, there is not a descent of the hyoid bone indicating that the tongue root does not descend with the larynx (Nishimura et al., 2003), which suggests limited tongue movement in chimpanzees (Ekström and Edlund, 2023).

Viki was a chimpanzee that was home-raised, beginning a few days after birth, by Dr. Keith Hayes and his wife Catherine (Hayes and Hayes, 1951). Viki was raised like a human child but did not spontaneously produce speech or speech sounds. With extensive speech training beginning at 5 weeks of age, Viki was able to ultimately produce the words “mama,” “papa,” and “cup” by the age of 3 yr. The difficulty Viki had learning to produce these three spoken words has been attributed to the lack of language abilities of chimpanzees (Kellog, 1968). Although the 1951 paper indicates that Viki produced these words, no acoustical analysis was presented. In 1952, Peterson and Barney (1952) published their now classic paper on the acoustic formant structure of vowels, but a follow-up acoustical study on Viki's speech production does not appear to have been carried out until recently when Ekström (2023) published wideband spectrograms of Viki's speech.

In 2021, I began teaching a course on the topic “What's Special about Human Speech?” This course takes a comparative approach towards understanding the evolution of speech production and speech perception in humans. As part of this course, Viki is discussed. The story of Viki is often a topic in courses about language development as well. Because an acoustical analysis of Viki's speech was not available when I began teaching the course, I developed a lab exercise that allows students to analyze the acoustic features of Viki's speech and then make comparisons between Viki's speech and human speech. A comparison between the acoustic features of Viki's speech and human speech provides students with important insights into the anatomical differences between humans and chimpanzees related to vocal production. The exercise described could be used in any course on speech science, childhood language, or acoustic phonetics. Upon completion of the lab

  • students will be able to analyze complex sounds, including speech, using digital signal processing software

  • students should understand why human speech is clear compared to speech sounds produced by Viki based on the anatomical differences in the vocal system of humans and chimpanzees.

II. LAB EXERCISE

I found a video of Viki and Dr. Hayes on YouTube.1 The audio can then be extracted from the video using free online software.1 From this extracted audio file, I was then able to isolate and extract Viki's production of the words “cup” and “papa.” Note that Viki did not produce the word “mama” in the video. Students use Praat (Boersma and van Heuvan, 2001) to analyze these words. Praat is a free online program that can be downloaded and is extensively used for speech analysis. Students are provided with the .wav files for Viki's “cup” and “papa” along with versions of these words spoken by the author. Alternatively, students could record their own versions of “cup” and “papa” for analysis. Students are also provided with a step-by-step set of instructions for how to use Praat for this analysis.

Prior to this exercise, background material is presented in lecture describing how the human vocal tract filters the output of the larynx to produce vowel sounds. Students learn that changes in the placement of the tongue change the shape and length of the vocal tract and thus change the formant structure of the vowels. Moreover, students have also learned about the differences between humans and other non-human mammals regarding the descended larynx and tongue root, vocal membranes, and laryngeal air sacs. By comparing the acoustic features of the words produced by humans and Viki as analyzed from Praat, students can make inferences regarding (1) the movement of the tongue, (2) the presence of vocal membranes, (3) the presence of laryngeal air sacs, and (4) speech production during exhalation and inhalation.

III. RESULTS FROM PRAAT ANALYSIS

A. Vowels

Figure 1(A) illustrates the acoustic waveform for Viki's production of the word “cup.” It appears that there are two formants: formant frequency 1 and formant frequency 2 (F1 and F2) in Viki's production of the vowel /Λ/ [Figs. 1(B) and 1(C)]. The average F1 and F2 frequencies estimated between 0.253 and 0.366 s are 770 and 1310 Hz, respectively [shaded area in Fig. 2(A)]. These formants for Viki's /Λ/ are similar to those produced by human speakers (Fig. 3). Figure 4(A) illustrates the acoustic waveform for the word “cup” as spoken by a human male. In contrast to /Λ/ as produced by Viki, Fig. 4(B) shows multiple formants for the wideband spectrogram of vowel /Λ/ as produced by a human male. The average F1 and F2 frequencies estimated between 0.094 and 0.176 s are 662 and 1188 Hz, respectively [shaded area in Fig. 2(B)]. The difference between F1 and F2 in Viki and the human male presumably reflects a shorter vocal tract length in Viki than in the male human (see Nishimura, 2005). Although the formant structure of the /Λ/ produced by Viki is more unclear and less distinct than that for the human /Λ/ (Fig. 2), this acoustic analysis suggests that Viki is capable of producing a low-central vowel like /Λ/ (Ladefoged, 1999) with a formant structure like that of human /Λ/s (Fig. 3). Thus, students should conclude that Viki can make some tongue height movements (i.e., up–down movements of the tongue).

FIG. 1.

FIG. 1.

(A) Waveform for Viki's production of “cup.” The phonemes are indicated at the top. (B) Wideband and (C) narrowband spectrograms for Viki's “cup.” The apparent formant frequencies are labeled F1 and F2.

FIG. 2.

FIG. 2.

(Color online) Formant contours from Praat for the vowel /Λ/ in “cup” as spoken by (A) Viki and (B) a human male. The shaded areas indicate the duration of the vowel.

FIG. 3.

FIG. 3.

(Color online) Relationship between formant frequency 1 and formant frequency 2 for English vowel as The red shaded cloud indicates the distribution of F1 and F2 for the vowel in “cup.” The red + indicates F1 and F2 in Fig. 1 for Viki. The black “×” indicates F1 and F2 for a human male. Adapted with permission from Peterson, G. E., and Barney, H. L., J. Acoust. Soc. Am. 24, 175–184 (1952). Copyright 1952 Acoustical Society of America.

FIG. 4.

FIG. 4.

(A) Waveform for a human male's production of “cup.” The phonemes are indicated at the top. (B) Wideband and (C) narrowband spectrograms for human “cup.” The apparent formant frequencies are labeled F1, F2, F3, F4, and F5.

There does not appear to be a strong periodicity in the waveform for vowel /Λ/ as produced by Viki [Fig. 1(A)], and the narrowband spectrogram [Fig. 1(C)] does not show a strong harmonic structure. In contrast, for the human male, there does appear to be a strong periodicity in the waveform for /Λ/ vowel [Fig. 4(A)] and the narrowband spectrogram [Fig. 4(C)] does show a strong harmonic structure. Figure 5 compares the autocorrelation functions (ACFs) for the envelope of the vowel /Λ/ as produced by Viki and a human male. The envelope was extracted from the isolated vowel following half-wave rectification and low-pass filtering (500 Hz cutoff frequency). There are small, weak peaks in the envelope ACF for Viki's /Λ/, with a weak first peak in the ACF around 0.0053 s. Thus, Viki's /Λ/ appears to be largely a voiceless vowel. In contrast, there are strong, clear peaks in the ACF for the human /Λ/, with a large peak at the fundamental period of 0.00744 s [Fig. 5(B)].

FIG. 5.

FIG. 5.

Autocorrelation functions of the extract envelope for the vowel /Λ/ produced by (A) Viki and (B) a human male. The vertical dashed line indicates the fundamental period. (C) Autocorrelation function produced by a human male during inhalation.

The acoustic analysis shows that Viki's /Λ/ has more of a noisy structure, which may be the result of the vibration of the vocal membrane. Note that the formant contours for Viki's “cup” are much more chaotic than the more stable formant contours for the human “cup” (Fig. 2). The vocal membrane, which is found in all primates, produces a more chaotic, nonlinear vibration at the output of the larynx (Mergell et al., 1999; Nishimura et al., 2022). During evolution, modern humans have lost the vocal membranes that are found in all other primates, including chimpanzees (Nishimura et al., 2022), resulting in a more stable, periodic output from the larynx.

Figure 6(A) illustrates the acoustic waveform for Viki's production of the word “papa.” The wideband spectrogram in Fig. 6(B) clearly shows that a vowel was not produced by Viki. There is clearly no formant structure corresponding to the vowel /a/; indeed, it appears that no vocalization is present following the consonants. Figure 7 shows the results of the acoustic analysis for the word “papa” as spoken by a human male. For the human /a/, there is a clear formant structure with multiple formant frequencies [Fig. 7(B)] and a clear periodicity [Figs. 7(A) and 7(C)]. The vowel /a/ is a low-back vowel (Ladefoged, 1999). Here, the tongue height position is low and tongue advancement position is towards the pharyngeal cavity (i.e., back). Thus, students should conclude that although Viki can make some up–down tongue movements, she does not appear to be capable of making front–back movements of the tongue. The arrows in Fig. 6 point to a brief periodic sound in Viki's vocalization that has a frequency estimated to be around 285–295 Hz. This frequency may reflect the resonant frequency of Viki's laryngeal air sac (Riede et al., 2008; de Boer, 2012).

FIG. 6.

FIG. 6.

(A) Waveform and (B) wideband spectrogram of Viki's “papa.” The black arrows indicate a presumptive sound from the laryngeal air sacs.

FIG. 7.

FIG. 7.

(A) Waveform for a human male's production of “papa.” (B) Wideband spectrogram for human “papa.” The apparent formant frequencies are labeled F1, F2, F3, F4, and F5. (C) Autocorrelation function of the envelope for the first /a/. The vertical dashed line indicates the fundamental period.

B. Consonants

Comparison of Viki's words with those produced by the human male show differences in the acoustics of the consonants /k/ and /p/. Although the spectrograms indicate that the consonant /k/ in the word “cup” is broadband in both Viki [Fig. 1(B)] and the human male [Fig. 4(B)], there are differences in the acoustic waveforms of these consonants. For example, the consonant /k/ in Viki's “cup” has a duration of approximately 0.034 s and perhaps has more of a click-like waveform [Fig. 1(A)], whereas the human /k/ has a duration of approximately 0.063 s and has more of a noise-burst waveform [Fig. 4(A)]. Likewise, the consonants /p/ in Viki's “cup” and “papa” are more click-like [Figs. 1(A) and 6(A)], whereas the /p/ produced by the human male has more of a noise-burst waveform [Figs. 4(A) and 7(A)].

C. Human imitations of Viki's speech

The above discussion suggests that the differences in the acoustic features of Viki's production of “cup” and “papa” from those of a human speaker may be related to limitations in tongue movement as well as the presence of vocal membranes and laryngeal air sacs in Viki. When one listens to Viki's words during playback, one perceives words that are indistinct and unclear, but nevertheless, the words “cup” and “papa” can be recognized. Humans are capable of understanding speech when the speech is highly degraded. Our recognition of Viki's words may be similar to the perception of degraded words, such as with noise-vocoding, for example. When one listens to Viki's “papa” in the video for instance, one does seem to recognize an indistinct or degraded version of “papa,” and yet no vowel is produced. In the video, the narrator does state that Viki can produce the word “papa.” Our recognition of the word “papa” as produced by Viki may reflect aspects of linguistic priming (Remez et al., 1981) and phonemic restoration (Warren, 1970).

My own perception of the consonant /p/ in Viki's “papa” reminds me of a “lip smacking” sound. Fitch (2010, pp. 300–301) states that “lip smacking” is common in primates and that chimpanzee hoots contain exhalation and inhalation components. I tried to imitate Viki's pronunciation of “cup” and “papa” without much success until I tried producing these words during inhalation rather than exhalation. Figures 8 and 9 show acoustic waveforms and spectrograms for my imitations of Viki's “cup” and “papa” produced during inhalation. Note that the /p/ was produced with “lip smacking” rather than as an exhaled plosive. There is a surprising similarity in waveforms and spectrograms between these imitations and Viki's words. This similarity suggests that although Viki is trying to articulate the words, she could be doing it during inhalation rather than exhalation as humans do. Note the wideband spectrogram for the imitated “cup” shows a formant structure for the /Λ/ [Fig. 8(B)] that is as indistinct as it is in Viki's “cup” [Fig. 1(B)]. Also, the autocorrelation functions for both Viki's /Λ/ in “cup” [Fig. 5(A)] and the imitated /Λ/ in “cup” [Fig. 5(C)] show little periodicity, whereas the autocorrelation function for the human version of /Λ/ in “cup” [Fig. 5(B)] shows strong periodicity. It proved difficult to produce a voiced vowel during inhalation. Thus, students should conclude that Viki's production of these two specific speech sounds (i.e., “cup” and “papa”) may be occurring during inhalation rather than exhalation, as in humans.

FIG. 8.

FIG. 8.

(A) Waveform and (B) wideband spectrogram for “cup” as imitated by a human male during inhalation.

FIG. 9.

FIG. 9.

(A) Waveform and (B) wideband spectrogram for “papa” as imitated by a human male during inhalation.

IV. SUMMARY

Students studying speech science and language science generally learn that humans are the only species that produce speech spontaneously. Chimpanzees and humans share a common ancestor on the phylogenetic tree, but chimpanzees do not spontaneously produce speech or speech sounds. The exercise described above provides students with an opportunity to compare speech sounds produced by humans with those produced by Viki, a speech-trained, home-raised chimpanzee. It should be abundantly clear to students that the speech produced by Viki is not the same as human speech. Although the exercise is based only on two words and is by no means exhaustive, it does provide students with an opportunity to infer from their acoustic analyses how the characteristics of the vocal apparatus in chimpanzees can limit clear speech production. Students can thus infer that the indistinct speech produced by Viki reflects (1) a tongue root that is not descended, which limits movements to tongue height, (2) the presence of the vocal membranes and laryngeal air sac, and (3) articulation during inhalation. In contrast, students should infer that the clear speech naturally produced by humans reflects (1) the descended larynx and tongue root in humans, which enables movements in both tongue height and tongue advancement, (2) the loss of the vocal membranes and laryngeal air sac, and (3) articulation during exhalation. Although the indistinct words produced by Viki reflects anatomical differences in the vocal system of humans and chimpanzees, students should also be made aware that differences in the forkhead box protein P2 (FOXP2) gene (see Fitch, 2018) and differences in neural circuits (see Fitch, 2010, pp. 350–352, 2018) as related to speech production also exist between humans and chimpanzees.

Finally, the laboratory exercise developed and presented in this paper specially focused on speech produced by a chimpanzee and humans given the proximity of these two species on the phylogenetic tree. However, it should be noted that other animals can produce vocalizations that are similar to human speech, namely, mynah birds, parrots, and a harbor seal (see Fitch, 2010, p. 14). A comparative analysis among these speech sounds could easily be carried out using the step-by-step instructions for Praat, once an instructor obtained the necessary .wav files.

SUPPLEMENTARY MATERIAL

See the supplementary material for Praat instructions and .wav files of Viki's and human words; SuppPub1.docx (Student Instructions for using Praat); SuppPub2.wav (wav file for Viki's “cup”); SuppPub3.wav (wav file for Viki's “papa”); SuppPub4.wav (wav file for human “cup”); SuppPub5.wav (wav file for human “papa”); SuppPub6.wav (wav file for human imitating Viki's “cup”); SuppPub7.wav (wav file for human imitating Viki's “papa”).

ACKNOWLEDGMENTS

The author would like to thank Dr. Ishanti Gangopadhyay and Dr. Steven Lulich for their comments on the manuscript. The author would also like to thank his students who have played an important role in the development of this lab exercise.

Footnotes

1

The video of Viki and Dr. Hayes can be found on YouTube at https://youtu.be/1-_LsVQb3t0. The audio can be extracted from the video using free online software (http://www.videolan.org/).

Author Declarations

Conflict of Interest

The author has no conflicts to disclose.

Ethics Approval

Human and living animal participants were not used. Chimpanzee speech sounds were obtained from the internet; human speech sounds were spoken and recorded by the author. IRB and IACUC approvals were not required.

DATA AVAILABILITY

The data that support the findings of this study are available within the supplemental material.

References

  • 1. Boersma, P. , and van Heuvan, V. (2001). “ Praat, a system for doing phonetics by computer,” Glot Int. 5(9/10), 341–345. [Google Scholar]
  • 2. de Boer, B. (2012). “ Loss of air sacs improved hominin speech abilities,” J. Hum. Evol. 62, 1–6. 10.1016/j.jhevol.2011.07.007 [DOI] [PubMed] [Google Scholar]
  • 3. Ekström, A. G. (2023). “ Viki's first words: A comparative phonetics case study,” Int. J. Primatol. 44, 249–253. 10.1007/s10764-023-00350-1 [DOI] [Google Scholar]
  • 4. Ekström, A. G. , and Edlund, J. (2023). “ Evolution of the human tongue and emergence of speech biomechanics,” Front. Psychol. 14, 1150778. 10.3389/fpsyg.2023.1150778 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Fitch, W. T. (2000). “ The evolution of speech: A comparative review,” Trends Cogn. Sci. 4, 258–267. 10.1016/S1364-6613(00)01494-7 [DOI] [PubMed] [Google Scholar]
  • 6. Fitch, W. T. (2010). The Evolution of Language ( Cambridge University Press, Cambridge, UK: ). [Google Scholar]
  • 7. Fitch, W. T. (2018). “ The biology and evolution of speech: A comparative analysis,” Annu. Rev. Linguist. 4, 255–279. 10.1146/annurev-linguistics-011817-045748 [DOI] [Google Scholar]
  • 8. Fitch, W. T. , Neubauert, J. , and Herzel, H. (2002). “ Calls out of chaos: The adaptive significance of nonlinear phenomena in mammalian vocal production,” Anim. Behav. 63, 407–418. 10.1006/anbe.2001.1912 [DOI] [Google Scholar]
  • 9. Hayes, K. J. , and Hayes, C. (1951). “ The intellectual development of a home-raised chimpanzee,” Proc. Am. Philos. Soc. 95, 105–109, available at https://www.jstor.org/stable/3143327. [Google Scholar]
  • 10. Kellog, W. N. (1968). “ Communication and language in the home-raised chimpanzee: The gestures, ‘words,’ and behavioral signals of home-raised apes are critically examined,” Science 162, 423–427. 10.1126/science.162.3852.423 [DOI] [PubMed] [Google Scholar]
  • 11. Ladefoged, P. (1999). “ American English,” in Handbook of the International Phonetic Association ( Cambridge University Press, Cambridge, UK: ), pp. 41–44. [Google Scholar]
  • 12. Mergell, P. , Fitch, W. T. , and Herzel, H. (1999). “ Modeling the role of nonhuman vocal membranes in phonation,” J. Acoust. Soc. Am. 105, 2020–2028. 10.1121/1.426735 [DOI] [PubMed] [Google Scholar]
  • 13. Nishimura, T. (2005). “ Developmental changes in the shape of the supralaryngeal vocal tract in chimpanzees,” Am. J. Phys. Anthropol. 126, 193–204. 10.1002/ajpa.20112 [DOI] [PubMed] [Google Scholar]
  • 14. Nishimura, T. , Mikami, A. , Suzuki, J. , and Matsuzawa, T. (2003). “ Descent of the larynx in chimpanzee infants,” Proc. Natl. Acad. Sci. U.S.A. 100, 6930–6933. 10.1073/pnas.1231107100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Nishimura, T. , Tokuda, I. T. , Miyachi, S. , Dunn, J. C. , Herbst, C. T. , Ishimura, K. , Kaneko, A. , Kinoshita, Y. , Koda, H. , Saers, J. P. P. , Imai, H. , Matsuda, T. , Larsen, O. N. , Jürgens, U. , Hirabayashi, H. , Kojima, S. , and Fitch, W. T. (2022). “ Evolutionary loss of complexity in human vocal anatomy as an adaptation for speech,” Science 377, 760–763. 10.1126/science.abm1574 [DOI] [PubMed] [Google Scholar]
  • 16. Peterson, G. E. , and Barney, H. L. (1952). “ Control methods used in a study of the vowels,” J. Acoust. Soc. Am. 24, 175–184. 10.1121/1.1906875 [DOI] [Google Scholar]
  • 17. Remez, R. E. , Rubin, P. E. , Pisoni, D. B. , and Carroll, T. D. (1981). “ Speech perception without traditional speech cues,” Science 212, 947–950. 10.1126/science.7233191 [DOI] [PubMed] [Google Scholar]
  • 18. Riede, T. , Tokuda, I. T. , Munger, J. B. , and Thomson, S. L. (2008). “ Mammalian laryngseal air sacs add variability to the vocal tract impedance: Physical and computational modeling,” J. Acoust. Soc. Am. 124, 634–647. 10.1121/1.2924125 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Warren, R. M. (1970). “ Perceptual restoration of missing speech sounds,” Science 167, 392–393. 10.1126/science.167.3917.392 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

See the supplementary material for Praat instructions and .wav files of Viki's and human words; SuppPub1.docx (Student Instructions for using Praat); SuppPub2.wav (wav file for Viki's “cup”); SuppPub3.wav (wav file for Viki's “papa”); SuppPub4.wav (wav file for human “cup”); SuppPub5.wav (wav file for human “papa”); SuppPub6.wav (wav file for human imitating Viki's “cup”); SuppPub7.wav (wav file for human imitating Viki's “papa”).

Data Availability Statement

The data that support the findings of this study are available within the supplemental material.


Articles from The Journal of the Acoustical Society of America are provided here courtesy of Acoustical Society of America

RESOURCES