Skip to main content
Journal of Speech, Language, and Hearing Research : JSLHR logoLink to Journal of Speech, Language, and Hearing Research : JSLHR
. 2022 Nov 9;65(12):4679–4689. doi: 10.1044/2022_JSLHR-22-00245

An Acoustic-Phonetic Approach to Effects of Face Masks on Speech Intelligibility

Yunjung Kim a,, Austin Thompson a
PMCID: PMC9934909  PMID: 36351244

Abstract

Purpose:

This study aimed to examine the effects of wearing a face mask on speech acoustics and intelligibility, using an acoustic-phonetic analysis of speech. In addition, the effects of speakers' behavioral modification while wearing a mask were examined.

Method:

Fourteen female adults were asked to read a set of words and sentences under three conditions: (a) conversational, mask-off; (b) conversational, mask-on; and (c) clear, mask-on. Seventy listeners rated speech intelligibility using two methods: orthographic transcription and visual analog scale (VAS). Acoustic measures for vowels included duration, first (F1) and second (F2) formant frequency, and intensity ratio of F1/F2. For consonants, spectral moment coefficients and consonant–vowel (CV) boundary (intensity ratio between consonant and vowel) were measured.

Results:

Face masks had a negative impact on speech intelligibility as measured by both intelligibility ratings. However, speech intelligibility was recovered in the clear speech condition for VAS but not for transcription scores. Analysis of orthographic transcription showed that listeners tended to frequently confuse consonants (particularly fricatives, affricates, and stops), rather than vowels in the word-initial position. Acoustic data indicated a significant effect of condition on CV intensity ratio only.

Conclusions:

Our data demonstrate a negative effect of face masks on speech intelligibility, mainly affecting consonants. However, intelligibility can be enhanced by speaking clearly, likely driven by prosodic alterations.


With the advent of the COVID-19 pandemic, face masks have become part of our daily speech communication at work, grocery stores, and classrooms, as they are (or had been) mandatory in many public settings. Furthermore, ongoing guidance suggests that face masks will continue for public use to minimize the spread of the virus even after the COVID-19 virus becomes constrained (Rab et al., 2020). Unfortunately, this essential personal protective equipment is perceived as an obstacle to clear and effective oral communication. First, several timely publications have agreed that a mask functions as a low-pass filter. Second, most face masks prevent visual cues to the speaker's lip and jaw movement. A recent study reported that, despite the poorer acoustic performance of transparent masks compared with nontransparent masks, the presence of visual cues such as lipreading and other facial cues remains important for both verbal and nonverbal communication (Atcherson et al., 2021).

One common finding across studies on the effects of various face masks on speech acoustics is the significant attenuation of high-frequency signals, although the specifics regarding the degree and range of frequencies affected vary slightly across studies. For example, Goldin et al. (2020) reported that speech signals in the frequency range of 2–7 kHz are attenuated by 3–4 dB for a simple medical mask and close to 9–12 dB for a respiratory protective mask (e.g., N95). Atcherson et al. (2020) found a reduction in maximum sound pressure level in all types of masks included in the study (surgical, respiratory protective, and cloth), ranging from 5 dB to 21.2 dB.

Unlike other studies that used a mouth simulator as the talker and a microphone as the listener (Atcherson et al., 2020; Goldin et al., 2020; Rab et al., 2020), Magee et al. (2020) included multiple human speakers and listeners in their experiment. In this study, single words and sentence stimuli were recorded under four mask conditions (N95, surgical, cloth, and no mask) and four listeners rated speech intelligibility using a transcription method. Despite the changes in power spectral distribution (i.e., power was significantly lowered between 3 and 10 kHz), no significant effects of masks were found for single word and sentence intelligibility.

A more recent study by Yi et al. (2021) compared word identification scores across four conditions, type of mask condition, presentation mode, speaking style, and background noise. One hundred twenty sentences recorded by one speaker were transcribed by 26 listeners for intelligibility scores. The results indicated that the combination of face masks and background noise at a signal-to-noise ratio (SNR) of −5 dB negatively impacts intelligibility.

The studies summarized above capture some important limitations in the delivery of acoustic signals between the speaker and listener, primarily an attenuation effect on the signals at high frequencies, which results in a decrease in speech intelligibility. Despite this general description, our understanding is still limited concerning the phonetic-based sources of perceptual difficulties when speakers wear a mask. For example, the observation of dampened signals in the frequency range of 2–7 kHz does not provide concrete information on the types and kinds of speech sounds that are vulnerable to mask-induced speech degradation.

Nguyen et al. (2021) provided acoustic-phonetic data focusing on spectral changes in three corner vowels and four voiceless fricative consonants (/f, s, ʃ, v/) extracted from The Rainbow Passage in mask-wearing conditions. The authors found reduced root-mean-square amplitude of fricatives but no changes in the amplitude of vowel formants. Among four spectral moments coefficients, only the first moment coefficient (M1) of /f/ showed a significant decrease. The findings of this study hint at the varying effects of face masks on acoustic characteristics across speech sounds, which negatively impact their perception. Previous literature on the effect of high-pass and low-pass filtering on speech also supports a varying degree of contribution across speech sounds to intelligibility (Pollack, 1948; Winn et al., 2013).

The lack of sophisticated, phonetic-based understanding of face mask effects makes it difficult to provide the public with practical tips for speaking efficiently with face masks. Currently, general, and largely generic, tips for talking with masks on are provided by several organizations; such as “be loud and clear, consider a face mask with a clear window (National Institute on Deafness and Other Communication Disorders, 2021) and “talk a little louder/slower, make sure you have your communication partner's attention, use your hands and your body language, face your partner directly” (American Speech-Language-Hearing Association, 2022). Therefore, this study was motivated by the need for an explanatory account of degraded intelligibility of masked speech, using the methods well-rooted in the literature on speech intelligibility for talkers with communication disorders (Ansel & Kent, 1992; Kent et al., 1989; Kim et al., 2011). Furthermore, it was our intention to include a relatively large number of speakers and listeners who represent our communication partners in our daily lives, given that existing data have been obtained from a small number of participants (as small as one in Yi et al., 2021) who may not represent typical, real-world speakers (e.g., a mouth simulator: Atcherson et al., 2020; Rab et al., 2020, speech-language pathologists: Nguyen et al., 2021) and sometimes have failed to match the native languages of speakers and listeners.

Specifically, two research questions and corresponding hypotheses were posed. First, are speech acoustics and intelligibility degraded in the mask-on condition compared with the mask-off condition? Second, does the “clear speech” condition enhance speech acoustics and intelligibility? We hypothesized that acoustic contrasts and speech intelligibility ratings would decrease for the mask-on condition. Speech sounds that entail a great amount of high-frequency energy, such as stops and fricatives, would be more affected compared with those that do not (e.g., vowels). In addition, we hypothesized that the clear speech cues (clear, mask-on condition) would enhance speech intelligibility compared with other mask conditions.

Method

The study protocol was approved by the Florida State University Institutional Review Board (FSU-IRB 00001982). Written informed consent was obtained from all participants.

Participants

Seventeen female adult speakers recruited from an undergraduate course, aged 21–24 years, participated in the study. Three out of 17 speakers were excluded from analyses due to the presence of noise during recordings (n = 14). Due to the high number of female students in the course, all our speakers were females. All speakers were native speakers of American English. Eighty-seven listeners were recruited to provide ratings of speech intelligibility. However, only 70 of the listeners (68 females, two males), ranging between 18 and 40 years (M = 21.95, SD = 4.29), were retained in the final analysis (see Reliability section for more details). All the speakers and listeners self-reported no speech, language, or hearing disorders.

Data Collection

The speakers were asked to read a set of words and phrases twice in three conditions: (a) conversational, mask-off; (b) conversational, mask-on; and (c) clear, mask-on. Conversational speech for mask-on and mask-off conditions was elicited by instructing the speakers to speak as they typically do in everyday conversation. Clear, mask-on was elicited by instructing the speakers to speak more clearly to overcome the communication barrier of wearing a mask, as if someone had difficulty understanding them because of the mask. The three speaking conditions were fixed across speakers so that any clear speech adjustments induced by a mask condition would not be carried through the two “conversational” conditions. Furthermore, our experimental instructions explicitly discouraged changes in speech behaviors for the conversational, mask-on condition (i.e., keep their typical, comfortable voice).

Speech recordings were obtained remotely at the speaker's home following the university's COVID-19 guidelines for research activities. Each participant self-provided all the experimental devices and materials, including the face mask, computer, and microphone. The participants were asked to wear the type of face mask they use daily. The majority (n = 12) used fabric masks, and two participants used medical masks. Our study did not separate the participants based on mask type, as prior work has reported largely the same effect across masks.

Speakers were instructed to use the built-in microphone of their personal laptops and Praat (Boersma & Weenink, 2019) for recordings with a sampling rate of 22.1 kHz and 16-bit quantization. Although the participants reported that they were familiar with Praat, as they either took or were taking a speech science course at the time of data collection, video instructions and PowerPoint slides were provided to ensure the quality of speech recordings.

For a phonetic analysis, a set of single-word stimuli were used. Every target word was part of a minimal pair. The use of minimal pairs allows both intelligibility analysis and direct acoustic analysis at the segmental level, which has been widely used for individuals with various speech and hearing disorders (Ansel & Kent, 1992; Chin et al., 2001; Kent et al., 1989). The word stimuli consisted of monosyllabic consonant–vowel–consonant real word pairs selected from Ansel and Kent (1992). The word pairs contained phonetic contrasts including front-back vowel (e.g., beat-boot), high-low vowel (e.g., keen-cap), and null−/h/ (e.g., eat-heat), for example. The speakers were also asked to read five sentences, as daily communication mostly occurs in the unit of sentences, not words. Sentence stimuli ranged from four to nine words and have been used in previous studies of sentence intelligibility (Kim et al., 2011; Levy et al., 2017). Table 1 lists the selected word and sentence stimuli of the study.

Table 1.

Speech stimuli included in the study.

Words
back, bag, bat, beat, book, boot, cab, can, cap, chai, cheap, chip, dot, eat, fat, feet, fit, gap, heat, hit, hot, it, keen, knot, mad, neat, pad, pat, sheep, ship, shy, sigh, sip, tan, teen, ten, tip, tot, vat
Sentences
Find all the crayons.
Three little pink pigs.
Don't splash any water.
Put the high stack of cards on the table.
Combine all the ingredients in a large bowl.

Intelligibility Ratings

Listeners provided orthographic transcriptions for the single-word stimuli and perceptual ratings of intelligibility on a visual analog scale (VAS) for the sentence stimuli. Perceptual data were remotely collected using Qualtrics. Listeners were asked to conduct the ratings in a quiet room using their personal computers without headphones. They were also instructed to set the volume at a level where the participant felt comfortable. Headphones were not used in our remotely conducted experiment because of the inability to control the quality and because the compliance of the crowdsourced listeners cannot be guaranteed (Lansford et al., 2016).

Speech recordings were mixed with multitalker babble using Praat. This method has frequently been used in clear speech and dysarthria research to prevent ceiling effects and increase task difficulty (Bunton, 2006; Smiljanić & Bradlow, 2009; Uchanski, 2005). An SNR of +3 dB was applied to each stimulus across the three conditions, such that the speaker audio was mixed with the multitalker babble audio at 70 dB and 67 dB, respectively. This SNR noise mixing procedure is consistent with prior work examining the effects of masks on speech perception (Toscano & Toscano, 2021).

Listeners were randomly assigned to provide intelligibility ratings for three speakers. For the single-word transcriptions, listeners orthographically transcribed all 39 randomized words for each of the three speaking conditions (i.e., 117 words per speaker). A single presentation of each word was allowed. Sentence intelligibility ratings were based on the five sentences produced using each of the three speaking conditions (i.e., a total of 15 sentences per speaker). Listeners were tasked with rating “how clearly [they] understood the speaker” for each recording, using a horizontally oriented continuous scale with endpoints labeled “cannot understand anything” and “understand everything” on the left and right sides of the scale, respectively. The scale ranged from 0 to 100; however, the value was not displayed to the listener, and the VAS did not contain tick marks (Kent & Kim, 2011). The conditions were presented in random order for both intelligibility measures, and the listeners were blind to the condition.

Data Analysis

The data were cleaned and analyzed using R (R Core Team, 2020). The Autoscore package within R was used to derive the transcription accuracy scores for the single-word stimuli (Borrie et al., 2019). The transcription accuracy was used to calculate the percent words correct for each speaker and to analyze perceptual errors.

The acoustic analysis included three subsets: vowel, consonant, and segmental boundaries. For each, the following measures were selected based on the literature identifying phonetic contrasts that contribute to speech intelligibility (Ansel & Kent, 1992; Kent et al., 1989). Table 2 summarizes the selected acoustic measures, target phonetic feature/contrast, and supporting studies. Acoustic measurements were made using TF32 (Milenkovic, 2005). Conventional settings for acoustic measurements were used, including the combined waveform and wideband (300–400 Hz) spectrographic displays.

Table 2.

Acoustic measures, their corresponding phonetic features, and supporting studies.

Acoustic measures Phonetic features Supporting studies
Vowel
 Duration Tense vs. lax
Low vs. high
Fant (1970),
Hillenbrand et al. (1995)
 F1 and F2 Low vs. high
Front vs. back
Stevens (1980),
Hillenbrand et al. (1995)
 F1/F2 ratio F1/F2 ratio Stevens (1980)
Consonant
 M1, M2, M3, and M4 Alveolar vs. velar Kim (2017),
McRae et al. (2002),
Tjaden and Turner (1997)
Consonant–vowel boundary
 C/V intensity ratio Integrity of segmental boundary Fairbanks et al. (1957),
Kim (2017),
Mattys and Liss (2008)

Note. F1 = first formant frequency; F2 = second formant frequency; C/V = consonant/vowel coefficients; M1, M2, M3, and M4 = spectral moment coefficients.

Verification of Clear Speech

Literature on clear speech frequently reports changes in vocal intensity or speaking rate to verify the clear speech condition (Mefferd, 2017). Because of remote data collection, which did not include an audio calibration procedure and strict control of the distance between the microphone and the speaker's mouth, we did not include vocal intensity measures. Speaking rates were averaged across the five sentences (see Table 1) per speaker in the number of syllables per second (syl/s) and compared between the two conditions (conversational, mask-on; and clear, mask-on). Results of a paired t test indicated that there was a significant difference in speaking rate between conversational, mask-on (M = 4.7 syl/s, SD = 0.4) and clear, mask-on (M = 3.3 syl/s, SD = 0.8) conditions, t(13) = 5.38, p < .001.

Statistical Analysis

A series of linear mixed-effects (LME) models were created to examine the effects of speaking conditions on the target measures. The LME model was deemed to be the most appropriate analysis approach due to the dependent nature of the nested data. The first level of the data was the target words and sentences, which were nested within the speaker. A total of eleven LME models were constructed to model the various dependent variables, including (a) sentence intelligibility, (b) word intelligibility, (c) vowel duration, (d) first formant frequency (F1), (e) second formant frequency (F2), (f) the F1/F2 ratio of peak intensity, (g–j) consonant moments analysis (M1–M4), and (k) the consonant/vowel intensity ratio.

Speaking condition was entered into all the models to examine the effects of mask and speaking style on each dependent variable. Speaker (i.e., SpeakerID) was entered into the model as a random effect to account for within- and between-speaker variability. Additional variables were entered into the model as random effects, when appropriate (e.g., consonant manner for the consonant-related models [stop vs. fricative], target vowel for the vowel-related models [/a/, /æ/, /i/, and /u/]). The LME model was conducted in R using the lme4 and lmerTest packages (Bates et al., 2015; Kuznetsova et al., 2017). The fixed effects from the models were evaluated at the Bonferroni corrected alpha level, α Bonferroni = .0045, to account for the increased risk of a type I error when testing multiple models. Finally, conversational, mask-on was coded as the reference speaking condition. As such, the condition results for the LME models are interpreted in reference to the conversational, mask-on condition. Post hoc Tukey's tests were used to examine the relationship between conversational, mask-off, and clear, mask-on conditions using the lsmeans function in the emmeans package (Lenth et al., 2019).

During model creation, the model assumptions were examined using the performance package in R (Lüdecke et al., 2021). For the sentence intelligibility ratings, the assumption of normality of residuals was violated. This violation is likely due to a minor ceiling effect observed within the VAS ratings where, despite the multitalker babble, some listeners rated speakers to be 100% intelligible. For this reason, an alternative approach was used to model sentence intelligibility. Specifically, a generalized linear mixed-effects model using a gamma distribution with a log link was constructed using the glmer function in the lme4 package.

Reliability

For the listener group, intrarater reliability was established through a two-step process. First, for the sentence stimuli, intrarater reliability was examined by randomly assigning 20% of the sentences to be rated again by the listener. The absolute difference between the listeners' first and second ratings for each sentence was calculated, and the average absolute difference across the three sentences was derived for each listener. Listeners with a mean absolute difference between their first and second ratings greater than 20 were removed from the study. Twelve listeners with poor VAS rating reliability were removed. Second, for the single-word stimuli, intrarater reliability was examined by randomly assigning three words to be transcribed a second time by the listener. Five listeners with poor agreement between their first and second transcription for at least one of the three words were removed from the study, which left 70 listeners in the data analysis.

Intermeasurer reliability of acoustic measurements was established by having a second measurer analyze 30% of the data and then examining the correlation between the two sets of measurements. The measures derived from the first and second measurer were strongly correlated, r = .99, indicating high intermeasurer reliability. The mean absolute errors and their standard deviations were the following: vowel durations (M = 9.58, SD = 8.52 in ms), F1 (M = 12.22, SD = 20.55 in Hz), F2 (M = 53.00, SD = 67.22 in Hz), F1/F2 ratio (M = 0.01, SD = 0.01), C/V intensity ratio (M = 0.01, SD = 0.01), and spectral moment coefficients (M1 through M4: M = 0.11, SD = 0.32).

Results

The descriptive statistics for the target measures across the three speaking conditions are displayed in Table 3.

Table 3.

Mean (M) and standard deviation (SD) of each measure across the three conditions.

Variable Segment Conversational, mask-off
Conversational, mask-on
Clear, mask-on
M SD M SD M SD
Intelligibility
 Sentence 90.73 14.70 85.85 18.31 90.49 15.73
 Word 69.27 15.90 57.86 19.31 66.26 18.68
Vowels
 Duration (ms) /a/ 238.65 53.62 228.96 49.83 217.30 64.65
/æ/ 221.11 41.65 214.34 38.81 208.76 52.95
/i/ 186.05 36.09 184.43 31.41 181.50 44.09
/u/ 196.38 35.32 197.40 36.47 187.78 40.93
 F1 (Hz) /a/ 831.07 107.39 773.88 112.51 832.46 117.28
/æ/ 942.11 150.27 911.62 118.42 955.68 175.01
/i/ 392.82 57.82 383.57 64.43 395.66 69.88
/u/ 413.61 39.22 426.61 37.36 419.57 39.08
 F2 (Hz) /a/ 1,334.45 125.68 1,235.50 130.52 1,324.48 307.25
/æ/ 1,725.43 156.35 1,684.41 160.58 1,678.68 195.66
/i/ 2,926.95 195.41 2,931.11 205.40 2,933.86 431.05
/u/ 1,672.50 214.06 1,583.54 302.02 1,639.68 331.30
 F1/F2 Intensity Ratio /a/ 0.89 0.11 0.93 0.15 0.93 0.16
/æ/ 0.96 0.32 0.97 0.18 0.97 0.20
/i/ 1.06 0.16 0.95 0.19 0.96 0.41
/u/ 0.76 0.17 0.74 0.15 0.77 0.14
Consonants
 M1 Fricative 6.19 1.73 6.06 1.87 6.48 1.98
Stop 5.68 1.59 5.34 1.71 5.70 1.66
 M2 Fricative 1.26 0.37 1.27 0.46 1.28 0.43
Stop 1.48 0.64 1.40 0.63 1.54 0.67
 M3 Fricative 0.43 1.13 0.36 1.17 0.19 1.13
Stop 0.38 1.39 0.42 1.31 0.19 1.13
 M4 Fricative 2.06 4.77 2.17 3.74 2.45 3.43
Stop 3.72 8.99 4.59 8.31 2.65 4.26
C/V boundary
 C/V intensity ratio Fricative 1.96 0.80 1.82 0.46 2.84 2.47
Stop 1.52 0.40 1.52 0.26 1.84 0.55

Note. F1 = first formant frequency; F2 = second formant frequency; C/V = consonant/vowel coefficients; M1, M2, M3, and M4 = spectral moment coefficients.

Research Question 1: Effects of Face Masks

The first research question examined whether speech intelligibility (i.e., word and sentence intelligibility) and speech acoustics (i.e., vowels and consonants) were degraded by wearing a mask (i.e., conversational, mask-off vs. conversational, mask-on). Both sentence intelligibility ratings, t(605) = 3.382, p < .001, and single word transcription accuracy, t(605) = 8.490, p < .001, were observed to have significantly decreased intelligibility in the conversational, mask-on condition compared with the conversational, mask-off condition (top row of Figure 1).

Figure 1.

Figure 1.

Box plots for the eleven measures of interest. This figure presents the z-score transformed measures to visually control for the random effects in the modeling (i.e., speaker, vowel, manner). ** indicates significance p < .0045; *** indicates p < .001; C/V = consonant/vowel; F1 = first formant frequency; F2 = second formant frequency; VAS = visual analog scale; M1, M2, M3, and M4 = spectral moment coefficients.

From the single-word transcriptions, a descriptive analysis was conducted to examine confusion patterns among sounds. Across all word transcriptions, the percentage of correctly transcribed vowels changed minimally between the three speaking conditions (i.e., 94.51% for conversational, mask-off; 91.52% for conversational, mask-on; and 94.40% for clear, mask-on). In contrast, the transcription accuracy of consonants, especially in the word-initial position, was more sensitive to speaking conditions (i.e., 79.22% for conversational, mask-off; 70.96% for conversational, mask-on; and 76.28% for clear, mask-on). Figure 2 displays the confusion matrices for the initial and final consonantal transcriptions across the three speaking conditions. Within the conversational, mask-on condition, 28.8% of the transcribed words contained class errors in the initial consonant position. Fricatives yielded the largest number of transcription errors (43.36%), followed by affricates (36.22%), nasals (29.74%), and stops (22.61%). The same pattern, but of smaller magnitude, was observed for the conversational, mask-off, and clear, mask-on conditions. Finally, the same pattern was observed for the final consonant position for nasals and stops.

Figure 2.

Figure 2.

Confusion matrices for the initial and final consonants across the three speaking conditions. “Null” indicates stimuli with no initial or final consonants, as in it or chai.

Regarding the acoustic measures, none of the vowel- or consonant-related measures were significantly different in the conversational, mask-on condition compared with conversational, mask-off (middle and bottom rows of Figure 1). However, decreased F1, F2, and M1 trends were observed in the conversational, mask-on compared with mask-off condition. Although, these trends were not significant at the adjusted alpha level.

Research Question 2: Effects of Clear Speech

The second research question examined whether clear speech improved speech acoustics and intelligibility while wearing a mask comparable to conversational speech without a mask (i.e., conversational, mask-on vs. clear, mask-on and conversational, mask-off vs. clear, mask-on). For VAS ratings, clear, mask-on speech was rated significantly more intelligible than the conversational, mask-on condition, t(605) = 2.981, p = .003. There was a nonsignificant difference between conversational, mask-off, and clear, mask-on, suggesting that clear speech successfully overcame the mask-induced intelligibility degradation and was as intelligible as not wearing a mask.

Single word transcription scores showed that clear, mask-on speech was significantly more intelligible than the conversational, mask-on condition, t(605) = 4.487, p < .001. However, the single word transcription accuracy for the clear, mask-on condition was significantly lower than the conversational, mask-off condition, t(589) = 4.011, p < .001. This finding suggests that while clear speech improved single-word intelligibility while wearing a mask, it did not overcome the mask-induced intelligibility degradation entirely.

Looking at acoustic analysis, clear, mask-on speech had significantly greater F1 values than the conversational, mask-on condition, t(569) = 3.410, p < .001. There was a nonsignificant difference between conversational, mask-off, and clear, mask-on, suggesting that clear speech elicited F1 values comparable to those observed when not wearing a mask.

Finally, for consonants, there was a significant difference between clear, mask-on, and the other two conditions, such that both conversational, mask-off, t(480) = −4.917, p < .001, and conversational, mask-on, t(480) = −5.612, p < .001, had smaller C/V intensity ratios than clear, mask-on. This finding reveals that relatively larger C/V intensity ratios are a feature of clear speech rather than being affected by the presence of a mask. No significant differences between clear, mask-on and the other conditions were observed for the other measures.

Discussion

Albeit limited, existing data have shown mixed findings on whether face masks reduce perceptual impressions of speech production. Based on prior findings and our data, it appears that the presence of noise plays a role in the findings. While Magee et al. (2020) reported no significant effect on word and sentence intelligibility, this study and other studies (Toscano & Toscano, 2021; Yi et al., 2021), which introduced noise to the speech signals, found significant decreases in intelligibility ratings with masks on. When the conversation occurs in a quiet environment, a mask wearer may remain as intelligible as when the person speaks without a mask. However, if the conversation environment includes some degree of noise, including talking noise, speech intelligibility is likely degraded by masks.

Our perceptual error analysis indicated consonants are more vulnerable to perceptual confusion in the mask-on condition than vowels. This finding may result from the fact that the primary acoustic cues of vowel identification (i.e., F1 and F2) are usually located below the frequencies significantly attenuated by masks (i.e., roughly 2–3 kHz). This is consistent with the finding that vowel acoustics did not significantly change in the conversational, mask-on condition. The intensity ratio between F1 and F2 was of particular interest, given the potential mask-induced effect of spectral tilt on the distinctiveness of formant peaks on the vowel spectrum. However, the ratio did not differ significantly across the conditions. Similarly, no changes were observed in formant amplitude, despite the decrease in speech clarity ratings (Nguyen et al., 2021).

Unlike vowels, consonants were frequently confused in the conversational, mask-on condition. Limited acoustic measures were included across consonants due to feasibility. However, the results provide potential sources of confusion. For example, M1 is considered an important acoustic cue to place of articulation for stops and fricatives (Forrest et al., 1990; Kim, 2017; Tjaden & Turner, 1997). Although not statistically significant, M1 consistently decreased from the conversational, mask-off to conversational, mask-on condition for both stops and fricatives, then increased back to clear, mask-on condition. This may reflect greater spectral tilt due to mask-induced attenuation in high frequencies and a flattening of spectral tilt in clear speech, as observed in Lombard speech (Bell et al., 1989; Lu & Cooke, 2009).

When mask-wearers adopted the “clear” speech modification, both intelligibility scores significantly improved. However, while sentence VAS intelligibility scores were similar to the intelligibility level in the no-mask condition, the single word transcription intelligibility scores did not reach the no-mask condition level. A greater gain in VAS for sentences compared with single word transcription scores hints at the possibility that prosodic modification may play an important role in enhanced speech intelligibility in clear speech. The prosodic changes elicited by clear speech, such as slow speech, increased mean, and variations in fundamental frequency and intensity, have been repeatedly reported to enhance perceptual ratings of speech, including intelligibility and acceptability (Bunton et al., 2001; Dagenais et al., 2006).

Prior studies have documented several acoustic consequences of clear speech, primarily focusing on vowels, such as expanded vowel space area, increased spectral dynamics, and increased duration (Ferguson & Kewley-Port, 2002, 2007; Moon & Lindblom, 1994; Picheny et al., 1986; Wouters & Macon, 2002). The only change observed in this study was increased F1. Few changes in vowel acoustics in clear speech appear at first surprising. One possible explanation is that mask wearers' behavioral strategies for clear speech may differ in other situations requiring clear speech (e.g., experimental instruction, loud background noise). Speakers may consciously or unconsciously be aware that consonants (particularly stops and fricative) are more perceptually challenging with a mask than vowels, which leads to minimal changes in vowel acoustics.

This study had a few limitations. First, because the data were collected remotely, the absolute measurement of intensity could not be examined in this study, and the sound pressure level was normalized across the conditions. Second, it should be noted that our experimental settings including a microphone and a quiet room, do not mimic our daily, real conversation environment.

Our findings indicate that face masks degrade the perceptual impression of speech primarily for stops and fricatives. Acoustic data support such findings that vowel acoustics do not significantly change in the conversational, mask-on condition. Furthermore, speech intelligibility improves when a mask-wearer speaks clearly, particularly for sentence level production.

Data Availability Statement

Due to the nature of this study, the speech data generated and analyzed for this study are not publicly available to be compliant with the institutional review board requirements. Data spreadsheets that deidentified participant information may be available from the first author (Yunjung Kim).

Acknowledgments

This work was partly supported by a National Institute of Health Grant (F31 DC020121) awarded to the second author. Part of the data were presented at the American Speech-Language-Hearing Association Convention in November 2021. The authors appreciate Brianna Russo for her assistance with acoustic data analyses.

Funding Statement

This work was partly supported by a National Institute of Health Grant (F31 DC020121) awarded to the second author. Part of the data were presented at the American Speech-Language-Hearing Association Convention in November 2021.

References

  1. American Speech-Language-Hearing Association. (2022). Communicating effectively while wearing masks. https://www.asha.org/public/communicating-effectively-while-wearing-masks-and-physical-distancing/
  2. Ansel, B. M. , & Kent, R. D. (1992). Acoustic-phonetic contrasts and intelligibility in the dysarthria associated with mixed cerebral palsy. Journal of Speech and Hearing Research, 35(2), 296–308. https://doi.org/10.1044/jshr.3502.296 [DOI] [PubMed] [Google Scholar]
  3. Atcherson, S. R. , Finley, E. T. , McDowell, B. R. , & Watson, C. (2020). More speech degradations and considerations in the search for transparent face coverings during the COVID-19 pandemic. American Academy of Audiology. [Google Scholar]
  4. Atcherson, S. R. , McDowell, B. R. , & Howard, M. P. (2021). Acoustic effects of non-transparent and transparent face coverings. The Journal of the Acoustical Society of America, 149(4), 2249–2254. https://doi.org/10.1121/10.0003962 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bates, D. , Mächler, M. , Bolker, B. , & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01 [Google Scholar]
  6. Bell, T. S. , Dirks, D. D. , & Carterette, E. C. (1989). Interactive factors in consonant confusion patterns. The Journal of the Acoustical Society of America, 85(1), 339–346. https://doi.org/10.1121/1.397685 [DOI] [PubMed] [Google Scholar]
  7. Boersma, W. & Weenink, D. (2019). Praat: Doing phonetics by computer (Version 6.1) [Computer program] . Retrieved April 2021, from http://www.praat.org/
  8. Borrie, S. A. , Barrett, T. S. , & Yoho, S. E. (2019). Autoscore: An open-source automated tool for scoring listener perception of speech. The Journal of the Acoustical Society of America, 145(1), 392–399. https://doi.org/10.1121/1.5087276 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Bunton, K. (2006). Fundamental frequency as a perceptual cue for vowel identification in speakers with Parkinson's disease. Folia Phoniatrica et Logopaedica, 58(5), 323–339. https://doi.org/10.1159/000094567 [DOI] [PubMed] [Google Scholar]
  10. Bunton, K. , Kent, R. D. , Kent, J. F. , & Duffy, J. R. (2001). The effects of flattening fundamental frequency contours on sentence intelligibility in speakers with dysarthria. Clinical Linguistics & Phonetics, 15(3), 181–193. https://doi.org/10.1080/02699200010003378 [Google Scholar]
  11. Chin, S. B. , Finnegan, K. R. , & Chung, B. A. (2001). Relationships among types of speech intelligibility in pediatric users of cochlear implants. Journal of Communication Disorders, 34(3), 187–205. https://doi.org/10.1016/S0021-9924(00)00048-4 [DOI] [PubMed] [Google Scholar]
  12. Dagenais, P. A. , Brown, G. R. , & Moore, R. E. (2006). Speech rate effects upon intelligibility and acceptability of dysarthric speech. Clinical Linguistics & Phonetics, 20(2–3), 141–148. https://doi.org/10.1080/02699200400026843 [DOI] [PubMed] [Google Scholar]
  13. Fairbanks, G. , Guttman, N. , & Miron, M. S. (1957). Effects of time compression upon the comprehension of connected speech. Journal of Speech and Hearing Disorders, 22(1), 10–19. https://doi.org/10.1044/jshd.2201.10 [DOI] [PubMed] [Google Scholar]
  14. Fant, G. (1970). Acoustic theory of speech production (No. 2). Walter de Gruyter. [Google Scholar]
  15. Ferguson, S. H. , & Kewley-Port, D. (2002). Vowel intelligibility in clear and conversational speech for normal-hearing and hearing-impaired listeners. The Journal of the Acoustical Society of America, 112(1), 259–271. https://doi.org/10.1121/1.1482078 [DOI] [PubMed] [Google Scholar]
  16. Ferguson, S. H. , & Kewley-Port, D. (2007). Talker differences in clear and conversational speech: Acoustic characteristics of vowels. Journal of Speech, Language, and Hearing Research, 50(5), 1241–1255. https://doi.org/10.1044/1092-4388(2007/087) [DOI] [PubMed] [Google Scholar]
  17. Forrest, K. , Weismer, G. , Hodge, M. , Dinnsen, D. A. , & Elbert, M. (1990). Statistical analysis of word-initial /k/ and /t/ produced by normal and phonologically disordered children. Clinical Linguistics & Phonetics, 4(4), 327–340. https://doi.org/10.3109/02699209008985495 [Google Scholar]
  18. Goldin, A. , Weinstein, B. , & Shiman, N. (2020). Speech blocked by surgical masks becomes a more important issue in the era of COVID-19. Hearing Review, 27(5), 8–9. https://hearingreview.com/hearing-loss/health-wellness/how-do-medical-masks-degrade-speech-reception [Google Scholar]
  19. Hillenbrand, J. , Getty, L. A. , Clark, M. J. , & Wheeler, K. (1995). Acoustic characteristics of American English vowels. The Journal of the Acoustical Society of America, 97(5), 3099–3111. https://doi.org/10.1121/1.411872 [DOI] [PubMed] [Google Scholar]
  20. Kent, R. D. , & Kim, Y. (2011). The assessment of intelligibility in motor speech disorders. In Lowit A. & Kent R. D. (Eds.), Assessment of motor speech disorders (pp. 21–37). Plural. [Google Scholar]
  21. Kent, R. D. , Weismer, G. , Kent, J. F. , & Rosenbek, J. C. (1989). Toward phonetic intelligibility testing in dysarthria. Journal of Speech and Hearing Disorders, 54(4), 482–499. https://doi.org/10.1044/jshd.5404.482 [DOI] [PubMed] [Google Scholar]
  22. Kim, Y. (2017). Acoustic characteristics of fricatives /s/ and /∫/produced by speakers with Parkinson's disease. Clinical Archives of Communication Disorders, 2(1), 7–14. https://doi.org/10.21849/cacd.2016.00080 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Kim, Y. , Kent, R. D. , & Weismer, G. (2011). An acoustic study of the relationships among neurologic disease, dysarthria type, and severity of dysarthria. Journal of Speech, Language, and Hearing Research, 54(2), 417–429. https://doi.org/10.1044/1092-4388(2010/10-0020) [DOI] [PubMed] [Google Scholar]
  24. Kuznetsova, A. , Brockhoff, P. B. , & Christensen, R. H. B. (2017). lmerTest Package: Tests in linear mixed effects models. Journal of Statistical Software, 82(13), 1–26. https://doi.org/10.18637/jss.v082.i13 [Google Scholar]
  25. Lansford, K. L. , Borrie, S. A. , & Bystricky, L. (2016). Use of crowdsourcing to assess the ecological validity of perceptual-training paradigms in dysarthria. American Journal of Speech-Language Pathology, 25(2), 233–239. https://doi.org/10.1044/2015_AJSLP-15-0059 [DOI] [PubMed] [Google Scholar]
  26. Lenth, R. , Singmann, H. , Love, J. , Buerkner, P. , & Herve, M. (2019). Emmeans: Estimated marginal means, aka least-squares means (R package Version 1.1) [Computer software] .
  27. Levy, E. S. , Chang, Y. M. , Ancelle, J. A. , & McAuliffe, M. J. (2017). Acoustic and perceptual consequences of speech cues for children with dysarthria. Journal of Speech, Language, and Hearing Research, 60(6S), 1766–1779. https://doi.org/10.1044/2017_JSLHR-S-16-0274 [DOI] [PubMed] [Google Scholar]
  28. Lu, Y. , & Cooke, M. (2009). The contribution of changes in F0 and spectral tilt to increased intelligibility of speech produced in noise. Speech Communication, 51(12), 1253–1262. https://doi.org/10.1016/j.specom.2009.07.002 [Google Scholar]
  29. Lüdecke, D. , Ben-Shachar, M. S. , Patil, I. , Waggoner, P. , & Makowski, D. (2021). performance: An R package for assessment, comparison and testing of statistical models. Journal of Open Source Software, 6(60). https://doi.org/10.21105/joss.03139 [Google Scholar]
  30. Magee, M. , Lewis, C. , Noffs, G. , Reece, H. , Chan, J. C. , Zaga, C. J. , Paynter, C. , Birchall, O. , Rojas Azocar, S. , Ediriweera, A. , Kenyon, K. , Caverléd, M. W. , Schultz, B. G. , & Vogel, A. P. (2020). Effects of face masks on acoustic analysis and speech perception: Implications for peri-pandemic protocols. The Journal of the Acoustical Society of America, 148(6), 3562–3568. https://doi.org/10.1121/10.0002873 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Mattys, S. L. , & Liss, J. M. (2008). On building models of spoken-word recognition: When there is as much to learn from natural “oddities” as artificial normality. Perception & Psychophysics, 70(7), 1235–1242. https://doi.org/10.3758/PP.70.7.1235 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. McRae, P. A. , Tjaden, K. , & Schoonings, B. (2002). Acoustic and perceptual consequences of articulatory rate change in Parkinson disease. Journal of Speech, Language, and Hearing Research, 45(1), 35–50. https://doi.org/10.1044/1092-4388(2002/003) [DOI] [PubMed] [Google Scholar]
  33. Mefferd, A. S. (2017). Tongue-and jaw-specific contributions to acoustic vowel contrast changes in the diphthong /ai/ in response to slow, loud, and clear speech. Journal of Speech, Language, and Hearing Research, 60(11), 3144–3158. https://doi.org/10.1044/2017_JSLHR-S-17-0114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Milenkovic, P. (2005). Time-frequency 32 [Computer software] . University of Wisconsin-Madison. [Google Scholar]
  35. Moon, S. J. , & Lindblom, B. (1994). Interaction between duration, context, and speaking style in English stressed vowels. The Journal of the Acoustical Society of America, 96(1), 40–55. https://doi.org/10.1121/1.410492 [Google Scholar]
  36. National Institute on Deafness and Other Communication Disorders. (2021). Improving communication when wearing a face covering: Text version. Retrieved March 27, 2022, from https://www.nidcd.nih.gov/about/nidcd-director-message/cloth-face-coverings-and-distancing-pose-communication-challenges-many/improving-communication-when-wearing-face-covering-text-version
  37. Nguyen, D. D. , McCabe, P. , Thomas, D. , Purcell, A. , Doble, M. , Novakovic, D. , Chacon, A. , & Madill, C. (2021). Acoustic voice characteristics with and without wearing a facemask. Scientific Reports, 11(1), 5651–5611. https://doi.org/10.1038/s41598-021-85130-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Picheny, M. A. , Durlach, N. I. , & Braida, L. D. (1986). Speaking clearly for the hard of hearing II: Acoustic characteristics of clear and conversational speech. Journal of Speech and Hearing Research, 29(4), 434–446. https://doi.org/10.1044/jshr.2904.434 [DOI] [PubMed] [Google Scholar]
  39. Pollack, I. (1948). Effects of high pass and low pass filtering on the intelligibility of speech in noise. The Journal of the Acoustical Society of America, 20(3), 259–266. https://doi.org/10.1121/1.1906369 [Google Scholar]
  40. R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/ [Google Scholar]
  41. Rab, S. , Javaid, M. , Haleem, A. , & Vaishya, R. (2020). Face masks are new normal after COVID-19 pandemic. Diabetes & Metabolic Syndrome: Clinical Research & Reviews, 14(6), 1617–1619. https://doi.org/10.1016/j.dsx.2020.08.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Smiljanić, R. , & Bradlow, A. R. (2009). Speaking and hearing clearly: Talker and listener factors in speaking style changes. Language and Linguistics Compass, 3(1), 236–264. https://doi.org/10.1111/j.1749-818X.2008.00112.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Stevens, K. N. (1980). Acoustic correlates of some phonetic categories. The Journal of the Acoustical Society of America, 68(3), 836–842. https://doi.org/10.1121/1.384823 [DOI] [PubMed] [Google Scholar]
  44. Tjaden, K. , & Turner, G. S. (1997). Spectral properties of fricatives in amyotrophic lateral sclerosis. Journal of Speech, Language, and Hearing Research, 40(6), 1358–1372. https://doi.org/10.1044/jslhr.4006.1358 [DOI] [PubMed] [Google Scholar]
  45. Toscano, J. C. , & Toscano, C. M. (2021). Effects of face masks on speech recognition in multi-talker babble noise. PLOS ONE, 16(2), Article e0246842. https://doi.org/10.1371/journal.pone.0246842 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Uchanski, R. M. (2005). Clear speech. In Pisoni D. B. & Remez R. E. (Eds.), The handbook of speech perception (pp. 207–235). Wiley. https://doi.org/10.1002/9780470757024.ch9 [Google Scholar]
  47. Winn, M. B. , Chatterjee, M. , & Idsardi, W. J. (2013). Roles of voice onset time and F0 in stop consonant voicing perception: Effects of masking noise and low-pass filtering. Journal of Speech, Language, and Hearing Research, 56(4), 1097–1107. https://doi.org/10.1044/1092-4388(2012/12-0086) [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Wouters, J. , & Macon, M. W. (2002). Effects of prosodic factors on spectral dynamics. I. Analysis. The Journal of the Acoustical Society of America, 111(1), 417–427. https://doi.org/10.1121/1.1428262 [DOI] [PubMed] [Google Scholar]
  49. Yi, H. , Pingsterhaus, A. , & Song, W. (2021). Effects of wearing face masks while using different speaking styles in noise on speech intelligibility during the COVID-19 pandemic. Frontiers in Psychology, 12. https://doi.org/10.3389/fpsyg.2021.682677 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Due to the nature of this study, the speech data generated and analyzed for this study are not publicly available to be compliant with the institutional review board requirements. Data spreadsheets that deidentified participant information may be available from the first author (Yunjung Kim).


Articles from Journal of Speech, Language, and Hearing Research : JSLHR are provided here courtesy of American Speech-Language-Hearing Association

RESOURCES