Abstract
Purpose:
This study examined the relationship between judged speech sound distortions and spectral moment metrics in speakers with Class III malocclusion.
Methods:
A quantitative online survey was distributed to 30 speech specialists (clinicians and/or students) and 100 lay listeners to judge the clarity of the sounds /s/, /ʃ/, /t/ and /k/ using a visual analog scale (VAS) from recordings of 11 Class III (underbite) Dentofacial Disharmony (DFD) patients and eight Class I controls. Patients and controls were grouped according to high, moderate, and low /s/-/ʃ/ first spectral moment differences. A linear mixed model was used to analyze the data.
Results:
VAS scale ratings increased as a function of decreasing spectral contrast for both groups of listeners. VAS ratings of speech specialists were more homogenous than lay listeners, and speech specialists rated distortions as less severe than lay listeners.
Conclusions:
Recordings of Class III DFD patients with low /s/-/ʃ/ first spectral moment differences were scored by listeners as having increased VAS scale ratings, indicative of more significant perceived speech-sound distortions. Spectral moment analysis appears to be a promising approach for characterizing speech of DFD patients and other craniofacial disorders.
Introduction
Patients with Dentofacial Disharmonies (DFD) demonstrate significant malocclusions due to skeletal jaw disproportions (Proffit et al., 2003). DFD presentations include underbite (Class III), severe overjet (e.g., “overbite,” Class II), and open bite with one or both jaws being too large, too small, or abnormally positioned for proper jaw and tooth relationships (Figure 1) (Proffit et al., 2019). Patients with DFD require orthodontics and orthognathic jaw surgery for full correction (Proffit & White, 1990; Proffit et al., 2003). Orthognathic jaw surgery is an involved procedure to reposition the mandible, maxilla, and/or chin (Khechoyan, 2013; Proffit et al., 2003). Following recommendations from their orthodontist, patients may choose to pursue jaw surgery to improve facial and dental esthetics, enhance their ability to chew, resolve joint pain, correct their dental relationships, address speech concerns, and improve quality of life (Khechoyan, 2013; Proffit et al., 2003). Proffit and White (1990) reported that a quarter of DFD surgery candidates ranked speech concerns as a chief complaint, with 80-90% of DFD patients demonstrating perceptual speech distortions, compared to 3.5-5% of the general population (Keyser et al., 2022; Lathrop-Marshall et al., 2021; Proffit & White, 1990).
Figure 1. Sagittal occlusal schematics of craniofacial structures.
(A) Schematics of Class I anatomy. (B) Schematic of Class III anatomy with maxillary deficiency (retrognathism) and mandibular excess (prognathism). Class III jaw relationships can occur with a maxillary deficiency, mandibular excess, or relative contributions of both jaws. Anterior space is typically reduced, with decreased or negative overjet (OJ) as the maxillary incisors are posterior to the mandibular incisors (e.g., an underbite). Class III patients can also present with proclined (e.g., leaning out labially) maxillary incisors, retroclined (e.g., leaning back lingually) mandibular incisors, condylar hyperplasia, anterior positioning of the condyle, a short anterior cranial base length, acute cranial base angle, an obtuse gonial angle, and an excessive lower anterior face height (if high mandibular plane angle). Labels: U1 = upper 1, L1 = lower 1, UL = upper lip, LL = lower lip, SP = soft palate (or velum), HP = hard palate, TT = tongue tip, Mx = maxilla, and Md = mandible. OJ is the extent of horizontal (anterior–posterior) overlap of the maxillary central incisors over the mandibular central incisors.
DFD patients with skeletal Class III malocclusions demonstrate changes in their tongue position, influencing articulation and resulting in a lisp (Starr, 2013). The sound most likely affected in severe Class III patients is /s/, contributing to a high incidence of lisping (Görgülü et al., 2011; Guay et al., 1978; Starr, 2013). Vallino and Tompson (1993) described various types of lisping, including interdental, dental, and lateral. The interdental and dental lisps can also be classified as a frontal lisp, where the tongue tip is too far forward; a lateral lisp occurs when air is sent over the sides of the tongue rather than centrally, creating a slurry or slushy sound (Vallino & Tompson, 1993). This can occur when the tongue does not form a seal along the lateral mandibular teeth. Vallino and Tompson (1993) also point out that some individuals with malocclusion may exhibit visual distortions – that is, the tongue is seen touching the teeth – but have no auditory-perceptual distortions.
Most clinicians will recommend speech therapy for a lisp associated with severe malocclusion only after orthognathic surgery if the lisp persists. Previous studies on the effects of surgical correction on speech used perceptual evaluations by speech-language pathologists (SLPs) (Ruscello et al., 1986). However, the perceptual approach is qualitative and introduces subjectivity, leading to limited interjudge and intrajudge reliability of articulation characteristics, even by experienced SLPs (Tanner et al., 2005). To objectively evaluate speech, our team aimed to validate the use of an established quantitative approach for measuring voiceless consonant speech distortions in our DFD population, known as spectral moment analysis (SMA). SMA is a quantitative method for statistically measuring and describing speech for voiceless stop consonants and sibilant fricatives through four spectral moments (Forrest et al., 1988). The four moments include mean frequency (M1, centroid or center of gravity), energy spread or variance (M2), skewness (M3), and kurtosis (M4) of the sound energy (Jiang et al., 2016). This method has been recently used to quantitatively describe speech characteristics in children with repaired cleft lip/palate and conductive hearing loss (Zajac et al., 2021) and in adolescents and adults with DFD (Lathrop-Marshall et al., 2021). These studies indicate that SMA could provide objective indicators of speech change with treatment, especially for evaluating speech in DFD patients following surgical correction.
Our objective was to apply SMA to recordings of selected DFD patients and evaluate the quantitative output relative to listeners’ perception of speech distortions. We focused on M1, specifically the M1 difference between /s/-/ʃ/, in this study given that this spectral distinction has been shown to be sensitive to place of articulation of alveolar sounds in children with repaired cleft palate (Zajac et al., 2021). We hypothesized that listeners’ perceptions of speech sound distortions would increase as speakers’ /s/-/ʃ/ spectral distinctions decreased. Speech specialists have training in perception of speech distortions, yet DFD patients interact primarily with lay people in their day-to-day life. As a result, we evaluated the quantitative output of SMA relative to both specialists’ and lay listeners’ perceptions of articulation. (Supplemental Table 1)
Methods
Speakers
Nineteen speakers participated in this study (Table 1). There were 11 speakers with Class III (underbite) and 8 speakers with Class I who served as controls. Speakers with Class III ranged in age from 23 to 34 years (mean=21.5); there were 7 males and 4 females. Control speakers ranged in age from 19 to 34 years (mean=24.5); there were 8 females. Controls had no history of speech-language disorders or hearing loss and were judged by an experienced SLP (the 6th author) to have normal articulation. All speakers were part of a larger study investigating spectral characteristics of DFD patients (Keyser et al., 2022; Lathrop-Marshall et al., 2021). The speakers for the current study were selected to represent distinct ranges of /s/-/ʃ/ spectral distinction as described below. Because of this, no attempt was made to balance sex of the speakers.
Table 1.
Class III Dentofacial Disharmony (DFD) and Class I control speakers’ demographics and spectral characteristics
| Group | DFD ID | M/F^ | Age | Race* | Ethnicity** | M1 /s/ |
M1 /ʃ/ |
M1 /s/-/ʃ/ |
M2 /s/ |
M2 /ʃ/ |
M3 /s/ |
M3 /ʃ/ |
M4 /s/ |
M4 /ʃ/ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Class III-Low Spectral Distinction for /s/-/ʃ/0.0-1.99kHz | DFD_124 | M | 15 | C | NH | 6.54 | 5.12 | 1.43 | 1.96 | 3.07 | 1.77 | 1.31 | 4.40 | 1.46 |
| DFD_144 | M | 15 | C | H | 8.30 | 7.31 | 1.00 | 2.55 | 2.91 | 1.07 | 0.77 | 2.40 | 1.80 | |
| DFD_190 | M | 17 | AA | NH | 8.00 | 7.49 | 0.51 | 3.37 | 3.92 | 0.49 | 0.78 | 0.09 | 0.71 | |
| DFD_201 | M | 23 | AA | NH | 9.11 | 7.16 | 1.95 | 2.38 | 3.29 | 0.18 | 0.18 | 1.86 | 2.58 | |
| Mean | NA | NA | 17.5 | NA | NA | 7.99 | 6.7 | 1.22 | 2.05 | 2.64 | 0.70 | 0.61 | 1.75 | 1.31 |
| Range | NA | NA | 15-23 | NA | NA | 2.57 | 2.37 | 1.44 | 1.40 | 1.01 | 1.59 | 1.13 | 4.31 | 1.87 |
| Class III-Moderate Spectral Distinction for /s/-/ʃ/2.0-3.99kHz | DFD_123 | M | 16 | C | NH | 8.40 | 5.25 | 3.15 | 2.48 | 2.66 | 1.05 | 1.32 | 1.89 | 1.95 |
| DFD_159 | M | 15 | C | H | 9.56 | 6.26 | 3.31 | 2.64 | 2.76 | 0.58 | 1.22 | 1.36 | 0.96 | |
| DFD_167 | M | 19 | AA | NH | 9.30 | 5.95 | 3.35 | 1.77 | 2.14 | 1.00 | 1.24 | 4.01 | 2.81 | |
| Mean | NA | NA | 16.7 | NA | NA | 9.09 | 5.82 | 3.27 | 2.30 | 2.52 | 0.87 | 1.26 | 2.42 | 1.91 |
| Range | NA | NA | 15-19 | NA | NA | 1.16 | 1.00 | 0.20 | 0.87 | 0.62 | 0.47 | 0.10 | 2.65 | 1.86 |
| Class III-High Spectral Distinction for /s/-/ʃ/4.0-6.7kHz | DFD_108 | F | 28 | AA | NH | 10.85 | 5.20 | 5.64 | 1.34 | 2.14 | 0.22 | 2.49 | 8.02 | 8.99 |
| DFD_145 | F | 34 | AA | NH | 12.63 | 6.02 | 6.61 | 1.69 | 2.27 | 0.54 | 1.79 | 4.65 | 5.01 | |
| DFD_171 | F | 31 | C | NH | 10.85 | 6.28 | 4.57 | 2.21 | 2.44 | 0.87 | 1.07 | 1.63 | 2.15 | |
| DFD_250 | F | 23 | AA | NH | 11.57 | 4.99 | 6.58 | 2.27 | 2.11 | −0.094 | 1.68 | 0.77 | 4.15 | |
| Mean | NA | NA | 29 | NA | NA | 11.47 | 5.62 | 5.85 | 1.88 | 2.24 | 0.38 | 1.76 | 3.77 | 5.07 |
| Range | NA | NA | 23-34 | NA | NA | 1.78 | 1.29 | 2.04 | 0.93 | 0.32 | 0.97 | 1.42 | 7.25 | 6.84 |
| Overall Class III Mean | NA | NA | 21.5 | NA | NA | 9.56 | 6.09 | 3.46 | 2.29 | 2.7 | 0.70 | 1.26 | 2.83 | 2.96 |
| Control-Spectral Distinction for /s/-/ʃ/4.0-6.7kHz | DFD_099 | F | 33 | C* | NH | 10.06 | 5.62 | 4.45 | 1.31 | 1.90 | 1.00 | 1.39 | 6.67 | 3.83 |
| DFD_100 | F | 34 | C | NH | 10.16 | 5.44 | 4.72 | 1.28 | 2.34 | 1.38 | 1.90 | 10.09 | 5.43 | |
| DFD_102 | F | 23 | C | NH | 10.13 | 5.98 | 4.15 | 1.41 | 1.58 | 0.63 | 1.70 | 8.38 | 7.81 | |
| DFD_173 | F | 26 | A | NH | 10.40 | 4.82 | 5.58 | 1.50 | 2.08 | 0.88 | 1.91 | 4.75 | 5.58 | |
| DFD_233 | F | 20 | C | NH | 9.54 | 5.33 | 4.22 | 1.51 | 2.08 | 1.28 | 2.35 | 6.10 | 7.31 | |
| DFD_237 | F | 22 | C | NH | 11.77 | 5.45 | 6.32 | 2.08 | 2.15 | −0.16 | 1.18 | 2.16 | 2.00 | |
| DFD_241 | F | 19 | A | NH | 10.12 | 5.95 | 4.17 | 2.19 | 1.81 | 1.53 | 1.15 | 3.31 | 3.36 | |
| DFD_243 | F | 19 | C | NH | 9.35 | 5.05 | 4.30 | 1.31 | 1.88 | 1.05 | 2.26 | 5.22 | 7.68 | |
| Mean | NA | NA | 24.5 | NA | NA | 10.19 | 5.45 | 4.74 | 1.57 | 1.98 | 0.95 | 1.73 | 5.84 | 5.37 |
| Range | NA | NA | 19-34 | NA | NA | 2.43 | 1.16 | 2.17 | 0.91 | 0.77 | 1.69 | 1.20 | 7.93 | 5.81 |
Sex: M= male. F= Female.
Race: C= Caucasian. AA= African American. A= Asian.
Ethnicity: H= Hispanic. NH= Non-Hispanic.
Notes: M1=first spectral moment, M2=second spectral moment, M3=third spectral moment, M4=fourth spectral moment, NA=not applicable
Listeners
Listeners included 30 speech specialists and 100 laypeople. The speech specialists were either SLPs (n=10, 33%), SLP graduate clinicians (n=13, 44%), or speech and hearing sciences undergraduate students (n=7, 23%). All listeners learned English as a first language. All lay people had no background in speech pathology or speech sciences. None of the listeners had a history of speech therapy or hearing loss. Demographics of our listeners are found in Table 2. A majority of listeners (81%, n=81) were women. Listeners ranged in age from 18 to 35 years (mean=23.1 for laypeople, mean=24.6 for speech specialists). For both groups, most listeners were Caucasian, (68%, n=68 for laypeople, 80%, n=24 for speech specialists), which is representative of the population that was recruited for this study. All listeners attested to completing the survey in a quiet place, with a majority using headphones (73%, n=73 for laypeople, 77%, n=23 for speech specialists).
Table 2.
Survey participant demographics
| Group | Lay Listeners (N=100) |
Speech and Hearing Specialists (N=30) |
|---|---|---|
| Age Mean | 23.09 +/− 4.17 SD | 24.6 +/− 4.33 SD |
| Age Range | 18-35 years old | 18-35 years old |
| Gender | ||
| Women | 81% (n=81) | 80% (n=24) |
| Men | 18% (n=18) | 20% (n=6) |
| Nonbinary | 1% (n=1) | -- |
| Race | ||
| Caucasian | 68% (n=68) | 80% (n=24) |
| Asian | 20% (n=20) | 3.3% (n=1) |
| African American | 3% (n=3) | 6.7% (n=2) |
| Mixed Race | 6% (n=6) | 10% (n=3) |
| Other | 1% (n=1) | -- |
| Prefer not to say | 2% (n=2) | -- |
| Ethnicity | ||
| Hispanic | 7% (n=7) | 10% (n=3) |
| Non-Hispanic | 91% (n=91) | 90% (n=27) |
| Prefer not to say | 2% (n=2) | -- |
| Listening Method | ||
| Over-ear headphones | 15% (n=15) | 30% (n=9) |
| In-ear headphones | 58% (n=58) | 47% (n=14) |
| Computer Speakers | 27% (n=27) | 23% (n=7) |
| Quiet setting | 100% yes (n=100) | 100% yes (n=30) |
| Group Makeup | ||
| Lay people - no speech or hearing training | 100% (n=100) | N/A |
| Speech-Language Pathologists (SLP) | N/A | 33% (n=10) |
| SLP Graduate Clinicians | N/A | 43% (n=13) |
| Speech and hearing science students | N/A | 23% (n=7) |
As indicated previously, DFD patients interact primarily with lay people in their day-to-day life. Therefore, we included lay listeners to determine if they judge speech distortions differently from speech specialists. Although previous research has indicated that lay listeners can make valid judgments of various aspects of disordered speech (e.g., Burgi and Matthews, 1960; Eadie et al., 2010), we are unaware if this has been shown for sibilant distortions of speakers with Class III malocclusion.
Speech Sample and SMA
All speakers were audio recorded in a sound-attenuated booth (Eckoustic Noise Control Products: Eckel Industries of Canada Limited, Morrisburg, Ontario) using a head-mounted microphone (AKG, model C-520, Vienna, Austria). The speakers produced 60 words including see, she, tea, and key three times in the carrier phrase “say ___ again” (see S2 for a list of all words included in the larger study). We targeted the /s/-/ʃ/ phonetic contrast as these sounds are most often associated with lisping. We also included words with the /t/-/k/ phonetic contrast to explore if the alveolar stop may be distorted given its similar place of production to the alveolar fricative. Thus, a total of 228 target words were produced (19 speakers x 4 words x 3 repetitions).
Recordings were digitized using the Computerized Speech Laboratory (CSL Model 4500, Pentax Medical, NJ, USA). CSL software was configured to record at a sampling rate of 44.1 kHz with a low-pass filter at 80% of the Nyquist frequency (~18 kHz). SMA was performed using TF32 software (Milenkovic, 2001). Briefly, the target sounds were analyzed via the Fast Fourier Transform (FFT) algorithm using a linear frequency scale, simplifying the wave to resemble a statistical distribution curve within a static window of the spectra. For the sibilants /s/ and /ʃ/, spectral moments were determined using a 20ms window placed at the temporal midpoint of the frication noise; for the stops /t/ and /k/, a 20ms window was positioned at the beginning of the burst release.
SMA of the speakers’ recordings yielded M1 values from each consonant token. These M1 values were averaged for each consonant (/s/, /ʃ/, /t/, /k/) to yield the speakers’ average M1 value for that consonant. Speakers were selected based on M1 /s/-/ʃ/ differences and grouped into high, moderate, and low spectral distinctions (see Table 1). All control (Class I) and high-Class III speakers had spectral separation of at least 4.0 kHz; moderate-Class III speakers had spectral separation of 2.0-3.9 kHz; and low-Class III speakers had spectral separation of 0-1.9 kHz. Although these grouping categories were somewhat arbitrary, Haley et al. (2010) reported /s/-ʃ/ spectral differences of approximately 2 to 3 kHz for normal speakers (Haley et al., 2010).
Perceptual Analysis
We conducted separate cross-sectional, quantitative surveys of both groups of listeners. For lay listeners, two surveys consisting of 114 words each (228 in total) were created with equal distributions of control, high-, moderate-, and low-spectral distinction speakers. Two surveys, each with half the recordings, were utilized to shorten the overall length of the survey to maintain lay listener engagement and ensure completion. The recordings in both surveys were randomly ordered. Lay listeners were enrolled sequentially with odd numbered listeners completing survey 1 and even numbered listeners completing survey 2. For speech specialists, the survey included all control and Class III recordings (228 recordings); a single longer survey was used, as we believed the speech specialists would be more accustomed to making repetitive speech ratings.
All listeners were asked to rate the level of perceived distortion of the initial sound in the words see, she, tea, and key using a visual analog scale (VAS) from 0 to 100, with 0 representing completely clear and 100 representing severely distorted (Supplemental Figure 1). Audio files were edited to include only the target word. Participants were instructed to focus on the initial sound of the word. The survey was pre-tested by 5 people, including 4 laypeople and 1 SLP, and iteratively revised to ensure demographic question clarity and proper technical function.
The survey was administered online remotely, due to COVID-19 pandemic restrictions on in-person research at our institution. All responses were collected using a secure UNC Qualtrics link (Qualtrics XM, Inc., Provo, Utah, USA). All listeners were recruited from the community via flyers, website recruitment advertisements, and emailed listserv announcements using IRB-approved recruitment materials. Participants consented, answered screening questions, attested to completing the survey in a quiet place with headphones, and completed the survey remotely via personal computer (Appendix). This research was approved by the Institutional Review Board of the University of North Carolina at Chapel Hill (IRB #20-2207).
Data and Statistical Analyses
The VAS ratings for the three instances of each word were averaged for each speaker. Multiple listener agreements were calculated for the speech specialists and lay listeners as separate groups; multiple listener agreements were also calculated between the lay listeners and speech specialists. This was done by aligning all possible pairs of listeners and determining the closeness of ratings for a given target word of the DFD speakers. Statistical analyses were conducted using SAS 9 Software (SAS Institute, Cary, NC, USA). Type III and pairwise p-values were calculated using a mixed model, with the word as a random variable. Pairwise p-values were calculated using least-squares means with Tukey adjusted p-values, for multiple testing error adjustment. Each DFD group was compared relative to the controls for pairwise tests. Significance was set as p<0.05.
Results
Listeners’ Agreement
Table-3 shows multiple-listener cumulative agreements of visual analog scale (VAS) ratings of target sounds by speech specialists and lay listeners for DFD speakers with low, moderate, and high spectral distinction. As shown in the table, cumulative percentages of agreement were calculated for pairs of ratings falling within 10, 20, and 30 VAS points. Approximately 67% of speech specialist pairs rated a given target word within 30 points on the VAS (i.e., within 30% of the scale range). For lay listeners, about 63% of pairs rated a given target word within 30 points on the VAS across all categories of spectral distinction. Although these percentages reflect only moderate levels of agreement, it must be emphasized that (a) judging the clarity of a single sound of a word is a somewhat difficult perceptual task, especially for lay listeners, and (b) because the VAS ratings were averaged across the three target words, reduced agreement for a single word was mitigated to some extent.
Table 3.
Multiple-listener cumulative agreements of visual analog scale (VAS) ratings of target sounds by speech specialists and lay listeners for DFD speakers with low, moderate, and high spectral distinction
| Listener and DFD Groups | Sound | Within 10 VAS points (%) |
Within 20 VAS points (%) |
Within 30 VAS points (%) |
Number of listener pairs* |
|---|---|---|---|---|---|
| Speech Specialists | |||||
| Low Spectral Distinction** | |||||
| /s/ | 26.8 | 44.4 | 57.9 | 3,346 | |
| /ʃ/ | 23.6 | 43.2 | 60.0 | 3,366 | |
| /t/ | 42.2 | 58.0 | 70.0 | 3,366 | |
| /k/ | 37.7 | 56.8 | 69.6 | 3,366 | |
| Moderate Spectral Distinction** | |||||
| /s/ | 28.6 | 48.1 | 61.1 | 2,988 | |
| /ʃ/ | 26.2 | 44.8 | 58.8 | 2,968 | |
| /t/ | 39.4 | 57.7 | 71.4 | 2,988 | |
| /k/ | 44.4 | 65.4 | 77.3 | 2,942 | |
| High Spectral Distinction** | |||||
| /s/ | 27.5 | 46.3 | 60.8 | 3,326 | |
| /ʃ/ | 32.9 | 53.1 | 66.5 | 3,366 | |
| /t/ | 44.9 | 61.1 | 72.7 | 3,366 | |
| /k/ | 40.6 | 61.2 | 72.9 | 3,366 | |
| Lay Listeners | |||||
| Low Spectral Distinction*** | |||||
| /s/ | 25.3 | 43.2 | 59.5 | 14,700 | |
| /ʃ/ | 22.4 | 40.3 | 55.2 | 14,700 | |
| /t/ | 31.3 | 48.7 | 62.3 | 14,700 | |
| /k/ | 27.6 | 45.7 | 59.8 | 14,700 | |
| Moderate Spectral Distinction*** | |||||
| /s/ | 24.3 | 41.2 | 56.6 | 14,700 | |
| /ʃ/ | 23.3 | 41.5 | 56.7 | 14,700 | |
| /t/ | 31.4 | 50.4 | 67.0 | 14,700 | |
| /k/ | 32.2 | 51.4 | 67.5 | 14,700 | |
| High Spectral Distinction*** | |||||
| /s/ | 29.1 | 48.7 | 61.9 | 14,700 | |
| /ʃ/ | 27.2 | 47.0 | 63.8 | 13,475 | |
| /t/ | 40.7 | 60.8 | 74.2 | 14,700 | |
| /k/ | 32.0 | 51.5 | 67.2 | 14,700 |
Total number of pairs of listeners by difference group scored for degree of distortion
Comparison of pairs, made up of two separate raters from the speech professionals group scoring recordings with low, moderate or high spectral distinctions
Comparison of pairs, made up of two separate raters from the lay listeners group scoring recordings with low, moderate or high spectral distinctions
Table 4 shows multiple-listener cumulative agreements of VAS ratings of target sounds between speech specialists and lay listeners.(Supplemental Table 3) Approximately 50% of listener pairs rated a given target word within 30 points on the VAS across all categories of spectral distinction. This shows somewhat low agreement between speech specialists and lay listeners.
Table 4.
Multiple-listener cumulative agreements of visual analog scale (VAS) ratings of target sounds between pairs of speech specialists and lay listeners for DFD speakers with low, moderate, and high spectral distinction
| DFD Group | Sound | Within 10 VAS points (%) |
Within 20 VAS points (%) |
Within 30 VAS points (%) |
Number of listener pairs* |
|---|---|---|---|---|---|
| Spectral Distinction | |||||
| Low** | |||||
| /s/ | 22.2 | 38.3 | 53.3 | 30,300 | |
| /ʃ/ | 17.8 | 37.4 | 52.7 | 30,300 | |
| /t/ | 28.8 | 42.9 | 58.1 | 30,300 | |
| /k/ | 23.2 | 39.0 | 54.1 | 30,300 | |
| Moderate** | |||||
| /s/ | 18.3 | 32.1 | 43.3 | 33,000 | |
| /ʃ/ | 17.9 | 35.0 | 49.0 | 33,000 | |
| /t/ | 21.5 | 34.6 | 49.3 | 33,000 | |
| /k/ | 21.7 | 33.7 | 44.7 | 33,000 | |
| High** | |||||
| /s/ | 18.6 | 32.7 | 44.4 | 30,300 | |
| /ʃ/ | 19.3 | 34.5 | 48.9 | 28,200 | |
| /t/ | 24.1 | 38.1 | 47.0 | 30,300 | |
| /k/ | 20.0 | 33.8 | 46.3 | 30,300 |
Total number of pairs of listeners by difference group scored for degree of distortion
Comparison of pairs, made up of one rater from the speech professionals group scoring recordings with low, moderate or high spectral distinctions and one rater from the lay listeners’ group scoring recordings with low, moderate or high spectral distinctions
Speech Specialists’ Perceptions of Class III Surgical Speakers
Speech specialists rated speakers with low spectral distinctions as more distorted than those with moderate and high spectral distinctions (an inverse relationship), consistent with our hypothesis (Figure 2B, Table 6). Mixed models showed significant spectral distinction group effects for /s/ (p<.0001), /ʃ/ (p<.0001), and /t/ (p=.01), but not for /k/ (p=.09). There were significant pairwise differences between controls and low (p<.0001), moderate (p<.0001), and high (p<.0001) spectral distinction speakers for /s/; between controls and low (p<.0001) and moderate (p=.0002) spectral distinction speakers for /ʃ/; and between controls and low (p=.02) spectral distinction speakers for /t/ (Table 6).
Figure 2. Lay listener and speech specialists’ perceptions of Class III distortions.
Classification was based on /s/-/ʃ/ spectral distinction. A. Lay people sound distortion perceptions. B. Speech specialists’ sound distortion perceptions. Blue circle: Controls. Green square: high spectral differences. Red triangle: moderate spectral differences. Yellow upside-down triangle: low spectral differences. Bars represent standard error. Conventions: **p < 0.05 significant.
Table 6.
Means, standard errors (in parentheses), and mixed model results of visual analog scale ratings by speech specialist for controls and Dentofacial Disharmony speakers
| Controls Reference |
Low | Moderate | High | Type III p- value* |
Low p- value*^ |
Mod p- value*^ |
High p- value*^ |
|
|---|---|---|---|---|---|---|---|---|
| /s/ | 11.9 (1.72) | 28.0 (2.43) | 25.9 (2.78) | 25.9 (2.44) | <0.0001 | <0.0001 | <0.0001 | <0.0001 |
| /ʃ/ | 11.8 (2.48) | 32.2 (3.51) | 29.2 (4.04) | 19.1 (3.51) | <0.0001 | <0.0001 | 0.0002 | 0.09^^ |
| /t/ | 9.37 (1.33) | 15.9 (1.88) | 15.5 (2.15) | 13.5 (1.88) | 0.01 | 0.02 | 0.08^^ | 0.28^^ |
| /k/ | 12.2 (1.24) | 17.6 (1.75) | 12.97 (1.24) | 14.2 (1.75) | 0.09^^ | 0.06^^ | 0.99^^ | 0.80^^ |
Type III and pairwise p-values calculated using a mixed model, with word as a random variable.
Pairwise p-value calculated using least-squares means with Tukey adjusted p-values. Each group was compared relative to the controls for pairwise tests.
Mean (Standard error)
No significant difference p > or = 0.05. Significance defined as p<0.05
Notes: Low=low spectral distinction, Mod=moderate spectral distinction, High=high spectral distinction
Lay Listeners’ Perception of Class III Surgical Speakers
Lay listeners rated speakers with low spectral distinctions as the most distorted and controls as the least distorted (an inverse relationship), consistent with speech specialists and our hypothesis (Figure 2A, Table 5). Mixed models showed significant spectral distinction group effects for /s/, /ʃ/, /t/ and /k/ (p<.0001, Table 5). There were significant pairwise differences between controls and low (p<.0001), moderate (p<.0001), and high (p=.001) spectral distinction speakers for /s/; between controls and low (p<.0001) and moderate (p=.0001) spectral distinction speakers for /ʃ/; between controls and low (p<.0001) and moderate (p=.0004) spectral distinction speakers for /t/; and, between controls and low (p<.0001), moderate (p<.0001), and high (p<.0001) spectral distinction speakers for /k/ (Table 5). (Supplemental Table 4)
Table 5.
Means, standard errors (in parentheses), and mixed model results of visual analog scale ratings by lay listeners for controls and Dentofacial Disharmony speakers
| Controls High |
Class III Low |
Class III Moderate |
Class III High |
Type III p- value* |
Control v. Low*^ |
Control v. Mod*^ |
Control v. High*^ |
|
|---|---|---|---|---|---|---|---|---|
| /s/ | 13.5 (2.16) | 33.1 (3.06) | 33.9 (3.50) | 27.5 (3.06) | <0.0001 | <0.0001 | <0.0001 | 0.001 |
| /ʃ/ | 13.9 (2.98) | 39.8 (4.21) | 38.1 (4.84) | 25.8 (4.21) | <0.0001 | <0.0001 | 0.0001 | 0.09^^ |
| /t/ | 9.57 (1.58) | 24.1 (2.23) | 21.6 (2.54) | 16.6 (2.23) | <0.0001 | <0.0001 | 0.0004 | 0.050^^ |
| /k/ | 14.0 (0.95)** | 26.3 (1.34) | 21.7 (1.47) | 21.8 (1.34) | <0.0001 | <0.0001 | <0.0001 | <0.0001 |
Type III and pairwise p-values calculated using a mixed model, with word as a random variable.
Pairwise p-value calculated using least-squares means with Tukey adjusted p-values. Each group was compared relative to the controls for pairwise tests.
Mean (Standard error)
No significant difference p > or = 0.05. Significance defined as p<0.05
Notes: Low=low spectral distinction, Mod=moderate spectral distinction, High=high spectral distinction
It should be noted that control speakers were rated as less distorted than Class III high-distinction speakers for /s/ (p<.0001), despite having similar M1 spectral distinction. This was also found for the speech specialists and is discussed below.
Comparison of Lay Listeners and Speech Specialists
The lay listeners and speech specialists were well matched demographically in terms of age, sex, ethnic and racial makeup, in addition to listening methods (Table 2). Both lay listeners and speech specialists rated controls and Class III speakers overall similarly, with a clear inverse relationship between spectral distinction and perceived distortion (Figure 2). However, lay listeners showed slightly less overall agreement as a group and tended to rate speakers as having more serious speech sound distortions than speech specialists (Figure 2). To determine if these differences were statistically significant for DFD speakers, we averaged each listener’s VAS ratings across all speech sounds and compared the lay listeners and speech specialists using Mann-Whitney tests. For the low spectral distinction speakers, the median VAS ratings for speech specialists and lay listeners were 19 and 31, respectively (p=.025). For the moderate spectral distinction speakers, the median VAS ratings for speech specialists and lay listeners were 18 and 29, respectively (p=.003). For the high spectral distinction speakers, the median VAS ratings for speech specialists and lay listeners were 14 and 22, respectively (p=.117). Thus, lay listeners tended to rate speakers with low and moderate spectral distinctions as more distorted than speech specialists.
Discussion
The purpose of this study was to determine if spectral distinction metrics were associated with listeners’ perceptions of distorted speech sound production by speakers with Class III malocclusion. We hypothesized that perceived distortions would increase as speakers’ /s/-/ʃ/ spectral distinctions decreased. Overall, the hypothesis was largely confirmed, particularly by speech specialists (clinicians and students), who rated DFD speakers with low spectral distinction as having the greatest speech sound distortions, especially for /s/ and /ʃ/.
As noted in the introduction, a quarter of DFD patients rank speech concerns as their highest priority (Proffit & White, 1990). Indeed, distortion (lisping) of sibilants is a frequent speech finding (Lathrop-Marshall et al., 2021). In the present study, all DFD speakers with low spectral distinction were judged to have lingual fronting distortions by the sixth author (DJZ), an experienced craniofacial SLP. In addition, listeners rated these speakers as having significantly distorted /s/ and /ʃ/ speech sounds, consistent with the craniofacial SLP and spectral distinction metric. SMA, therefore, has promise for several uses in the clinical setting. First, DFD patients can be recorded for SMA to supplement perceptual assessment of speech. As shown by the results of this study, there is variation among SLPs’ judgments of speech distortions. The use of valid and well-defined spectral metrics can only facilitate accurate diagnosis. Related, DFD patients are often referred to SLPs by surgeons to document speech impairment as part of the insurance approval process required for orthognathic surgeries. Objective measures of spectral contrast can provide persuasive information to insurers. Lastly, the use of SMA as a treatment outcome measure of DFD patients can also provide objective insight to the impact of orthodontics and/or orthognathic jaw surgery on speech, bypassing potential bias of the treating clinician(s).
Several findings of the present study deserve additional discussion. One finding of particular interest was that control speakers were rated as significantly less distorted than Class III high-distinction speakers for /s/, despite both groups having high /s/-/ʃ/ spectral distinction. This may have occurred due to listeners being influenced by other spectral characteristics such as M2, M3, and/or M4. Indeed, Table 1 shows somewhat large M3 (spectral slope) and M4 (peakedness of spectral energy) differences for /ʃ/ between controls and Class III high-distinction speakers.
Another finding of interest was that listeners identified /t/ as being significantly distorted by low-spectral contrast DFD speakers with Class III malocclusion compared to controls. Although lisping primarily is associated with sibilant sounds, /t/ is produced at the same place as /s/ and also includes frication noise during the release (plosive) phase, if produced with an oral release. The present findings, therefore, highlight that SLPs need to be cognizant of possible distortions associated with /t/ among DFD patients.
Validity of Lay Listeners’ Ratings
Previous research has indicated that lay listeners generally provide valid judgments of various aspects of disordered speech (e.g., Burgi and Matthews, 1960; Eadie et al., 2010). The validity of some of the lay listeners’ VAS ratings in the present study, however, needs to be questioned. Although their overall ratings paralleled the speech specialists’ ratings, there were discrepancies. First, lay listeners rated DFD speakers with low and moderate spectral distinctions significantly higher than speech specialists. This may have occurred due to speech specialists being more familiar with speech distortions and encountering a wider range of severity. Second, the lay listeners rated /k/ as being significantly distorted between controls and all DFD spectral distinction speakers while the speech specialists rated this sound similarly across speaker groups. As noted in methods, we included /k/ to derive a spectral distinction metric for /t/, but we did not expect this sound to be distorted by Class III DFD speakers. (Supplemental Table 2) The overall multiple pair-wise interrater reliablity of the lay listeners’ VAS judgments were also more variable than the speech specialists. The reasons for these discrepant findings are not entirely clear but may be related to the difficulty of judging specific speech segments versus more global aspects such as intelligiblity and/or vocal quality as done in previous studies. It is also possible that our use of online recruitment of listeners may have been a factor, at least for the lay listeners. That is, it is posible that some lay listeners may not have deligently attended to the online task and/or lost interest even though we attempted to make the length of the listening task manageable.
Limitations
Limitations of this study include a small number of control and Class III DFD speakers, a lack of balance among speakers relative to sex, a focus on M1 spectral distinction metrics, and only modest reliability of VAS by listeners. A larger sample of speakers and the inclusion of other spectral moment metrics – including the use of multi-taper analysis techniques, such as that proposed by Reidy (2015)– should be explored in future studies. Additionally, our study did not include listener groups who also exhibit lisping such as individuals with cleft lip and palate (CLP) and/or DFD subjects with Class III skeletal malocclusions to gain patient perspectives. A future study would benefit from having listeners from these patient groups to understand their perspective and perceptions.
Conclusions
Results of this study suggest that SMA metrics may be a valid way to quantify perceptual distortions of Class III DFD speakers, both prior to and following orthognathic surgery. SMA techniques are readily available to clinicians via free downloadable programs such as TF32 and Praat (Boersma & Weenink 2021). Trained speech clinicians, however, may provide more valid perceptual ratings of speech distortions than lay listeners. Additional research is needed to further validate this promising approach.
Supplementary Material
Supplemental Table 1. Table shows the list of words used to trigger specific vowel sounds during the speech recordings.
Supplemental Table 2. List of class III DFD patient demographics and spectral distinction for /t/ and /k/
Supplemental Table 3. Correlation between mean word sounds between lay people and speech professionals.
Supplemental Table 4. Correlation between log10 mean word sounds between lay people and speech professionals.
Supplemental Figure 1. Example of visual analog scale for recording scales within the distributed survey. A. Question and scale for a /s/ word. B. Question and scale for a /ʃ/ word.
Learning Outcomes.
You will be able to:
Describe how Dentofacial Disharmony patients with Class III malocclusion demonstrate changes in their tongue position and how it influences articulation.
Assess how spectral metrics relate to perceived speech distortions by lay people and speech specialists.
Compare perceived speech distortion ratings between lay people and speech specialists for speakers with Class III malocclusion with varying levels of spectral distinction.
Acknowledgements
We would like to thank the survey respondents for their time and efforts filling out this survey. We appreciate the suggestions and support provided by Dr. Steven Oliver and assistance with manuscript preparation by Nare Ghaltakhchyan.
Funding
This research was supported by the American Association of Orthodontics Martin ‘Bud’ Schulman Postdoctoral Fellowship Award (to LAJ). Additionally, this work was supported by the Oral and Maxillofacial Surgery Foundation Research Support Grant (to LAJ). Finally, the project described was supported by the National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), through Grant Award Number UL1TR002489 (to LAJ), by National Institutes of Dental and Craniofacial Research (NIDCR), NIH through a K08 award (to LAJ), with a Grant Award Number 1K08DE030235, and by National Institutes of Dental and Craniofacial Research (NIDCR), NIH through a R01 award (to DJZ), with a Grant Award Number R01DE022566. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Footnotes
Conflict of Interest
The authors have no conflicts of interest to declare.
References
- Boersma P, & Weenink D (2021). Praat: doing phonetics by computer (Version 6.2.01). [Computer program]. Retrieved November 17, 2021, from http://www.praat.org/ [Google Scholar]
- Burgi EJ, & Matthews J (1960). Effects of listener sophistication upon global ratings of speech behavior. Journal of speech and hearing research, 3(4), 348–353. [DOI] [PubMed] [Google Scholar]
- Eadie TL, Kapsner M, Rosenzweig J, Waugh P, Hillel A, Merati A. The role of experience on judgments of dysphonia. J Voice. 2010. Sep;24(5):564–73. doi: 10.1016/j.jvoice.2008.12.005. Epub 2009 Sep 17 [DOI] [PubMed] [Google Scholar]
- Forrest K, Weismer G, Milenkovic P, & Dougall RN (1988). Statistical analysis of word-initial voiceless obstruents: Preliminary data. F J Acoust Soc Am, 84(1), 115–123. 10.1121/1.396977 [DOI] [PubMed] [Google Scholar]
- Görgülü S, Sağdıç D, Akin E, Karaçay S, & Bulakbası N (2011). Tongue movements in patients with skeletal Class III malocclusions evaluated with real-time balanced turbo field echo cine magnetic resonance imaging. Am J Orthod Dentofacial Orthop, 139(5), e405–414. 10.1016/j.ajodo.2009.07.022 [DOI] [PubMed] [Google Scholar]
- Guay AH, Maxwell DL, & Beecher R (1978). A radiographic study of tongue posture at rest and during the phonation of /s/ in class III malocclusion. Angle Orthod, 48(1), 10–22. 10.1043/0003-3219(1978)048<0010:Arsotp>2.0.Co;2 [DOI] [PubMed] [Google Scholar]
- Haley KL, Seelinger E, Mandulak KC, & Zajac DJ (2010). Evaluating the spectral distinction between sibilant fricatives through a speaker-centered approach. J Phon, 38(4), 548–554. 10.1016/j.wocn.2010.07.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang C, Whitehill TL, McPherson B, & Ng ML (2016). Spectral moment analysis of affricates produced by Mandarin-speaking pre-adolescents with repaired cleft palate. Int J Pediatr Otorhinolaryngol, 84, 137–142. 10.1016/j.ijporl.2016.01.029 [DOI] [PubMed] [Google Scholar]
- Keyser MMB, Lathrop-Marshall H, Jhingree S, Giduz N, Bocklage C, Couldwell S, Oliver S, Moss K, Frazier-Bowers S, Phillips C, Turvey T, Blakey G, White R, Mielke J, Zajac DJ, & Jacox LA (2022). Impacts of skeletal anterior open bite malocclusion on speech. Submitted to FACE. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khechoyan DY (2013). Orthognathic surgery: General considerations. Seminars in plastic surgery, 27(3), 133–136. 10.1055/s-0033-1357109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lathrop-Marshall H, Keyser MMB, Jhingree S, Giduz N, Bocklage C, Couldwell S, Edwards H, Glesener T, Moss K, Frazier-Bowers S, Phillips C, Turvey T, Blakey G, White R, Mielke J, Zajac D, & Jacox LA (2021). Orthognathic speech pathology: Impacts of Class III malocclusion on speech. Eur J Orthod. 10.1093/ejo/cjab067 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Milenkovic P (2001). TF32 [Computer program]. Department of Electrical and Computer Engineering, University of Wisconsin, Madison. [Google Scholar]
- Proffit WR, Fields HW, Larson BE, & Sarver DM (2019). Contemporary orthodontics (6th ed.). Elservier. [Google Scholar]
- Proffit WR, & White RP Jr. (1990). Who needs surgical-orthodontic treatment? Int J Adult Orthodon Orthognath Surg, 5(2), 81–89. [PubMed] [Google Scholar]
- Proffit WR, White RP, & Sarver DM (2003). Contemporary treatment of dentofacial deformity (Vol. 283). Mosby [Google Scholar]
- Reidy PF (2015). A comparison of spectral estimation methods for the analysis of sibilant fricatives. J Acoust Soc Am, 137(4), EL248–EL254. 10.1121/1.4915064 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ruscello DM, Tekieli ME, Jakomis T, Cook L, & Van Sickels JE (1986). The effects of orthognathic surgery on speech production. Am J Orthod, 89(3), 237–241. 10.1016/0002-9416(86)90038-2 [DOI] [PubMed] [Google Scholar]
- Starr S (2013). 16 - Speech, language and swallowing. In Cameron AC & Widmer RP (Eds.), Handbook of Pediatric Dentistry (Fourth Edition) (pp. 463–473). Mosby. 10.1016/B978-0-7234-3695-9.00016-X [DOI] [Google Scholar]
- Tanner K, Roy N, Ash A, & Buder EH (2005). Spectral Moments of the Long-term Average spectrum: Sensitive indices of voice change after therapy? J Voice, 19(2), 211–222. 10.1016/j.jvoice.2004.02.005 [DOI] [PubMed] [Google Scholar]
- Vallino LD, & Tompson B (1993). Perceptual characteristics of consonant errors associated with malocclusion. J Oral Maxillofac Surg, 51(8), 850–856. 10.1016/s0278-2391(10)80101-6 [DOI] [PubMed] [Google Scholar]
- Zajac DJ, Whitt H, Baylis A, Tourian M, & Garcia K (2021). Alveolar Backing in 3-Year-Old Children With and Without Repaired Cleft Palate: Preliminary Findings Related to Cleft Type and History of Otitis Media. Perspectives of the ASHA Special Interest Groups, 6(6), 1889–1899. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental Table 1. Table shows the list of words used to trigger specific vowel sounds during the speech recordings.
Supplemental Table 2. List of class III DFD patient demographics and spectral distinction for /t/ and /k/
Supplemental Table 3. Correlation between mean word sounds between lay people and speech professionals.
Supplemental Table 4. Correlation between log10 mean word sounds between lay people and speech professionals.
Supplemental Figure 1. Example of visual analog scale for recording scales within the distributed survey. A. Question and scale for a /s/ word. B. Question and scale for a /ʃ/ word.


