Pause Postures: The relationship between articulation and cognitive processes during pauses

Jelena Krivokapić; Will Styler; Benjamin Parrell

doi:10.1016/j.wocn.2019.100953

. Author manuscript; available in PMC: 2021 Mar 1.

Published in final edited form as: J Phon. 2020 Feb 21;79:100953. doi: 10.1016/j.wocn.2019.100953

Pause Postures: The relationship between articulation and cognitive processes during pauses.

Jelena Krivokapić ^a,^b, Will Styler ^c, Benjamin Parrell ^d

PMCID: PMC7098615 NIHMSID: NIHMS1569445 PMID: 32218635

Abstract

Studies examining articulatory characteristics of pauses have identified language-specific postures of the vocal tract in inter-utterance pauses and different articulatory patterns in grammatical and non-grammatical pauses. Pause postures—specific articulatory movements that occur during pauses at strong prosodic boundaries—have been identified for Greek and German. However, the cognitive function of these articulations has not been examined so far. We start addressing this question by investigating the effect of 1) utterance type and 2) planning on pause posture occurrence and properties in American English. We first examine whether pause postures exist in American English. In an electromagnetic articulometry study, seven participants produced sentences varying in linguistic structure (stress, boundary, sentence type). To determine the presence of pause postures, as well as to lay the groundwork for their future automatic annotation and detection, a Support Vector Machine Classifier was built to identify pause postures. Results show that pause postures exist for all speakers in this study but that the frequency of occurrence is speaker dependent. Across participants, we find that there is a stable relationship between the pause posture and other events (boundary tones and vowels) at prosodic boundaries, parallel to previous work in Greek. We find that the occurrence of pause postures is not systematically related to utterance type. Lastly, pause postures increase in frequency and duration as utterance length increases, suggesting that pause postures are at least partially related to speech planning processes.

Keywords: Articulatory settings, pause postures, pauses, speech planning, prosodic boundaries, speech production

1. Introduction

A long line of research has examined acoustic pauses during connected speech, which can be grouped into grammatical and non-grammatical pauses. Grammatical pauses are a part of prosodic boundaries, which are planned events that indicate linguistic structure. Non-grammatical pauses, on the other hand, are not planned events and are, broadly, the result of speech planning processes (e.g., pauses that are related to the time a speaker needs to plan an upcoming word or utterance, or filled pauses, such as uh and uhm). Recent work has started examining articulations during pauses (e.g., Gick, Wilson, Koch, & Cook 2004, Ramanarayanan, Goldstein, Byrd, & Narayanan 2013, Katsika, Krivokapić, Mooshammer, Tiede, & Goldstein 2014, Rasskazova, Mooshammer & Fuchs 2018) and has identified various articulatory patterns during pauses. These seem to have language-specific characteristics, but also exhibit large variability within languages in terms of speaker and context, and crucially depend on the type of pause (grammatical vs. non-grammatical).

A major unanswered question in this research is what the cognitive function of these patterns is. The current study addresses this question by examining the production of pause postures at prosodic boundaries. Pause postures are specific movements of articulators during acoustic pauses (described in detail in sections 1.2 and 1.3). Four questions are addressed in this study:

1) Do pause postures exist in American English? We examine whether specific movement patterns, as have been identified for Greek during pauses, termed pause postures, also exist in American English. Foreshadowing the results for this question, we indeed find evidence of pause postures in American English.

2) We examine how pause postures are timed relative to other gestures at the boundary. Evidence of systematic timing patterns with other gestures at the boundary would provide further evidence of pause postures as cognitive units.

3) The main focus of our study examines what the cognitive processes underlying these pause postures are. Specifically, we examine if their occurrence and articulatory properties are related to utterance type and to speech planning processes.

4) Finally, we address a methodological question, namely, can pause postures be detected automatically using machine learning models? This provides both a method for future annotation, as well as secondary validation of the presence and detectability of this phenomenon.

1.1. Cognitive processes related to pauses during speech

Studies of grammatical pauses have established a number of factors determining the likelihood of occurrence and length of a pause (see overview in Fletcher 2010, Fuchs, Petrone, Krivokapić, & Hoole 2013). For example, faster speech typically leads to shorter and fewer pauses (e.g., Goldman Eisler 1968, Lane & Grosjean 1973, Fletcher 1987). In read speech, but not in spontaneous speech, pauses occur only at syntactic boundaries (Goldman Eisler 1968).

Discourse content has also been shown to affect pausing in speech. Analyses based on theoretical approaches to discourse, regardless of the specific theory, consistently show that hierarchically higher discourse boundaries are associated with longer pauses (Den Ouden, Noordman, & Terken 2009, Yang, Xu, & Yang 2014, Tyler 2013, Hirschberg & Nakatani 1996). Other studies have examined how a change in topic affects pause duration. A robust finding in these studies is that topic change has an effect on pause duration, such that topic shift between utterances leads to longer pauses than topic continuation (Swerts & Geluykens 1994, Bannert, Botinis, Gawronska, Katsika, & Sandblom 2003, Smith 2004, Yang, Xu, & Yang 2014), though there is some evidence that this could be dependent on speaking style (Gustafson-Capkova & Megyesi 2002).

The occurrence and duration of pauses in general is further influenced by a number of other structural factors. The more complex the linguistic (syntactic or prosodic) structure preceding or following a boundary, the likelier a pause is to occur and to be longer in duration (e.g., Oller 1973, Cooper & Paccia-Cooper 1980, Ferreira 1991, Grosjean, Grosjean, & Lane 1979, Ferreira 1993, Sanderman & Collier 1995, Watson & Gibson 2004). Similarly, the longer the preceding or following utterance (in terms of feet, syllables, or phonological words), the likelier a pause is to occur and to be longer in duration (e.g., Sternberg, Monsell, Knoll, & Wright 1978, Ferreira 1991, Wheeldon & Lahiri 1997, Zvonik & Cummins 2003, Kentner 2007, Krivokapić 2007a, 2007b, Fuchs, et al. 2013), though the strength of the effect of each of these factors is not well understood (see Krivokapić 2007a, Yang et al. 2014). While the relationship between prosodic boundary strength and pause duration has been examined only in a few studies, there is evidence that the stronger the boundary, the more likely a pause is to occur and to increase in length (Strangert 1991, Ferreira 1993, Zellner 1994, Horne, Strangert, and Heldner 1995, Choi 2003, Gollrad 2013, Petrone, Truckenbrodt, Wellmann, Holzgrefe-Lang, Wartenburger, Höhle 2017).¹

Pauses at prosodic boundaries (the type we are examining here) are grammatical pauses but they have been argued to have multiple functions (see for an overview Ferreira 2007). Specifically, pauses are one of the phonetic markers of prosodic boundaries (the structural function) but processing of preceding and following utterances is also known to take place during prosodic boundary pauses. The effect of the material preceding the pause has been argued to be related either to the time for a listener to process information, or the time for the speaker to deactivate information processed in the preceding phrase, though of course it could be related to both (e.g., Watson & Gibson 2004, Krivokapić 2007a), while the effect of the upcoming material is related to the planning of the upcoming utterance (e.g., Gee & Grosjean 1983, Watson & Gibson 2004, Krivokapić 2007a, 2012). The idea behind this is that more upcoming structural units (syntactic, phonological, prosodic) will lead to longer pauses, because cognitive load increases with the number of units to be processed and the longer pauses allow for more time for the speaker to plan the upcoming utterance (e.g., Goldman Eisler 1968, Grosjean et al. 1979, Cooper & Paccia-Cooper 1980, Butcher 1981, Levelt 1989, Ferreira 1991, 1993, Watson & Gibson 2004, Krivokapić 2007a, 2012, Fuchs et al. 2013).

The present study builds on the findings reviewed in this section to examine if pause posture occurrence and properties are related to discourse structure and speech planning as has been shown for pause acoustics.

1.2. Articulatory behavior during pauses

Articulation during pauses has only been investigated recently, as technological advances have allowed scientists to do so. It has been long postulated that during pauses, the vocal tract assumes a default position, i.e., an “articulatory setting” (Honikman 1964, Laver 1978, Jenner 2001), and the first observations of articulatory settings in kinematic data come from Öhman (1967) and Perkell (1969). Gick et al. (2004) were the first to systematically investigate these settings with the goal of understanding the phonological status of articulatory settings. For read speech in French and English, with five speakers from each language, they examined seven articulatory parameters (pharynx width, velopharyngeal port width, tongue body to palate distance, tongue tip to alveolar ridge distance, jaw aperture, upper lip protrusion, and lower lip protrusion) during pauses, at a point in time after articulators had stopped moving for the preceding utterance and before they started moving for the upcoming utterance. They found that English and French speakers differ in four of these parameters: upper and lower lip protrusion, pharynx width, tongue tip-to-alveolar ridge distance, tongue body-to-palate distance. These differences indicate that articulatory settings are at least partially language specific. They further argued, based on the spatial stability of five of the vocal tract parameters, that the differences between languages in articulation during pauses were caused by targeted, language-specific articulatory settings, and as such might be part of the phonological and phonetic inventory of the language. Further evidence for this argument comes from Wilson & Gick (2014) who in a study of eight English-French bilinguals found that bilinguals who are perceived as native-like in both languages, but not those who aren’t, used distinct articulatory settings for the different languages.

In read speech, all pauses are typically planned, in the sense that they are structurally determined and encode prosodic structure. However, in spontaneous speech differences may exist between different types of pauses. Thus we can distinguish between planned vs. unplanned pauses, where unplanned, or non-grammatical pauses are pauses specifically introduced to allow speakers more time to plan the upcoming chunk of speech, whether to find an appropriate word or for purposes of structural encoding. It should be clarified that we are discussing here two types of planning: In one case we are talking about planning in the sense of linguistic (in this case prosodic) structure encoding, and in the other, we mean planning in the sense of planning an upcoming chunk of speech. Note also that we assume that in both types of pauses, planning of the upcoming utterance might take place, the difference being that in the case of unplanned pauses, they do not mark a structural unit, instead occurring because the speaker needs additional planning time. Ramanarayanan, Bresch, Byrd, Goldstein, & Narayanan (2009) examined pause articulation during spontaneous speech for seven speakers, capturing both grammatical pauses (defined there as pauses occurring at major syntactic constituents) and non-grammatical pauses (all other pauses). Ramanarayanan et al. found that grammatical pauses, but not non-grammatical ones, showed a significant decrease in speed of articulator movement during the pause as compared to the pre-pause period. The period following both grammatical and non-grammatical pauses showed an increase in speed of articulators. There was also higher variation in articulator speed during and after the pause for non-grammatical pauses in comparison to grammatical pauses. More variability is generally assumed to mean less targeted, less structurally controlled movements, indicating, as Ramanarayanan et al. discuss, that the grammatical, but not the non-grammatical pauses are planned, targeted articulations. As Ramanarayanan et al. argue, these results suggest that different types of articulation during pauses can reflect different cognitive processes (planned grammatical breaks encoding linguistic structure vs. active cognitive planning processes).

Ramanarayanan et al. (2013) examined, for five speakers, vocal tract postures during acoustic pauses before or after speech (“absolute rest positions”; these were pauses at the beginning and end of a data acquisition interval), pauses directly prior to speech onset (speech-ready pauses), and grammatical silent or filled acoustic pauses during both spontaneous and read speech (inter-speech pauses). They found that the vocal tract postures differed during absolute rest position (with the articulators indicating a more closed vocal tract) compared to during inter-speech pauses and pauses prior to speech onset. They further identified differences in postures between read and spontaneous speech (with a higher jaw and lower tongue in spontaneous compared to read speech). Finally, they found a trend such that absolute rest positions showed higher variability than pauses directly prior to speech onset, which in turn showed more variability than pauses during read speech. Based on this, Ramanarayanann et al. suggest that inter-speech pauses in read speech are planned in the sense of structurally controlled, targeted positions, while the absolute rest pauses are likely to be least planned, linguistically controlled.

Taken together, these studies provide articulatory evidence that pausing during speech can arise through multiple cognitive processes, and that these processes differentially affect the control of vocal tract articulation during the production of the pause. Articulatory configurations during pauses can be the result of targeted movements controlled by linguistic representations, but also may reflect other cognitive processes such as for example non-grammatical pauses arising through speech planning in the sense of planning an upcoming utterance (Ramanarayanan et al. 2009, Ramanarayanan et al. 2013) or a variation in cognitive load (such as more demanding speech planning in spontaneous than in read speech, Ramanarayanan et al. 2013).²

The studies discussed so far examined specific positions of articulators or movements of the whole vocal tract. Katsika (Katsika et al. 2014, Katsika 2012) examined movements of individual articulators for eight speakers of Greek and identified pause postures which occurred during pauses in read speech at strong prosodic boundaries between sentences. These pause postures, which were visible on both the lip aperture and tongue dorsum trajectories, were spatially stable across repetitions and can be described as a movement away from a straight interpolation between a pre-boundary vowel and a post-boundary preparatory position for the upcoming post-boundary gesture (see Figure 3), thus introducing an additional movement between the gestures of the consonants and vowels. Katsika et al. suggested that the pause posture could be the default articulatory setting for Greek. They developed an account of articulatory events at prosodic boundaries within the π-gesture model (Byrd & Saltzman 2003), showing how pause postures show specific timing patterns with temporal, constriction, and tonal gestures (this model is described in the next section). Katsika et al. (2014) thus provide a way how articulatory settings could arise in relation to other linguistic events. Katsika further suggested that the identified properties of pause postures (their temporal relationship to other linguistic units and spatial stability) indicate that they may be targeted, controlled movements (i.e., cognitive units).³

Figure 3. — Pause posture labeling for the sentence “I don’t know about Mima, Mini does, but I know about birds”. The identified landmarks are pause posture onset, maximum constriction (target), and offset. LA: lip aperture trajectory and velocity.

Rasskazova et al. (2018) also examined articulatory movements during grammatical pauses for eight speakers of German in read speech and found “rest” trajectories and “transitions”. Transitions refer to tongue movements that proceed from the pre-pause utterance to the post-pause utterance in a “smooth” movement, while “rest” trajectories consist of either articulators not moving after completion of the pre-boundary gesture, or of the tongue moving to the palate and staying there (given that this is an additional movement, it might be a pause-posture). Rasskazova et al. found that the frequency of occurrence of these two types of articulations during pauses differs between speakers, with some speakers predominantly having “transitions” and other speakers predominantly producing “rest” trajectories (see also Schaeffler, Scobbie, & Mennen 2008 for speaker specific articulatory behavior during pauses), and that speakers who predominantly produced “rest” trajectories also had longer pauses, but not slower speech rate, which seemed fairly constant across speakers.

The above studies showed large variability in the articulation of acoustic pauses. What is not clear from these studies is what cognitive function this articulatory behavior reflects, which is an essential question if we are to understand pauses. It is evident that articulatory behavior during pauses is not just language-specific and speaker-specific, but it is also pause-type specific, and variation is shown to occur even during the same type of boundary. On the assumption that movements in the vocal tract are not random but reflect either cognitive processes (possibly each with their own articulatory target) or physiological needs, these differences indicate that absent a physiological explanation, variation in cognitive processes could underlie these different types of articulations. Thus, the primary goal of this study is to begin to examine which cognitive processes underlie articulations during pauses. We will focus on pause postures at prosodic boundaries, with the understanding that they might be the default articulatory settings of the vocal tract.

1.3. Theoretical account of pause postures and prosodic boundaries

Before we can examine the cognitive processes associated with pause postures, we first need to establish their existence in American English. The only account of properties of pause postures is given in Katsika et al. (2014), within the framework of Articulatory Phonology. This model, which accounts for the interdependence of tonal and temporal properties of boundaries, stress, and pause posture, will be introduced in some detail in this section as the predictions of the model will be used to identify the timing of pause postures with other gestures. We will only discuss properties relevant for this study; many aspects of it, and the motivation behind it will not be discussed (for a more detailed review see Krivokapić 2014 and Krivokapić 2020).

Within Articulatory Phonology, boundaries are understood to arise through the interplay of prosodic gestures (π-gesture, μ-gesture, and tone gestures) with constriction gestures. Prosodic gestures model prosodic properties, while constriction gestures model segments. The π-gesture (Byrd & Saltzman 2003) extends over an interval and during that interval slows the clock that controls a speaker’s speech rate. The scope of the π-gesture is at this point an empirical question, as is the question whether a boundary has one or two π-gestures. There are two possibilities: it could be one gesture spanning the period starting somewhere towards the end of the prosodic phrase and ending somewhere at the beginning of the following prosodic phrase, or one for the end of the phrase and one for the beginning of the following phrase (see Byrd & Saltzman 2003, Katsika 2016 for discussion). The effect of the π-gesture is that co-active gestures become slower, spatially larger and temporally longer (accounting, among other things, for the well-known lengthening at prosodic boundaries), and less overlapped. The strength of the effect of the π-gesture is determined by its activation level, with stronger activation levels leading to stronger boundary effects (such as more lengthening for a more strongly activated π-gesture than for a less strongly activated π-gesture). Another prosodic gesture, the μ-gesture (Saltzman, Nam, Krivokapić, & Goldstein 2008), models temporal effects of lexical stress, also lengthening gestures co-active with it (both the π-gesture and the μ-gesture lengthen gestures co-active with them, the difference between these two types of temporal gestures is mainly in their implementation within the computational model of Articulatory Phonology). Finally, tone gestures have been proposed to model lexical tone (Gao 2008). Tone gestures have as their goal linguistically relevant F0 targets (such as H and L tones). They have also been suggested to account for pitch accents (e.g., Mücke, Nam, Hermes, & Goldstein 2012) and boundary tones (Katsika et al. 2014). Based on their analysis of the effect of prosodic boundaries and prominence in Greek, Katsika et al. (2014) suggest the following account of the temporal relationships of prosodic gestures at the boundary (the schematic representation of the model is shown in Figure 1). As one aim of this study is to establish the existence and grammatical status of pause postures, we present aspects of this model that make predictions about the temporal patterns in our data:

Figure 1. — Simplified schematic representation of boundary events as given in Katsika et al. 2014 for words with stress on the second syllable (a) and for words with stress on the first syllable (b); The model is simplified to only represent aspects of it relevant for the issue to be examined here. “ $\overset{´}{v}$ ” indicates stressed vowels. The lines indicate coordination between gestures, and the dashed lines indicate a weaker coordination. The triangle marks the onset of the boundary tone and the rhomboid the onset of the pause posture. Figure adapted from Katsika et al. 2014.

1) π-gestures are coordinated with the phrase-final vowel at the boundary and, weakly, with the μ-gesture of the stressed syllable. Depending on where the lexical stress of the phrase-final word is, the μ-gesture co-occurs either with the final vowel (if stress is on the last syllable of a word, Figure 1a) or with an earlier vowel in the phrase final word (if stress occurs earlier in the word, Figure 1b). If stress (and the μ-gesture) occurs earlier in the word, then the coordination of the π-gesture with the μ-gesture will lead to a slight shifting of the π-gesture towards the μ-gesture. This means that the π-gesture starts earlier when stress is on the first syllable than when it is on the second (compare Figure 1b to Figure 1a).

2) The boundary tone is triggered when the π-gesture reaches a certain level of activation. While “certain level of activation” is not a clearly defined point, the implication of this is that there will not be a boundary tone without lengthening (note that by virtue of having a boundary tone, this boundary will be an Intonation Phrase (IP) boundary in the model of Beckman & Pierrehumbert 1986). Given the shift of the π-gesture towards the stressed syllable (as described in 1), this threshold is reached earlier, and thus the boundary tone occurs earlier, when stress is on the first than when it is on the second syllable of a bisyllabic word.

3) Pause postures are triggered by an even stronger activation of the π-gesture than boundary tones are. Again, the term “stronger” is not well defined, but evidence for a stronger activation of the π-gesture is in general more lengthening, and for this particular assumption, the implication is that there will be no cases of PPs occurring in utterances without boundary tones. This prediction is consistent with the fact that only strong IPs have pauses. A further implication is that the temporal relationship between the boundary tone and the pause posture onset is stable, independent of the position of the stressed syllable: both boundary tone and pause posture will be triggered by specific levels of activation of the π-gesture, and these will occur earlier when the stressed syllable is earlier in the word than when it is later (as described in point 2). However, the relationship between the boundary tone and the pause posture will be stable (schematically shown in Figure 1).

These patterns are evidenced in temporal relations between the boundary tone, the final vowel, and the pause posture, and thus lead to specific predictions for the temporal relationships between certain articulatory landmarks of these gestures. We will describe these in the methods section.

Finally, it is generally assumed that controlled, targeted movements show relatively little variability (for discussion of this point and a more nuanced view on variability see Gick et al. 2004, Riley & Turvey 2002, Whalen, Chen, Tiede, & Nam 2018). Thus, in order to examine the cognitive status of pause postures, we will also examine variability of the pause posture (see also Katsika et al. 2014 for such an analysis).

1.4. Motivation for the current study

One of the main questions of this study is to examine when pause postures occur. As pause postures seem to be tied to strong prosodic boundaries, they also might be tied to discourse structure, given the close link between prosodic and discourse boundaries. Prosodic and discourse boundaries are related in the sense that while discourse boundaries serve to mark larger discourse units (such as change of topic) at or above the level of a sentence and prosodic boundaries mark smaller units, namely prosodic phrases (which often but not always correspond to syntactic phrases), both are marked phonetically in a similar way, with the difference that discourse boundaries are stronger, for example having longer pauses than typical sentence level prosodic boundaries (Lehiste 1975, Swerts & Geluykens 1994, Beckman, Hirschberg & Shattuck-Hufnagel 2005). Given the observed variability in articulations at different types of pauses, the first question we address is whether articulatory settings are tied to specific discourse-pragmatic uses. Specifically, we examine the role of discourse in the occurrence of pause postures. However, the existing studies examining discourse used longer stretches of spontaneous speech, which, while ideal for the analysis of discourse and pragmatics, is not feasible for a study as controlled as the current one needs to be in order to investigate the articulation during pauses. We therefore constructed sentences varying in syntactic structure, meaning, and punctuation, with the goal of eliciting a variety of discourse-pragmatic interpretations, in the expectation that some of the sentences will elicit more pause postures than others. At this point, we did not formulate a hypothesis more specific than this; as there are no indications in the literature as to the cognitive functions of pause postures/articulatory settings (other than the brief points in Ramanarayanan et al., 2009, 2013 that various cognitive processes could underlie articulatory settings, and that they could be interacting with other processes of speech production), more well-founded hypotheses cannot be made. If our expectation is met, future studies will examine more specific discourse-related questions.

We further examine how pause postures may relate to cognitive function. As mentioned above, acoustic pauses serve multiple functions. Ferreira (2007, see also 1988, 1993) suggests, as a strong hypothesis, that the acoustic pause at prosodic boundaries can be divided into two fundamentally different parts: she suggests that the first part is the implementation of the prosodic boundary, while the second is related to planning. While we expect that both parts of the pause are implementations of the prosodic boundary, and that planning proceeds throughout the utterance, including through both parts of the pause, it is an empirical question if planning takes place predominantly in some parts of the boundary or evenly throughout. One interesting possibility is that these two functions of the boundary are indicated by different articulatory behavior. Although the current study was not initially designed with this question in mind, the stimuli used allow us to additionally examine the role of speech planning in the occurrence of pause postures and thus shed some light on the question whether the boundary is divided into different cognitive parts which are reflected in articulation. Specifically, as discussed above, it is known that longer upcoming phrases take longer time to plan. We examine whether there is a relationship between the amount of planning needed (as indicated here by the number of syllables in the upcoming prosodic phrase) and pause posture occurrence and duration. We test the hypothesis that pause postures are more frequent and longer before longer upcoming phrases, allowing speakers more time to plan an upcoming utterance.

Before examining these two questions, we first need to examine whether there is a pause posture (PP) in American English, specifically asking 1) whether we see evidence of a consistent articulatory pattern during pauses (similar to the pause posture in Katsika et al. 2014) and 2) whether this articulatory pattern shows a consistent temporal relation to other linguistic events at the boundary, which would be additional evidence of the status of the PP as a cognitive, controlled, unit. We use a subset of the measures Katsika used in developing her model of gestural coordination at prosodic boundaries (Katsika et al. 2014) as a diagnostic for addressing the second question.⁴ As an additional measure, we further examine spatial variability (also following Katsika et al. 2014, Gick et al. 2004, Ramanarayanan et al. 2009, 2013). To begin the investigation of pause postures in American English, we decided to focus on pause postures of the lip aperture (LA). While previous research has examined various articulators, and each articulator could be used to address this question, we focused on LA as it is relatively straightforward to label, which is useful for a study of a fairly new phenomenon. Based on what we know from existing research, there are two theoretical possibilities of how pause postures could occur: One is that they occur after the last active gesture for each articulator (e.g., on the tongue body after the last vowel in an utterance, on the tongue tip after the last coronal consonant of the utterance), thus occurring at different times for different articulators. The other possibility is that the pause posture occurs across the whole vocal tract simultaneously. According to Katsika et al. (2014) the latter should be the case (since, as discussed in section 1.3., pause postures are triggered at a certain level of activation of the π-gesture). This is an empirical question, but we will not address it in our study, where we will focus on one articulator only, as it will suffice to address our main questions.

As the question of pause postures is new, and as established methods for determining a pause posture do not exist, there is a potential problem in identifying the presence and extent of pause postures. That is, it may be difficult to distinguish the targeted movements associated with the pause posture from the background of noise and interpolative motion. To address this issue, in our analysis we will first start with labeling by a human annotator, but we will also use machine learning to identify pause postures. Although the human annotations alone could provide us with the data required to study the distribution and potential triggers of pause postures, and any supervised machine learning task will necessarily be guided by (but not beholden to) the human judgements which provide the training data, for this analysis of new and not-well-understood phenomena, we must ensure that the phenomena under discussion are measurable and reproducible through a mathematically predictable and transparent means, rather than based solely on a human judgement or heuristic. Finally, we hope that this work will provide a useful and generalizable approach to gestural detection, and ultimately result in a model which can identify pause postures in novel data using the same criteria as previously employed.

Shaw and Kawahara (2018) present one possible approach to solving this issue as a component of their suite of tools for identifying phonological targets in phonetic data. They use discrete cosine transform (DCT) to model continuous articulatory data as a series of four coefficients, then classify tokens according to their likelihood of being targeted movement using a straightforward Bayesian classifier. This technique combines the flexibility of DCT-based curve modeling with the decision-making abilities of a Naïve Bayes classifier, and provides a more nuanced manner of determining the status than a simple binary human decision. We present a similar but more generally applicable method, using Support Vector Machines (SVMs) for evaluating the presence or absence of pause postures, again using curvature analysis and machine learning, and describe how these models can be used to categorically describe annotated data.

To summarize, the goals of the study are to examine 1) whether pause postures occur in American English, 2) whether they can be considered cognitive units, 3) whether they are related to discourse structure and speech planning, and 4) whether they can be detected automatically using curvature analysis and machine learning.

2. Methods

We present an electromagnetic articulometer (EMA) study examining the existence and kinematic properties of pause postures in American English.

2.1. Participants

Eight participants (four male and four female) with no reported history of speech or hearing disorders participated in the current study. Data from one participant were not processed as he had difficulties with the set-up and with reading the sentences. The participants were students at the University of Southern California and were naive as to the purpose of the experiment.

2.2. Stimuli

Fourteen sentences were designed to elicit a range of boundaries with varying syntactic structures and pragmatic uses (Table I). To begin the investigation of pause postures in American English, we decided to focus on the lip aperture (LA). We therefore chose target words that contain bilabial consonants so that we can control LA trajectories. There were three target words: MIma, miMA, biBU(capitalization indicates lexical stress), pronounced as ['mimə, mɪ'mɑ, bɪ'bu]. The words MIma and miMA varied lexical stress so that the temporal relationship of the pause posture with the utterance final word could be examined (as the predictions of the model developed in Katsika et al. 2014 can be tested using different stress patterns of phrase-final words), while biBU was included to vary vowel context (since the reason to include this target word was only to examine the effect of the vowel context, not to examine the effect of stress per se, lexical stress was not manipulated on this target word). While these target words required participants to learn new words, they allowed for phonetically controlled boundaries, which was necessary for the purposes of this study.

Table I.

Stimuli for the target word “MIma”. The same sentences were recorded with the target words “miMA ” and “biBU”. Stress on the target word is indicated by capital letters. Participants read aloud target sentences (T) which were sometimes preceded by a context sentence (C), which participants read silently. The number of syllables represents the number of syllables preceding the boundary and the number of syllables of the first prosodic phrase following the boundary. The investigated boundary is marked by “#” (the pound sign was not in the stimuli presented to the participants).

Stimuli	Number of syllables before the boundary/in the first prosodic phrase after the boundary		Boundary
1. C: You two know everything! T: I don’t know about MIma. # Mini does know though.	7	5	IP boundary
2. C: What should we talk about? T: There’s a lovely story I know about MIma. # Mini doesn’t like it though.	12	7	IP boundary
3. C: I will ask you about MIma later. T: I don’t know about MIma. # Mini doesn’t tell me these things.	7	8	IP boundary
4. T: I don’t know about MIma— # Minni does—but I know about birds.	7	3	IP boundary
5. T: I know about MIma, # Mini, and the rest of the gang.	6	2	IP boundary
6. T: Here is what I know about MIma: # Mini doesn't like her.	9	6	IP boundary
7. T: I don’t know about MIma # mini-dolls going on sale.	7	7	Word boundary
8. C: Does Mina know about MIma? T: Does she know about MIma? # Mini discovered MIma!	7	7	IP boundary
9. C: You two know everything! T: I don’t know about MIma… # Mini does know though	7	5	IP boundary
10. C: So you think you will get to know about MIma? T: I hope we get to know about MIma… # Mini seems to like her.	10	6	IP boundary
11. C: So Bob certainly knows about MIma? T: Bob may know about MIma … # Mini doesn’t think so though.	7	7	IP boundary
12. C: So Bob certainly knows about MIma? Bob could know about MIma … # Mini doesn’t think so though.	7	7	IP boundary
13. T: # MIma mini-dolls are going on sale.	0	10	Phrase initial IP boundary
14. T: I know all about MIma. #	7	0	Phase-final IP boundary

Participant	Disfluent/prosodic errors	Could not be labeled	Number of utterances included in the svm model	Number of utterances included in the analysis according to the svm model
F1	8	5	323	318
F2	2	6	286	286
F3	1	43	292	288
F4	3	4	329	329
M2	7	20	309	306
M3	2	16	318	314
M4	2	1	333	333

	Annotator 'No'	Annotator 'Yes'
SVM 'No'	258	7
SVM 'Yes'	10	100
SVM Accuracy	95.4%
Cohen's Kappa	0.89

Speaker	Total number of pause postures and percentage (all target words included)	Total number of miMA/MIma tokens with pause postures for the analysis of temporal patterns (miMA/MIma/total)	Total number of miMA/MIma tokens with pause postures for which the boundary tone could be labeled
F1	117 (37%)	22/34/56	27 (48%)
F2	56 (20%)	5/18/23	2 (9%)
F3	36 (13%)	6/4/10	5 (50%)
F4	17 (5%)	2/0/2 (not included)	0 (0%)
M2	154 (51%)	41/33/74	21 (28%)
M3	144 (46%)	47/35/82	52 (63%)
M4	144 (43%)	22/43/65	41 (63%)

F1	F2	F3	M2	M3	M4
stress 1 152 (35)	stress 1 157 (38)	stress 1 160 (44)	stress 1 306 (108)	stress 1 260 (72)	stress 1 216 (80)
stress 2 208 (50)	stress 2 260 (48)	stress 2 215 (23)	stress 2 376 (121)	stress 2 282 (62)	stress 2 325 (102)
F(1, 55) = 24.3181, p <0.0001	F(1, 21)= 25.0498, p<0.0001	F(1, 9) = 6.9354, p = 0.03	F(1, 72) = 6.9618, p =0.0101	n.s.	F(1, 63) = 23.8961, p<0.0001

F1	F2	F3	M2	M3	M4
stress 1 −53 (32)	NA	NA	NA	stress 1 −22 (34)	stress 1 −9 (27)
stress 2 −37 (29)				stress 2 58 (67)	stress 2 49 (52)
n.s.				F(1, 51) = 26.6012, p < 0.0001	F(1, 40) = 22.6526, p < 0.0001

F1	F2	F3	M2	M3	M4
stress 1 223 (45)	NA	NA	NA	stress 1 298 (72)	stress 1 251 (98)
stress 2 214 (40)				stress 2 267 (91)	stress 2 276 (124)
n.s.				n.s.	n.s.

	Value	Std. Error	t value	Pr(>∣t∣)
(Intercept)	−0.392	0.0805	−4.868	0
constrictionpp_mm	−1.2	0.1314	−9.1329	0
targetwordmiMA	−0.061	0.1067	−0.5714	0.5679
targetwordMIma	−0.3372	0.1103	−3.0564	0.0023
zdur	0.0781	0.0744	1.0485	0.2947
constrictionpp_mm:targetwordmiMA	−0.1766	0.2168	−0.8146	0.4155
constrictionpp_mm:targetwordMIma	0.1567	0.2447	0.6407	0.5219
constrictionpp_mm:zdur	−0.1135	0.1209	−0.9384	0.3482
targetwordmiMA:zdur	0.012	0.1074	0.1119	0.911
targetwordMIma:zdur	−0.0163	0.1094	−0.1491	0.8815
constrictionpp_mm:targetwordmiMA:zdur	0.0547	0.2262	0.2419	0.8089
constrictionpp_mm:targetwordMIma:zdur	−1.369	0.2195	−6.2381	0

	Value	Std. Error	t value	Pr(>∣t∣)
(Intercept)	0.6713	0.0799	8.4072	0
constrictionpp_mm	−0.4969	0.1944	−2.5557	0.0107
targetwordmiMA	−0.1053	0.1064	−0.9896	0.3226
targetwordMIma	−0.6158	0.1045	−5.8915	0
zdur	0.0121	0.0736	0.1645	0.8693
constrictionpp_mm:targetwordmiMA	3.1841	0.2649	12.021	0
constrictionpp_mm:targetwordMIma	2.4523	0.3736	6.5649	0
constrictionpp_mm:zdur	−0.5759	0.1732	−3.3254	9e-04
targetwordmiMA:zdur	−0.0475	0.1017	−0.467	0.6406
targetwordMIma:zdur	0.0536	0.1052	0.5098	0.6103
constrictionpp_mm:targetwordmiMA:zdur	−0.3184	0.2554	−1.2466	0.2128
constrictionpp_mm:targetwordMIma:zdur	−0.426	0.3115	−1.3673	0.1718

Overall:
	Estimate	Std. Error	z value	Pr(>∣z∣)
(Intercept)	−0.9483	0.1796	−5.2795	0
sentence2	0.6162	0.2414	2.5523	0.0107
sentence3	−0.0813	0.2573	−0.316	0.752
sentence4	−0.3221	0.2689	−1.1981	0.2309
sentence5	−1.1457	0.3136	−3.6537	3e-04
sentence6	0.1561	0.2501	0.6242	0.5325
sentence8	0.5746	0.2422	2.3728	0.0177
sentence9	0.3763	0.2444	1.5398	0.1236
sentence10	0.7195	0.2407	2.9897	0.0028
sentence11	0.5325	0.2412	2.2073	0.0273
sentence12	0.9861	0.2396	4.1147	0
sentence13	0.0408	0.2506	0.1627	0.8707
sentence14	−0.1256	0.2549	−0.4926	0.6223
F1:
	Estimate	Std. Error	z value	Pr(>∣z∣)
(Intercept)	−0.8473	0.488	−1.7364	0.0825
sentence2	0.6802	0.6371	1.0677	0.2857
sentence3	−1.0498	0.7883	−1.3317	0.1829
sentence4	−17.7188	1331.4281	−0.0133	0.9894
sentence5	−17.7188	1331.4281	−0.0133	0.9894
sentence6	−0.4336	0.7026	−0.6172	0.5371
sentence8	0.7603	0.6421	1.184	0.2364
sentence9	1.1097	0.6442	1.7225	0.085
sentence10	2.1823	0.7005	3.1153	0.0018
sentence11	2.4567	0.7335	3.3491	8e-04
sentence12	2.4054	0.7353	3.2712	0.0011
sentence13	−1.5041	0.8864	−1.6968	0.0897
sentence14	0.1542	0.6524	0.2363	0.8132
M3:
	Estimate	Std. Error	z value	Pr(>∣z∣)
(Intercept)	0.1671	0.4097	0.4078	0.6834
sentence2	0.4616	0.5996	0.7698	0.4414
sentence3	−0.2624	0.599	−0.438	0.6614
sentence4	−0.3494	0.5926	−0.5896	0.5555
sentence5	−1.4198	0.6995	−2.0298	0.0424
sentence6	0.0953	0.5872	0.1623	0.8711
sentence8	1.3911	0.6859	2.0281	0.0425
sentence9	−0.1671	0.5784	−0.2888	0.7727
sentence10	0	0.5794	0	1
sentence11	−0.1671	0.5913	−0.2825	0.7775
sentence12	0.6596	0.6109	1.0798	0.2802
sentence13	−1.2657	0.6245	−2.0265	0.0427
sentence14	−2.5649	0.8443	−3.038	0.0024
F2:
	Estimate	Std. Error	z value	Pr(>∣z∣)
(Intercept)	−18.5661	1458.5063	−0.0127	0.9898
sentence2	17.1191	1458.5064	0.0117	0.9906
sentence3	16.8921	1458.5065	0.0116	0.9908
sentence4	17.4029	1458.5064	0.0119	0.9905
sentence5	16.7743	1458.5065	0.0115	0.9908
sentence6	0	2037.9363	0	1
sentence8	17.6498	1458.5064	0.0121	0.9903
sentence9	16.3688	1458.5065	0.0112	0.991
sentence10	18.0806	1458.5064	0.0124	0.9901
sentence11	17.4675	1458.5064	0.012	0.9904
sentence12	18.0806	1458.5064	0.0124	0.9901
sentence13	16.7743	1458.5065	0.0115	0.9908
sentence14	18.4607	1458.5064	0.0127	0.9899
F4:
	Estimate	Std. Error	z value	Pr(>∣z∣)
(Intercept)	−3.1355	1.0215	−3.0695	0.0021
sentence2	−17.4306	3619.1967	−0.0048	0.9962
sentence3	0.7376	1.2605	0.5852	0.5584
sentence4	−17.4306	3619.1967	−0.0048	0.9962
sentence5	0	1.4446	0	1
sentence6	−17.4306	3619.1967	−0.0048	0.9962
sentence8	0	1.4446	0	1
sentence9	−17.4306	3697.0378	−0.0047	0.9962
sentence10	0	1.4446	0	1
sentence11	−17.4306	3619.1967	−0.0048	0.9962
sentence12	−17.4306	3780.1277	−0.0046	0.9963
sentence13	1.6314	1.1615	1.4046	0.1601
sentence14	2.2482	1.1159	2.0147	0.0439
M2:
	Estimate	Std. Error	z value	Pr(>∣z∣)
(Intercept)	−0.5596	0.4432	−1.2627	0.2067
sentence2	1.6582	0.647	2.5628	0.0104
sentence3	−0.2025	0.6371	−0.3179	0.7506
sentence4	−0.539	0.7278	−0.7405	0.459
sentence5	−0.069	0.623	−0.1107	0.9118
sentence6	1.0451	0.6312	1.6559	0.0977
sentence8	2.6997	0.869	3.1066	0.0019
sentence9	0.5596	0.615	0.9099	0.3629
sentence10	1.7228	0.6774	2.543	0.011
sentence11	0.7267	0.6035	1.204	0.2286
sentence12	1.8946	0.6701	2.8273	0.0047
sentence13	1.6582	0.647	2.5628	0.0104
sentence14	−1.8383	0.8611	−2.1349	0.0328
M4:
	Estimate	Std. Error	z value	Pr(>∣z∣)
(Intercept)	0.5108	0.4216	1.2115	0.2257
sentence2	0.1823	0.6044	0.3017	0.7629
sentence3	−0.5108	0.5869	−0.8704	0.3841
sentence4	−0.5108	0.5869	−0.8704	0.3841
sentence5	−3.6463	1.1049	−3.3001	0.001
sentence6	−0.3438	0.5879	−0.5848	0.5587
sentence8	−0.8473	0.5909	−1.4338	0.1516
sentence9	0.3765	0.616	0.6112	0.5411
sentence10	−0.6779	0.5879	−1.1531	0.2489
sentence11	−0.1744	0.5909	−0.295	0.768
sentence12	0.1823	0.6044	0.3017	0.7629
sentence13	−1.5523	0.635	−2.4444	0.0145
sentence14	−2.8134	0.8531	−3.2979	0.001
F3:
	Estimate	Std. Error	z value	Pr(>∣z∣)
(Intercept)	−19.5661	2404.6704	−0.0081	0.9935
sentence2	17.4866	2404.6705	0.0073	0.9942
sentence3	17.8921	2404.6705	0.0074	0.9941
sentence4	16.927	2404.6706	0.007	0.9944
sentence5	0	3359.9889	0	1
sentence6	18.3133	2404.6705	0.0076	0.9939
sentence8	0	3287.955	0	1
sentence9	17.2635	2404.6705	0.0072	0.9943
sentence10	17.3688	2404.6705	0.0072	0.9942
sentence11	16.475	2404.6706	0.0069	0.9945
sentence12	17.7202	2404.6705	0.0074	0.9941
sentence13	18.7394	2404.6704	0.0078	0.9938
sentence14	19.399	2404.6704	0.0081	0.9936

	Estimate	Std. Error	z value	Pr(>∣z∣)
(Intercept)	−6.0099	0.4163	−14.4367	0
pausedur	0.0073	4e-04	18.5775	0
preboundary_length	0.0256	0.0385	0.6648	0.5062
upcoming phrase_length	0.1043	0.0421	2.4758	0.0133

Overall:
	Estimate	Std. Error	z value	Pr(>∣z∣)
(Intercept)	−1.9554	0.2005	−9.7514	0
upcoming_phrase_length	0.2202	0.0323	6.8101	0
F1:
	Estimate	Std. Error	t value	Pr(>∣t∣)
(Intercept)	−0.1304	0.0979	−1.3323	0.1839
upcoming_phrase_length	0.0961	0.0163	5.8853	0
M3:
	Estimate	Std. Error	t value	Pr(>∣t∣)
(Intercept)	0.2257	0.1119	2.0175	0.0447
upcoming_phrase_length	0.0562	0.0186	3.0251	0.0028
F2:
	Estimate	Std. Error	t value	Pr(>∣t∣)
(Intercept)	0.0906	0.0892	1.0156	0.3109
upcoming_phrase_length	0.0182	0.0149	1.2211	0.2233
F4:
	Estimate	Std. Error	t value	Pr(>∣t∣)
(Intercept)	0.0112	0.0315	0.3564	0.7218
upcoming_phrase_length	0.0021	0.0053	0.3905	0.6965
M2:
	Estimate	Std. Error	t value	Pr(>∣t∣)
(Intercept)	0.1986	0.1092	1.8195	0.0701
upcoming_phrase_length	0.0629	0.018	3.4857	6e-04
M4:
	Estimate	Std. Error	t value	Pr(>∣t∣)
(Intercept)	0.2022	0.1029	1.9651	0.0505
upcoming_phrase_length	0.0553	0.0172	3.2197	0.0014
F3:
	Estimate	Std. Error	t value	Pr(>∣t∣)
(Intercept)	−0.0128	0.064	−0.1997	0.8419
upcoming_phrase_length	0.0163	0.0106	1.5364	0.1259

	Estimate	Std. Error	z value	Pr(>∣z∣)
(Intercept)	−5.8542	0.342	−17.1173	0
pausedur	0.0073	4e-04	18.6036	0
upcoming_phrase_length	0.1116	0.0406	2.753	0.0059

	Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
1	1706	1535.7111	-	-	-
2	1707	1536.1519	−1	−0.4408	0.5067

	Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
1	1707	1536.1519	-	-	-
2	1708	1543.9438	−1	−7.7919	0.0052

Participant	Disfluent/prosodic errors	Could not be labeled	Number of utterances included in the svm model	Number of utterances included in the analysis according to the svm model
F1	8	5	323	318
F2	2	6	286	286
F3	1	43	292	288
F4	3	4	329	329
M2	7	20	309	306
M3	2	16	318	314
M4	2	1	333	333

	Estimate	Std. Error	z value	Pr(>∣z∣)
(Intercept)	−5.2673	0.2554	−20.6224	0
pausedur	0.0074	4e-04	18.9937	0

	Estimate	Std. Error	t value	Pr(>∣t∣)
(Intercept)	684.6877	41.7584	16.3964	0
upcoming_phrase_length	16.4539	6.6227	2.4845	0.0133

	Estimate	Std. Error	t value	Pr(>∣t∣)
(Intercept)	404.738	16.6963	24.2411	0
upcoming_phrase_length	16.7744	2.8537	5.8781	0

	Estimate	Std. Error	t value	Pr(>∣t∣)
(Intercept)	376.7806	30.8174	12.2262	0
upcoming_phrase_length	11.1603	4.8714	2.291	0.0223

	Estimate	Std. Error	t value	Pr(>∣t∣)
(Intercept)	138.3371	16.5912	8.338	0
upcoming_phrase_length	0.3085	2.6201	0.1177	0.9063

	Estimate	Std. Error	t value	Pr(>∣t∣)
(Intercept)	−184.068	22.6541	−8.1251	0
upcoming_phrase_length	−4.1741	3.581	−1.1656	0.2443

PERMALINK

Pause Postures: The relationship between articulation and cognitive processes during pauses.

Jelena Krivokapić

Will Styler

Benjamin Parrell

Abstract

1. Introduction

1.1. Cognitive processes related to pauses during speech

1.2. Articulatory behavior during pauses

Figure 3.

1.3. Theoretical account of pause postures and prosodic boundaries

Figure 1.

1.4. Motivation for the current study

2. Methods

2.1. Participants

2.2. Stimuli

Table I.

2.3. Experiment Procedure and Data Acquisition

2.4. Data labeling

Table III.

Figure 2.

2.6. Support Vector Machine model

Figure 4:

Figure 5:

Table II:

3. Results

3.1. Frequency of pause postures

Table IV.

3.2. The cognitive status of pause postures

3.2.1. The timing of the pause postures with linguistic events at the boundary

3.2.1.1. Stress and Duration of Pause Posture Formation movement

Table V.

3.2.1.2. The effect of stress on the duration of the boundary-tone to V-target interval

Table VI.

3.2.1.3. The effect of stress on the boundary-tone to pause-target interval

3.2.2. Variability of pause postures

Figure 6.

Figure 7.

Figure 8.

3.3. Factors affecting the occurrence and duration of pause postures

3.3.1. The effect of sentence type on pause posture occurrence

Figure 9.

3.3.2. The effect of planning on pause posture occurrence

3.3.2.1. The effect of boundary duration on pause posture occurrence

Figure 10.

Figure 11.

3.3.2.2. The effect of upcoming phrase length on pause posture occurrence

Figure 12.

Figure 13.

3.3.2.3. The effect of upcoming phrase length on pause posture duration

Figure 14.

Figure 15.

3.3.2.4. The effect of repetition on pause posture occurrence and duration

Figure 16.

Figure 17.

4. Discussion

4.1. The existence and properties of pause postures

4.2. Factors determining the occurrence of pause postures

Table VII.

Highlights.

Acknowledgments:

Appendix I: Full Regression outputs for all models

1. QUANTILE REGRESSION

1a:

1b:

2. SENTENCE EFFECT

3. Boundary Duration

4. Upcoming Phrase Length

5. Boundary Duration vs. Upcoming vs. Preceding Length

5a:

5b:

5c:

5d:

5e:

6. Boundary Duration vs. Upcoming Phrase Length

6a:

6b:

6c:

6d:

6e:

Participant	Disfluent/prosodic errors	Could not be labeled	Number of utterances included in the svm model	Number of utterances included in the analysis according to the svm model
F1	8	5	323	318
F2	2	6	286	286
F3	1	43	292	288
F4	3	4	329	329
M2	7	20	309	306
M3	2	16	318	314
M4	2	1	333	333