Abstract
This short report describes a longitudinal examination of the acquisition of English-aspirated stops by an initial cohort of 24 adult Slavic-language (Russian, Ukrainian, and Croatian) speakers. All had arrived in Canada with low oral English proficiency, and all were enrolled in the same language instruction program at the outset. Initial bilabial stops in CVCs were recorded at eight testing times: six during the first year of the study, again at year 7, and finally at year 10. Intelligibility was evaluated through a blind listening assessment of the stop productions from the first seven testing times. Voice onset times (VOT) were measured for /p/ from all eight times. Mean /p/ intelligibility improved—mainly during a proposed Window of Maximal Opportunity for L2 speech acquisition–but remained below 100%, even after 7 years. For some speakers, early /p/ productions were minimally aspirated, with VOT increasing over time but remaining intermediate between L1 English and L1 Slavic-Language values at 10 years. However, inter-speaker variability was dramatic, with some speakers showing full intelligibility throughout the study and others showing many unintelligible productions at all times. Individual learning trajectories tended to be non-linear and often non-cumulative. Overall, these findings point to a developmental process that varies considerably from one learner to another. It also demonstrates the serious drawbacks of relying on group means to characterize the process of L2 segmental learning.
Keywords: Voice onset time, longitudinal, acquisition
1 Introduction
Despite several decades of investigation into second language (L2) perception and production, the developmental details of L2 segmental learning remain poorly understood. To a great extent, this is because longitudinal research exploring how learners progress over time is limited, and with few exceptions, individual learner variability has been given only cursory consideration. In this study, we aim to partially address these limitations through a longitudinal investigation of learners’ segmental learning trajectories. The results may shed new light not only on the learners’ degree of success in segmental acquisition but also on the route they take in achieving that success. In particular, we examine non-instructed segmental learning through a longitudinal analysis of productions of English [ph] by Slavic-language (Russian, Ukrainian, and Croatian) learners at multiple testing times over 10 years. The data are drawn from a larger investigation of oral language development in adult ESL learners living in Canada. Its broad purpose was to examine how Canadian immigrants with low English oral proficiency progress in their L2 acquisition in terms of their achievement levels and their rate of acquisition.
We hasten to point out that this project was not originally designed to test predictions from any particular theoretical model. Rather, given the relative dearth of longitudinal data at the time it was conceived, we opted to collect a variety of data types that we expected would provide a basis for informative retrospective interpretation. In previous publications, we have reported on various dependent measures from these learners, including global comprehensibility, fluency, and accentedness (Derwing et al., 2006, 2008; Derwing & Munro, 2013; Thomson et al., 2024) aspects of L2 perception (Derwing et al., 2012) and vowel production (Munro & Derwing, 2008). The picture that has emerged after 10 years reveals large individual differences, with some learners showing relatively steady improvement and others showing idiosyncratic periods of improvement combined with plateaus and regressions. To some degree, differences in trajectories can be linked to variability in life events affecting opportunities to use spoken English (Thomson et al., 2024). In addition, the learners’ ages at the project’s outset strongly predicted global foreign accent scores at the 7-year point (Derwing & Munro, 2013). Even though all speakers had immigrated after reaching adulthood, older arrivals exhibited stronger accents.
1.1 Acquisition of L2 plosives
Numerous studies examining the perception and production of plosives by L2 learners have played a role in current theorizing about segmental acquisition, including the Revised Speech Learning Model (SLM-r, Flege & Bohn, 2021). Flege (1991), for instance, observed that late-learning (adult) Spanish L1 speakers produced word-initial English /t/ with VOT values intermediate between those of Spanish unaspirated /t/ and English aspirated /t/. In contrast, childhood learners produced native-like English VOTs. Flege et al. (1995) reported a similar finding for Italian L1 immigrants in Canada, noting that mean VOTs of speakers who had arrived at a mean of 21 years were “almost exactly intermediate” (p. 15) between those of Italian unaspirated plosives and those of Canadian English aspirated stops. These findings may relate to the ways in which learners establish phonetic categories for L2 segments. The SLM-r (Flege & Bohn, 2021) assumes that L1 and L2 speech perception and production can never exactly match because of interactions between L1 and L2 systems, but that segmental category learning can nonetheless occur irrespective of the age at which L2 learning begins. Another aspect of the SLM-r is its concern with sources of individual differences in L2 performance. For instance, Flege and Bohn (2021) reflect on data from Flege et al. (1998), showing that Spanish-speaking learners of English, all of whom arrived in the USA after age 16, varied in their productions of English /t/. Although some speakers produced tokens with native-like VOT, others had notably shorter VOT durations. The authors propose that such variability may result from differences in L1 category specification at the time of first exposure to L2, individual differences in interlingual identification, different perceptions of phonetic distance from L1 sounds, and differences in L2 phonetic input (Flege & Bohn, 2021, p. 66). Other work on individual variability has focused on between-learner differences in auditory processing. In particular, Saito et al. (2020) found that learners’ temporal and spectral processing abilities predicted L2 proficiency independently of language experience factors.
Much of the research on L2 segmental learning applies inferential statistics to single-timepoint and cross-sectional data. Despite the merits of such work, a detailed understanding of the learning process is not possible without longitudinal investigation. Because cross-sectional designs compare means from groups of different speakers at different stages of learning, they do not yield data on individual learning trajectories. As Lowie and Verspoor (2019) explain, mean data can be misleading because of the “ergodicity problem.” This concern relates to the nature of individual differences in a particular set of findings. Of course, individual differences appear in all L2 data, and it is theoretically possible that all learners in a group might pattern in a very similar way during acquisition of an L2 feature. For instance, they might all improve linearly at roughly the same rate. But as Lowie and Verspoor point out, learning trajectories are often non-ergodic. By this, they mean that the developmental trajectories of individuals depart considerably from the mean performance of the group. Exactly such discrepancies were observed by Thomson et al. (2024) on global speech dimensions of the same speakers as in the present study. Although the mean data suggested cumulative improvement over time, individual trajectories revealed multiple divergent patterns. For some learners, production development was non-linear and non-cumulative, with plateaus and reversals of direction, running counter to the view that learning continuously builds on previously-acquired knowledge toward some maximal level. In fact, if the acquisition process includes intervals of both improvement and decline that may occur at any time, then the commonly-assumed notion of “ultimate attainment” appears not to be viable.
True longitudinal research (with multiple testing points over years) on the detailed aspects of adult L2 phonetic learning is extremely rare; however, a handful of recent shorter-term studies have shed light on the development of plosive consonants. In particular, some evidence points to more rapid gains during the early stages of learning than later on, a phenomenon that has been described by Derwing and Munro (2015) as a Window of Maximal Opportunity (WMO) for speech learning. They proposed that the largest gains in pronunciation performance are most likely to occur during the early stages of L2 learning—generally the first few months following adult learners’ initial massive exposure to L2. Evidence favoring a WMO in ESL contexts has emerged from data on global speech properties (Thomson et al., 2024), vowels (Munro & Derwing, 2008), and consonants such as /ɹ/ (Saito & Munro, 2014). The WMO also appears relevant in Hanzawa (2018), who studied Japanese learners’ acquisition of English voiceless stops over an academic year. In that case, learning occurred during the students’ first major exposure to English in a content-based language class, as opposed to one in which specific pronunciation instruction was provided. With respect to foreign language learning, Nagle (2019) studied productions of Spanish unaspirated /p/ and prevoiced /b/ by native English students enrolled in a communicatively-focused Spanish course in the United States. The speakers were “novice” learners unlikely to use Spanish outside the classroom, and they received little or no pronunciation instruction in their classes. The finding of greater improvement during the first half of the study than the second may also be consistent with the WMO, though the interpretation of foreign language studies is complicated by the general lack of massive exposure that occurs in L2-speaking environments. Moreover, neither González López and Counselman (2013) nor Schuhmann and Huffman (2019) observed VOT improvement in similar learners of Spanish without focused pronunciation instruction.
With respect to speaker variability, Nagle (2019) found noteworthy individual differences in learning trajectories for plosives, as did Holliday (2015), who examined Mandarin learners’ productions of Korean stops. Of particular importance here is the distinction between random variability in performance that is best understood as “noise” and patterns of behavior that are consistently different from each other. The latter type was observed by Wade et al. (2021) in L1 imitations of voiceless stops. When repeating after a model, speakers showed individual VOT divergences of reliable magnitude, even on different recording days. This finding led them to conclude that considering only the mean data from a group could lead to overly simplistic interpretations.
1.2 Stops in English and Slavic languages
Cross-linguistically, syllable-initial oral voiceless plosive production entails an interval of complete vocal tract obstruction followed by a release at the point of articulation and the onset of quasi-periodic vibration of the vocal folds (voicing) sometime later. VOT is the time (typically measured in seconds or milliseconds) between the release and the beginning of voicing. Ringen and Kulikov (2012) refer to “aspirating languages” such as English, in which aspirated /p/, /t/, and /k/ are produced with long positive VOTs, and contrast with unaspirated /b/, /d/, /g/, which generally show short positive VOTs. The authors distinguish these languages from “true voice” languages in which prevoiced stops (negative VOT) contrast with short-lag (short positive VOT) stops. The temporal properties of English plosives have been documented in numerous studies, while data on Russian, Ukrainian, and Croatian indicate that all three fit the “true voice” description. Ringen and Kulikov (2012) observed minimal aspiration of /p/ in Russian, with a mean VOT of .018 s, though a few VOTs exceeded .05 s. Matsui’s (2012) data are similar, with VOT means of .017 s and .012 s for slow and fast speech, respectively. Bondarenko (2015) notes that because of the lasting effects of oppression of the Ukrainian-speaking community in the former Soviet Union, the Ukrainian language has not been well-documented by linguists. For the same reason, Ukrainian speakers are typically simultaneous bilinguals of both Ukrainian and Russian. In the only study of Ukrainian VOT we know of, Nagy and Kochetov (2013) reported a mean of less than .03 s for /p/ in L1 Ukrainians who had immigrated to Canada. For all Croatian voiceless stops, Smiljanic and Bradlow (2008) measured a mean VOT of .021–.023 s.
1.3 Approach and research questions
Given that Slavic-language plosives differ in terms of aspiration (and therefore VOT) from their English counterparts, these segments provide a suitable focus for research. Our chief objective in this investigation is to examine longitudinally the acquisition of aspirated stops among learners whose L1 has no aspirated stop categories and who therefore need to acquire an unfamiliar type of plosive production. Broadly speaking, we determine the degree to which the learners, all recent immigrants to Canada with low spoken English proficiency, succeed in achieving this goal. We also examine the acquisition trajectories they follow, focusing on such possible patterns as continuous linear improvement over time, improvement with plateaus, improvement with regression, or attrition. Two longitudinal dependent measures are employed. First, we undertake an intelligibility analysis based on listener judgments of the L2 speakers’ stops as either /p/ or /b/. Second, we measure VOT (in s) from waveform and spectral representations of the L2 /p/ productions.
Our research questions are as follows:
Over the course of the study, to what degree will Slavic-language (SL) speakers approach Western Canadian English (WCE) norms in their production of English aspirated /p/? We address this question by assessing the intelligibility of word-initial plosive productions in English CVC words and by comparing L2 VOTs to those of a cohort of WCE speakers.
What shape will the SL group’s learning trajectory assume for /p/ over the course of the study? In light of the dearth of longitudinal research on the acquisition of L2 consonants and the call for more longitudinal work by Piske et al. (2001) and Nagle (2021), we trace the group’s trajectory of naturalistic development of /p/. Intelligibility scores are determined from productions recorded at six times during year 1 and again at year 7. Acoustic VOT measurements are made at the same time points, as well as at year 10. The resulting data allow us to identify periods of improvement, stability, and regression for both dependent variables. We also examine whether a hypothesized WMO (Derwing & Munro, 2015) is evident during the first year of learning. If so, we would expect the most rapid improvement to occur during that time.
In light of the ergodicity problem raised by Lowie and Verspoor (2019), to what degree will individual SL learners differ from each other and from group mean performance in English /p/ acquisition? To answer this question, we inspect individual learner trajectories on the dependent measures.
2 Method
2.1 Speakers
At the outset of the larger project, 25 SL speakers were recruited. Because of missing data, however, only 24 (18 Russian, 5 Ukrainian, and 1 Croatian) are considered in this report over the first year, with attrition to 18 at the 7-year point, and to 12 at 10 years. Initially, their mean age was 39 years (range: 19 to 49). All Ukrainian speakers also had competence in Russian. This particular cohort was selected in part because of their accessibility: At the outset and during the first year of the study, all were enrolled in government-funded ESL classes at a school familiar to the researchers. In addition, they were a relatively homogeneous learner group: All had been well-educated in their countries of origin and had studied English prior to arrival in Canada, but all were classified as Stage 1 (Basic Proficiency), the lowest of the three stages of speaking and listening competencies in the Canadian Language Benchmarks (Pawlikowska-Smith, 2000). With respect to the research questions posed in this report (i.e., concerning the acquisition of English /p/), they were a very suitable group because of the absence of aspirated stops in their L1 phonological inventories. Because the participants did not receive explicit English pronunciation instruction, we regard them as naturalistic learners with respect to the segmental issues addressed here. All speakers had normal hearing, as confirmed by a pure-tone hearing screen.
2.2 Procedure
Speech data for this study were evaluated from recordings collected at eight times: six recording sessions were scheduled at roughly 2-month intervals during the first year, another session was held at the end of year 7, and a final one took place at 10 years. The first-year recordings were made in a quiet room at the learners’ language schools; the remainder were made in a TESL research laboratory. Several tests were conducted during each session, but for this report, we focus exclusively on plosive productions. While wearing headphones, the speakers heard a randomized sequence of bVt and pVt words recorded by a speaker of WCE. These were presented in the frame, “The next word is _____.” Their task was to repeat each target item in the frame, “Now I say ______.” We assume that this reformulation entailed slightly more processing than Wade et al.’s (2021) shadowing task, in which the speakers simply repeated target items after a model voice. A sufficient pause was given to allow easy, unrushed productions. The productions were digitally-recorded (Sennheiser MD46 microphone with either HHB MDP500 or Marantz MD670 recorder) and the target words were excised from their frames and saved as audio files with a sampling rate of 44.1 kHz and 16-bit resolution. For comparison purposes, we selected stimuli from a database produced by 15 WCE speakers.
2.3 Intelligibility assessment
Intelligibility was evaluated for the SL /p/ and /b/ tokens collected at the outset of the study and at 2 months, 4 months, 6 months, 8 months, 1 year, and 7 years, but not 10 years, as explained later. The /p/ productions were in the words peat, pet, pat, and pot, and the /b/ productions were in beat, bet, bat, and bot. The first two authors pre-screened the stimuli in informal listening sessions. Two tokens were removed because of background noise. Also, many tokens were heard as /b/ and some were ambiguous—heard as /p/ by one listener and as /b/ by the other. A total of 646 /pVt/ tokens were saved; to these we added 96 of the /bVt/ items (6 per speaker) for evaluation. We chose not to use an equal number of /bVt/ and /pVt/ tokens for two reasons: first, to minimize listener fatigue, and second, because of the preponderance of /pVt/ tokens heard as /bVt/ in the pre-screening. Adding an equal number of /bVt/ tokens might bias the responses in favor of /pVt/ (Repp et al., 1984), thus giving us an overestimation of the speakers’ stop intelligibility. Our preference was to err (if at all) in the direction of under-estimation.
Intelligibility was measured via a blind listener identification task, which is commonly used for consonant intelligibility assessment (e.g., Hazan & Simpson, 2000). For the SL productions, the first two authors and three research assistants (all speakers of WCE) served as assessors. Evaluations were completed soon after the 7-year point in the study. Because it was not possible for all assessors to re-convene after the 10-year point, we did not carry out an intelligibility evaluation on the 10-year data. The task was administered with a customized script (available on request from the corresponding author) using Praat’s Multiple Forced Choice listening experiment capability (Boersma & Weenink, 2023). The assessors sat in front of a computer screen, listened to a blind, random presentation of the tokens, and labeled the initial consonants via one of two screen buttons. They were instructed not to make any assumptions about the relative proportions of /p/ and /b/ tokens and were simply told that the numbers might be equal or not equal. Up to three replays were permitted. To determine the listeners’ accuracy in identifying /p/ and /b/ produced by L1 English speakers, the first two authors and two research assistants subsequently completed a separate intelligibility assessment of the WCE stimuli. The same forced-choice task was used under the same conditions. However, an equal number of /b/ and /p/ tokens were presented because it was not expected that the WCE speakers’ productions of /p/ would tend to be heard as /b/.
2.4 Acoustic analysis
VOTs were measured for all SL /p/ tokens used in the intelligibility assessments and for those collected at an additional testing point 10 years into the investigation. Measurements were made to the nearest .001 s from the release of the /p/ to the onset of voicing, as determined from waveform and spectrogram displays in Praat. The same was done for the WCE speakers’ productions of /b/ and /p/ to obtain a clear picture of how the L2 VOTs would compare with those of both WCE categories.
3 Results
3.1 Group assessment of intelligibility and VOT
For the SL /p/ productions, all 5 judges agreed on the perceived target in 75% of tokens, and at least 4 of 5 agreed on 88%. Given the ambiguity in voicing categories noted in pre-screening, we deemed this level of agreement acceptable. The vast majority of SL /b/ productions were judged correct (95%), while the overall score on /p/ was 83%. Of the 600 identifications of the WCE productions (4 listeners × 150 items) only 2 misidentifications occurred, each on a different item.
For the SL group, an intelligibility score out of a maximum of 5 was computed for each token by summing the number of correct identifications from the judges. Scores were then pooled over the productions from each speaker to give a single score for each speaker at each testing time. The circle symbols in Figure 1 show mean /p/ intelligibility development up to year 7. At the outset, the SL group scored about 70% correct, with a repeated-measures ANOVA (6 levels of time) showing significant improvement over year 1, F(2.682, 61.686) = 7.146, p < .001, η2 = .237 (Greenhouse-Geisser-adjusted due to non-sphericity). However, the figure suggests that gains were confined to about the first 6 months of the study while the speakers were still enrolled in English language classes in Canada. After that, the amount of change appears small. Because of attrition, the mean at 7 years comes from a subset of the original group. An additional, repeated-measures ANOVA was computed for the attrited sub-group, with 7 levels of time. The overall effect of time was again significant, though the effect size was smaller, F(2.687, 45.686) = 3.799, p = .02, η2 = .183. A t-test for correlated samples from only that subgroup revealed a non-significant difference between 1 and 7 years, t(17) = .432, p = .671.
Figure 1.

Mean intelligibility (circles) and VOT (triangles) data with standard error for the L2 speakers at each time point. Dotted lines are used to emphasize the time gap between 12 and 84 months.
Descriptive VOT statistics are provided in Table 1. Figure 1 also shows the SL group’s mean VOTs in triangle symbols, using the scale on the right-hand y-axis. The initial mean of .032 s rose to a peak of .054 s at 8 months. In general, the VOT data paralleled the intelligibility data. A repeated-measures ANOVA yielded a significant effect of time (6 levels) on VOT over the first year, F(5, 115) = 9.723, p < .001 η2 = .297. An additional ANOVA on data from only the 10-year attrited subgroup (8 levels of time) yielded a similar outcome, F(7, 77) = 4.18, p < .001 η2 = .275. The VOT difference between 1 and 7 years for the attrited group missed significance, t(17) = .8, p = .435, as did the difference between 7 and 10 years, t(11) = .524, p = .611. Mean VOTs for the WCE speakers were .082 s for /p/ (range: .057 to .128 s) and .012 s for /b/ (range: .005 to .035 s). Thus, while the SL group’s mean at each time exceeded the WCE mean for /b/, it remained .024 s below the WCE mean for /p/, even at its highest point.
Table 1.
Descriptive Statistics (VOT of /p/) at All Testing Points for SL and for WCE.
| SL Testing Time (Months) | WCE | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Outset | 2 | 4 | 6 | 8 | 12 | 84 | 120 | ||
| Valid | 95 | 96 | 96 | 96 | 96 | 96 | 72 | 48 | 60 |
| Mean (s) | 0.032 | 0.041 | 0.048 | 0.047 | 0.054 | 0.047 | 0.048 | 0.057 | 0.082 |
| St. Dev. | 0.027 | 0.031 | 0.031 | 0.029 | 0.029 | 0.029 | 0.029 | 0.028 | 0.016 |
| Min (s) | 0.000 | 0.000 | 0.009 | 0.006 | 0.011 | 0.004 | 0.012 | 0.018 | 0.052 |
| Max (s) | 0.118 | 0.124 | 0.130 | 0.140 | 0.142 | 0.143 | 0.149 | 0.123 | 0.128 |
3.2 Individual intelligibility assessment
Individual intelligibility trajectories departed markedly from the group performance in Figure 1. A full set of individual figures is provided in the supplementary data file. To facilitate further discussion, the first two authors assigned one of four ad hoc descriptors to each trajectory. The labeling, based on visual impressions of the figures, was determined by consensus. The following labels were used: “perfect/near perfect (n = 7),” ‘rapid’ (n = 6), “slow” (n = 6),’ and “gains + regression” (n = 5). To illustrate the categories, we selected four speakers (Figure 2) who remained in the study for its full duration, each showing a different development pattern. Speaker SL30, along with six other speakers, showed perfect (or nearly perfect) intelligibility from the outset and at each subsequent testing point. Speaker SL44 was classified as a rapid learner, showing an initial score of 65%, with an increase to 80% at 4 months, and a subsequent regression followed by improvement to perfect performance at year 1. Four additional speakers patterned in a similar way, all reaching high scores. SL 29, labeled a slow learner, began at 20%, reached 25% at 4 months, and thereafter showed improvement, a slight regression, and further improvement to 80% at 7 years. Four other speakers followed a similar pattern. Finally, speaker SL35 began at 75%, peaking at 85% at 2 months, then regressing, improving, and regressing again to 30%—well below the initial score. The other four speakers in the gains + regression category showed 7-year scores that were equal to or lower than their initial scores, all of them also showing both gains and regressions over time.
Figure 2.

Individual intelligibility data from 4 divergent speakers from the 2-month to 7-year points illustrating the “perfect/near perfect” (SL30), “rapid” (SL44), “slow” (SL29), and “gains + regression” (SL35) patterns seen in the data. Dotted lines are used to emphasize the gap between 12 and 84 months. (Individual trajectories for all speakers are provided in the supplementary data file.)
There was no evidence that the L1 Russian and L1 Ukrainian speakers performed differently in terms of intelligibility. Two L1 Ukrainian speakers fell into the perfect/near-perfect category, two were categorized as slow, and the fifth fell into the gains + regression group. The sole L1 Croatian speaker also fell into the latter group; however, the performance of a single speaker cannot be used to draw conclusions about Croatian speakers in general. (See the supplementary data, in which these speakers are identified).
As noted earlier, Derwing and Munro (2013) observed a close negative relationship between learners’ ages at the outset of the study and global foreign accents at the 7-year point. In the present study, however, a correlational analysis yielded no significant relationship between age and /p/ intelligibility at any of the test timepoints (rs ranging from −.278 to .306, p > .05).
3.3 Acoustic analysis
As with the intelligibility data, the VOT trajectories of individual learners varied considerably and tended to be non-linear. Figure 3 presents VOT data for the same 4 speakers as in Figure 2. A perfect performer in terms of intelligibility, SL30 produced the longest VOTs at every time point, exceeding the WCE mean of .082 s at all but one testing point. At the last two testing points speaker S44 reached the WCE mean after a VOT of only .026 s at the outset and a large VOT increase after the end of year 1. Both other speakers produced relatively short VOTs over the 10-year period. SL 29 produced a mean VOT of .010 s at the outset and appeared to improve after the 1-year point, reaching a maximum of .041 s at 10 years, noticeably lower than SL44’s VOT. Speaker 35 peaked at .033 s at the 2-month point and then declined slightly, with slight changes in both directions thereafter. As noted earlier, that same speaker’s intelligibility had reached its maximum at 2 months as well. A full set of individual VOT results is included in the supplementary data.
Figure 3.

Individual VOT data for 4 divergent speakers from 2 months to 10 years corresponding to the “perfect/near perfect” (SL30), “rapid” (SL44), “slow” (SL29), and “gains + regression” (SL35) patterns seen in the intelligibility data. Dotted lines are used to emphasize the gap between 12 and 84 months. (Individual trajectories for all speakers are provided in the supplementary data file.)
3.4 VOT distributions and changes in variability over time
The violin and box plots in Figure 4 show the VOT medians and ranges for all /p/ productions by the SL and WCE groups. The SL data are more widely distributed than the WCE data, with VOTs often falling below the WCE minimum of .052 s and sometimes above the WCE maximum of .128 s.
Figure 4.

Violin and box plots for all VOT measurements for the two speaker groups.
4 Discussion
This report focuses on two independent measures of English aspirated stop production by SL speakers over an extended time period: plosive intelligibility over the speakers’ first year of residence in Canada and at seven years, and plosive VOT at the same time points and at 10 years. We posed three research questions. In response to the first—the degree to which SL speakers would approach WCE norms for aspirated /p/—our results are mixed. First, intelligibility at the outset of the study averaged 70%, a level that would presumably not be possible if the speakers had simply drawn upon their L1 categorical knowledge when producing English /p/. Moreover, some speakers performed at 100% at the beginning and at all other points. These findings suggest that some of the Slavic speakers did not encounter difficulty when learning English aspirated /p/, despite the absence of aspirated consonant categories in their L1. At 7 and 10 years, however, the mean VOTs of SL /p/ productions fell considerably short of the WCE mean, even though intelligibility was high (88%) at 7 years. This result reconfirms that, like other aspects of L2 pronunciation, aspiration in /p/ need not reach native speaker norms for full intelligibility to be observed (Munro & Derwing, 2020). At the same time, however, it points to performance that, for many speakers, was not fully native-like. We cannot be certain why many of the VOTs fell short of native values. One possibility is that the speakers’ well-established L1 production routines for unaspirated stops hampered their ability to produce English-like VOT durations, even after many years of speaking English in Canada. An interesting complication arises here. If, at some point in their learning, the speakers succeeded in producing L2 VOTs that were non-native-like, but long enough to ensure intelligibility, they would not necessarily be motivated to increase VOTs still more because their productions would be fully intelligible to their interlocutors.
Our second question concerned the shape of the SL groups’ learning trajectory for /p/ intelligibility and VOT. The mean results seem to indicate relatively steady improvement over approximately the first 8 months of the study, with plateauing thereafter. However, as explained below, this observation requires considerable qualification. In general, the shape of the mean trajectory supports the existence of a WMO for speech learning during the first months of L2 exposure (Derwing & Munro, 2015; Munro & Derwing, 2008).
Research question 3 focused on the degree to which individual learners would differ from each other and from the group as a whole. Although the trajectories of some speakers (labeled by us as “rapid” learners) were relatively similar to the means, most speakers diverged noticeably. One cohort (labeled “perfect/near perfect”) showed full intelligibility from the outset onward, and another (labeled “slow”) showed relatively slow improvement but eventually achieved high intelligibility. The final group (“gains + regression”) showed periods of improvement combined with regressions, ultimately achieving no net gain in intelligibility over the course of the study. Across the sub-groups, changes in intelligibility and VOT over time were often non-cumulative and entailed both improvements and regressions. It is also worth noting that even at the seven-year point, nearly 60% of speakers produced both intelligible and unintelligible /p/ tokens. In accordance with the SLM-r (Flege & Bohn, 2021), these outcomes might indicate that categorical knowledge of English /p/ continued to be non-native-like.
To sum up, our mean production data show increasing intelligibility and VOT over the first few months of the study with plateauing thereafter. In that respect the outcome supports the assumption of the SLM-r (Flege & Bohn, 2021) that phonetic learning remains possible in adulthood. At the same time, the large individual departures from mean performance point to non-ergodic patterning. As explained by Lowie and Verspoor (2019), non-ergodicity occurs when the mean trajectory followed by a group of learners is not a good indicator of the trajectories followed by individual members of the group. In this study, few of the individual speakers came close to matching the groups’ mean performance over time; in fact, the trajectories of most participants departed substantially from the means. For instance, those in the “perfect/near perfect” category showed little or no evidence of learning because their performance was highly intelligible right from the start. In contrast, learners in the “slow” and “gains + regression” categories showed periods of learning combined with plateaus and periods of regression. Such variability cannot be seen in the mean data. We therefore concur with Wade et al. (2021) that mean production data tend to conceal individual complexities in phonetic behavior. It follows that an understanding of L2 phonetic learning processes requires a careful examination of individual data.
4.1 Implications
The alignment of our findings with the proposed WMO suggests that the first few months of massive exposure to the L2 is an optimal time for learning and likely best for pronunciation instruction. This may be because learners’ phonetic representations are more malleable at early times than later in the language acquisition process. Alternatively, or in addition, L2 production routines may be more amenable to adjustment during the early stages of acquisition, but become more fixed over time as a result of greater L2 production experience.
Another implication of the current findings is that pedagogical specialists should not be overly concerned with linguistically-based segmental error prediction. This study, along with others, indicates that production difficulties and learning trajectories displayed even by learners from the same or similar L1 backgrounds are not homogeneous. Some SL speakers appeared to have no difficulty in producing intelligible English /p/ even at the study’s outset. Others had serious difficulties that persisted right to the end. Our study does not permit us to pinpoint the reasons for this variability. However, possible causes for individual differences in performance have been offered in connection with the SML-r (Flege & Bohn, 2021). Among these are differences in the way L1 categories (in this case, L1 /p/) were specified at the beginning of L2 learning and differences in the nature of L2 phonetic input. Differential success in acquisition may also result from variations in speakers’ perceptual-cognitive abilities (e.g., auditory processing). As observed by Saito et al. (2020), even with identical amounts and qualities of input, the perceptual-cognitive individual differences among speakers may influence their ability to make the most of input opportunities, leading to differing learning outcomes over time. We cannot establish a connection between variables such as these and the outcome of the study because we do not have data on L1 category knowledge, amount of exposure to productions of stop consonants, or auditory processing abilities. In future longitudinal work, researchers should consider including appropriate measures of these variables in their research designs. We should add here that we failed to find a relationship between the learners’ ages at the study’s outset and their /p/ intelligibility, even though we had previously observed that later immigration age predicted stronger L2 accentedness (Derwing & Munro, 2013). That difference may be explained by the fact that accentedness encompasses a wide range of phonetic properties and is usually assessed from speech samples lasting at least a few seconds. Here we have constrained the evaluation to individual speech sounds produced in 1-syllable words. Moreover, we have focused on intelligibility which does not correlate well with accentedness.
With respect to L2 classroom applications, the large differences seen among learners point to a need for individualized assessment and instructional plans. For all learners, distinguishing English /p/ and /b/ is potentially important because English has many minimal pairs involving these sounds, and they occur in high frequency words. For learners who experience difficulty producing aspirated consonants, instruction may be helpful. Research evidence suggests that VOT is indeed trainable and that more than one instructional approach may be effective. Dahmen et al. (2023), for instance, observed significant increases in the VOT of Italian learners of German when they were provided with a simple explanation and demonstration of aspiration, followed by practice with peer feedback. Offerman and Olson (2016) achieved the reverse result—a reduction of VOT among English learners of Spanish—by providing feedback training in the form of visual representations of speech. Nevertheless, for many learners of English, training may be unnecessary, given that high levels of intelligibility were achievable in this study without focused pronunciation instruction.
4.2 Limitations and future directions
A few limitations of this investigation deserve attention. First, as with virtually all long-term longitudinal studies, some participants dropped out before the end of the project. This complicates interpretation of the 10-year data because those who did not complete the study may have further progressed (or not) in ways different from those who continued to participate. Another concern is that our focus on a single plosive category with productions obtained through a delayed-repetition task cannot automatically be assumed to generalize to spontaneous speech. Finally, our focus is on only one member of the English /p/ - /b/ opposition. To gain a full understanding of how L2 plosive acquisition takes place, data from /b/ would also have to be examined, and contrasts from other places of articulation would need to be included.
To our knowledge this study is the first to extend L2 VOT research to 10 years with multiple testing points. Its outcomes indicate that future longitudinal studies must examine data at the level of the individual learner, given that mean data may present a misleading picture of how segmental learning processes actually unfold. We also believe that other researchers should see our findings as encouraging in the sense that changes in adult L2 segmental production are indeed observable over a 10-year time interval. Further longitudinal work of this type is therefore likely to be a good investment of time and resources.
Supplemental Material
Supplemental material, sj-docx-1-las-10.1177_00238309241264296 for Aspiring to Aspirate: L2 Acquisition of English Word-Initial /p/ Over 10 Years by Murray J. Munro, Tracey M. Derwing and Kazuya Saito in Language and Speech
Acknowledgments
We acknowledge the invaluable role of the research participants in the success of this study and are grateful to the administrators at NorQuest College, Metro, and Sacred Heart in Edmonton, who facilitated contact and data collection. In addition, we thank J. Foote, G. Mellesmoen, and R. Thomson for their assistance, and J. Flege for helpful comments on an earlier draft.
Footnotes
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by research grants to the authors from the Social Sciences and Humanities Research Council of Canada.
ORCID iDs: Murray J. Munro
https://orcid.org/0000-0002-5936-7750
Kazuya Saito
https://orcid.org/0000-0002-4718-2943
Supplemental material: Supplemental material for this article is available online.
Contributor Information
Murray J. Munro, Department of Linguistics, Simon Fraser University, Canada.
Tracey M. Derwing, University of Alberta, Canada; Simon Fraser University, Canada
Kazuya Saito, University College London, UK.
References
- Boersma P., Weenink D. (2023). Praat: Doing phonetics by computer [Computer program]. http://www.praat.org/
- Bondarenko M. (2015). Influence of L1 on VOT production in Spanish: English and Ukrainian comparison. In LSO Working Papers in Linguistics 10, 1–16. University of Wisconson-Madison. https://langsci.wisc.edu/lso-working-papers-in-linguistics-volume-10 [Google Scholar]
- Dahmen S., Grice M., Roessig S. (2023). Prosodic and segmental aspects of pronunciation training and their effects on L2. Languages, 8(1), 74. [Google Scholar]
- Derwing T. M., Munro M. J. (2013). The development of L2 oral language skills in two L1 groups: A seven-year study. Language Learning, 63, 163–185. 10.1111/lang.12000 [DOI] [Google Scholar]
- Derwing T. M., Munro M. J. (2015). Pronunciation fundamentals: Evidence-based perspectives for L2 teaching and research. John Benjamins. [Google Scholar]
- Derwing T. M., Munro M. J., Thomson R. I. (2008). A longitudinal study of ESL learners’ fluency and comprehensibility development. Applied Linguistics, 29, 359–380. 10.1093/applin/amm041 [DOI] [Google Scholar]
- Derwing T. M., Thomson R. I., Foote J. A., Munro M. J. (2012). A longitudinal study of listening perception in adult learners of English: Implications for teachers. The Canadian Modern Language Review, 68(3), 247–266. [Google Scholar]
- Derwing T. M., Thomson R. I., Munro M. J. (2006). English pronunciation and fluency development in Mandarin and Slavic speakers. System, 34(2), 183–193. [Google Scholar]
- Flege J. E. (1991). Age of learning affects the authenticity of Voice-Onset Time (VOT) in stop consonants produced in a second language. The Journal of the Acoustical Society of America, 89(1), 395–411. [DOI] [PubMed] [Google Scholar]
- Flege J. E., Bohn O.-S. (2021). The Revised Speech Learning Model (SLM-r). In Wayland R. (Ed.), Second language speech learning: Theoretical and empirical progress (pp. 3–83). Cambridge University Press. 10.1017/9781108886901.002 [DOI] [Google Scholar]
- Flege J. E., Frieda E. M., Walley A. C., Randazza L. A. (1998). Lexical factors and segmental accuracy in second language speech production. Studies in Second Language Acquisition, 20, 155–187. [Google Scholar]
- Flege J. E., Munro M. J., MacKay I. R. A. (1995). Effects of age of second-language learning on the production of English consonants. Speech Communication, 16, 1–26. [Google Scholar]
- González López V., Counselman D. (2013). L2 acquisition and category formation of Spanish voiceless stops by monolingual English novice learners. In Cabrelli Amaro J., Lord G., de Prada Pérez A., Aaron J. (Eds.), Selected proceedings of the 16th Hispanic linguistics symposium, October 25-28, 2012, University of Florida (pp. 118–127). Cascadilla Proceedings Project. [Google Scholar]
- Hanzawa K. (2018). The development of Voice Onset Time (VOT) in a content-based instruction university program by Japanese learners of English: A longitudinal study. The Canadian Modern Language Review, 74(4), 502–522. [Google Scholar]
- Hazan V., Simpson A. (2000). The effect of cue-enhancement on consonant intelligibility in noise: Speaker and listener effects. Language and Speech, 43(3), 273–294. [DOI] [PubMed] [Google Scholar]
- Holliday J. J. (2015). A longitudinal study of the second language acquisition of a three-way stop contrast. Journal of Phonetics, 50, 1–14. [Google Scholar]
- Lowie W. M., Verspoor M. H. (2019). Individual differences and the ergodicity problem. Language Learning, 69, 184–206. 10.1111/lang.12324 [DOI] [Google Scholar]
- Matsui M. (2012). Asymmetric effects of speaking rate on voice-onset time: The case of Russian. In Proceedings of the international conference of experimental linguistics, ExLing 2012, August 2012 (pp. 27–29). https://exlingsociety.com/wp-content/uploads/proceedings/exling-2012/05_0022_000228.pdf
- Munro M. J., Derwing T. M. (2008). Segmental acquisition in adult ESL learners: A longitudinal study of vowel production. Language Learning, 58, 479–502. 10.1111/j.1467-9922.2008.00448.x [DOI] [Google Scholar]
- Munro M. J., Derwing T. M. (2020). Foreign accent, comprehensibility and intelligibility, redux. Journal of Second Language Pronunciation, 6(3), 283–309. [Google Scholar]
- Nagle C. L. (2019). A longitudinal study of voice onset time development in L2 Spanish stops. Applied Linguistics, 40(1), 86–107. [Google Scholar]
- Nagle C. L. (2021). Assessing the state of the art in longitudinal L2 pronunciation research: Trends and future directions. Journal of Second Language Pronunciation, 7(2), 154–182. 10.1075/jslp.20059.nag [DOI] [Google Scholar]
- Nagy N., Kochetov A. (2013). Voice onset time across the generations: A crosslinguistic study of contact-induced change. In Siemund P., Gogolin I., Schulz M. E., Davydova J. (Eds.), Multilingualism and language diversity in urban areas: Acquisition, identities, space, education (pp. 19–38). John Benjamins. [Google Scholar]
- Offerman H. M., Olson D. J. (2016). Visual feedback and second language segmental production: The generalizability of pronunciation gains. System, 59, 45–60. [Google Scholar]
- Pawlikowska-Smith G. (2000). Canadian Language Benchmarks 2000: English as a second language for adults. Centre for Canadian Language Benchmarks. [Google Scholar]
- Piske T., MacKay I. R. A., Flege J. E. (2001). Factors affecting degree of foreign accent in an L2: A review. Journal of Phonetics, 29(2), 191–215. 10.1006/jpho.2001.0134 [DOI] [Google Scholar]
- Repp B. H., Liberman A. M., Harnad S. (1984). Phonetic category boundaries are flexible (Haskins Laboratories Status Report on Speech Research, SR-77). https://apps.dtic.mil/sti/tr/pdf/ADA145585.pdf
- Ringen C., Kulikov V. (2012). Voicing in Russian stops: Cross-linguistic implications. Journal of Slavic Linguistics, 20, 269–286. 10.1353/jsl.2012.0012 [DOI] [Google Scholar]
- Saito K., Kachlicka M., Sun H., Tierney A. (2020). Domain-general auditory processing as an anchor of post-pubertal second language pronunciation learning: Behavioural and neurophysiological investigations of perceptual acuity, age, experience, development, and attainment. Journal of Memory and Language, 115, 104168. [Google Scholar]
- Saito K., Munro M. J. (2014). The early phase of /ɹ/ production development in adult Japanese learners of English. Language and Speech, 57(4), 451–469. [DOI] [PubMed] [Google Scholar]
- Schuhmann K. S., Huffman M. K. (2019). Development of L2 Spanish VOT before and after a brief pronunciation training session. Journal of Second Language Pronunciation, 5(3), 402–434. [Google Scholar]
- Smiljanic R., Bradlow A. R. (2008). Stability of temporal contrasts across speaking styles in English and Croatian. Journal of Phonetics, 36(1), 91–113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thomson R. I., Derwing T. M., Munro M. J. (2024). How long can naturalistic L2 pronunciation learning continue in adults? A ten-year study. Language Awareness, 33(2), 201–223. 10.1080/09658416.2023.2227559 [DOI] [Google Scholar]
- Wade L., Lai W., Tamminga M. (2021). The reliability of individual differences in VOT imitation. Language and Speech, 64(3), 576–593. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-docx-1-las-10.1177_00238309241264296 for Aspiring to Aspirate: L2 Acquisition of English Word-Initial /p/ Over 10 Years by Murray J. Munro, Tracey M. Derwing and Kazuya Saito in Language and Speech
